CN105956085A - Reverse indexing construction method and apparatus as well as retrieval method and apparatus - Google Patents

Reverse indexing construction method and apparatus as well as retrieval method and apparatus Download PDF

Info

Publication number
CN105956085A
CN105956085A CN201610282316.3A CN201610282316A CN105956085A CN 105956085 A CN105956085 A CN 105956085A CN 201610282316 A CN201610282316 A CN 201610282316A CN 105956085 A CN105956085 A CN 105956085A
Authority
CN
China
Prior art keywords
combination
value
key
bit vector
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610282316.3A
Other languages
Chinese (zh)
Other versions
CN105956085B (en
Inventor
文德民
张云锋
周盛
潘柏宇
王冀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
1Verge Internet Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 1Verge Internet Technology Beijing Co Ltd filed Critical 1Verge Internet Technology Beijing Co Ltd
Priority to CN201610282316.3A priority Critical patent/CN105956085B/en
Publication of CN105956085A publication Critical patent/CN105956085A/en
Application granted granted Critical
Publication of CN105956085B publication Critical patent/CN105956085B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the invention provide a reverse indexing construction method and apparatus as well as a retrieval method and apparatus. The construction method comprises the steps of creating an array, wherein the array comprises m elements; enabling m pieces of data in a to-be-indexed data set to correspond to the m elements of the array, wherein attributes of the m pieces of data in the to-be-indexed data set form an attribute set, wherein the attribute set comprises n attributes, at least one attribute corresponds to s attribute values, m is a positive integer, n is greater than or equal to zero, and s is greater than or equal to zero; for each piece of data in the to-be-indexed data set, traversing the attributes of the data and combinations of corresponding attribute values; and for each combination, setting a bit vector corresponding to the combination as a set value to obtain an updated key-value pair set. According to the embodiments of the invention, the optimization is carried out by utilizing small storage capacity and high calculation characteristic of the bit vector, so that the reverse indexing speed can be greatly increased, the memory occupation can be greatly reduced, the system throughput can be increased, and the system stability can be improved.

Description

The construction method of a kind of inverted index and device, search method and device
Technical field
The present invention relates to information search technique field, be specifically related to construction method and device, search method and the device of a kind of inverted index.
Background technology
Inverted index is a kind of indexing means, is used to the mapping of the storage position being stored under full-text search certain word in a document or one group of document.It is data structure the most frequently used in DRS.By inverted index, the lists of documents of this word can be comprised according to word quick obtaining.
Inverted index is the core of searching system, is responsible for the property value according to system input and searches qualified record.The life cycle of inverted index includes creating and using two links.First inverted index system needs to create inverted index according to existing set of records ends.Then the property value that searching system inputs according to user searches qualified Record ID in inverted index.Lucene is representative that what existing Inverted Index Technique was most widely used is based on participle and document.
Inverted Index Technique based on Lucene is relatively suitable for the scene of the full-text search of bigger data acquisition system to be indexed, and in this case, the span of attribute and property value is very big, and inverted index tree needs to leave in disk.But for the searching system that response time is the most sensitive, the recall precision of Lucene can not meet demand.Such as, in advertisement putting engine, data set to be retrieved is typically within several ten thousand orders of magnitude, and attribute to be retrieved is fixing, and the span of property value is within thousand of orders of magnitude, and is substantially precise search, does not has the scene of fuzzy search and full-text search.Advertisement putting engine is very sensitive to response speed, it is desirable to the retrieval of inverted index time-consumingly controls in millisecond, and now inverted index based on Lucene can not meet the demand of retrieval.
Summary of the invention
The embodiment of the present invention provides the construction method of a kind of inverted index, and described method includes:
Creating array, described array comprises m element;
By corresponding for the m data in data acquisition system to be indexed m the element to described array;Wherein, in described data acquisition system to be indexed, the attribute of m data constitutes community set, and described community set comprises n attribute, at least one described attribute correspondence s property value, and m is positive integer, and n is more than or equal to zero, and s is more than or equal to zero;
For each data in described data acquisition system to be indexed, travel through attribute and the combination of respective attributes value of these data;
To each described combination, arranging bit vector corresponding to this combination is that setting value is with the key-value pair set after being updated.
Preferably, before the described bit vector arranging this described combination correspondence is setting value, described method also includes:
To each described combination, it is judged that whether having the key-value pair that this combination is corresponding in initial key-value pair set, the key-value pair of this combination described is made up of the bit vector that this combination and this combination are corresponding;
If the key-value pair that in described initial key-value pair set, this combination is not corresponding, then create the key-value pair that this combination is corresponding;
Described bit vector corresponding to this combination that arrange is setting value, including:
All positions of bit vector in the key-value pair that this combination is corresponding are set and are the first setting value.
Preferably, described method also includes:
If described initial key-value pair set has the key-value pair that this combination is corresponding, obtain the subscript value of data in the data acquisition system to be indexed that this combination is corresponding according to described array;
Described bit vector corresponding to this combination that arrange is setting value, including:
The position arranging the described subscript value of institute's bit vector corresponding is the second setting value.
Preferably, the property value of at least one described attribute is empty, and described method also includes:
Travel through described community set;
Attribute described for each in described community set, key-value pair set after described renewal creates the key-value pair of particular combination, arranging all positions of bit vector is the first setting value, and wherein said particular combination is the combination that property value is empty attribute and property value is constituted.
Preferably, described method also includes:
Travel through described data acquisition system to be indexed;
Determine that in described data acquisition system to be indexed, the property value of an attribute of data is sky;
Determine the subscript value in the corresponding described array of these data;
Key-value pair set after described renewal will be set to described second setting value with the correspondence position that the described subscript value of bit vector corresponding to described subscript value is corresponding.
It addition, the embodiment of the present invention also provides for the construction device of a kind of inverted index, described device includes:
First creating unit, is used for creating array, and described array comprises m element;
Corresponding unit, for by corresponding for the m data in data acquisition system to be indexed m the element to described array;Wherein, in described data acquisition system to be indexed, the attribute of m data constitutes community set, and described community set comprises n attribute, at least one described attribute correspondence s property value, and m is positive integer, and n is more than or equal to zero, and s is more than or equal to zero;
First Traversal Unit, for for each data in described data acquisition system to be indexed, travels through attribute and the combination of respective attributes value of these data;
Arranging unit, for each described combination, arranging bit vector corresponding to this combination is that setting value is with the key-value pair set after being updated.
Preferably, described device also includes:
Judging unit, for each described combination, it is judged that whether having the key-value pair that this combination is corresponding in initial key-value pair set, the key-value pair of this combination described is made up of the bit vector that this combination and this combination are corresponding;
Second creating unit, when the key-value pair that this combination is not corresponding in described initial key-value pair set, creates the key-value pair that this combination is corresponding;
Described arranging unit, in the key-value pair corresponding specifically for arranging this combination, all positions of bit vector are the first setting value.
Preferably, described device also includes:
Acquiring unit, if there being key-value pair corresponding to this combination in the described initial key-value pair set, obtains the subscript value of data in the data acquisition system to be indexed that this combination is corresponding according to described array;
It is described that to arrange unit specifically for arranging all positions of bit vector corresponding to described subscript value be the second setting value.
Preferably, the property value of at least one described attribute is empty, and described device also includes:
Second Traversal Unit, is used for traveling through described community set;
3rd creating unit, for attribute described for each in described community set, creates the key-value pair of particular combination in the key-value pair set after described renewal,
Second arranges unit, and being used for arranging bit vector is the first setting value, and wherein said particular combination is the combination that property value is empty attribute and property value is constituted.
Preferably, described device also includes:
3rd Traversal Unit, is used for traveling through described data acquisition system to be indexed;
First determines unit, for determining that in described data acquisition system to be indexed, the property value of an attribute of data is sky;
Second determines unit, for determining the subscript value in the corresponding described array of these data;
3rd arranges unit, for all positions of bit vector corresponding for subscript value described in the key-value pair set after described renewal are set to described second setting value.
It addition, the embodiment of the present invention also provides for the search method of a kind of inverted index, described inverted index uses the construction method of the inverted index of embodiment of the present invention offer to build, and described search method includes:
The attribute of traverse user input and the combination of property value;
The initial bit vector searching any described combination in key-value pair set corresponding obtains initial bit Vector Groups, has p initial bit vector in described initial bit Vector Groups, and p is positive integer;
Described p initial bit vector is done position and obtains new bit vector with computing;
Travel through in described new bit vector the position of promising described second setting value;
From described array, corresponding data are taken out according to subscript value.
Preferably, before the described initial bit vector searching arbitrary described combination correspondence in key-value pair set obtains initial bit Vector Groups, also include:
Determine and described key-value pair set does not exist key-value pair corresponding with the combination of the attribute that described user inputs and property value;
The described bit vector searching arbitrary described combination in key-value pair set corresponding obtains bit vector group, including:
Bit vector corresponding to the particular combination bit vector as this arbitrary described combination correspondence is searched in described key-value pair set.
Preferably, before the described bit vector searching arbitrary described combination correspondence in key-value pair set obtains bit vector group, described method also includes:
Determine the property value disappearance of the attribute that described user inputs;
The described bit vector searching arbitrary described combination in key-value pair set corresponding obtains bit vector group, including:
Bit vector corresponding to the particular combination of this attribute bit vector as this arbitrary described combination is searched in described key-value pair set.
Additionally, the embodiment of the present invention provides the retrieval device of a kind of inverted index, described inverted index uses the construction method of the inverted index of embodiment of the present invention offer to build, and described retrieval device includes:
First Traversal Unit, for attribute and the combination of property value of traverse user input;
Searching unit, obtain initial bit Vector Groups for the initial bit vector searching any described combination in key-value pair set corresponding, have p initial bit vector in described initial bit Vector Groups, p is positive integer;
Arithmetic element, obtains new bit vector for described p initial bit vector does position with computing;
Second Traversal Unit, for travel through in described new bit vector the position of promising described second setting value;
Retrieval unit, for taking out corresponding data according to subscript value from described array.
Preferably, described device also includes:
First determines unit, there is not key-value pair corresponding with the combination of the attribute that described user inputs and property value for determining in described key-value pair set;
Described lookup unit, specifically for searching bit vector corresponding to the particular combination bit vector as this arbitrary described combination correspondence in described key-value pair set.
Preferably, described device also includes:
Second determines unit, for determining that the property value of attribute that described user input lacks;
Described lookup unit, specifically for searching bit vector corresponding to the particular combination of this attribute bit vector as this arbitrary described combination in described key-value pair set.
The embodiment of the present invention provides the construction method of a kind of inverted index, construction device, search method and retrieval device.The embodiment of the present invention uses array to achieve the data of arbitrary data types in data acquisition system and the corresponding relation of bit vector.The present invention implements to utilize the little amount of storage of bit vector, high computation performance to optimize, and can improve the speed of reverse index greatly, greatly reduce EMS memory occupation, such that it is able to improve throughput of system and stability.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, and the schematic description and description of the application is used for explaining the application, is not intended that the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the schematic flow sheet of the construction method of the inverted index that the embodiment of the present invention provides;
Fig. 2 is the structural representation of the construction device of the inverted index that the embodiment of the present invention provides;
Fig. 3 is the schematic flow sheet of the search method of the inverted index that the embodiment of the present invention provides;
Fig. 4 is the structural representation of the retrieval device of the inverted index that the embodiment of the present invention provides.
Detailed description of the invention
As employed some vocabulary in the middle of description and claim to censure specific components.Those skilled in the art are it is to be appreciated that hardware manufacturer may call same assembly with different nouns.In the way of this specification and claims not difference by title is used as distinguishing assembly, but it is used as the criterion distinguished with assembly difference functionally." comprising " as mentioned by the middle of description and claim in the whole text is an open language, therefore should be construed to " comprise but be not limited to "." substantially " referring in receivable range of error, those skilled in the art can solve described technical problem in the range of certain error, basically reaches described technique effect.Additionally, " coupling " word comprises any directly and indirectly electric property coupling means at this.Therefore, if a first device is coupled to one second device described in literary composition, then represents described first device and can directly be electrically coupled to described second device, or be indirectly electrically coupled to described second device by other devices or the means that couple.Description subsequent descriptions is to implement the better embodiment of the application, for the purpose of right described description is the rule so that the application to be described, is not limited to scope of the present application.The protection domain of the application is when being as the criterion depending on the defined person of claims.
As it is shown in figure 1, the embodiment of the present invention provides the construction method of a kind of inverted index, the method may comprise steps of:
Step S100: creating array, described array comprises the element of N.
Step S102: by corresponding for the m data in data acquisition system to be indexed m the element to described array;Wherein, in described data acquisition system to be indexed, the attribute of m data constitutes community set, and described community set comprises n attribute, at least one described attribute correspondence s property value, and m is positive integer, and n is more than or equal to zero, and s is more than or equal to zero.
The size of array A created is identical with the size of data acquisition system T to be indexed, and data acquisition system T to be indexed is { T1,T2,...,Tm-1,Tm}.Put in A by all data in data acquisition system T to be indexed, then in each data in T and A, the subscript value in each element and A is set up corresponding relation it is as shown in the table:
Table oneTable one
Data in T T1 T2 …… Tm-1 Tm
Element in A T1 T2 …… Tm-1 Tm
Subscript value in A 0 1 …… m-2 m-1
Data in T can be any data format.By the conversion of array A, each data in T can represent with the subscript value in A.
It should be noted that in the corresponding relation described in table one, the data order in array A can be arbitrary, it is only necessary in the data one_to_one corresponding in T to array A.So, it being same as same data acquisition system T to be indexed, when certain inverted index generated and next time generate inverted index, the subscript value in array A that certain data in T are corresponding may be different.
Step S104: for each data in data acquisition system to be indexed, travel through attribute and the combination of respective attributes value of these data.
Each data Ti (0 < i < m) in data acquisition system T to be indexed may have n attribute, and if this n attribute is searchable attribute, it is { D that this n searchable attribute constitutes community set D, community set D1,D2,...,Dn-1,Dn}.This community set D can be variable can also be changeless, such as at advertisement engine application scenarios, this community set D can be changeless, will not change because of the size of T, so the size of D is more than or equal to all searchable attribute sum of each data in T.
In T, the attribute value set of all data is combined into V:{V1,V2,...,Vn-1,Vn}.In V, each property value is still that a set: such as attribute DkThe all properties value set of (0 < k < n) is Vk, and VkSet sizes be Sk, then VkIt is represented by: { Vk-1,Vk-2,...,Vk-Sk}.What V represented is the property value set of all data in T.The size of V is less than or equal to the attribute retrieval value set size of user's input.
Step S106: to each described combination, arranging bit vector corresponding to this combination is that setting value is with the key-value pair set after being updated.
Key-value pair set I can use HashMap to realize, or uses the data structures such as array to realize.
Traversal T, contrasts each data Ti, travel through TiAll attributes being retrieved and the combination of property value.
Each attribute bit vector corresponding with the combination of each property value of this attribute in inverted index.Bit vector refers to the vector being made up of binary digit (bit).Being embodied in a bit array, array element is bit, can only have 0 or 1 two kind of value, can be used for representing two states value.In conjunction with corresponding relation array A, data T during the bit value of i-th bit just can identify the T that A [i] is corresponding in bit vectoriWhether hit the retrieval of the property value of this attribute.
In some embodiments, performing before above-mentioned steps S106, it is also possible to following step one and step 2 be first carried out:
Step one: to each described combination, it is judged that whether having the key-value pair that this combination is corresponding in initial key-value pair set, the key-value pair of this combination described is made up of the bit vector that this combination and this combination are corresponding;
Step 2: if the key-value pair that in described initial key-value pair set, this combination is not corresponding, then create the key-value pair that this combination is corresponding;
Now, above-mentioned steps S106 specifically can realize in the following way:
All positions of bit vector in the key-value pair that this combination is corresponding are set and are the first setting value.
Here the first setting value can be 0.
If it addition, initial key-value pair set has key-value pair corresponding to this combination, then obtain the subscript value of data in the data acquisition system to be indexed that this combination is corresponding according to array.
If initial key-value pair set has corresponding key-value pair, then can get T according to array Ai(these data are TiTime) subscript value, then the i-th bit of bit vector is installed and is set to the second setting value.Here this second setting value can be 1.
In some embodiments, the i-th bit of bit vector is set to 1, shows data (i-th bit data T of array A of its correspondencei) hit this key assignments (attribute+property value combination), if if i.e. user retrieves with this combination (combination of attribute+property value), T can be retrievedi.It should be noted that some attribute has property value, and the property value of some attribute does not exists or it is empty to be.Now, the construction method of the above-mentioned row for the treatment of index can also comprise the following steps:
First, traversal community set D;
Then, for each attribute in community set, (such as this particular combination uses D to create particular combination in key-value pair set I in the updated1-Other) key-value pair, all positions of bit vector are set to the first setting value, and wherein particular combination is the combination that property value is empty attribute and property value is constituted.
Additionally, in some embodiments, the construction method of above-mentioned inverted index can also comprise the following steps:
First, data acquisition system T to be indexed is traveled through;
Then, it is determined that a certain data T in TiA certain attribute DkProperty value for empty or do not exist;
Determine this certain data TiThe corresponding subscript value i in array A;
Key-value pair set after updating is set to the second setting value with the correspondence position that the subscript value of bit vector corresponding for subscript value i is corresponding.Here the second setting value can be 1.
Above-mentioned steps shows any data T in data acquisition system TiWith any attribute D in community set DkUnrelated.Pass through DkAny one property value can retrieve Ti
Using the construction method of above-mentioned inverted index, the data structure of the inverted index ultimately formed can be Key (attribute-property value)-Value (bit vector) key-value pair set I as follows:
Key Value
D1-V1-1 bitvector
D1-V1-2 bitvector
D1-V1-S1 bitvector
D1-Other bitvector
Dn-Vn-1 bitvector
Dn-Vn-2 bitvector
Dn-Vn-Sn bitvector
Dn-Other bitvector
Wherein, the span of Key is attribute and all combinations of property value of all data in data acquisition system T, so the range size of all combinations is equal to the set sizes sum of all elements in property value set V.Value is corresponding bit vector.Bit vector a certain position value is 1, then it represents that the data in corresponding data set T meet the retrieval requirement of respective attributes and property value, if retrieving according to respective attributes and property value, can hit these data.
It should be noted that the attribute of every data may be few, may Coverings close in all properties.It addition, in general the property value set of the data set all properties corresponding with T can not cover institute's likely value of attribute.Both of these case all can cause the condition to be retrieved (" attribute+property value " combines) of input in actual application can not be covered by inverted index.
For this situation, can create a particular combination and be combined as key assignments by " attribute+property value ", such as " attribute+Other " combination, the value (namely this bit vector) of this particular combination can initialize and be all 0.
DkThe effect of-Other: Dk-Other does not has at D for markkThe data set property on attribute.If certain data T in data set TiAt certain attribute DkIn there is no property value, i.e. TiAt DkIt is not provided with retrieval on this attribute to limit, DkAny one property value of attribute all should be able to retrieve Ti.So TiAt DkIt is set to 1, wherein 1≤k≤n on bit vector relevant position corresponding for-Other.
Dk-Other certainly exists, and for each attribute, can be all the key-value pair of Key with the presence of an attribute-Other.
In some embodiments, the length of bit vector is fixing, equal to the size of data acquisition system.Its intermediate value be 1 position represent hit search condition data;Value be 0 position represent the data of miss search condition.
Some embodiments of the invention use " value " (position of bit vector) in key-value pair to be used for identifying whether certain data hits " attribute+property value " combination accordingly.The data of position and arbitrary format are associated, data are put in array, position its position in bit vector with data subscript value (numeral) in array.
It addition, the embodiment of the present invention also provides for the construction device of a kind of inverted index, as in figure 2 it is shown, the construction device of inverted index may include that
First creating unit 201, is used for creating array, and described array comprises m element;
Corresponding unit 202, for by corresponding for the m data in data acquisition system to be indexed m the element to described array;Wherein, in described data acquisition system to be indexed, the attribute of m data constitutes community set, and described community set comprises n attribute, at least one described attribute correspondence s property value, and m is positive integer, and n is more than or equal to zero, and s is more than or equal to zero;
First Traversal Unit 203, for for each data in described data acquisition system to be indexed, travels through attribute and the combination of respective attributes value of these data;
Arranging unit 304, for each described combination, arranging bit vector corresponding to this combination is that setting value is with the key-value pair set after being updated.
Preferably, described device can also include:
Judging unit, for each described combination, it is judged that whether having the key-value pair that this combination is corresponding in initial key-value pair set, the key-value pair of this combination described is made up of the bit vector that this combination and this combination are corresponding;
Second creating unit, when the key-value pair that this combination is not corresponding in described initial key-value pair set, creates the key-value pair that this combination is corresponding;
Described arranging unit, in the key-value pair corresponding specifically for arranging this combination, all positions of bit vector are the first setting value.
Preferably, described device can also include:
Acquiring unit, if there being key-value pair corresponding to this combination in the described initial key-value pair set, obtains the subscript value of data in the data acquisition system to be indexed that this combination is corresponding according to described array;
The position that the described described subscript value arranging unit bit vector specifically for arranging is corresponding is the second setting value.
Preferably, in some embodiments, the property value of at least one described attribute is empty, and described device can also include:
Second Traversal Unit, is used for traveling through described community set;
3rd creating unit, for attribute described for each in described community set, creates the key-value pair of particular combination in the key-value pair set after described renewal,
Second arranges unit, and being used for arranging bit vector is the first setting value, and wherein said particular combination is the combination that property value is empty attribute and property value is constituted.
Preferably, described device can also include:
3rd Traversal Unit, is used for traveling through described data acquisition system to be indexed;
First determines unit, for determining that in described data acquisition system to be indexed, the property value of an attribute of data is sky;
Second determines unit, for determining the subscript value in the corresponding described array of these data;
3rd arranges unit, for being set to described second setting value with the correspondence position that the described subscript value of bit vector corresponding to described subscript value is corresponding in the key-value pair set after described renewal.
As it is shown on figure 3, the embodiment of the present invention also provides for the search method of a kind of inverted index, this inverted index uses the construction method of the inverted index described in above-described embodiment to build, and the method may include steps of:
Step S300: the attribute of traverse user input and the combination of property value.
Step S302: the initial bit vector of the combination correspondence searching any attribute and property value in key-value pair set obtains initial bit Vector Groups, has p initial bit vector in initial bit Vector Groups, wherein p is positive integer.
Step S304: p initial bit vector is done position and obtains new bit vector with computing.
Step S306: travel through in new bit vector the position of promising second setting value.Here the second setting value is 1.
Step S308: take out corresponding data from array according to subscript value.
In some embodiments, before performing step S302, first described method can also determine in key-value pair set the attribute and the corresponding key-value pair of combination of property value not existed with user's input;
Now, step S302 specifically, searches bit vector corresponding to particular combination as bit vector corresponding to this any combination in key-value pair set.
In other embodiment, performing before step S302, first described method can also determine the property value disappearance of the attribute that user inputs;
The bit vector now searching arbitrary described combination in key-value pair set corresponding obtains bit vector group, is specially the bit vector searching bit vector corresponding to the particular combination of this attribute in described key-value pair set as this arbitrary described combination.
The attribute retrieval value set of user's input is more than or equal to all properties value number in property value set V, so attribute and property value combination according to user's input may can not find corresponding record in key-value pair set I.Now use particular combination Dk-Other again retrieves key-value pair set I as Key, as bit vector Bk
If the attribute number of user's input is assumed to be D less than n (size of community set D), the property value disappearance of the most a certain attributek, then particular combination D is usedk-Other retrieves key-value pair set I as Key, as bit vector Bk.B the most hereinkIt is the bit vector of this arbitrary described combination.
Travel through through the first step, finally give n (size of community set D) individual bit vector: { B1,B2,...,Bn-1,Bn}.This n bit vector is done position and computing, obtains bit vector B.
Traversal bit vector B in promising 1 position, from array A, take out corresponding data according to subscript value.The set finally given is retrieval result.
Additionally, some embodiments also provide for the retrieval device of a kind of inverted index, as shown in Figure 4, the retrieval device of this inverted index can have a following structure:
First Traversal Unit 401, for attribute and the combination of property value of traverse user input;
Searching unit 402, obtain initial bit Vector Groups for the initial bit vector searching any described combination in key-value pair set corresponding, have p initial bit vector in described initial bit Vector Groups, p is positive integer;
Arithmetic element 403, obtains new bit vector for described p initial bit vector does position with computing;
Second Traversal Unit 404, for travel through in described new bit vector the position of promising described second setting value;
Retrieval unit 405, for taking out corresponding data according to subscript value from described array.
Preferably, described retrieval device can also include that first determines unit, and described first determines that unit does not exist key-value pair corresponding with the combination of the attribute that described user inputs and property value for determining in described key-value pair set;
Described lookup unit, specifically for searching bit vector corresponding to the particular combination bit vector as this arbitrary described combination correspondence in described key-value pair set.
In some embodiments, described retrieval device can also include that second determines unit, and described second determines the property value disappearance of the attribute that unit inputs for determining described user;
Described lookup unit, specifically for searching bit vector corresponding to the particular combination of this attribute bit vector as this arbitrary described combination in described key-value pair set.
As an instantiation, the inverted index implementation that DSP throws in engine can be such that
In DSP advertisement putting engine, data acquisition system T to be retrieved is all set throwing in single ID composition.
Community set D is stereotactic conditions set, and size is 13, including region, channel, operating system, device type, keyword etc..For simplifying explanation, it is assumed that D size is 3:{area, channel, os}.
Each stereotactic conditions value thrown in constitutes property value set V.As follows:
c3 c5 c4 c2 c1
area Beijing Shanghai
channel s s p,s
os 1 1,0 2 2,0
In area orientation, c3 throws Beijing surely, and c5 throws Shanghai surely, and c4, c2, c1 are logical to be thrown.
In channel orientation, c3 surely throws s, c4 and surely throws s, c2 and surely throw p or s, and c5, c1 are logical to be thrown.
In os orientation, c5 throws 1 surely, and c4 throws 1 or 0 surely, and c2 throws 2 surely, and c1 throws 2 or 0 surely, and c3 is logical to be thrown.(it is to enumerate that os determines throwing value, and such as 0 represents windows, and 1 represents android, and 2 represent ios etc.)
The stereotactic conditions data traveling through all inputs can obtain key-value pair set I, and in throwing in engine, key-value pair set I HashMap realizes, it is possible to achieve the search efficiency of O (1).The BitSet class that bit vector java carries represents.
Assume that data acquisition system T is as follows: { c1, c2, c3, c4}.
The A obtained is as follows: [c3, c5, c4, c2, c1].
Corresponding relation is as follows:
Data in T c3 c5 c4 c2 c1
Element in A c3 c5 c4 c2 c1
Subscript value in A 0 1 2 3 4
The key-value pair set I obtained is as follows:
Key Value
Area-Beijing 10111
Area-Shanghai 01111
area-Other 00111
channel-p 01011
channel-s 11111
channel-t 01001
channel-Other 01001
os-1 11100
os-2 10011
os-0 10101
os-Other 10000
One bit vector of Digital ID in Value, each is a binary number.
It will be seen that if certain input can be thrown in corresponding orientation from the bit vector of I, then relevant position is 1.When logical throwing, all bit vector correspondence positions of this orientation are all 1.
During retrieval,
In some embodiments, the condition of retrieval request is as follows:
Key Value
area Beijing
channel v
os 1
Attribute and property value according to request are combined into 3 kinds of combinations:
{ area-Beijing, channel-v, os-1}.
Being Key according to area-Beijing, from I, retrieval obtains bit vector 10111
Retrieve less than data according to channel-v, because the value request v of channel has exceeded property value set in inverted index.Now obtain bit vector 01001 with channel-Other retrieval
In like manner obtain bit vector 11100 according to os-1 retrieval.
Three bit vectors are position and computing: 10111&01001&11100=0.I.e. retrieve less than data according to this stereotactic conditions.
In some embodiments, the condition of retrieval request following (compared with 1, area orientation becomes Shanghai):
Key Value
area Shanghai
channel v
os 1
Obtain three bit vectors and be position and computing 011111&01001&11100=01000.2nd (counting from left to right) is 1, and obtaining the 2nd (counting from left to right) data in array A is c5, is retrieval result.
In some embodiments, the condition of retrieval request is as follows:
Key Value
area Beijing
channel s
Orient according to area and channel and respectively obtain two bit vectors: 10111,11111.
Retrieval request lacks the searching value of os orientation, so obtain the bit vector of os orientation according to os-Other: 10000.
These 3 bit vectors are done position and are obtained bit vector 10000 with computing.1st (counting from left to right) is 1, and obtaining the 1st (counting from left to right) data in array A is c3, is retrieval result.
Described above illustrate and describes some preferred embodiments of the application, but as previously mentioned, it is to be understood that the application is not limited to form disclosed herein, it is not to be taken as the eliminating to other embodiments, and can be used for other combinations various, amendment and environment, and can be modified by above-mentioned teaching or the technology of association area or knowledge in invention contemplated scope described herein.And the change that those skilled in the art are carried out and change are without departing from spirit and scope, the most all should be in the protection domain of the application claims.

Claims (16)

1. the construction method of an inverted index, it is characterised in that described method includes:
Creating array, described array comprises m element;
By corresponding for the m data in data acquisition system to be indexed m the element to described array;Wherein, In described data acquisition system to be indexed, the attribute of m data constitutes community set, and described community set comprises n Individual attribute, at least one described attribute correspondence s property value, m is positive integer, and n is more than or equal to zero, s More than or equal to zero;
For each data in described data acquisition system to be indexed, travel through attribute and the respective attributes of these data The combination of value;
To each described combination, arranging bit vector corresponding to this combination is that setting value is with the key after being updated Value is to set.
Method the most according to claim 1, it is characterised in that described arrange this described combination correspondence Bit vector be setting value before, described method also includes:
To each described combination, it is judged that whether initial key-value pair set has the key-value pair that this combination is corresponding, The key-value pair of this combination described is made up of the bit vector that this combination and this combination are corresponding;
If the key-value pair that in described initial key-value pair set, this combination is not corresponding, then create this combination corresponding Key-value pair;
Described bit vector corresponding to this combination that arrange is setting value, including:
All positions of bit vector in the key-value pair that this combination is corresponding are set and are the first setting value.
Method the most according to claim 2, it is characterised in that described method also includes:
If there being the key-value pair that this combination is corresponding in described initial key-value pair set, obtaining according to described array should The subscript value of the data in the data acquisition system to be indexed that combination is corresponding;
Described bit vector corresponding to this combination that arrange is setting value, including:
The position arranging the described subscript value of institute's bit vector corresponding is the second setting value.
Method the most according to claim 1, it is characterised in that the attribute of at least one described attribute Value is sky, and described method also includes:
Travel through described community set;
Attribute described for each in described community set, in the key-value pair set after described renewal Creating the key-value pair of particular combination, arranging all positions of bit vector is the first setting value, wherein said specific Being combined as property value is empty attribute and the combination of property value composition thereof.
Method the most according to claim 1, it is characterised in that described method also includes:
Travel through described data acquisition system to be indexed;
Determine that in described data acquisition system to be indexed, the property value of an attribute of data is sky;
Determine the subscript value in the corresponding described array of these data;
All positions of bit vector corresponding for subscript value described in the key-value pair set after described renewal are arranged For described second setting value.
6. the construction device of an inverted index, it is characterised in that described device includes:
First creating unit, is used for creating array, and described array comprises m element;
Corresponding unit, for by corresponding for the m data in the data acquisition system to be indexed m to described array Individual element;Wherein, in described data acquisition system to be indexed, the attribute of m data constitutes community set, described Community set comprises n attribute, and at least one described attribute correspondence s property value, m is positive integer, n More than or equal to zero, s is more than or equal to zero;
First Traversal Unit, for for each data in described data acquisition system to be indexed, travels through this number According to attribute and the combination of respective attributes value;
Unit is set, for each described combination, arrange bit vector corresponding to this combination be setting value with Key-value pair set after being updated.
Device the most according to claim 6, it is characterised in that described device also includes:
Judging unit, for each described combination, it is judged that whether have this combination in initial key-value pair set Corresponding key-value pair, the key-value pair of this combination described is made up of the bit vector that this combination and this combination are corresponding;
Second creating unit, for the key-value pair that this combination is not corresponding in described initial key-value pair set Time, create the key-value pair that this combination is corresponding;
Described arranging unit, in the key-value pair corresponding specifically for arranging this combination, all positions of bit vector are equal It it is the first setting value.
Device the most according to claim 7, it is characterised in that described device also includes:
Acquiring unit, when being used for the key-value pair having this combination corresponding in described initial key-value pair set, root The subscript value of data in the data acquisition system to be indexed that this combination is corresponding is obtained according to described array;
The position that the described described subscript value arranging unit bit vector specifically for arranging is corresponding is second Setting value.
Device the most according to claim 6, it is characterised in that the attribute of at least one described attribute Value is sky, and described device also includes:
Second Traversal Unit, is used for traveling through described community set;
3rd creating unit, for attribute described for each in described community set, described more Key-value pair set after Xin creates the key-value pair of particular combination,
Second arranges unit, and being used for arranging bit vector is the first setting value, and wherein said particular combination is for belonging to Property value be empty attribute and combination that property value is constituted.
Device the most according to claim 6, it is characterised in that described device also includes:
3rd Traversal Unit, is used for traveling through described data acquisition system to be indexed;
First determines unit, for determining the attribute of an attribute of data in described data acquisition system to be indexed Value is sky;
Second determines unit, for determining the subscript value in the corresponding described array of these data;
3rd arranges unit, for by position corresponding for subscript value described in the key-value pair set after described renewal All positions of vector are set to described second setting value.
The search method of 11. 1 kinds of inverted indexs, it is characterised in that described inverted index uses above-mentioned power Profit requires that method described in any one of 1-5 builds, and described search method includes:
The attribute of traverse user input and the combination of property value;
The initial bit vector searching any described combination in key-value pair set corresponding obtains initial bit vector Group, has p initial bit vector in described initial bit Vector Groups, p is positive integer;
Described p initial bit vector is done position and obtains new bit vector with computing;
Travel through in described new bit vector the position of promising described second setting value;
From described array, corresponding data are taken out according to subscript value.
12. methods according to claim 11, it is characterised in that described look in key-value pair set Before the initial bit vector looking for arbitrary described combination corresponding obtains initial bit Vector Groups, also include:
Determine the combination phase that there is not the attribute inputted with described user and property value in described key-value pair set The key-value pair answered;
The described bit vector searching arbitrary described combination in key-value pair set corresponding obtains bit vector group, bag Include:
The bit vector searching particular combination corresponding in described key-value pair set is right as this arbitrary described combination The bit vector answered.
13. methods according to claim 12, it is characterised in that described look in key-value pair set Before the bit vector looking for arbitrary described combination corresponding obtains bit vector group, described method also includes:
Determine the property value disappearance of the attribute that described user inputs;
The described bit vector searching arbitrary described combination in key-value pair set corresponding obtains bit vector group, bag Include:
Bit vector corresponding to the particular combination of this attribute is searched as this arbitrary institute in described key-value pair set State the bit vector of combination.
The retrieval device of 14. 1 kinds of inverted indexs, it is characterised in that described inverted index uses above-mentioned power Profit requires that method described in any one of 1-5 builds, and described retrieval device includes:
First Traversal Unit, for attribute and the combination of property value of traverse user input;
Search unit, for searching initial bit corresponding to any described combination in key-value pair set to measuring To initial bit Vector Groups, having p initial bit vector in described initial bit Vector Groups, p is positive integer;
Arithmetic element, obtains new bit vector for described p initial bit vector does position with computing;
Second Traversal Unit, for travel through in described new bit vector the position of promising described second setting value;
Retrieval unit, for taking out corresponding data according to subscript value from described array.
15. devices according to claim 14, it is characterised in that described device also includes:
First determines unit, there is not, for determining in described key-value pair set, the genus inputted with described user Property and the corresponding key-value pair of combination of property value;
Described lookup unit, specifically for search in described key-value pair set position corresponding to particular combination to Measure the bit vector corresponding as this arbitrary described combination.
16. devices according to claim 15, it is characterised in that described device also includes:
Second determines unit, for determining that the property value of attribute that described user input lacks;
Described lookup unit, specifically for searching the particular combination pair of this attribute in described key-value pair set The bit vector answered is as the bit vector of this arbitrary described combination.
CN201610282316.3A 2016-04-29 2016-04-29 A kind of construction method and device, search method and device of inverted index Active CN105956085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610282316.3A CN105956085B (en) 2016-04-29 2016-04-29 A kind of construction method and device, search method and device of inverted index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610282316.3A CN105956085B (en) 2016-04-29 2016-04-29 A kind of construction method and device, search method and device of inverted index

Publications (2)

Publication Number Publication Date
CN105956085A true CN105956085A (en) 2016-09-21
CN105956085B CN105956085B (en) 2019-08-27

Family

ID=56913362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610282316.3A Active CN105956085B (en) 2016-04-29 2016-04-29 A kind of construction method and device, search method and device of inverted index

Country Status (1)

Country Link
CN (1) CN105956085B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599291A (en) * 2016-12-26 2017-04-26 腾讯科技(深圳)有限公司 Method and device for grouping data
CN107085532A (en) * 2017-03-21 2017-08-22 东软集团股份有限公司 Task monitor method and device
CN108205577A (en) * 2016-12-20 2018-06-26 阿里巴巴集团控股有限公司 A kind of array structure, the method, apparatus and electronic equipment of array inquiry
CN109325032A (en) * 2018-09-18 2019-02-12 厦门市美亚柏科信息股份有限公司 A kind of index datastore and search method, device and storage medium
CN110019980A (en) * 2017-11-27 2019-07-16 腾讯科技(深圳)有限公司 Index process method, apparatus, storage medium and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661391A (en) * 2009-09-24 2010-03-03 金蝶软件(中国)有限公司 Object serializing method, object deserializing method, device and system
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
US20120215760A1 (en) * 2008-06-09 2012-08-23 Brightedge Technologies, Inc. Collecting and scoring online references
CN103164408A (en) * 2011-12-09 2013-06-19 阿里巴巴集团控股有限公司 Information storage and query method based on vertical search engine and device thereof
KR20130092242A (en) * 2012-02-10 2013-08-20 (주)프람트테크놀로지 Inference query processing using hyper cube

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120215760A1 (en) * 2008-06-09 2012-08-23 Brightedge Technologies, Inc. Collecting and scoring online references
CN101661391A (en) * 2009-09-24 2010-03-03 金蝶软件(中国)有限公司 Object serializing method, object deserializing method, device and system
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
CN103164408A (en) * 2011-12-09 2013-06-19 阿里巴巴集团控股有限公司 Information storage and query method based on vertical search engine and device thereof
KR20130092242A (en) * 2012-02-10 2013-08-20 (주)프람트테크놀로지 Inference query processing using hyper cube

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108205577A (en) * 2016-12-20 2018-06-26 阿里巴巴集团控股有限公司 A kind of array structure, the method, apparatus and electronic equipment of array inquiry
CN106599291A (en) * 2016-12-26 2017-04-26 腾讯科技(深圳)有限公司 Method and device for grouping data
CN106599291B (en) * 2016-12-26 2019-10-25 腾讯科技(深圳)有限公司 Data grouping method and device
CN107085532A (en) * 2017-03-21 2017-08-22 东软集团股份有限公司 Task monitor method and device
CN110019980A (en) * 2017-11-27 2019-07-16 腾讯科技(深圳)有限公司 Index process method, apparatus, storage medium and computer equipment
CN109325032A (en) * 2018-09-18 2019-02-12 厦门市美亚柏科信息股份有限公司 A kind of index datastore and search method, device and storage medium

Also Published As

Publication number Publication date
CN105956085B (en) 2019-08-27

Similar Documents

Publication Publication Date Title
CN105956085A (en) Reverse indexing construction method and apparatus as well as retrieval method and apparatus
KR101223173B1 (en) Phrase-based indexing in an information retrieval system
KR101223172B1 (en) Phrase-based searching in an information retrieval system
CN104252445B (en) Approximate repetitive file detection method and device
CN103282902B (en) Suffix array candidate selects and index data structure
US7827181B2 (en) Click distance determination
KR101176079B1 (en) Phrase-based generation of document descriptions
JP4881322B2 (en) Information retrieval system based on multiple indexes
JP4763354B2 (en) System and method for embedding anchor text in ranking search results
US10496624B2 (en) Index key generating device, index key generating method, and search method
Jain et al. Page ranking algorithms in web mining, limitations of existing methods and a new method for indexing web pages
WO2009033098A1 (en) Integrating external related phrase information into a phrase-based indexing information retrieval system
CN108897761A (en) A kind of clustering storage method and device
CN108304484A (en) Key word matching method and device, electronic equipment and readable storage medium storing program for executing
CN102867049B (en) Chinese PINYIN quick word segmentation method based on word search tree
US20160125004A1 (en) Method of index recommendation for nosql database
CN111143547B (en) Big data display method based on knowledge graph
CN107180093A (en) Information search method and device and ageing inquiry word recognition method and device
CN111581479B (en) One-stop data processing method and device, storage medium and electronic equipment
CN105373546A (en) Information processing method and system for knowledge services
CN103678436A (en) Information processing system and information processing method
Alon et al. Chasing a fast robber on planar graphs and random graphs
Aljubayrin et al. Finding non-dominated paths in uncertain road networks
US7979452B2 (en) System and method for retrieving task information using task-based semantic indexes
Wang et al. Graph ranking auditing: Problem definition and fast solutions

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 A 5 C, block A, China International Steel Plaza, 8 Haidian Avenue, Haidian District, Beijing.

Applicant after: Youku network technology (Beijing) Co., Ltd.

Address before: 100080 A 5 C, block A, China International Steel Plaza, 8 Haidian Avenue, Haidian District, Beijing.

Applicant before: 1Verge Inc.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200427

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: Youku network technology (Beijing) Co., Ltd