CN107784110A - A kind of index establishing method and device - Google Patents

A kind of index establishing method and device Download PDF

Info

Publication number
CN107784110A
CN107784110A CN201711069369.8A CN201711069369A CN107784110A CN 107784110 A CN107784110 A CN 107784110A CN 201711069369 A CN201711069369 A CN 201711069369A CN 107784110 A CN107784110 A CN 107784110A
Authority
CN
China
Prior art keywords
index
cryptographic hash
target text
hash
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711069369.8A
Other languages
Chinese (zh)
Other versions
CN107784110B (en
Inventor
谢永恒
张侠
火莽
火一莽
万月亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201711069369.8A priority Critical patent/CN107784110B/en
Publication of CN107784110A publication Critical patent/CN107784110A/en
Application granted granted Critical
Publication of CN107784110B publication Critical patent/CN107784110B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of index establishing method and device, methods described includes:Extract the Feature Words of target text;The Feature Words are ranked up to obtain feature string;To the feature string application MinHash algorithms, cryptographic Hash corresponding to the target text is obtained;Search to whether there is in mapped cache pond and map bucket with the index of the Hash values match, if in the presence of establishing the index between the cryptographic Hash and the target text in the index mapping bucket;Bucket is mapped with the index of the Hash values match if being not present in the mapped cache pond, establishes and maps bucket, and the index established between the cryptographic Hash and the target text with the index of the Hash values match.Index establishing method provided in an embodiment of the present invention reduces index amount of storage, by the way that the index of Similar Text is established in same index mapping bucket, realizes the classification of Similar Text, improves the retrieval rate of Similar Text.

Description

A kind of index establishing method and device
Technical field
The present embodiments relate to information index and inquiry field, more particularly to a kind of index establishing method and device.
Background technology
In recent years, developing rapidly and popularizing with Internet technology, my needs are fast from mass data in many cases Speed and correctly find the data that we want, this process is referred to as similarity searching.
With sharply increasing for network data, search speed has become a big bottleneck of similarity searching, therefore, how A fast and effectively index structure is designed, becomes the active demand of similarity searching under the big data epoch.It is currently used A kind of index technology is the index based on tree structure, typically there is KD trees.The index of tree structure employs Subspace partition Structure design, by the way that object data is divided into some sub-spaces, per similar data are included in sub-spaces, entering During row search, only scanned in the range of certain sub-spaces, retrieval rate is effectively increased in low-dimensional feature space. But when the intrinsic dimensionality increase of object search, the efficiency of tree structure index substantially reduces, its efficiency and linear search Time complexity is compared and almost not improved.Another index technology be based on traditional hash index functions, such as md5, its Principle be by original contents try one's best it is uniformly random be mapped as one signature, therefore, even if original contents only differ a byte, Then caused signature is also likely to very different.If two signatures are equal, it is phase under certain probability to illustrate original contents Deng, if unequal, in addition to explanation original contents are unequal, no longer provide any information.Therefore based on traditional hash The index technology of function can not possess certain office by the similarity for comparing the similarity of signature to determine between original contents It is sex-limited.
The content of the invention
The embodiments of the invention provide a kind of index establishing method and device, the amount of storage of index data is effectively reduced, And then improve retrieval rate.
In a first aspect, the embodiments of the invention provide a kind of index establishing method, this method includes:
Extract the Feature Words of target text;
The Feature Words are ranked up to obtain feature string;
To the feature string application MinHash algorithms, cryptographic Hash corresponding to the target text is obtained;
Search to whether there is in mapped cache pond and map bucket with the index of the Hash values match, if in the presence of described The index between the cryptographic Hash and the target text is established in index mapping bucket;
If being not present in the mapped cache pond and mapping bucket with the index of the Hash values match, establish and the cryptographic Hash The index mapping bucket of matching, and the index established between the cryptographic Hash and the target text.
Further, the index between the cryptographic Hash and the target text is established in the index mapping bucket, wrapped Include:
If being not present in the index mapping bucket and indexing cryptographic Hash with the cryptographic Hash identical, the cryptographic Hash is deposited Enter in the index mapping bucket, and the index established between the cryptographic Hash and the target text;
If exist in the index mapping bucket and index cryptographic Hash with the cryptographic Hash identical, not to the Hash Value is preserved again, the index directly established between the index cryptographic Hash and the target text.
Further, methods described also includes:
, will be with the cryptographic Hash pair if existing in the mapped cache pond and mapping bucket with the index of the Hash values match The text data answered is recommended as the text data similar to the target text.
Further, methods described also includes:
N number of hash function is determined at random;
Hash operation is carried out to the feature string of target text based on N number of hash function respectively, obtains N number of Hash Value;
Count the quantity that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond;
The quantity of the close cryptographic Hash is ranked up, and it is similar to the target text according to ranking results determination Recommend text data set;
Calculate the target text and the similarity recommended between each recommendation text data of text data concentration;
The recommendation text data that similarity meets given threshold is recommended;Wherein, N is positive integer.
Further, the Feature Words of the extraction target text include:
Target text is segmented;
The Feature Words of the target text are determined according to the part of speech of each participle and the frequency occurred.
Further, it is described participle is carried out to target text to include:
Based on big granularity or small grain size pattern, participle unit is to target text with reference to corresponding to word frequency and part of speech selection This is divided in units of word, and marks the part of speech of each word.
Further, before being segmented to target text, methods described also includes:
The character that can not be identified in target text is filtered.
Second aspect, the embodiments of the invention provide one kind index to establish device, and described device includes:
Feature Words extraction module, for extracting the Feature Words of target text;
Order module, for being ranked up to obtain feature string to the Feature Words;
First computing module, for the feature string application MinHash algorithms, it is corresponding to obtain the target text Cryptographic Hash;
First establishes module, is mapped for searching to whether there is in mapped cache pond with the index of the Hash values match Bucket, if in the presence of, it is described index mapping bucket in establish the index between the cryptographic Hash and the target text;
Second establishes module, if being mapped for being not present in the mapped cache pond with the index of the Hash values match Bucket, establish and map bucket, and the index established between the cryptographic Hash and the target text with the index of the Hash values match.
Further, described device also includes:
Storage unit, if indexing cryptographic Hash with the cryptographic Hash identical for being not present in the index mapping bucket, The cryptographic Hash deposit index is mapped in bucket, and the index established between the cryptographic Hash and the target text;
Unit is established, if indexing cryptographic Hash with the cryptographic Hash identical for existing in the index mapping bucket, The cryptographic Hash is not preserved again then, the index directly established between the index cryptographic Hash and the target text.
Further, described device also includes:
Recommending module, will if mapping bucket with the index of the Hash values match for existing in the mapped cache pond Text data corresponding with the cryptographic Hash is recommended as the text data similar to the target text;Or for Machine determines N number of hash function;Hash operation is carried out based on N number of hash function respectively to the feature string of target text, Obtain N number of cryptographic Hash;Count the number that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond Amount;The quantity of the close cryptographic Hash is ranked up, and the recommendation similar to the target text is determined according to ranking results Text data set;Calculate similar between the target text and each recommendation text data of recommendation text data concentration Degree;The recommendation text data that similarity meets given threshold is recommended;Wherein, N is positive integer.
A kind of index establishing method provided in an embodiment of the present invention, pass through the Feature Words character string application to target text MinHash algorithms obtain cryptographic Hash corresponding to the target text, and then search in mapped cache pond and whether there is and the Kazakhstan The index mapping bucket of uncommon value matching, if in the presence of establishing the cryptographic Hash and the target text in the index mapping bucket Between index, if in the mapped cache pond be not present and the Hash values match index mapping bucket, establish with the Kazakhstan The index mapping bucket of uncommon value matching, and the index established between the cryptographic Hash and the target text, it is achieved thereby that passing through The index of similar text data is stored into same index mapping bucket in the form of cryptographic Hash, reduces depositing for set of metadata of similar data Storage index, improves retrieval rate.
Brief description of the drawings
Fig. 1 is a kind of index establishing method schematic flow sheet that the embodiment of the present invention one provides;
Fig. 2 is a kind of index establishing method schematic flow sheet that the embodiment of the present invention two provides;
Fig. 3 is a kind of classification of neighbour's Hash, index foundation and the approximate text addressing flow that the embodiment of the present invention two provides Schematic diagram;
Fig. 4 is the structural representation that a kind of index that the embodiment of the present invention three provides establishes device.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail The processing described as flow chart or method.Although every step is described as the processing of order by flow chart, therein to be permitted Multi-step can be implemented concurrently, concomitantly or simultaneously.In addition, the order of every step can be rearranged.When it The processing can be terminated when step is completed, it is also possible to the additional step being not included in accompanying drawing.The processing It can correspond to method, function, code, subroutine, subprogram etc..
Embodiment one
A kind of index establishing method schematic flow sheet that Fig. 1 provides for the embodiment of the present invention one, the rope that the present embodiment provides Draw method for building up to be applicable to establish index to large batch of text data, this method can be established device by index to perform. Shown in Figure 1, methods described specifically includes as follows:
Step 110, the Feature Words for extracting target text.
Specifically, the Feature Words of extraction target text can be segmented based on Chinese word segmentation in participle by text And word frequency sequence, and text semantic analysis and part of speech tuning etc. can be further relied on, text implication can be accurately reflected by finding Participle, using this kind of participle that can accurately reflect text implication as Feature Words.Feature Words are arranged further according to preset strategy Feature string is obtained after sequence, by feature string application MinHash algorithms, obtaining breathe out corresponding with the target text Uncommon value, mark of this cryptographic Hash as the target text, therefore, the target text then can by corresponding cryptographic Hash come Index, compared to aspect indexing text is directly stored, reduce index amount of storage;Meanwhile similar text is being carried out by the index Cryptographic Hash corresponding to then can directly comparing during this retrieval, greatly reduces the complexity of similitude identification between text, improves Recall precision.Meanwhile the Feature Words of target text are obtained by using Word Intelligent Segmentation rather than simple keyword extraction, It ensure that higher recognition accuracy.
Exemplarily, extracting the Feature Words of target text includes:
Target text is segmented;Specifically, carrying out participle to target text includes:Based on big granularity or small grain size mould Formula, the participle unit with reference to corresponding to word frequency and part of speech selection is divided to the target text in units of word, and is marked Note the part of speech of each word.
The Feature Words of the target text are determined according to the part of speech of each participle and the frequency occurred.
It by target text cutting is word that specific practice, which is, can use big granularity or small grain size pattern, example during specific cutting If target text is " World Cup ", if we carry out cutting according to big granularity pattern, cutting result is the " world Cup/football match ", if we carry out cutting according to small grain size pattern, cutting result is " world/cup/football/match ", two kinds of moulds The difference of formula is that the number of words scope for the participle being syncopated as is different.It is actual to carry out can be combined with intelligent word frequency system during cutting Meter, select specific participle unit to carry out cutting with reference to word frequency part of speech, and carry out part-of-speech tagging, such as be for target text " World Cup in 2014 is held in Brazil ", the result after participle can be:2014 (numbers), football match (noun), bar Western (local noun), hold (verb).Then Feature Words of several words as target text, such as I are chosen from all participles Only choose three Feature Words, general local noun, noun and verb can preferably react the implication of original text, if in addition one Individual word occurs repeatedly in the text, then it is contemplated that improving the selected probability for being characterized word of the word, the algorithm of specific selected characteristic word It can be determined according to the actual requirements.Because each participle only occurs once in the example above, therefore word frequency can not be considered Factor, it can such as choose " Brazil ", " world cup ", " holding " are used as Feature Words.
Further, before being segmented to target text, in addition to:
The character that can not be identified in target text is filtered.
Specifically, filtering noise pretreatment mainly is carried out to target text, it is impossible to which the character of identification can refer to non-finger Determine coded format, or insignificant character, such as tab, space.
Step 120, the Feature Words are ranked up to obtain feature string.
The Feature Words selected are ranked up according to predetermined order strategy, such as the ordering strategy can be according to each spy The initial of sign word is ranked up, the feature string after being sorted.
Step 130, to the feature string application MinHash algorithms, obtain cryptographic Hash corresponding to the target text.
Wherein, it is to the first step of the feature string application MinHash algorithms:Hash function generates and design Hash Function h (x)={ h1 (x), h2 (x) ... hm (x) }, hash function h (x) digit (such as 32, i.e. m=32) is set, will be breathed out Everybody of uncommon function h (x) is initialized as 0.Wherein, h (x) implication is x to be mapped to the hash function of an integer.It is assumed that h (x) it is a good hash function, there is good uniformity, different elements can be mapped to different integers.Second Step:Hash corresponding to each Feature Words is calculated by the hash function h (x) of generation and encodes hashcode (such as 32), for every Each of the Hash coding hashcode of individual Feature Words, if the position adds its weight for the value of 1, Minhash corresponding positions (frequency typically occurred);Otherwise its weight is subtracted;To (32) Minhash finally obtained, if the position is 1, set 1 is set to, if the position is 0, is arranged to 0.The Hash for being finally based on to obtain encodes to obtain cryptographic Hash corresponding to target text.
It whether there is in step 140, lookup mapped cache pond and map bucket with the index of the Hash values match, if in the presence of, Then the index between the cryptographic Hash and the target text is established in the index mapping bucket.
Index mapping bucket in mapped cache pond is the set of preset range cryptographic Hash, the cryptographic Hash in preset range For cryptographic Hash close to each other, it is Similar Text to represent target text corresponding to each cryptographic Hash, i.e., each index mapping bucket is The set of one Similar Text index.By the way that the index of Similar Text is established in same index mapping bucket, and due to The physical characteristic of MinHash algorithms, cryptographic Hash corresponding to multiple Similar Texts may be identical, therefore multiple Similar Texts may be right Same index cryptographic Hash is answered, realizes the classification of Similar Text, reduces index amount of storage, improves the retrieval of Similar Text Speed.
Further, the index between the cryptographic Hash and the target text is established in the index mapping bucket, wrapped Include:
If being not present in the index mapping bucket and indexing cryptographic Hash with the cryptographic Hash identical, the cryptographic Hash is deposited Enter in the index mapping bucket, and the index established between the cryptographic Hash and the target text;
If exist in the index mapping bucket and index cryptographic Hash with the cryptographic Hash identical, not to the Hash Value is preserved again, the index directly established between the index cryptographic Hash and the target text.
Because the cryptographic Hash being likely to be obtained after similar text application MinHash algorithms is identical, therefore, if index mapping bucket In exist and current cryptographic Hash identical index cryptographic Hash, then no longer to current index cryptographic Hash carry out repeat preservation, Directly establish the existing index relative indexed with current cryptographic Hash identical between cryptographic Hash and current target text.Such as This, a corresponding index is established without each target text, so as to reduce the storage of Similar Text index, storage The overstocked quantity of index is few, further increases the retrieval rate of Similar Text.
If being not present in step 150, the mapped cache pond bucket, foundation and institute are mapped with the index of the Hash values match State the index mapping bucket of Hash values match, and the index established between the cryptographic Hash and the target text.
A kind of index establishing method that the present embodiment provides, passes through the feature string application MinHash to target text Algorithm, corresponding cryptographic Hash is obtained, because cryptographic Hash corresponding to multiple Similar Texts may be identical, therefore multiple Similar Texts can Same index cryptographic Hash can be corresponded to, therefore, reduces index amount of storage, by the way that the index of Similar Text is established same In index mapping bucket, the classification of Similar Text is realized, improves the retrieval rate of Similar Text.
Embodiment two
Fig. 2 is a kind of index establishing method schematic flow sheet that the embodiment of the present invention two provides, in the technical side of embodiment one On the basis of case, embodiment adds the recommendation operation for carrying out Similar Text, established based on method disclosed in embodiment one Index carries out Similar Text recommendation, it is possible to achieve higher Similar Text recommends efficiency and accuracy rate.Referring specifically to Fig. 2 institutes Show, methods described includes:
Step 210, the Feature Words for extracting target text.
Step 220, the Feature Words are ranked up to obtain feature string.
Step 230, to the feature string application MinHash algorithms, obtain cryptographic Hash corresponding to the target text.
If existing in step 240, mapped cache pond and mapping bucket with the index of the Hash values match, reflected in the index The index established between the cryptographic Hash and the target text is penetrated in bucket, and then will text data corresponding with the cryptographic Hash Recommended as the text data similar to the target text.
Operation for step 240 can be indexed in the index mapping bucket by matching with the cryptographic Hash identical Text data corresponding to cryptographic Hash is recommended as the text data similar to the target text.Obviously now, described Exist in the index mapping bucket matched somebody with somebody and index cryptographic Hash with the cryptographic Hash identical, again need not now carry out the cryptographic Hash Repeat to preserve, to reduce index amount of storage, can directly carry out Similar Text recommendation.If or in the index mapping bucket of the matching Cryptographic Hash is indexed in the absence of with the cryptographic Hash identical, then can be proceeded as follows, to realize the recommendation of Similar Text.
N number of hash function is determined at random;N number of hash function can be sequentially selected directly from hash function storehouse;
Hash operation is carried out to the feature string of target text based on N number of hash function respectively, obtains N number of Hash Value;
Count the quantity that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond;
The quantity of the close cryptographic Hash is ranked up, and it is similar to the target text according to ranking results determination Recommend text data set;
Calculate the target text and the similarity recommended between each recommendation text data of text data concentration;
The recommendation text data that similarity meets given threshold is recommended;Wherein, N is positive integer.
In order to facilitate understanding, aforesaid operations process is now illustrated:The N is set as 10, corresponding 10 cryptographic Hash point It is not:N1, N2 ... N10, No. 1 index mapping bucket being located at through counting N1, N2, N3 and N4 in mapped cache pond, N5, N6 and N7 No. 2 index mapping buckets in mapped cache pond, N8 and N9 are located at No. 3 index mapping buckets in mapped cache pond, and N10 is located at No. 4 index mapping buckets in mapped cache pond, ranking results 4,3,2,1;According to ranking results determination and the target The similar recommendation text data set of text is that No. 1 index maps text data corresponding with cryptographic Hash N1, N2, N3 and N4 in bucket.
Specifically, the target text can be obtained by the similarity between set of computations A and set B and recommends text with described Notebook data concentrates each similarity recommended between text data, wherein, set A is the Feature Words element group of the target text Into set, set B is the set of each Feature Words element composition for recommending text data.Between set of computations A and set B Similarity can be obtained by hash function:
Define hmin (S) be set S in element after h (x) Hash, with minimum hash element.It is so right The condition that set A, B, hmin (A)=hmin (B) are set up be in A ∪ B the element with minimum hash also in A ∩ B.Here Having one, it has good uniformity it is assumed that h (x) is a good hash function, different elements can be mapped to not Same integer.So having, Pr [hmin (A)=hmin (B)]=J (A, B), i.e. set A and B similarity pass through for set A, B The equal probability of minimum hash after hash.Therefore the similarity of two set can be calculated according to MinHash.Typically have two Kind method:
The first:Use multiple hash functions
In order to which set of computations A, B has the probability of minimum hash, we can select a number of hash functions, than Such as K.Then cryptographic Hash is asked to set A, B respectively with this K hash function, K minimum value is obtained to each set.Such as Min (A) k={ a1, a2 ..., ak }, Min (B) k={ b1, b2 ..., bk }.So, set A, B similarity is | Min (A) K ∩ Min (B) k |/| Min (A) k ∪ Min (B) k |, and identical element number and total element number in Min (A) k and Min (B) k Ratio.
Second:Use single hash functions
Hmin (S) is an element for having in set S minimum hash, and it is to have most in set S to define hmink (S) K element of small cryptographic Hash.Therefore just only need to seek a Hash to each set, then take K element of minimum.Calculate two In individual set A, B similarity, exactly set A the common factor number of K element minimum in K minimum element and set B and The ratio of union number.
If being not present in step 250, the mapped cache pond bucket, foundation and institute are mapped with the index of the Hash values match State the index mapping bucket of Hash values match, and the index established between the cryptographic Hash and the target text.
A kind of index establishing method that the present embodiment provides, the index established based on method disclosed in embodiment one are carried out Similar Text is recommended, it is possible to achieve higher Similar Text recommends efficiency and accuracy rate.
Further, on the basis of above-described embodiment, neighbour's Hash shown in Figure 3 classification, index are established and phase Like text query schematic flow sheet, as shown in figure 3, carrying out word segmentation processing to text data first, Feature Words character string is obtained, so Hash processing is carried out to Feature Words character string afterwards, the corresponding cryptographic Hash i.e. text feature of text data is obtained, according to the text Feature indexes corresponding to being established in mapped cache pond, and carries out Similar Text recommendation.By using text data Feature Words Cryptographic Hash establishes index, realizes being greatly reduced for index amount of storage, and realizes and be characterized set of metadata of similar data inquiry conversion most Neighbor search;Because cryptographic Hash corresponding to the Feature Words of Similar Text may be identical, built so as to solve high-volume text data When lithol draws caused time delay is overstock because index quantity is big;It is simultaneously convenient to be added according to different business tine applications Corresponding participle collection simultaneously improves index classification according to corresponding Hash result;By the way that Similar texts searching is converted into similar coding Search, significantly reduces query time, improves search efficiency, and can be real by the coding similarity degree between different texts Existing Similar content is recommended.
Embodiment three
Fig. 4 is that a kind of index of the offer of the embodiment of the present invention three establishes the structural representation of device, shown in Figure 4, institute Stating device includes:Feature Words extraction module 410, order module 420, the first computing module 430, first establishes module 440 and Two establish module 450;
Wherein, Feature Words extraction module 410, for extracting the Feature Words of target text;Order module 420, for institute Feature Words are stated to be ranked up to obtain feature string;First computing module 430, for the feature string application MinHash algorithms, obtain cryptographic Hash corresponding to the target text;First establishes module 440, for searching in mapped cache pond Bucket is mapped with the presence or absence of with the index of the Hash values match, if in the presence of establishing the Hash in the index mapping bucket Index between value and the target text;Second establishes module 450, if in the mapped cache pond be not present with it is described The index mapping bucket of Hash values match, establishes and maps bucket with the index of the Hash values match, and establish the cryptographic Hash and institute State the index between target text.
Further, first establish module 440 and include:
Storage unit, if indexing cryptographic Hash with the cryptographic Hash identical for being not present in the index mapping bucket, The cryptographic Hash deposit index is mapped in bucket, and the index established between the cryptographic Hash and the target text;
Unit is established, if indexing cryptographic Hash with the cryptographic Hash identical for existing in the index mapping bucket, The cryptographic Hash is not preserved again then, the index directly established between the index cryptographic Hash and the target text.
Further, described device also includes:
Recommending module, will if mapping bucket with the index of the Hash values match for existing in the mapped cache pond Text data corresponding with the cryptographic Hash is recommended as the text data similar to the target text;Or for Machine determines N number of hash function;Hash operation is carried out based on N number of hash function respectively to the feature string of target text, Obtain N number of cryptographic Hash;Count the number that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond Amount;The quantity of the close cryptographic Hash is ranked up, and the recommendation similar to the target text is determined according to ranking results Text data set;Calculate similar between the target text and each recommendation text data of recommendation text data concentration Degree;The recommendation text data that similarity meets given threshold is recommended.
A kind of index that the present embodiment provides establishes device, passes through the feature string application MinHash to target text Algorithm, corresponding cryptographic Hash is obtained, because cryptographic Hash corresponding to multiple Similar Texts may be identical, therefore multiple Similar Texts can Same index cryptographic Hash can be corresponded to, therefore, reduces index amount of storage, by the way that the index of Similar Text is established same In index mapping bucket, the classification of Similar Text is realized, improves the retrieval rate of Similar Text.
The control device for the frequency conversion refrigerator that the embodiment of the present invention is provided can perform what any embodiment of the present invention was provided The control method of frequency conversion refrigerator, possess the corresponding functional module of execution method and beneficial effect.
It will be appreciated by those skilled in the art that realize that all or part of step in above-described embodiment method is to pass through Program instructs the hardware of correlation to complete, and the program storage is in the storage medium, including some instructions are causing one Individual equipment (can be single-chip microcomputer, chip etc.) or processor (processor) perform each embodiment methods described of the application All or part of step.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey The medium of sequence code.
Pay attention to, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also Other more equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

  1. A kind of 1. index establishing method, it is characterised in that including:
    Extract the Feature Words of target text;
    The Feature Words are ranked up to obtain feature string;
    To the feature string application MinHash algorithms, cryptographic Hash corresponding to the target text is obtained;
    Search to whether there is in mapped cache pond and map bucket with the index of the Hash values match, if in the presence of in the index The index between the cryptographic Hash and the target text is established in mapping bucket;
    If being not present in the mapped cache pond and mapping bucket with the index of the Hash values match, establish and the Hash values match Index mapping bucket, and the index established between the cryptographic Hash and the target text.
  2. 2. according to the method for claim 1, it is characterised in that establish the cryptographic Hash and institute in the index mapping bucket The index between target text is stated, including:
    If being not present in the index mapping bucket and indexing cryptographic Hash with the cryptographic Hash identical, the cryptographic Hash is stored in institute State in index mapping bucket, and the index established between the cryptographic Hash and the target text;
    If existing in the index mapping bucket and indexing cryptographic Hash with the cryptographic Hash identical, the cryptographic Hash is not entered Row preserves again, the index directly established between the index cryptographic Hash and the target text.
  3. 3. method according to claim 1 or 2, it is characterised in that also include:
    , will be corresponding with the cryptographic Hash if existing in the mapped cache pond and mapping bucket with the index of the Hash values match Text data is recommended as the text data similar to the target text.
  4. 4. method according to claim 1 or 2, it is characterised in that also include:
    N number of hash function is determined at random;
    Hash operation is carried out to the feature string of target text based on N number of hash function respectively, obtains N number of cryptographic Hash;
    Count the quantity that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond;
    The quantity of the close cryptographic Hash is ranked up, and the recommendation similar to the target text is determined according to ranking results Text data set;
    Calculate the target text and the similarity recommended between each recommendation text data of text data concentration;
    The recommendation text data that similarity meets given threshold is recommended;
    Wherein, N is positive integer.
  5. 5. method according to claim 1 or 2, it is characterised in that the Feature Words of the extraction target text include:
    Target text is segmented;
    The Feature Words of the target text are determined according to the part of speech of each participle and the frequency occurred.
  6. 6. according to the method for claim 5, it is characterised in that described participle is carried out to target text to include:
    Based on big granularity or small grain size pattern, with reference to corresponding to word frequency and part of speech selection participle unit to the target text with Word is that unit is divided, and marks the part of speech of each word.
  7. 7. according to the method for claim 5, it is characterised in that before being segmented to target text, in addition to:
    The character that can not be identified in target text is filtered.
  8. 8. one kind index establishes device, it is characterised in that including:
    Feature Words extraction module, for extracting the Feature Words of target text;
    Order module, for being ranked up to obtain feature string to the Feature Words;
    First computing module, for the feature string application MinHash algorithms, obtaining breathing out corresponding to the target text Uncommon value;
    First establishes module, maps bucket with the index of the Hash values match for searching to whether there is in mapped cache pond, if In the presence of, then it is described index mapping bucket in establish the index between the cryptographic Hash and the target text;
    Second establishes module, if mapping bucket with the index of the Hash values match for being not present in the mapped cache pond, builds It is vertical to map bucket, and the index established between the cryptographic Hash and the target text with the index of the Hash values match.
  9. 9. device according to claim 8, it is characterised in that described first, which establishes module, includes:
    Storage unit, if cryptographic Hash is indexed with the cryptographic Hash identical for being not present in the index mapping bucket, by institute State in the cryptographic Hash deposit index mapping bucket, and the index established between the cryptographic Hash and the target text;
    Unit is established, if indexing cryptographic Hash with the cryptographic Hash identical for existing in the index mapping bucket, no The cryptographic Hash is preserved again, the index directly established between the index cryptographic Hash and the target text.
  10. 10. device according to claim 8, it is characterised in that also include:
    Recommending module, will be with institute if mapping bucket with the index of the Hash values match for existing in the mapped cache pond Text data corresponding to cryptographic Hash is stated as the text data similar to the target text to be recommended;Or for random true Fixed N number of hash function;Hash operation is carried out to the feature string of target text based on N number of hash function respectively, obtains N Individual cryptographic Hash;Count the quantity that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond;Will The quantity of the close cryptographic Hash is ranked up, and the recommendation textual data similar to the target text is determined according to ranking results According to collection;Calculate the target text and the similarity recommended between each recommendation text data of text data concentration;By phase Meet that the recommendation text data of given threshold is recommended like degree;Wherein, N is positive integer.
CN201711069369.8A 2017-11-03 2017-11-03 Index establishing method and device Active CN107784110B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711069369.8A CN107784110B (en) 2017-11-03 2017-11-03 Index establishing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711069369.8A CN107784110B (en) 2017-11-03 2017-11-03 Index establishing method and device

Publications (2)

Publication Number Publication Date
CN107784110A true CN107784110A (en) 2018-03-09
CN107784110B CN107784110B (en) 2020-07-03

Family

ID=61431627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711069369.8A Active CN107784110B (en) 2017-11-03 2017-11-03 Index establishing method and device

Country Status (1)

Country Link
CN (1) CN107784110B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670148A (en) * 2018-09-26 2019-04-23 平安科技(深圳)有限公司 Collection householder method, device, equipment and storage medium based on speech recognition
CN109710656A (en) * 2018-11-12 2019-05-03 清华大学 Approximate enquiring method and device
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
CN111078821A (en) * 2019-11-27 2020-04-28 泰康保险集团股份有限公司 Dictionary setting method, device, medium and electronic equipment
CN111597309A (en) * 2020-05-25 2020-08-28 深圳市小满科技有限公司 Similar enterprise recommendation method and device, electronic equipment and medium
CN111858607A (en) * 2020-07-24 2020-10-30 北京金山云网络技术有限公司 Data processing method and device, electronic equipment and computer readable medium
WO2021038887A1 (en) * 2019-08-30 2021-03-04 富士通株式会社 Similar document retrieval method, similar document retrieval program, similar document retrieval device, index information creation method, index information creation program, and index information creation device
CN113992625A (en) * 2021-10-15 2022-01-28 杭州安恒信息技术股份有限公司 Domain name source station detection method, system, computer and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020073068A1 (en) * 1997-03-07 2002-06-13 Guha Ramanathan V. System and method for rapidly identifying the existence and location of an item in a file
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
CN102193995A (en) * 2011-04-26 2011-09-21 深圳市迅雷网络技术有限公司 Method and device for establishing multimedia data index and retrieval
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020073068A1 (en) * 1997-03-07 2002-06-13 Guha Ramanathan V. System and method for rapidly identifying the existence and location of an item in a file
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
CN102193995A (en) * 2011-04-26 2011-09-21 深圳市迅雷网络技术有限公司 Method and device for establishing multimedia data index and retrieval
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MENGFANRONG: "minhash算法", 《HTTPS://WWW.CNBLOGS.COM/MENGFANRONG/P/5058919.HTML》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670148A (en) * 2018-09-26 2019-04-23 平安科技(深圳)有限公司 Collection householder method, device, equipment and storage medium based on speech recognition
CN109710656A (en) * 2018-11-12 2019-05-03 清华大学 Approximate enquiring method and device
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
WO2021038887A1 (en) * 2019-08-30 2021-03-04 富士通株式会社 Similar document retrieval method, similar document retrieval program, similar document retrieval device, index information creation method, index information creation program, and index information creation device
JPWO2021038887A1 (en) * 2019-08-30 2021-03-04
JP7193000B2 (en) 2019-08-30 2022-12-20 富士通株式会社 Similar document search method, similar document search program, similar document search device, index information creation method, index information creation program, and index information creation device
CN111078821A (en) * 2019-11-27 2020-04-28 泰康保险集团股份有限公司 Dictionary setting method, device, medium and electronic equipment
CN111078821B (en) * 2019-11-27 2023-12-08 泰康保险集团股份有限公司 Dictionary setting method, dictionary setting device, medium and electronic equipment
CN111597309A (en) * 2020-05-25 2020-08-28 深圳市小满科技有限公司 Similar enterprise recommendation method and device, electronic equipment and medium
CN111858607A (en) * 2020-07-24 2020-10-30 北京金山云网络技术有限公司 Data processing method and device, electronic equipment and computer readable medium
CN113992625A (en) * 2021-10-15 2022-01-28 杭州安恒信息技术股份有限公司 Domain name source station detection method, system, computer and readable storage medium
CN113992625B (en) * 2021-10-15 2024-05-28 杭州安恒信息技术股份有限公司 Domain name source station detection method, system, computer and readable storage medium

Also Published As

Publication number Publication date
CN107784110B (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN107784110A (en) A kind of index establishing method and device
CN109947904B (en) Preference space Skyline query processing method based on Spark environment
Norouzi et al. Fast exact search in hamming space with multi-index hashing
CN102129451B (en) Method for clustering data in image retrieval system
CN111104511B (en) Method, device and storage medium for extracting hot topics
CN105426426B (en) A kind of KNN file classification methods based on improved K-Medoids
US20100313258A1 (en) Identifying synonyms of entities using a document collection
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
Sood et al. Probabilistic near-duplicate detection using simhash
CN103365992B (en) Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN104298715B (en) A kind of more indexed results ordering by merging methods based on TF IDF
CN107291895B (en) Quick hierarchical document query method
US20150199567A1 (en) Document classification assisting apparatus, method and program
CN111797239A (en) Application program classification method and device and terminal equipment
Wick et al. A unified approach for schema matching, coreference and canonicalization
CN109325146A (en) A kind of video recommendation method, device, storage medium and server
KR102371437B1 (en) Method and apparatus for recommending entity, electronic device and computer readable medium
CN110929498A (en) Short text similarity calculation method and device and readable storage medium
Adamu et al. A survey on big data indexing strategies
CN111325033B (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN114297415A (en) Multi-source heterogeneous data storage method and retrieval method for full media data space
KR20180129001A (en) Method and System for Entity summarization based on multilingual projected entity space
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN108984711A (en) A kind of personalized APP recommended method based on layering insertion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An index establishment method and device

Effective date of registration: 20220105

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022990000005

PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220712

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022990000005

PC01 Cancellation of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A kind of index establishment method and apparatus

Effective date of registration: 20220907

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022110000206

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022110000206

PC01 Cancellation of the registration of the contract for pledge of patent right