CN107122370A - A kind of distributed search method and device - Google Patents

A kind of distributed search method and device Download PDF

Info

Publication number
CN107122370A
CN107122370A CN201610105198.9A CN201610105198A CN107122370A CN 107122370 A CN107122370 A CN 107122370A CN 201610105198 A CN201610105198 A CN 201610105198A CN 107122370 A CN107122370 A CN 107122370A
Authority
CN
China
Prior art keywords
segmentation
fingerprint
mark
server
retrieval request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610105198.9A
Other languages
Chinese (zh)
Inventor
林治晖
沈朝阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610105198.9A priority Critical patent/CN107122370A/en
Publication of CN107122370A publication Critical patent/CN107122370A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of distributed search method and apparatus;Methods described includes:Receive after retrieval request, according to the fingerprint of the information to be retrieved carried in the retrieval request, similar fingerprints are searched in the fingerprint preserved;For each similar fingerprints found, proceed as follows respectively:The segmentation of the similar fingerprints and the segmentation of the fingerprint of the information to be retrieved are corresponding in turn to according to predefined procedure and compared, stops comparing after first identical segmentation is found;The mark carried in the mark of the identical segmentation and the retrieval request is compared, the similar fingerprints are if the same included in count results;Wherein, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;Return to the count results.The application can solve to carry out in a distributed system to count during analog information identification it is unstable, the problem of need extra counting deduplication operation.

Description

A kind of distributed search method and device
Technical field
The present invention relates to information retrieval field, more particularly to a kind of distributed search method and device.
Background technology
Analog information identification technology is widely used at present, and a kind of typical application scenarios are in magnanimity information The presence of analog information is detected, for example, removing duplicate webpages are carried out in search engine crawler system;Another allusion quotation Type application scenarios are for detecting the frequency that analog information occurs, such as being carried out in anti-spam system similar The detection of number of mail.
SIMHASH is a kind of relatively conventional duplicate message recognizer, and SIMHASH can be by text The information such as shelves are converted into the byte of one 64, herein referred as fingerprint;If two information are calculated Fingerprint Hamming distances<N (rule of thumb the general values of this n are 3), is considered as two information phases Seemingly;Wherein, Hamming distances refer to that the corresponding bit value of two bytes (such as described fingerprint) is different Number of bits.Such as fingerprint FP1 and fingerprint FP2 are except the bit value on the 27th, the 55th Difference, the bit value on other 62 is all identical, then it is assumed that fingerprint FP1 is FP2 similar fingerprints, FP2 is also FP1 similar fingerprints.
At present, there are following two schemes using SIMHASH algorithms in a distributed system:
The first scheme:In storage information or retrieval information, letter to be stored is calculated using SIMHASH 64 fingerprints of breath or information to be retrieved;A server is chosen from multiple servers at random;Will meter The fingerprint calculated is sent to selected server;Server is received after fingerprint, uses SIMHASH units Scheme stores the fingerprint of the information to be stored, or retrieves the similar finger of the fingerprint of the information to be retrieved Line is simultaneously counted to similar fingerprints.
The major defect of the first scheme is:The counting of similar fingerprints is inaccurate, will be to be stored during storage The fingerprint random distribution of information is on a certain server in N platform servers, therefore retrieval result is The count value of similar fingerprints in one server.The counting of similar fingerprints obtained by so is unstable, Retrieval can obtain different retrieval results on different server.
Second scheme:Fingerprint is stored on multiple servers;Sent in retrieval to multiple servers The fingerprint of information to be retrieved;Each server retrieves the similar fingerprints and meter of the fingerprint of information to be retrieved respectively Number;Completed because storage and retrieval are distributed to multiple servers, it is therefore possible to duplicate counting Problem;Fingerprint FP1 above is all preserved in such as two servers, if the two servers are all received To the fingerprint FP2 of information to be retrieved, then similar fingerprints FP1 can all be counted.In order to be counted In duplicate removal, second scheme, the server retrieved needs similar fingerprints and count value returning to hair The client of retrieval is played, duplicate removal is carried out by collecting, comparing similar fingerprints by client.
The major defect of such scheme is:Extra deduplication operation is needed, processing speed is slow.If no It is to repeat with the similar fingerprints that server is found, different server is sent to the similar finger of client identical Line will waste network traffics.
The content of the invention
This application provides a kind of distributed search method and device, it can solve to enter in a distributed system Row analog information counted when recognizing it is unstable, the problem of need extra counting deduplication operation.
The application is adopted the following technical scheme that.
A kind of distributed search method, applied to server, including:
Receive after retrieval request, according to the fingerprint of the information to be retrieved carried in the retrieval request, in institute Similar fingerprints are searched in the fingerprint of preservation;
For each similar fingerprints found, proceed as follows respectively:By the segmentation of the similar fingerprints Segmentation with the fingerprint of the information to be retrieved is corresponding in turn to according to predefined procedure compares, when finding first Stop comparing after identical segmentation;By in the mark of the identical segmentation and the retrieval request The mark of carrying is compared, and the similar fingerprints are if the same included in count results;Wherein, divide The division of section and the mark being each segmented are determined according to the first pre-defined rule;
Return to the count results.
Alternatively, described method also includes:
If the mark of the identical segmentation is different with the mark carried in the retrieval request, Corresponding similar fingerprints are included not in the count results.
Alternatively, the sequence number for being designated the segmentation of the segmentation.
Alternatively, the fingerprint according to the information to be retrieved carried in the retrieval request, is being preserved Fingerprint in search similar fingerprints include:
According to the first pre-defined rule, in the fingerprint of the information to be retrieved carried from the retrieval request, Obtain the corresponding segmentation of mark that the retrieval request is carried;
In the segmentation preserved, search and the acquired identical segmentation of segmentation;Wherein, protected A kind of fingerprint that each segmentation deposited at least belongs in preserved fingerprint;
In the fingerprint preserved, the fingerprint corresponding to found segmentation is filtered out, by what is filtered out Fingerprint of the fingerprint respectively with the information to be retrieved is compared, and finds out similar fingerprints.
Alternatively, it is described in the fingerprint preserved, filter out the fingerprint corresponding to found segmentation Including:
The value of segmentation to be found is used as the corresponding key assignments of key name lookup;Corresponding to the value of one segmentation Key assignments be all fingerprints for including the segmentation in the fingerprint that is preserved.
Alternatively, it is described in the segmentation preserved, search and the acquired identical segmentation of segmentation Including:
In preserved, identifying as in the segmentation of index using retrieval request carrying, search and institute The identical segmentation of the segmentation of acquisition.
A kind of distributed search method, applied to client, including:
Determine each to be segmented corresponding server respectively according to the second pre-defined rule;
Respectively retrieval request is sent to the corresponding server of each segmentation;Carried in the retrieval request described The fingerprint of information to be retrieved, and the segmentation corresponding to the server mark;Wherein, the division of segmentation And the mark being each segmented is determined according to the first pre-defined rule;
Server is added for the count results that the retrieval request is returned, retrieval result is obtained.
Alternatively, the sequence number for being designated the segmentation of the segmentation.
Alternatively, it is described to determine that being each segmented corresponding server includes respectively according to the second pre-defined rule:
The fingerprint of the information to be retrieved is divided into K segmentation according to first pre-defined rule;
Hash operation is carried out to the number of server using the value being each segmented respectively, it is true according to operation result The corresponding server of the fixed segmentation.
A kind of distributed search device, is arranged at server, including:
Similar fingerprints searching modul, for receiving after retrieval request, according to what is carried in the retrieval request The fingerprint of information to be retrieved, similar fingerprints are searched in the fingerprint preserved;
Counting module, for each similar fingerprints for finding, is proceeded as follows respectively:Should The segmentation of similar fingerprints and the segmentation of the fingerprint of the information to be retrieved are corresponding in turn to ratio according to predefined procedure Compared with stopping comparing after first identical segmentation is found;By the mark of the identical segmentation It is compared with the mark carried in the retrieval request, the phase is if the same included in count results Like fingerprint;Wherein, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;
Respond module, for returning to the count results.
Alternatively, the counting module is additionally operable to the mark when the identical segmentation and the retrieval When the mark carried in request is different, corresponding similar fingerprints are included not in the count results.
Alternatively, the sequence number for being designated the segmentation of the segmentation.
Alternatively, the similar fingerprints searching modul includes:
Acquiring unit, for according to the first pre-defined rule, from the described to be retrieved of retrieval request carrying In the fingerprint of information, the corresponding segmentation of mark that the retrieval request is carried is obtained;
Comparing unit is segmented, in the segmentation preserved, searching identical with acquired segmentation Segmentation;Wherein, a kind of fingerprint that each segmentation preserved at least belongs in preserved fingerprint;
Fingerprint comparing unit, in the fingerprint preserved, filtering out corresponding to found segmentation Fingerprint, the fingerprint by the fingerprint filtered out respectively with the information to be retrieved is compared, finds out similar Fingerprint.
Alternatively, the fingerprint comparing unit filters out found segmentation in the fingerprint preserved Corresponding fingerprint includes:
The value of segmentation of the fingerprint comparing unit to be found is used as the corresponding key assignments of key name lookup;One Key assignments corresponding to the value of individual segmentation is all fingerprints for including the segmentation in the fingerprint that is preserved.
Alternatively, the segmentation comparing unit is searched complete with acquired segmentation in the segmentation preserved Exactly the same segmentation includes:
It is described segmentation comparing unit it is being preserved, using the retrieval request carry mark as index In segmentation, search and the acquired identical segmentation of segmentation.
A kind of distributed search device, is arranged at client, including:
Determining module, for determining each to be segmented corresponding server respectively according to the second pre-defined rule;
Request module, for sending retrieval request to the corresponding server of each segmentation respectively;The retrieval Carry the fingerprint of the information to be retrieved in request, and the segmentation corresponding to the server mark;Its In, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;
Computing module, for server to be added for the count results that the retrieval request is returned, is obtained Retrieval result.
Alternatively, the sequence number for being designated the segmentation of the segmentation.
Alternatively, the determining module includes:
Division unit, for the fingerprint of the information to be retrieved to be divided into according to first pre-defined rule K segmentation;
Hash operation unit, for carrying out Hash fortune to the number of server using the value being each segmented respectively Calculate, the corresponding server of the segmentation is determined according to operation result.
The application includes advantages below:.
In at least one alternative of the application, retrieval can be carried out in multiple servers, therefore retrieval knot Fruit is more comprehensive, and situation about being retrieved with respect to single server, retrieval result is relatively stable and accurate.
In at least one alternative of the application, server has been carried out when being counted to similar fingerprints Handle again;Therefore server can only return to count results to client, and client returns Servers-all The count results summation returned can obtain retrieval result, and processing speed is fast, and saves network traffics. Even if client requirements obtain the specific data of similar fingerprints, because duplicate removal has been carried out in server, because This can be avoided repeating to send same similar fingerprints, decrease unnecessary network traffics.
In at least one alternative of the application, fingerprint is divided into after multiple segmentations the different services that are mapped to Stored on device;It can be first passed through when so retrieving and compare segmentation progress preliminary screening, then according to sieve Select result to retrieve similar fingerprints again, accelerate processing speed.In a kind of embodiment of the alternative, Server can also only compare and be asked with retrieving when preserving segmentation to identify as index in retrieval The entrained segmentation identified as index is asked, number of comparisons is reduced, has further speeded up processing speed.
Certainly, implementing any product of the application must be not necessarily required to while reaching all the above excellent Point.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the distributed search method of embodiment one;
Fig. 2 is the schematic flow sheet of the distributed search method of embodiment two;
Fig. 3 is the schematic diagram of the distributed search device of embodiment three;
Fig. 4 is the schematic diagram of determining module in embodiment three;
Fig. 5 is the schematic diagram of the distributed search device of example IV;
Fig. 6 is the schematic diagram of module in example IV.
Embodiment
The technical scheme of the application is described in detail below in conjunction with drawings and Examples.
If it should be noted that not conflicting, each feature in the embodiment of the present application and embodiment can To be combined with each other, within the protection domain of the application.Patrolled in addition, though showing in flow charts Volume order, but in some cases, can be shown or described to be performed different from order herein Step.
In a typical configuration, the computing device of client or Verification System may include one or more Processor (CPU), input/output interface, network interface and internal memory (memory).
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory And/or the form, such as read-only storage (ROM) or flash memory (flash such as Nonvolatile memory (RAM) RAM).Internal memory is the example of computer-readable medium.Internal memory potentially includes module 1, module 2 ... ..., Module N (N is the integer more than 2).
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by Any method or technique come realize information store.Information can be computer-readable instruction, data structure, The module of program or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), Other kinds of random access memory (RAM), read-only storage (ROM), electrically erasable Read-only storage (EEPROM), fast flash memory bank or the read-only storage of other memory techniques, read-only optical disc Device (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, magnetic The storage of band magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, can be with available for storage The information being accessed by a computing device.Defined according to herein, computer-readable medium does not include non-temporary The data-signal and carrier wave of computer readable media (transitory media), such as modulation.
Embodiment one, a kind of distributed search method, applied to server, as shown in figure 1, including step Rapid S110~S130.
S110, receive after retrieval request, according to the fingerprint of the information to be retrieved carried in the retrieval request, Similar fingerprints are searched in the fingerprint preserved;
S120, each similar fingerprints for finding, are proceeded as follows respectively:By the similar fingerprints Segmentation and the segmentation of fingerprint of the information to be retrieved be corresponding in turn to and compare according to predefined procedure, when finding Stop comparing after first identical segmentation;By the mark of the identical segmentation and the retrieval The mark carried in request is compared, and the similar fingerprints are if the same included in count results;Its In, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;
S130, the return count results.
In the present embodiment, identical each bit for referring to the two segmentations of two segmentations all corresponds to phase Together, such as first is all that " 1 ", second are all " 0 ", by that analogy;That is, this two The value of individual segmentation is identical.
In the present embodiment, it can also include in step S120:If the mark of the identical segmentation Knowledge is different with the mark carried in the retrieval request, then can not be in the count results comprising corresponding Similar fingerprints, such as directly ignore the similar fingerprints.
In the present embodiment, the count results returned can be arranged to difference according to conditions such as scene, demands , such as there are following two implications in implication:
The first implication, count results represent all similar fingerprints being included in count results in the clothes The summation for the count value being engaged in device;
Because same fingerprint is possible to multiple appearance, such as in the scene of spam, it is assumed that an envelope The mail of identical content is repeatedly sent, then may occur repeatedly to store the fingerprint of the mail to arrive same Situation in server (i.e. a kind of this fingerprint has multiple).In order to save memory space, server will not Identical fingerprint is stored repeatedly, but is only stored once for a kind of fingerprint, then when storing this kind of fingerprint Only increase the count value of the fingerprint;For the every kind of fingerprint preserved in server, it also saving corresponding Count value (equivalent to number of every kind of fingerprint on book server or storage number of times);Fingerprint and its meter Numerical value can be, but not limited to save as the form of fingerprint table.
It is described to be comprising the similar fingerprints in count results when count results are the first implication Refer to:Count results are added to the count value (can be, but not limited to find from fingerprint table) of the similar fingerprints. Not including the similar fingerprints in count results can refer to:Do not increase count results.Correspondingly, to certain A kind of fingerprint carries out counting and refers to count results adding the corresponding count value of this kind of fingerprint.
Second of implication, count results represent the species of all similar fingerprints being included in count results Number;There are five kinds of fingerprints in the server such as retrieved to fingerprint FP1:FP2、FP3、FP4、 FP5, FP6;Assuming that what is carried in retrieval request is designated 1, FP2, FP3, FP4, FP5, FP6 Mark with first identical segmentation of FP1 is 1, then count results are 5;Assuming that retrieval please That asks middle carrying is designated 2, and only FP3 is designated 2 with first identical segmentation of FP1, Then count results are 1;By that analogy.
It is described to be comprising the similar fingerprints in count results when count results are second of implication Refer to:Count results plus 1.Not including the similar fingerprints in count results can refer to:Meter is not increased Number result.Correspondingly, counting is carried out to a certain fingerprint to refer to count results adding 1.
In above two implication, " same fingerprint " refers to that each all corresponds to identical fingerprint;Content The fingerprint that identical information is calculated using same algorithm just belongs to same fingerprint.It can be seen that, the A kind of count results statistics of implication be duplicate removal after, the number of similar fingerprints in server contains for second The count results statistics of justice be duplicate removal after, in server similar fingerprints species number.
In the present embodiment, if server receives the retrieval request that two or more carry identical fingerprints The mark of carrying (different) or receive when two or more marks are carried in a retrieval request, Can be with merging treatment.Such as one retrieval request carries fingerprint FP2, carries two marks, described to merge Processing is specially:In step S210, similar fingerprints are searched according to fingerprint FP2;In step S220 In, it is corresponding in turn to when comparing segmentation, as long as in a similar fingerprints, and the identical segmentations of FP2 Mark is identical with any one mark carried in the retrieval request for carrying FP2, then is included in count results The similar fingerprints, and if carry any one mark carried in FP2 retrieval request it is all different if do not exist The similar fingerprints are included in count results.Two retrieval requests all carry fingerprint FP2, but carry different marks The situation of knowledge is similar.
It is of course also possible to carry out above-mentioned steps respectively for the different retrieval requests for carrying identical fingerprints S210~S230, or the retrieval request for carrying two or more marks is considered as two retrieval requests Above-mentioned steps S210~S230 is carried out respectively.
In a kind of alternative of the present embodiment, first pre-defined rule can serve to indicate that the fingerprint Number of fragments K and the mark that is each segmented;Wherein, K is more than or equal to 2, is less than or equal to The integer of the digit of fingerprint, for the fingerprint of 64, K is preferably 4.First pre-defined rule can To be pre-stored in the client, the first pre-defined rule used in each client and server is consistent 's.
In other alternatives, first pre-defined rule can also be other forms, such as by indicating The start bit being each segmented, indicates segments K indirectly;For another example by the naming rule of given mark, To indicate the mark being each segmented.
In a kind of alternative of the present embodiment, first pre-defined rule may be used to indicate that division Mode, such as divide equally or specify the length being each segmented.In one example, fingerprint is 64, described First pre-defined rule is that fingerprint is divided into 4 segmentations, each segmentation 16.In another example, Fingerprint can also be divided into the segmentation of Length discrepancy.In other alternatives, respectively (this can also be defaulted as The result that Shi Zhiwen digit divided by segments K are obtained needs to be integer), without again with described First pre-defined rule indicates dividing mode.
In the present embodiment, belong to same analog information searching system or same analog information retrieval is provided First pre-defined rule used in multiple servers of service is identical.
In a kind of alternative of the present embodiment, the mark of the segmentation can be, but not limited to the sequence for segmentation Number.The fingerprint of 64 is such as divided into 4 segmentations, then the mark of this 4 segmentations from left to right may be used To be followed successively by 1,2,3,4.The mark of the segmentation can also be set to it according to the first pre-defined rule It is numbered or sequence, in a fingerprint, and segmentation and mark are corresponded;Than such as above-mentioned 4 points The mark of section can also from left to right be followed successively by a, b, c, d.
In a kind of alternative of the present embodiment, the algorithm for searching similar fingerprints can be, but not limited to use SIMHASH algorithms, are judged whether similar by Hamming distances;In other alternatives, also may be used To be searched using other analog information recognizers.
It is described according to the letter to be retrieved carried in the retrieval request in a kind of alternative of the present embodiment The fingerprint of breath, lookup similar fingerprints include in the fingerprint preserved:
The fingerprint according to the information to be retrieved carried in the retrieval request, in the fingerprint preserved Searching similar fingerprints includes:
According to the first pre-defined rule, in the fingerprint of the information to be retrieved carried from the retrieval request, Obtain the corresponding segmentation of mark that the retrieval request is carried;
In the segmentation preserved, search and the acquired identical segmentation of segmentation;Wherein, protected A kind of fingerprint that each segmentation deposited at least belongs in preserved fingerprint;
In the fingerprint preserved, the fingerprint corresponding to found segmentation is filtered out, by what is filtered out Fingerprint of the fingerprint respectively with the information to be retrieved is compared, and finds out similar fingerprints.
In this alternative, in the server in addition to preserving fingerprint, also preserve at least one in fingerprint Individual segmentation;The segmentation can be sent when storage is asked to send jointly to fingerprint to be stored by client Server, can also be by mark of the server according to the segmentation carried in storage request, from finger to be stored Voluntarily obtained in line.Correspondingly, the client is when sending the fingerprint of information to be stored to server, The mark of the segmentation obtained according to first pre-defined rule can be sent, or directly transmits segmentation to service Device;And client can determine each to be segmented corresponding server according to second pre-defined rule, come It will be segmented or the mark of segmentation be sent to corresponding server.
In this alternative, the fingerprint can be all preserved at least to the fingerprint of all preservations in the server Partial fingerprints can also only be preserved and are segmented by one segmentation.In other alternatives, it can not also preserve The segmentation of fingerprint, only preserves fingerprint in itself.
This alternative is equivalent to lookup is carried out in two steps, and the fingerprint first filtered out with information to be checked has The fingerprint of same segment, then similar fingerprints are searched wherein, so reduce required during lookup similar fingerprints The fingerprint to be compared, therefore the efficiency of lookup can be improved.
In this alternative, a kind of embodiment of fingerprint preservation corresponding with segmentation can be:It will divide Segment value regard all fingerprints comprising the segmentation as corresponding key assignments (value) as key name (key); It can be seen that, a corresponding key assignments of value being segmented is that the one or more in one or more fingerprints, key assignments refer to Line is the fingerprint corresponding to the segmentation filtered out.
It is described in the fingerprint preserved in present embodiment, filter out corresponding to found segmentation Fingerprint can include:
The value of segmentation to be found is used as the corresponding key assignments of key name lookup;Corresponding to the value of one segmentation Key assignments be all fingerprints for including the segmentation in the fingerprint that is preserved.
In present embodiment, value identical segmentation only preserves one, including the segmentation in a server One or more fingerprints both correspond to the segmentation preservation.Assuming that being divided into 4 for the fingerprint of 64 The value of one preserved in segmentation, server, 16 segmentations is " 1010101010101010 ", then Fingerprint being preserved in the server, comprising " 1010101010101010 " this segmentation can be made For the key assignments of the segmentation.
Preferably, segmentation can also be sorted out according to mark and preserves, such as be designated the segmentation of " 1 " all Preserved using " 1 " as index, by that analogy.The finger in key assignments corresponding to the value of one segmentation Line will not only include the segmentation, and the index of mark and the segmentation of the segmentation in the fingerprint is identical 's.What is be such as segmented is designated the sequence number of segmentation, and an index is for the value of 16 segmentations of " 1 " " 1010101010101010 ", the then value that first of every kind of fingerprint is segmented in the corresponding key assignments of the segmentation All it is " 1010101010101010 ";If a kind of value of the second/tri-/tetra- segmentations of fingerprint is " 1010101010101010 ", then the fingerprint be not belonging to the key assignments of the segmentation.So screening with it is to be checked When the fingerprint of inquiry information has the fingerprint of same segment, the fingerprint filtered out is less, further increases and looks into The efficiency looked for.
The corresponding relation for setting up and preserving segmentation and fingerprint otherwise can also be used in other embodiment, Segmentation and affiliated fingerprint can such as be corresponded and preserved.
In a kind of embodiment of this alternative, the server can also classify according to the mark of segmentation Segmentation is preserved, i.e.,:The mark being segmented when preserving to be segmented is used as index;To such as 1 be designated Segmentation is saved together, and is indexed as " 1 ";The segmentation for being designated " 2 " is saved together, indexes and is " 2 ", by that analogy;
It is described in the segmentation preserved in present embodiment, search identical with acquired segmentation Segmentation can include:
In preserved, identifying as in the segmentation of index using retrieval request carrying, search and institute The identical segmentation of the segmentation of acquisition.
The embodiment can reduce the scope for searching segmentation, corresponding only with the fingerprint of information to be checked Identical fingerprint is segmented, just more whether is further similar fingerprints, so can further improve lookup The speed of similar fingerprints.
In a kind of alternative of the present embodiment, by the segmentation of the similar fingerprints and the information to be retrieved The segmentation of fingerprint is corresponding in turn to according to predefined procedure when comparing, can be first by similar fingerprints and information to be retrieved Fingerprint be divided into segmentation, then compare successively;Such as the fingerprint FP1 and FP2 of 64, respectively From being divided into 4 sections, FP1 segmentation FP1-1, FP1-2, FP1-3, FP1-4 is obtained, and FP2 It is segmented FP2-1, FP2-2, FP2-3, FP2-4;Assuming that predefined procedure is from left to right, then first compare FP1-1 and FP2-1, stops comparing if identical, that is, no longer the similar fingerprints are carried out Segmentation correspondence compares;It is incomplete same, compare FP1-2 and FP2-2;By that analogy.Can also side stroke Point side is compared, such as assumes predefined procedure from left to right, to be a segmentation by 16, then take respectively The 1st~16 in FP1, FP2 is compared, and stops comparing if identical, if not exclusively It is identical, take the in FP1, FP2 the 17th~32 to be compared respectively, by that analogy.
In a kind of alternative of the present embodiment, if client also requires to return to similar fingerprints in itself, Only send the similar fingerprints counted (i.e.:The similar fingerprints included in count results), do not enter The similar fingerprints that row is counted are (i.e.:The similar fingerprints not included in count results) do not send.
Embodiment two, a kind of distributed search method, applied to client, as shown in Fig. 2 including step Rapid S210~S230.
S210, determine according to the second pre-defined rule to be each segmented corresponding server respectively;
S220, to the corresponding server of each segmentation send retrieval request respectively;Taken in the retrieval request Fingerprint with the information to be retrieved, and the segmentation corresponding to the server mark;Wherein, it is segmented Division and the mark that is each segmented determined according to the first pre-defined rule;
S230, the count results addition for returning to server for the retrieval request, obtain retrieval result.
In the present embodiment, it is stored in during same fingerprint on multiple servers;When storage on multiple servers During same fingerprint, the corresponding service of fingerprint difference segmentation can be determined also according to first pre-defined rule Device, also preserves corresponding segmentation or mark during server storage fingerprint.If a kind of fingerprint is only stored in one On platform server, the method that the present embodiment can also be applicable, but be only possible in this case for the fingerprint Once counted, so the problem of counting duplicate removal is not present.In a kind of alternative of the present embodiment, The fingerprint of the information to be retrieved can be, but not limited to calculate by SIMHASH algorithms and obtain;Other energy Enough calculating the duplicate message recognizer of fingerprint (or being characterized word) can also be applicable.The calculating can Client is sent to be carried out by client, or after can also being calculated by miscellaneous equipment.
In the present embodiment, the details of first pre-defined rule is referring to embodiment one;The client is used The first pre-defined rule it is identical with server.
In a kind of alternative of the present embodiment, the mark of the segmentation can be, but not limited to the sequence for segmentation Number.The fingerprint of 64 is such as divided into 4 segmentations, then the mark of this 4 segmentations from left to right may be used To be followed successively by 1,2,3,4.The mark of the segmentation can also be set to it according to the first pre-defined rule It is numbered or sequence, in a fingerprint, and segmentation and mark are corresponded;Than such as above-mentioned 4 points The mark of section can also from left to right be followed successively by a, b, c, d.
In a kind of alternative of the present embodiment, each segmentation correspondence is determined respectively according to the second pre-defined rule Server can include:
The fingerprint of the information to be retrieved is divided into K segmentation according to first pre-defined rule;
HASH (Hash) computing is carried out to the number of server using the value being each segmented respectively, according to Operation result determines the corresponding server of the segmentation.
In this alternative, the Hash result of K segmentation has J, and J scope is greater than or equal to 1st, less than or equal to K.That is, segmentation may be corresponded with server, it is also possible to two or More than two segmentations correspond to same server.Because the mark of segmentation and segmentation are to correspond , so the corresponding relation of segmentation and server, is equivalent to the mark of segmentation and the correspondence pass of server System.
In other alternatives, second pre-defined rule can also be other forms;Such as according to other Computational methods or rule obtain the corresponding server of each segmentation, for another example according to point prestored in client The corresponding relation of section and server identification determines that be each segmented corresponding server (prestores in different clients Corresponding relation can be different, so retrieval pressure can be made to share on different server), for another example Corresponding server is determined according to the mark of segmentation.When it is determined that being segmented corresponding server, if not Need to use segmentation in itself, client can not carry out staged operation to fingerprint, as long as getting segmentation Mark.
In a kind of alternative of the present embodiment, if identified corresponding server number is less than K, There is a situation where that two or more segmentations correspond to same server, then in step S120, To when sending retrieval request corresponding to two or more servers being segmented, two inspections are segmented into Rope request is sent, and the mark of two segmentations can also be placed in a retrieval request and sent.
In a kind of alternative of the present embodiment, the client is if necessary to the specific interior of similar fingerprints Hold, can also send mark and information to be retrieved fingerprint to server when, it is desirable to server return into The similar fingerprints that row is counted.
Illustrate above-described embodiment with a specific example below;In this example, segments K is 4, point Segmented mode is respectively;64 fingerprints are obtained using SIMHASH.Assuming that fingerprint FP1 and FP2 are phase Like fingerprint, only the 27th, the 55th difference.Segmentation the sequence number for being designated segmentation, four segmentation Mark is from left to right followed successively by 1,2,3,4.Predefined procedure is from left to right.
Memory phase:
Client by the fingerprint FP1 of information to be stored be divided into 4 segmentation FP1-1, FP1-2, FP1-3, FP1-4, the mark of segmentation is respectively 1,2,3,4;HASH computings, root are carried out according to each segmentation Determine that four are respectively mapped to server A, server B, server according to the result of HASH computings C and server D.Storage request with fingerprint FP1 and mark 1 is sent to server A by client, Storage request with fingerprint FP1 and mark 2 is sent to server B, fingerprint FP1 and mark will be carried 3 storage request is sent to server C, and the storage request with fingerprint FP1 and mark 4 is sent into service Device D.
By taking server A as an example, first according to the first of fingerprint FP1 segmentation FP1-1 to identify 1 To have searched whether same segment in the set of index, if it is not, will segmentation FP1-1 be stored in In set of the mark 1 for index.The form of the set can be, but not limited to as segmentation table, server A The segmentation for being designated 1 is all stored in the segmentation table.
Server A can also arrive fingerprint FP1 storages with FP1-1's using FP1-1 value as key Value is as in the value corresponding to key;The form of the value can be, but not limited to as fingerprint list.
If receiving the storage request with fingerprint FP3 after server A again, and taken in storage request Band is designated 1, it is assumed that fingerprint FP3 first segmentation FP3-1 is identical with FP1-1, then takes , will when business device A has searched whether same segment according to FP3-1 in the segmentation table with mark 1 for index FP1-1 can be found, then FP3 is also added to using corresponding to FP1-1 value as key by server A In value, that is, it is added to using in the fingerprint list corresponding to FP1-1 value as key.
The way of other servers is similar, repeats no more.
Even if it should be noted that not using the way of memory phase in this example, such as directly storage refers to Line can realize counting duplicate removal during retrieval on the server in itself, similarly.Provided using in this example Storage method, it is possible to increase retrieval similar fingerprints when efficiency, but on count duplicate removal do not influence.
Retrieval phase:
Client by the fingerprint FP2 of information to be retrieved be divided into 4 segmentation FP2-1, FP2-2, FP2-3, FP2-4, mark is respectively 1,2,3,4;HASH computings are carried out according to each segmentation, according to HASH The result of computing determines that four are respectively mapped to four servers, it is assumed that also be exactly server A, Server B, server C and server D.Client please by the retrieval with fingerprint FP2 and mark 1 Ask and be sent to server A, the retrieval request with fingerprint FP2 and mark 2 is sent to server B, by band There are fingerprint FP2 and the retrieval request of mark 3 to be sent to server C, will be with fingerprint FP2 and mark 4 Retrieval request be sent to server D.
By taking server C as an example, to identify in 3 segmentation tables for index, search and the complete phases of FP2-3 With segmentation, it will find FP1-3 (or with the identical segmentations of FP1-3, it is assumed here that For FP1-3);FP2 similar fingerprints are searched in using FP1-3 value as the corresponding fingerprint tables of key, Obtain comprising one or more fingerprints including fingerprint FP1.Compare FP1-1 and FP2-1 first, find It is identical, but the mark 1 of the segmentation is different from mark 3 in retrieval request, and therefore server C is not Fingerprint FP1 is included in count results.
Server A also enters to go above-mentioned similar operations, but the difference is that in mark 1 and retrieval request Identify 1 identical, therefore server A can include fingerprint FP1 in count results.
Due to the 27th of FP1 and FP2,55 it is different, i.e., FP1-2 is different with FP2-2, FP1-4 It is different with FP2-4, therefore server B and server D be to look for less than identical in corresponding segment table Segmentation, therefore in obtained similar fingerprints there is no FP1.
As can be seen here, by such scheme, server end can realize counting duplicate removal.
In the present example, it is assumed that be designated 3 storage request, retrieval request and be all changed to be sent to server A, then server A can filter out FP1 twice when retrieving similar fingerprints according to FP2, but due to first Individual identical segmentation is designated 1, therefore for being designated 3 retrieval request, is not counting As a result FP1 is included in, so also a counting is once when being counted to FP1.It can be seen that, even if carrying phase Multiple retrieval requests of fingerprint, different identification with information to be retrieved are dealt into same server, similarly Counting duplicate removal can be carried out in server.
In the present example, by when storage is with retrieval segmentation it is followed it is regular it is identical, determine corresponding clothes The rule of business device is also identical (to be carried out Hash operation according to the value of segmentation, is determined according to Hash result corresponding Server), so carrying FP1, identifying the storage request of 1 (or 3) and carrying FP2, mark 1 The retrieval request of (or 3) can be dealt into same server, it is to avoid the situation of missing inspection.
Embodiment three, a kind of distributed search device, are arranged at server, as shown in figure 3, including:
Similar fingerprints searching modul 31, for receiving after retrieval request, takes according in the retrieval request The fingerprint of the information to be retrieved of band, similar fingerprints are searched in the fingerprint preserved;
Counting module 32, for each similar fingerprints for finding, is proceeded as follows respectively: The segmentation of the similar fingerprints and the segmentation of the fingerprint of the information to be retrieved is right successively according to predefined procedure It should compare, stop comparing after first identical segmentation is found;By the identical segmentation The mark carried in mark and the retrieval request is compared, and is if the same included in count results The similar fingerprints;Wherein, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;
Respond module 33, for returning to the count results.
Wherein, similar fingerprints searching modul 31 is the portion of responsible retrieval similar fingerprints in apparatus described above Point, can be the combination of software, hardware or both.
Wherein, counting module 32 is the responsible part for counting and count duplicate removal in apparatus described above, It can be the combination of software, hardware or both.
Wherein, respond module 33 is the part of responsible returning result in apparatus described above, can be soft The combination of part, hardware or both.
In a kind of alternative of the present embodiment, the counting module can be also used for when described identical Segmentation mark it is different with the mark carried in the retrieval request when, not in the count results bag Containing corresponding similar fingerprints, such as directly ignore the similar fingerprints.
In a kind of alternative of the present embodiment, the mark of the segmentation can be, but not limited to as the segmentation Sequence number.
In a kind of alternative of the present embodiment, the similar fingerprints searching modul 31 as shown in figure 4, It can include:
Acquiring unit 311, for according to the first pre-defined rule, being treated from described in retrieval request carrying In the fingerprint for retrieving information, the corresponding segmentation of mark that the retrieval request is carried is obtained;
Comparing unit 312 is segmented, it is complete with acquired segmentation in the segmentation preserved, searching Identical is segmented;Wherein, a kind of fingerprint that each segmentation preserved at least belongs in preserved fingerprint;
Fingerprint comparing unit 313, the found segmentation institute in the fingerprint preserved, filtering out Corresponding fingerprint, the fingerprint by the fingerprint filtered out respectively with the information to be retrieved is compared, and is found out Similar fingerprints.
Wherein, acquiring unit 311 is to be responsible in the similar fingerprints searching modul 31 obtaining to be compared The part of segmentation, can be the combination of software, hardware or both.
Wherein, segmentation comparing unit 312 be responsible in the similar fingerprints searching modul 31 finding it is identical The part of segmentation, can be the combination of software, hardware or both.
Wherein, fingerprint comparing unit 313 is to be responsible for filtering out phase in the similar fingerprints searching modul 31 Can be the combination of software, hardware or both like the part of fingerprint.
In a kind of alternative of the present embodiment, the fingerprint comparing unit 313 is in the fingerprint preserved In, filtering out the fingerprint corresponding to found segmentation includes:
The value of segmentation of the fingerprint comparing unit 313 to be found is used as the corresponding key of key name lookup Value;Key assignments corresponding to the value of one segmentation is all fingerprints for including the segmentation in the fingerprint that is preserved.
In a kind of alternative of the present embodiment, the segmentation comparing unit 312 is in the segmentation preserved In, search includes with the identical segmentation of acquired segmentation:
The segmentation comparing unit 312 is used as rope in mark being preserved, being carried using the retrieval request In the segmentation drawn, search and the acquired identical segmentation of segmentation.
Other implementation details can be found in embodiment one.
Example IV, a kind of distributed search device, are arranged at client, as shown in figure 5, including:
Determining module 41, for determining each to be segmented corresponding server respectively according to the second pre-defined rule;
Request module 42, for sending retrieval request to the corresponding server of each segmentation respectively;It is described Carry the fingerprint of the information to be retrieved in retrieval request, and the segmentation corresponding to the server mark Know;Wherein, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;
Computing module 43, for server to be added for the count results that the retrieval request is returned, Obtain retrieval result.
Wherein it is determined that module 41 is to be responsible for determining the mark of segmentation in apparatus described above and corresponding The part of server, can be the combination of software, hardware or both.
Wherein, request module 42 is to be responsible for sending the part of retrieval request in apparatus described above, can be with It is the combination of software, hardware or both.
Wherein, computing module 43 is the part for being responsible for calculating retrieval result in apparatus described above, can be with It is the combination of software, hardware or both.
In a kind of alternative of the present embodiment, the mark of the segmentation can be, but not limited to as the segmentation Sequence number.
In a kind of alternative of the present embodiment, the determining module 41 is as shown in fig. 6, can include:
Division unit 411, for the fingerprint of the information to be retrieved to be drawn according to first pre-defined rule It is divided into K segmentation;
Hash operation unit 412, for being breathed out respectively using the value being each segmented to the number of server Uncommon computing, the corresponding server of the segmentation is determined according to operation result.
Wherein, division unit 411 is to be responsible for dividing the part of fingerprint in the determining module 41, can be with It is the combination of software, hardware or both.The division unit 411 can also be arranged on the determining module In 41, the segmentation that Hash operation unit 412 is marked off using division unit 411 carries out Hash operation.
Wherein, Hash operation unit 412 is the responsible portion for carrying out Hash operation in the determining module 41 Point, can be the combination of software, hardware or both.
Other implementation details can be found in embodiment two.
Embodiment five, a kind of distributed search method, including the method that server is applied in embodiment one With the method that client is applied in embodiment two.
Embodiment six, a kind of distributed search system, including the device of server is arranged in embodiment three With the device that client is arranged in example IV.
One of ordinary skill in the art will appreciate that all or part of step in the above method can pass through journey Sequence instructs related hardware to complete, and described program can be stored in computer-readable recording medium, such as only Read memory, disk or CD etc..Alternatively, all or part of step of above-described embodiment can also make Realized with one or more integrated circuits.Correspondingly, each module/unit in above-described embodiment can be with Realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The application is not limited In the combination of the hardware and software of any particular form.
Certainly, the application can also have other various embodiments, spiritual and its essence without departing substantially from the application In the case of, those skilled in the art work as can make various corresponding changes and change according to the application Shape, but these corresponding changes and deformation should all belong to the protection domain of claims hereof.

Claims (18)

1. a kind of distributed search method, applied to server, including:
Receive after retrieval request, according to the fingerprint of the information to be retrieved carried in the retrieval request, in institute Similar fingerprints are searched in the fingerprint of preservation;
For each similar fingerprints found, proceed as follows respectively:By the segmentation of the similar fingerprints Segmentation with the fingerprint of the information to be retrieved is corresponding in turn to according to predefined procedure compares, when finding first Stop comparing after identical segmentation;By in the mark of the identical segmentation and the retrieval request The mark of carrying is compared, and the similar fingerprints are if the same included in count results;Wherein, divide The division of section and the mark being each segmented are determined according to the first pre-defined rule;
Return to the count results.
2. the method as described in claim 1, it is characterised in that also include:
If the mark of the identical segmentation is different with the mark carried in the retrieval request, Corresponding similar fingerprints are included not in the count results.
3. the method as described in claim 1, it is characterised in that:
The sequence number for being designated the segmentation of the segmentation.
4. the method as described in any one of claims 1 to 3, it is characterised in that described according to the inspection The fingerprint of the information to be retrieved carried in rope request, lookup similar fingerprints include in the fingerprint preserved:
According to the first pre-defined rule, in the fingerprint of the information to be retrieved carried from the retrieval request, Obtain the corresponding segmentation of mark that the retrieval request is carried;
In the segmentation preserved, search and the acquired identical segmentation of segmentation;Wherein, protected A kind of fingerprint that each segmentation deposited at least belongs in preserved fingerprint;
In the fingerprint preserved, the fingerprint corresponding to found segmentation is filtered out, by what is filtered out Fingerprint of the fingerprint respectively with the information to be retrieved is compared, and finds out similar fingerprints.
5. method as claimed in claim 4, it is characterised in that described in the fingerprint preserved, Filtering out the fingerprint corresponding to found segmentation includes:
The value of segmentation to be found is used as the corresponding key assignments of key name lookup;Corresponding to the value of one segmentation Key assignments be all fingerprints for including the segmentation in the fingerprint that is preserved.
6. method as claimed in claim 4, it is characterised in that described in the segmentation preserved, Search includes with the identical segmentation of acquired segmentation:
In preserved, identifying as in the segmentation of index using retrieval request carrying, search and institute The identical segmentation of the segmentation of acquisition.
7. a kind of distributed search method, applied to client, including:
Determine each to be segmented corresponding server respectively according to the second pre-defined rule;
Respectively retrieval request is sent to the corresponding server of each segmentation;Carried in the retrieval request described The fingerprint of information to be retrieved, and the segmentation corresponding to the server mark;Wherein, the division of segmentation And the mark being each segmented is determined according to the first pre-defined rule;
Server is added for the count results that the retrieval request is returned, retrieval result is obtained.
8. method as claimed in claim 7, it is characterised in that:
The sequence number for being designated the segmentation of the segmentation.
9. method as claimed in claim 7 or 8, it is characterised in that described according to the second pre- set pattern Then determine that being each segmented corresponding server includes respectively:
The fingerprint of the information to be retrieved is divided into K segmentation according to first pre-defined rule;
Hash operation is carried out to the number of server using the value being each segmented respectively, it is true according to operation result The corresponding server of the fixed segmentation.
10. a kind of distributed search device, is arranged at server, it is characterised in that including:
Similar fingerprints searching modul, for receiving after retrieval request, according to what is carried in the retrieval request The fingerprint of information to be retrieved, similar fingerprints are searched in the fingerprint preserved;
Counting module, for each similar fingerprints for finding, is proceeded as follows respectively:Should The segmentation of similar fingerprints and the segmentation of the fingerprint of the information to be retrieved are corresponding in turn to ratio according to predefined procedure Compared with stopping comparing after first identical segmentation is found;By the mark of the identical segmentation It is compared with the mark carried in the retrieval request, the phase is if the same included in count results Like fingerprint;Wherein, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;
Respond module, for returning to the count results.
11. device as claimed in claim 10, it is characterised in that:
The counting module is additionally operable to when in the mark and the retrieval request of the identical segmentation When the mark of carrying is different, corresponding similar fingerprints are included not in the count results.
12. device as claimed in claim 10, it is characterised in that:
The sequence number for being designated the segmentation of the segmentation.
13. the device as any one of claim 10~12, it is characterised in that the similar finger Line searching modul includes:
Acquiring unit, for according to the first pre-defined rule, from the described to be retrieved of retrieval request carrying In the fingerprint of information, the corresponding segmentation of mark that the retrieval request is carried is obtained;
Comparing unit is segmented, in the segmentation preserved, searching identical with acquired segmentation Segmentation;Wherein, a kind of fingerprint that each segmentation preserved at least belongs in preserved fingerprint;
Fingerprint comparing unit, in the fingerprint preserved, filtering out corresponding to found segmentation Fingerprint, the fingerprint by the fingerprint filtered out respectively with the information to be retrieved is compared, finds out similar Fingerprint.
14. device as claimed in claim 13, it is characterised in that the fingerprint comparing unit is in institute In the fingerprint of preservation, filtering out the fingerprint corresponding to found segmentation includes:
The value of segmentation of the fingerprint comparing unit to be found is used as the corresponding key assignments of key name lookup;One Key assignments corresponding to the value of individual segmentation is all fingerprints for including the segmentation in the fingerprint that is preserved.
15. device as claimed in claim 13, it is characterised in that the segmentation comparing unit is in institute In the segmentation of preservation, search includes with the identical segmentation of acquired segmentation:
It is described segmentation comparing unit it is being preserved, using the retrieval request carry mark as index In segmentation, search and the acquired identical segmentation of segmentation.
16. a kind of distributed search device, is arranged at client, it is characterised in that including:
Determining module, for determining each to be segmented corresponding server respectively according to the second pre-defined rule;
Request module, for sending retrieval request to the corresponding server of each segmentation respectively;The retrieval Carry the fingerprint of the information to be retrieved in request, and the segmentation corresponding to the server mark;Its In, the division of segmentation and the mark being each segmented are determined according to the first pre-defined rule;
Computing module, for server to be added for the count results that the retrieval request is returned, is obtained Retrieval result.
17. device as claimed in claim 16, it is characterised in that:
The sequence number for being designated the segmentation of the segmentation.
18. the device as described in claim 16 or 17, it is characterised in that the determining module includes:
Division unit, for the fingerprint of the information to be retrieved to be divided into according to first pre-defined rule K segmentation;
Hash operation unit, for carrying out Hash fortune to the number of server using the value being each segmented respectively Calculate, the corresponding server of the segmentation is determined according to operation result.
CN201610105198.9A 2016-02-25 2016-02-25 A kind of distributed search method and device Pending CN107122370A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610105198.9A CN107122370A (en) 2016-02-25 2016-02-25 A kind of distributed search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610105198.9A CN107122370A (en) 2016-02-25 2016-02-25 A kind of distributed search method and device

Publications (1)

Publication Number Publication Date
CN107122370A true CN107122370A (en) 2017-09-01

Family

ID=59717519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610105198.9A Pending CN107122370A (en) 2016-02-25 2016-02-25 A kind of distributed search method and device

Country Status (1)

Country Link
CN (1) CN107122370A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109471921A (en) * 2018-11-23 2019-03-15 深圳市元征科技股份有限公司 A kind of text duplicate checking method, device and equipment
CN109582674A (en) * 2018-11-28 2019-04-05 亚信科技(南京)有限公司 A kind of date storage method and system
CN110135353A (en) * 2019-05-17 2019-08-16 北京海鑫高科指纹技术有限公司 A kind of method and system excluding victim and relevant people scene fingers and palms line
CN110149529A (en) * 2018-11-01 2019-08-20 腾讯科技(深圳)有限公司 Processing method, server and the storage medium of media information
CN116467481A (en) * 2022-12-14 2023-07-21 喜鹊科技(广州)有限公司 Information processing method and system based on cloud computing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102495894A (en) * 2011-12-12 2012-06-13 成都市华为赛门铁克科技有限公司 Method, device and system for searching repeated data
CN103248609A (en) * 2012-02-06 2013-08-14 同方股份有限公司 System, device and method for detecting data from end to end
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN110399464A (en) * 2019-07-30 2019-11-01 广州吉信网络科技开发有限公司 A kind of similar news method of discrimination, system and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102495894A (en) * 2011-12-12 2012-06-13 成都市华为赛门铁克科技有限公司 Method, device and system for searching repeated data
CN103248609A (en) * 2012-02-06 2013-08-14 同方股份有限公司 System, device and method for detecting data from end to end
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN110399464A (en) * 2019-07-30 2019-11-01 广州吉信网络科技开发有限公司 A kind of similar news method of discrimination, system and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王源: "一种基于Simhash的文本快速去重算法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
观澜而索源: "海量数据相似度计算之simhash短文本查找", 《CSDN》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110149529A (en) * 2018-11-01 2019-08-20 腾讯科技(深圳)有限公司 Processing method, server and the storage medium of media information
CN109471921A (en) * 2018-11-23 2019-03-15 深圳市元征科技股份有限公司 A kind of text duplicate checking method, device and equipment
CN109582674A (en) * 2018-11-28 2019-04-05 亚信科技(南京)有限公司 A kind of date storage method and system
CN109582674B (en) * 2018-11-28 2023-12-22 亚信科技(南京)有限公司 Data storage method and system
CN110135353A (en) * 2019-05-17 2019-08-16 北京海鑫高科指纹技术有限公司 A kind of method and system excluding victim and relevant people scene fingers and palms line
CN116467481A (en) * 2022-12-14 2023-07-21 喜鹊科技(广州)有限公司 Information processing method and system based on cloud computing
CN116467481B (en) * 2022-12-14 2023-12-01 要务(深圳)科技有限公司 Information processing method and system based on cloud computing

Similar Documents

Publication Publication Date Title
CN107122370A (en) A kind of distributed search method and device
CN106033416B (en) Character string processing method and device
JP5328808B2 (en) Data clustering method, system, apparatus, and computer program for applying the method
CN114389834B (en) Method, device, equipment and product for identifying abnormal call of API gateway
EP2095277B1 (en) Fuzzy database matching
CN104636349B (en) A kind of index data compression and the method and apparatus of index data search
CN109062936B (en) Data query method, computer readable storage medium and terminal equipment
US20160147867A1 (en) Information matching apparatus, information matching method, and computer readable storage medium having stored information matching program
CN104142946A (en) Method and system for aggregating and searching service objects of same type
CN110728526A (en) Address recognition method, apparatus and computer readable medium
US7584173B2 (en) Edit distance string search
CN116631561B (en) Patient identity information matching method and device based on feature division and electronic equipment
CN105138912A (en) Method and device for generating phishing website detection rules automatically
CN112035621A (en) Enterprise name similarity detection method based on statistics
CN109286622B (en) Network intrusion detection method based on learning rule set
CN115189914A (en) Application Programming Interface (API) identification method and device for network traffic
US20190130034A1 (en) Fingerprint clustering for content-based audio recognition
US8370390B1 (en) Method and apparatus for identifying near-duplicate documents
CN108319626B (en) Object classification method and device based on name information
CN114124484A (en) Network attack identification method, system, device, terminal equipment and storage medium
CN101414299B (en) Method and apparatus for repairing composite document
CN113821630A (en) Data clustering method and device
CN114943285B (en) Intelligent auditing system for internet news content data
CN109460407A (en) A kind of information storage means and system
CN111428482B (en) Information identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170901

RJ01 Rejection of invention patent application after publication