CN108614827A - Data segmentation method, judging method and electronic equipment - Google Patents

Data segmentation method, judging method and electronic equipment Download PDF

Info

Publication number
CN108614827A
CN108614827A CN201611139984.7A CN201611139984A CN108614827A CN 108614827 A CN108614827 A CN 108614827A CN 201611139984 A CN201611139984 A CN 201611139984A CN 108614827 A CN108614827 A CN 108614827A
Authority
CN
China
Prior art keywords
signature
prefix
cutting
data
pending data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611139984.7A
Other languages
Chinese (zh)
Inventor
薛亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201611139984.7A priority Critical patent/CN108614827A/en
Publication of CN108614827A publication Critical patent/CN108614827A/en
Pending legal-status Critical Current

Links

Abstract

This application provides data segmentation method, judging method and electronic equipments.Data judging method includes:Obtain the signature of pending data;At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;According at least two signatures prefix, the pending data is carried out to sentence weight.The time efficiency that data sentence weight can be improved using technical scheme.

Description

Data segmentation method, judging method and electronic equipment
Technical field
This application involves a kind of Internet technical field more particularly to data segmentation method, judging method and electronic equipments.
Background technology
When handling mass data (such as document or webpage), in order to save memory space, usually data can all be carried out Sentence weight.The mainstream way of industry is to carry out data based on SimHash algorithms to sentence weight at present.SimHash algorithms be data deduplication most Common Hash (Hash) method, principle are:The digit of selected SimHash values;Everybody of SimHash values is initialized as 0; Extract the feature in data to be signed;The hash value of each feature is calculated using traditional Hash functions;To the hash value of each feature Each, if the position be 1, SimHash value corresponding positions value add 1;Otherwise subtract 1;To each of obtained SimHash values Position is set as 1, is otherwise set as 0 if the position is more than 1, obtains SimHash signatures.The speed of SimHash algorithms is quickly.
Data based on SimHash algorithms sentence weight process:SimHash signatures are carried out to historical data and are stored SimHash signs;To new data, SimHash signatures are carried out to it first, then compare its SimHash signatures and historical data SimHash signature it is whether similar, to judge whether be present in historical data in new data.
In worst case, the SimHash signatures for needing to be traversed for whole historical datas every time are compared said program, Although on time complexity being O (n), since historical data radix is larger, such as reptile platform, data volume base All it is more than one hundred million ranks in sheet, so time efficiency is still relatively low.
Invention content
A kind of data segmentation method of the application offer, judging method and electronic equipment, to improve the time that data sentence weight Efficiency.
In order to achieve the above objectives, embodiments herein adopts the following technical scheme that:
In a first aspect, a kind of data segmentation method is provided, including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
Second aspect provides a kind of data judging method, including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
The third aspect provides a kind of electronic equipment, including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
Fourth aspect provides a kind of electronic equipment, including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
In the embodiment of the present application, at least level-one cutting is carried out to the signature of data, obtains at least two signature prefixes, it is right Signature prefix and signature should be stored, when progress data sentence weight, the signature of data is no longer based on but is based on signature prefix, with label Name is compared, and the digit for prefix of signing is relatively fewer, is the equal of the index of signature, and being conducive to reduction data by prefix of signing looks into Range is ask, the time efficiency for sentencing weight is improved.
Above description is only the general introduction of technical scheme, in order to better understand the technological means of the application, And can be implemented in accordance with the contents of the specification, and in order to allow above and other objects, features and advantages of the application can It is clearer and more comprehensible, below the special specific implementation mode for lifting the application.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit are common for this field Technical staff will become clear.Attached drawing only for the purpose of illustrating preferred embodiments, and is not considered as to the application Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the structural schematic diagram for the operation system that one embodiment of the application provides;
Fig. 2 a are the flow diagram for the data segmentation method that one embodiment of the application provides;
Fig. 2 b are the schematic diagram for the signature cutting result that another embodiment of the application provides;
Fig. 3 is the flow diagram for the data judging method that the another embodiment of the application provides;
Fig. 4 is the structural schematic diagram for the data cutting device that the another embodiment of the application provides;
Fig. 5 is the structural schematic diagram for the electronic equipment that the another embodiment of the application provides;
Fig. 6 is the structural schematic diagram that the data that the another embodiment of the application provides sentence that refitting is set;
Fig. 7 is the structural schematic diagram for the electronic equipment that the another embodiment of the application provides.
Specific implementation mode
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
The problems such as existing time efficiency is relatively low when sentence weight to data for the prior art, the application provides a kind of solution Certainly scheme, cardinal principle are:Cutting is carried out to the signatures of data, obtains at least two signature prefixes, based on signature prefix into Row data sentence weight.Compared with signature, the digit for prefix of signing is relatively fewer, is the equal of the index of signature, passes through prefix of signing Be conducive to reduce data query range, improve the time efficiency for sentencing weight.
It is worth noting that method provided by the embodiments of the present application can be applied to any business for sentencing weight logic with data System, such as can be crawler system.Crawler system will crawl a large amount of webpage daily, and the webpage newly crawled daily is in millions Not, if all storages, can occupy a large amount of memory space, so needing to judge whether are the webpage that currently crawls and history web pages It is similar, if similarity degree is higher, current web page can be abandoned, to save memory space.If using existing scheme, this It is relatively low kind to sentence weight efficiency, the time efficiency for sentencing weight can be greatlyd improve using method provided by the embodiments of the present application.
For some operation systems, one may be stored with before implementing method provided by the embodiments of the present application A little historical datas, these historical datas are to sentence weight according to what the method for the prior art carried out, i.e., these historical datas do not carry out Signature cutting.In order to carry out the method that data sentence weight based on signature prefix using provided by the embodiments of the present application, reality is not only needed Existing new data sentences weight logic, it is also necessary to carry out signature cutting to historical data and store.
Data segmentation method provided in this embodiment and data judging method can be executed by operation system, referred to herein Operation system, can be that reptile platform etc. is directed to the business platform that is handled of magnanimity webpage, integrated document can also be carried out Document process platform of processing etc..As shown in Figure 1, its structural schematic diagram for the operation system of one embodiment of the application, Fig. 1 institutes The structure shown is only one of the example of the adaptable operation system of technical scheme of the present invention.Operation system includes data Cutting device and data major punishment device, external call service can be any services for being capable of providing or generating data, master Come from operation system to the Operational Visit or service call of other systems either client, external call service is new number According to the main source of generation.It is corresponding that the corresponding database of operation system is used for the corresponding data of storage service system, the data Signature and the signature prefix that signature generated after cutting processing, wherein data, the signature of data and signature prefix can Storage is carried out with separately different databases to be stored in same database.In the database, there are partly not Carry out the historical data of signature cutting.
Data cutting device is mainly used to execute process flow shown in following Fig. 2 a, is mainly used for in database, The historical data for not carrying out signature cutting is handled, and generates the signature prefix after the corresponding cutting of historical data and data are written In library, gradually the total data in database can be made to all have signature prefix by the processing of data cutting device, to Sentencing after entering convenient for subsequent new data compares again.
Data major punishment device is mainly used to execute following process flows shown in Fig. 3, is carried out primarily directed to new data Processing, after new data enters operation system, it is raw to first pass through signature generation module (there is also the modules in the prior art) At the signature of new data, cutting then is carried out to the signature and the signature prefix generated based on cutting carries out sentencing weight, for non-heavy Complex data is then stored in database together with prefix of signing and sign, discard processing is then carried out for duplicate data together.
It should be noted that it can be overlapping that above-mentioned data cutting device and data, which sentence the part of module during refitting is set, , for example, the first acquisition module and the second acquisition module can be a module, it is used to obtain signature, only obtains signature Source it is different, the first cutting module and the second cutting module may be same module, be used to carry out cutting for signature Processing.
The explanation carried out above for the technical principle of the embodiment of the present invention and illustrative application framework, is situated between in detail below The specific technical solution for the embodiment of the present invention of continuing.
Based on above-mentioned, the application provides a kind of data segmentation method from the demand of historical data, to solve number Cutting according to signature and storage problem;And a kind of data judging method is provided from the demand of new data, to solve number According to sentence weight problem.It is described in detail below by different embodiments.
Fig. 2 a are the flow diagram for the data segmentation method that one embodiment of the application provides.As shown in Figure 2 a, this method Including:
11, the signature of pending data is obtained.The step can be executed by the data cutting device in Fig. 1.Wherein, it waits locating The signature of reason data can be formed and stored in by the scheme of the prior art in the database of operation system.In the step In, data cutting device can directly acquire signature from the database of storage signature, and acquired signature is main in step It is the signature of stored historical data.
12, at least level-one cutting is carried out to above-mentioned signature, to obtain at least two signature prefixes, which can be by Fig. 1 In data cutting device execute.
13, each of at least two signature prefix of corresponding storage signature prefix and above-mentioned signature, the step can be by Fig. 1 In data cutting device execute.Specifically, signature prefix and above-mentioned signature can be stored in existing by data cutting device In database for storing signature.
In the present embodiment, it would be desirable to which the data for carrying out signature cutting are known as pending data.Optionally, pending data Can be the historical data that signature cutting is not yet carried out in operation system, but it is not limited to this.Optionally, pending data can be with It is any type of data, such as document, sentence or webpage etc..
When needing to carry out signature cutting to pending data, the signature of pending data can be obtained.In the acquisition Cheng Zhong, if the signature of pending data is existing, data cutting device can directly access respective storage devices, and acquisition waits for Handle the signature of data;If the signature of pending data does not exist, data cutting device can generate logic according to signature Generate the signature of pending data.
Optionally, the signature of above-mentioned pending data can be SimHash signatures, but not limited to this.
Wherein, the process of the SimHash signatures of data cutting device acquisition pending data is:Selected SimHash values Digit;Everybody of SimHash values is initialized as 0;Extract the feature in pending data;It is calculated using traditional Hash functions each The hash value of a feature;To each of the hash value of each feature, if the value that the position is 1, SimHash value corresponding positions adds 1; Otherwise subtract 1;Is set as by 1, is otherwise set as 0, to obtain waiting locating if the position is more than 1 for each of obtained SimHash values Manage the SimHash signatures of data.
After the signature for obtaining pending data, cutting can be carried out to acquired signature.In the present embodiment, number Level-one cutting can be carried out to acquired signature, multistage cutting can also be carried out to acquired signature according to cutting device.It is logical It crosses and cutting is carried out to acquired signature, at least two signature prefixes can be obtained.The digit for prefix of signing is label less than signature A part for name, is equivalent to the index of signature.
After obtaining signature prefix, data cutting device will can each sign, and prefix is corresponding with signature to be stored, so as to Sentence weight in being subsequently based on the signature prefix and signature progress data.
In each embodiment of the application, described sign is equivalent to the fingerprint of pending data.This signature is not random What content of signing but rely on pending data itself generated, so for the larger data of content difference, the difference of signature Also not larger, data similar for content are signed also more similar.For signature, it can be signed by two Between Hamming distance whether compare the two signatures similar.Hamming distance is a concept, it indicates the word of two equal lengths The different quantity in the corresponding position of symbol string.Based on this, it can correspond to and compare two signatures, determine the different number in the corresponding position of two signatures Amount;Then judge whether the different quantity in the corresponding position is less than or equal to preset similar threshold value;If judging result be less than Or it is equal to, then it represents that two signatures are same or similar;If judging result be more than, then it represents that two signatures are dissimilar.
Above-mentioned similar threshold value indicates at most to allow different digit when two signatures are similar.Generally, for long article The similar threshold value of Hamming distance is 7 for shelves, it is meant that the Hamming distance between the signature of two documents (it is different to correspond to position Quantity) be less than or equal to 7 when, the two documents are more similar;The similar threshold value of Hamming distance is 3 for sentence, meaning When Hamming distance (corresponding to the different quantity in position) between the signature of two sentences and being less than or equal to 3, the two sentences compared with It is close.It illustrates, it is assumed that the signature of document A is 100010010, and the signature of document B is 110010011, then document A and text There are two differences in the signature of shelves B, so Hamming distance is 2.If similar threshold value is 7, because document A and document B The Hamming distance of signature is less than 7, so the two belongs to similar document.
Optionally, it signs for two, if be up to N different, is signed two by the way of cutting successively Name all cuttings are N+1 sections, then can be obtained according to drawer principle (or principle of pigeon hole):In the N+1 sections of two signature cuttings extremely Rare 1 section is identical.Drawer principle is an important principle in Combinational Mathematics.Assuming that there are ten apples on table, This ten apples are put into nine drawers, are put in any case, at least can at least put two apples there are one drawer the inside.It is based on This, the general sense of drawer principle is:Each drawer represents a set, each apple can represent an element, false Be put into n set if any n+1 element, wherein must there are one in set at least there are two element.Signature is analogous to cut Point, N+1 sections correspond to n+1 element, and similar threshold value N corresponds to n set.
Based on above-mentioned analysis, it can determine that each cutting needs to be by signature cutting according to the similar threshold value of Hamming distance Several sections, i.e. cutting hop count;Then, according to the cutting hop count, at least level-one cutting is carried out to signature, to obtain at least two Signature prefix.Wherein, the similar threshold value of Hamming distance is denoted as N, cutting hop count is denoted as m, then needing to meet m>=N+1. Preferably, m=N+1.For example, if N=3, m=4, i.e., it is 4 sections by signature cutting;If N=5, m=6, i.e., will Cutting of signing is 6 sections;If N=6, m=7, i.e., it is 7 sections by signature cutting;If m=7, m=8, i.e., will sign cutting It is 8 sections.It is worth noting that cutting here is not required for impartial cutting, cutting successively.Optionally, described to cut successively It can be according to sequence from left to right successively cutting to divide.
Optionally, data cutting device can carry out two-stage cutting to signature.A kind of reality of two-stage cutting is carried out to signature The mode of applying includes:
According to cutting hop count, level-one cutting is carried out to signature, to obtain at least two prefix stems;To at least two prefixes Each prefix stem in stem, according to cutting hop count, to remaining digit carries out two level in addition to the prefix stem in signature Cutting, to obtain corresponding at least two prefix tail of the prefix stem;Prefix stem each in this way can correspond at least two again Prefix tail, each prefix stem each prefix tail corresponding with its is respectively combined together with, a signature will be formed Prefix.Signature after two level cutting, square signature prefix with cutting hop count;Then will each sign prefix and label The corresponding storage of name.
Optionally, above-mentioned level-one cutting and two level cutting can be according to sequence from left to right, successively cutting.
The two level dicing process is illustrated below by specific example.In the example below, it is assumed that N=3, m=4, I.e. similar threshold value is 3, and each cutting needs signature cutting to be 4 sections, and for carrying out signature cutting to history archive, specifically Dicing process is as follows:
Generate 64 SimHash signatures of history archive;Two level division is carried out to the SimHash signatures of history archive.Such as SimHash shown in Fig. 2 b sign, first, according to from left to right in order by the signature cutting be 4 sections, 16 every section, this 4 sections The portions A, the portions B, the portions C, the portions D shown in respectively Fig. 2 b.To every section in tetra- sections of A, B, C, D, remaining part in signature is pressed Two level cutting is carried out according to sequence from left to right, remaining part is 48 here, after cutting is 4 sections successively, 12 every section, is cut Divide result respectively such as the corresponding inferior portion in the portion A, B, C, D in Fig. 2 b.Based on slit mode shown in Fig. 2 b, SimHash Autograph Sessions It is split as 16 signature prefixes.
In an optional embodiment, data cutting device can be by signature and the signature prefix pair obtained by signature cutting It should store in traditional database.It is traditional database corresponding to the database in Fig. 1.Then each signature prefix of corresponding storage and The mode of signature can be:With the signature for major key, storage is to one after each signature prefix separator is spliced In field, a data record is formed, the field can be named arbitrarily, such as be named as extend, and be arranged for the field Index.Optionally, the index can be signature field, but not limited to this.
In an optional embodiment, before the signature that data cutting device can be obtained by signature prefix and by signature cutting Sew in corresponding storage to the KV type databases of support list (list), which is actually KV cachings.Corresponding in Fig. 1 Database be support list list KV type databases.Then each signature prefix of corresponding storage and the mode of signature can be: Using each signature prefix as the key (key) in KV type databases, using signature as value (value), in the form of list (list) It is appended in the corresponding value of the key.In this embodiment, each Autograph Session is redundantly stored, and repeats the number etc. of storage In the number of signature prefix, the number for prefix of signing is denoted as M for ease of description, it is contemplated that KV type databases are supported parallel It retrieves, so retrieval time can be reduced to original 1/M.
In an optional embodiment, data cutting device can be by signature and the signature prefix pair obtained by signature cutting It should store in the KV type databases for not supporting list.It is the KV type databases for not supporting list corresponding to the database in Fig. 1. Then each signature prefix of corresponding storage and the mode of signature can be:Using each signature prefix as key, signature is updated to The corresponding value of key.
By above-mentioned processing, the cutting and storage of the signature of historical data can be completed, for subsequently based on signature prefix Data sentence weight process and provide condition.
Fig. 3 is the flow diagram for the data judging method that one embodiment of the application provides.As shown in figure 3, this method packet It includes:
31, the signature of pending data is obtained.The step can sentence refitting by the data in Fig. 1 and set execution.Wherein, it waits locating Reason data can come from the new data except operation system, can be the new data generated by external call service, newly After data carry out system, the signature of new data is generated by special signature generation module.
32, at least level-one cutting is carried out to the signature, to obtain at least two signature prefixes.The step can be by Fig. 1 In data sentence refitting and set execution.
33, according at least two signatures prefix, the pending data is carried out to sentence weight.The step can be by Fig. 1 In data sentence refitting and set execution.
It is worth noting that if the data segmentation method that the data judging method of the present embodiment is provided with above-described embodiment is answered For same operation system, then it can be same device that data cutting device sentences refitting to set with data, can also be self-contained unit. Wherein, if the historical data in operation system does not carry out signature cutting, signature of the data cutting device to historical data Cutting is carried out, the signature prefix of historical data is obtained, is that data sentence the premise for resetting and setting and sentencing weight based on signature prefix progress data Condition;If the historical data in operation system has carried out signature cutting, data, which are sentenced refitting and set, can be directly based upon label Name prefix carries out data and sentences weight.
In the present embodiment, it would be desirable to which the data for sentence weight are known as pending data.Optionally, pending data can be with It is the new data in operation system, but it is not limited to this.Optionally, pending data can be any type of data, such as Document, sentence or webpage etc..
When needing to carry out sentencing weight to pending data, data sentence refitting and set the signature that can obtain pending data. In the acquisition process, if the signature of pending data is existing, data, which are sentenced refitting and set, can directly access respective stored and set It is standby, obtain the signature of pending data;If the signature of pending data does not exist, data sentence that refitting sets can be according to signature Generate the signature that logic generates pending data.
Optionally, the signature of above-mentioned pending data can be SimHash signatures, but not limited to this.
Wherein, data sentence refitting set obtain pending data SimHash signature process be:Selected SimHash values Digit;Everybody of SimHash values is initialized as 0;Extract the feature in pending data;It is calculated using traditional Hash functions each The hash value of a feature;To each of the hash value of each feature, if the value that the position is 1, SimHash value corresponding positions adds 1; Otherwise subtract 1;Is set as by 1, is otherwise set as 0, to obtain waiting locating if the position is more than 1 for each of obtained SimHash values Manage the SimHash signatures of data.
After the signature for obtaining pending data, cutting can be carried out to acquired signature.In the present embodiment, number It sets according to refitting is sentenced and can carry out level-one cutting to acquired signature, multistage cutting can also be carried out to acquired signature.It is logical It crosses and cutting is carried out to acquired signature, at least two signature prefixes can be obtained.The digit for prefix of signing is label less than signature A part for name, is equivalent to the index of signature.
After obtaining signature prefix, data sentence refitting and set and can sentence weight based on signature prefix progress data.With signature phase Than the digit for prefix of signing is relatively fewer, is the equal of the index of signature, is conducive to reduce data query model by prefix of signing It encloses, improves the time efficiency for sentencing weight.
It is worth noting that the present embodiment carries out signature the mode of cutting, and in aforementioned data cutting method embodiment The mode that cutting is carried out to signature is identical, therefore can be found in previous embodiment, and details are not described herein.
In an optional embodiment, above-mentioned steps 33 carry out pending data that is, according at least two signature prefixes Sentence the embodiment of weight, including:
Using at least two signature prefixes of pending data as querying condition, inquired in history signs prefix; If not inquiring history signature prefix identical with any signature prefix at least two signature prefixes, it is determined that pending data For non-duplicate data.In this case, the signature prefix of pending data need to only be inquired, is looked into without carrying out signature It askes, and prefix of signing is equivalent to the index of signature, the search efficiency that search efficiency will significantly larger than sign, therefore can improve Sentence the time efficiency of weight.
It further, will if inquiring history signature prefix identical with prefix of arbitrarily signing at least two signature prefixes The comparison similar with the corresponding history signature progress of history signature prefix inquired of the signature of pending data;If pending data Signature and the history signature it is dissimilar, determine that pending data is non-duplicate data;If signature and the institute of pending data It is similar to state history signature, determines that pending data is duplicate data.
The above-mentioned signature by pending data history signature corresponding with the history signature prefix inquired carries out the likelihood ratio Compared with whether the Hamming distance for being primarily referred to as comparing two signatures is less than or equal to the process of similar threshold value.For example, ratio can be corresponded to It signs compared with two, determines the different quantity in the corresponding position of two signatures, i.e. Hamming distance;Then judge the different number in the corresponding position Measure whether (i.e. Hamming distance) is less than or equal to preset similar threshold value;If judging result be less than or equal to, then it represents that two It signs same or similar;If judging result be more than, then it represents that two signatures are dissimilar.
History signature prefix identical with prefix of arbitrarily signing at least two signature prefixes is being inquired, because The digit for prefix of signing is less than the digit of signature, in the case of data balancing, each corresponding history number of signature of prefix of signing Amount is far smaller than the total quantity of history signature, this, which is equivalent to, reduces data screening range.It illustrates, it is assumed that have 10,000,000,000 Signature, each prefix of signing is 28, then in data balancing, each prefix of signing corresponding signature number that is averaged is 10000000000/228, about 37.It can be seen that even if the history signature prefix pair for needing the signature by pending data and inquiring The history signature answered is compared, but since data screening range reduces, therefore the time effect that data sentence weight can also be improved Rate.
In addition, above-mentioned determining pending data be non-duplicate data in the case of, can correspond to storage pending data, The signature of each of pending data signature prefix and pending data, and execute corresponding service processing.The step can be by Fig. 1 In data sentence refitting and set execution, specific storage location can be the corresponding database of operation system in Fig. 1.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored into traditional database.Then Each signature prefix and the mode of signature can be for corresponding storage:With the signature for major key, each signature prefix is used After separator splicing in storage a to field, a data record is formed, the field can be named arbitrarily, such as be named as Extend, and be field setting index.Optionally, the index can be signature field, but not limited to this.
Optionally, the KV type numbers for supporting list are arrived in the prefix storage corresponding with signature that each of pending data can be signed According in library.Then each signature prefix of corresponding storage and the mode of signature can be:The prefix that will each sign is as KV type databases In key be appended in the form of list in the corresponding value of the key using signing as value.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored to the KV types for not supporting list In database.Then each signature prefix of corresponding storage and the mode of signature can be:Using each signature prefix as key, will sign Name is updated to the corresponding value of key.
In addition, in the case where above-mentioned determining pending data is duplicate data, pending data can be abandoned.
Sentence for heavy process from entire data, by carrying out cutting to signature, data is carried out based on signature prefix and are sentenced again, one Aspect because signature prefix as signature index, it is possible to reduce in database magnanimity signature inquiry times, on the other hand because It is less than the digit of signature for the digit for prefix of signing, corresponding number of signatures is relatively fewer, can reduce data screening range, Therefore the time efficiency that data sentence weight can be improved.
Fig. 4 is the structural schematic diagram for the data cutting device that the another embodiment of the application provides.As shown in figure 4, the device Including:First acquisition module 41, the first cutting module 42 and the first memory module 43.
First acquisition module 41, the signature for obtaining pending data.
First cutting module 42, for carrying out at least level-one cutting to the signature, to obtain at least two signature prefixes.
First memory module 43 is each signed prefix and described for the described at least two signatures prefix of corresponding storage Signature.
In an optional embodiment, the first cutting module 42 is specifically used for:According to the similar threshold value of Hamming distance, determine Cutting hop count;According to cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
Still optionally further, the first cutting module 42 is specifically used for:According to the cutting hop count, one is carried out to the signature Grade cutting, to obtain at least two prefix stems;For each prefix stem at least two prefix capital, according to institute State cutting hop count, to remaining digit carries out two level cutting in addition to the prefix stem in the signature, with obtain it is described before Sew corresponding at least two prefix tail of stem;To each prefix stem, by the prefix stem and the prefix stem Each prefix tail in corresponding at least two prefix tail is respectively combined together, forms at least two signatures prefix In one signature prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, the first memory module 43 is specifically used for:
With the signature for major key, after each signature prefix separator is spliced in storage a to field, and It is arranged for the field and indexes;Optionally, the index can be signature field, but not limited to this;Or
Using each signature prefix as key, the signature is appended in the corresponding comparative example of the key;Or
Using each signature prefix as key, signature prefix, and corresponding storage label can be obtained with the signature of cutting data Name prefix and signature based on signature prefix carry out data and sentence to provide condition again to be follow-up.
The foregoing describe the built-in function of data cutting device and structures, as shown in figure 5, in practice, data cutting dress It sets and can be achieved as electronic equipment, including:Memory 51 and processor 52.
Memory 51, for storing program.
In addition to above procedure, memory 51 is also configured to store various other data to support on an electronic device Operation.The example of these data includes the instruction for any application program or method that operate on an electronic device, contact Personal data, telephone book data, message, picture, video etc..
Memory 51 can be by any kind of volatibility or non-volatile memory device or combination thereof realization, such as Static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only to be deposited Reservoir (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or CD.
Processor 52 is coupled with memory 51, executes the program that the memory 51 is stored, for:
Obtain the signature of pending data;At least level-one cutting is carried out to the signature, before obtaining at least two signatures Sew;Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
In an optional embodiment, processor 52 is particularly used in when carrying out at least level-one cutting to the signature: According to the similar threshold value of Hamming distance, cutting hop count is determined;According to the cutting hop count, at least level-one is carried out to the signature and is cut Point, to obtain at least two signatures prefix.
Still optionally further, processor 52 is according to the cutting hop count, when carrying out at least level-one cutting to the signature, It is specifically used for:According to the cutting hop count, level-one cutting is carried out to the signature, to obtain at least two prefix stems;For Each prefix stem at least two prefix capital, according to the cutting hop count, to removing the prefix in the signature Remaining digit carries out two level cutting except stem, to obtain corresponding at least two prefix tail of the prefix stem;To institute Each prefix stem is stated, by each prefix in the prefix stem and corresponding at least two prefix tail of the prefix stem Tail portion is combined respectively, forms a signature prefix in at least two signatures prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, processor 52 will can each sign, and prefix is corresponding with signature to be stored to memory 51 In.Correspondingly, memory 51 is additionally operable to each signature prefix of corresponding storage and signature.
In an optional embodiment, processor 52 will can each sign, and prefix is corresponding with signature to be stored to external data In library.For example, processor 52 will can each sign in prefix storage to traditional database corresponding with signature, then processor 52 has Body can store after splicing each signature prefix separator into a field, and be with the signature for major key The field setting index.In another example processor 52 can be by each signature prefix storage corresponding with signature to support list's In KV type databases, then the signature can be specifically appended to described by processor 52 using each signature prefix as key In the list of the corresponding value of key.In another example processor 52 will can each sign, prefix storage corresponding with signature is not to propping up It holds in the KV type databases of list, then processor 52 specifically can be using each signature prefix as key, more by the signature It is newly the corresponding value of the key.
Further, as shown in figure 5, electronic equipment further includes:Communication component 53, power supply module 54, audio component 55, display Other components such as device 56.Members are only schematically provided in Fig. 5, are not meant to that electronic equipment only includes component shown in Fig. 5.
Communication component 53 is configured to facilitate the communication of wired or wireless way between electronic equipment and other equipment.Electronics Equipment can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary reality It applies in example, communication component 53 receives broadcast singal or the related letter of broadcast from external broadcasting management system via broadcast channel Breath.In one exemplary embodiment, the communication component 53 further includes near-field communication (NFC) module, to promote short range communication. For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) skill can be based in NFC module Art, bluetooth (BT) technology and other technologies are realized.
Based on communication component 53, processor 52 will can each be signed by communication component 53, and prefix is corresponding with signature to be stored Into external data base.
Power supply module 54 provides electric power for the various assemblies of electronic equipment.Power supply module 54 may include power management system System, one or more power supplys and other generate, manage and distribute electric power associated component with for electronic equipment.
Audio component 55 is configured as output and/or input audio signal.For example, audio component 55 includes a microphone (MIC), when electronic equipment is in operation mode, when such as call model, logging mode and speech recognition mode, microphone is configured To receive external audio signal.The received audio signal can be further stored in memory 51 or via communication component 53 It sends.In some embodiments, audio component 55 further includes a loud speaker, is used for exports audio signal.
Display 56 includes screen, and screen may include liquid crystal display (LCD) and touch panel (TP).If screen Including touch panel, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one Or multiple touch sensors are to sense the gesture on touch, slide, and touch panel.The touch sensor can be sensed not only The boundary of a touch or slide action, but also detect duration and pressure associated with the touch or slide operation.
Fig. 6 is the structural schematic diagram that the data that the another embodiment of the application provides sentence that refitting is set.As shown in fig. 6, the device Including:Second acquisition module 61, the second cutting module 62 and sentence molality block 63.
Second acquisition module 61, the signature for obtaining pending data.
Second cutting module 62, for carrying out at least level-one cutting to the signature, to obtain at least two signature prefixes.
Molality block 63 is sentenced, for according at least two signatures prefix, carrying out sentencing weight to the pending data.
In an optional embodiment, the second acquisition module 61 is particularly used in:If the signature of pending data has been deposited Respective storage devices then can be directly being accessed, the signature of pending data is obtained;If the signature of pending data is not deposited The signature that logic generates pending data then can generated according to signature.
In an optional embodiment, the second cutting module 62 is specifically used for:According to the similar threshold value of Hamming distance, determine Cutting hop count;According to cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
Still optionally further, the second cutting module 62 is specifically used for:According to the cutting hop count, one is carried out to the signature Grade cutting, to obtain at least two prefix stems;For each prefix stem at least two prefix capital, according to institute State cutting hop count, to remaining digit carries out two level cutting in addition to the prefix stem in the signature, with obtain it is described before Sew corresponding at least two prefix tail of stem;To each prefix stem, by the prefix stem and the prefix stem Each prefix tail in corresponding at least two prefix tail is respectively combined together, forms at least two signatures prefix In one signature prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, sentences molality block 63 and be particularly used in:Using at least two signatures prefix as looking into Inquiry condition is inquired in history signs prefix;If do not inquire in at least two signatures prefix before any signature Sew identical history signature prefix, determines that the pending data is non-duplicate data.
Further, sentence molality block 63 to be additionally operable to:If inquiring and prefix of arbitrarily signing in at least two signatures prefix Identical history signature prefix, the signature is similar with the corresponding history signature progress of history signature prefix inquired Compare;If the signature and history signature are dissimilar, determine that the pending data is non-duplicate data.
Further, sentence molality block 63 to be additionally operable to:If the signature is similar to history signature, the pending number is determined According to for duplicate data.
Further, it further includes the second memory module 64 that data, which sentence refitting to set, is additionally operable to:Determining that the pending data is It is corresponding to store the pending data, each signature prefix and the signature when non-duplicate data.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored into traditional database.Then Second memory module 64 can specifically store after splicing each signature prefix separator and arrive with the signature for major key In one field, which can be named as extend, and be field setting index.Optionally, the index can be label File-name field, but not limited to this.
Optionally, the KV type numbers for supporting list are arrived in the prefix storage corresponding with signature that each of pending data can be signed According in library.Then the second memory module 64 specifically can using each signature prefix as the key in KV type databases, using sign as Value is appended in the form of list in the corresponding value of the key.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored to the KV types for not supporting list In database.Then signature specifically it is corresponding can be updated to key using each signature prefix as key by the second memory module 64 value。
Further, sentence molality block 63 to be additionally operable to:When it is duplicate data to determine the pending data, wait locating described in discarding Manage data, each signature prefix and the signature.
Data provided in this embodiment are sentenced refitting and are set, and by carrying out cutting to signature, carry out data based on signature prefix and sentence Weight, on the one hand because of index of the signature prefix as signature, it is possible to reduce the inquiry times that magnanimity is signed in database, another party Because the digit of signature prefix is less than the digit of signature, corresponding number of signatures is relatively fewer, can reduce data screening in face Range, therefore the time efficiency that data sentence weight can be improved.
The foregoing describe data to sentence the built-in function and structure that refitting is set, as shown in fig. 7, in practice, which sentences refitting It sets and can be achieved as electronic equipment, including:Memory 71 and processor 72.
Memory 71 is for storing program.
In addition to above procedure, program's memory space is also configured to store various other data to support to set in electronics Standby upper operation.The example of these data includes the instruction for any application program or method that operate on an electronic device, Contact data, telephone book data, message, picture, video etc..
Memory 71 can be by any kind of volatibility or non-volatile memory device or combination thereof realization, such as Static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only to be deposited Reservoir (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or CD.
Processor 72 is coupled to memory 71, for executing program, for:
Obtain the signature of pending data;At least level-one cutting is carried out to the signature, before obtaining at least two signatures Sew;According at least two signatures prefix, the pending data is carried out to sentence weight.
In an optional embodiment, processor 72 is particularly used in when obtaining the signature of pending data:If waited for The signature for handling data is existing, then can directly access respective storage devices, obtain the signature of pending data;If waiting locating The signature of reason data does not exist, then can generate the signature that logic generates pending data according to signature.
In an optional embodiment, processor 72 is particularly used in when carrying out at least level-one cutting to the signature: According to the similar threshold value of Hamming distance, cutting hop count is determined;According to cutting hop count, at least level-one cutting is carried out to the signature, To obtain at least two signatures prefix.
Further, processor 72 when carrying out at least level-one cutting to the signature, specifically can be used according to cutting hop count In:According to the cutting hop count, level-one cutting is carried out to the signature, to obtain at least two prefix stems;For it is described extremely Each prefix stem in few two prefix capitals, according to the cutting hop count, to removed in the signature prefix stem it Outer remaining digit carries out two level cutting, to obtain corresponding at least two prefix tail of the prefix stem;To described each Prefix stem, by each prefix tail in the prefix stem and corresponding at least two prefix tail of the prefix stem point It does not combine, forms a signature prefix in at least two signatures prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, processor 72 is particularly used in when carrying out sentencing weight to the pending data:With At least two signature prefix is inquired as querying condition in history signs prefix;If do not inquire with it is described extremely The identical history of any signature prefix is signed prefix in few two signatures prefixes, determines the pending data for non-duplicate number According to.
Further, processor 72 is additionally operable to:If inquiring and prefix phase of arbitrarily signing in at least two signatures prefix Same history signature prefix, by the signature history signature progress likelihood ratio corresponding with the history signature prefix inquired Compared with;If the signature and history signature are dissimilar, determine that the pending data is non-duplicate data.
Further, processor 72 is additionally operable to:If the signature is similar to history signature, the pending data is determined For duplicate data.
Further, processor 72 is additionally operable to:When it is non-duplicate data to determine the pending data, described in corresponding storage Pending data, each signature prefix and the signature.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored into traditional database.Then Processor 72 can specifically store after splicing each signature prefix separator to a word with the signature for major key Duan Zhong, the field can be named as extend, and be field setting index.Optionally, the index can be signature field, But not limited to this.
Optionally, the KV type numbers for supporting list are arrived in the prefix storage corresponding with signature that each of pending data can be signed According in library.Then processor 72 specifically can using each signature prefix as the key in KV type databases, to sign as value, It is appended in the form of list in the corresponding value of the key.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored to the KV types for not supporting list In database.Then signature specifically can be updated to the corresponding value of key by processor 72 using each signature prefix as key.
Further, processor 72 is additionally operable to:When it is duplicate data to determine the pending data, abandon described pending Data, each signature prefix and the signature.
Further, as shown in fig. 7, electronic equipment further includes:Communication component 73, power supply module 74, audio component 75, display Other components such as device 76.Members are only schematically provided in Fig. 7, are not meant to that electronic equipment only includes component shown in Fig. 7.
Communication component 73 is configured to facilitate the communication of wired or wireless way between electronic equipment and other equipment.Electronics Equipment can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary reality It applies in example, communication component 73 receives broadcast singal or the related letter of broadcast from external broadcasting management system via broadcast channel Breath.In one exemplary embodiment, the communication component 73 further includes near-field communication (NFC) module, to promote short range communication. For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) skill can be based in NFC module Art, bluetooth (BT) technology and other technologies are realized.
Power supply module 74 provides electric power for the various assemblies of electronic equipment.Power supply module 74 may include power management system System, one or more power supplys and other generate, manage and distribute electric power associated component with for electronic equipment.
Audio component 75 is configured as output and/or input audio signal.For example, audio component 75 includes a microphone (MIC), when electronic equipment is in operation mode, when such as call model, logging mode and speech recognition mode, microphone is configured To receive external audio signal.The received audio signal can be further stored in memory 71 or via communication component 73 It sends.In some embodiments, audio component 75 further includes a loud speaker, is used for exports audio signal.
Display 76 includes screen, and screen may include liquid crystal display (LCD) and touch panel (TP).If screen Including touch panel, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one Or multiple touch sensors are to sense the gesture on touch, slide, and touch panel.The touch sensor can be sensed not only The boundary of a touch or slide action, but also detect duration and pressure associated with the touch or slide operation.
One of ordinary skill in the art will appreciate that:Realize that all or part of step of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer read/write memory medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned includes:ROM, RAM, magnetic disc or The various media that can store program code such as person's CD.
Finally it should be noted that:The above various embodiments is only to illustrate the technical solution of the application, rather than its limitations;To the greatest extent Pipe is described in detail the application with reference to foregoing embodiments, it will be understood by those of ordinary skill in the art that:Its according to So can with technical scheme described in the above embodiments is modified, either to which part or all technical features into Row equivalent replacement;And these modifications or replacements, each embodiment technology of the application that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (14)

1. a kind of data segmentation method, which is characterized in that including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
2. according to the method described in claim 1, it is characterized in that, described carry out at least level-one cutting to the signature, to obtain At least two signature prefixes are obtained, including:
According to the similar threshold value of Hamming distance, cutting hop count is determined;
According to the cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
3. according to the method described in claim 2, it is characterized in that, described according to the cutting hop count, the signature is carried out At least level-one cutting, to obtain at least two signatures prefix, including:
According to the cutting hop count, level-one cutting is carried out to the signature, to obtain at least two prefix stems;
For each prefix stem at least two prefix capital, according to the cutting hop count, to being removed in the signature Remaining digit carries out two level cutting except the prefix stem, to obtain the corresponding at least two prefixes tail of the prefix stem Portion;
It, will be in the prefix stem and corresponding at least two prefix tail of the prefix stem to each prefix stem Each prefix tail is respectively combined together, forms a signature prefix in at least two signatures prefix.
4. according to claim 1-3 any one of them methods, which is characterized in that corresponding storage at least two signature Each of prefix signature prefix and the signature, including:
After each signature prefix separator is spliced in storage a to field, and it is institute with the signature for major key State field setting index;Or
Using each signature prefix as key, the signature is appended in the corresponding comparative example of the key;Or
Using each signature prefix as key, the signature is updated to the corresponding value of the key.
5. a kind of data judging method, which is characterized in that including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
6. according to the method described in claim 6, it is characterized in that, it is described according to it is described at least two sign prefix, to described Pending data carries out sentencing weight, including:
Using at least two signature prefix as querying condition, inquired in history signs prefix;
If not inquiring history signature prefix identical with any signature prefix in at least two signatures prefix, described in determination Pending data is non-duplicate data.
7. according to the method described in claim 6, it is characterized in that, further including:
If history signature prefix identical with prefix of arbitrarily signing in at least two signatures prefix is inquired, by the signature Comparison similar with the corresponding history signature progress of history signature prefix inquired;
If the signature and history signature are dissimilar, determine that the pending data is non-duplicate data.
8. the method according to the description of claim 7 is characterized in that further including:
If the signature is similar to history signature, determine that the pending data is duplicate data.
9. the method described according to claim 6 or 7, which is characterized in that further include:
It is corresponding to store the pending data, each signature prefix and the signature.
10. a kind of electronic equipment, which is characterized in that including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
11. electronic equipment according to claim 10, which is characterized in that the processor carries out at least to the signature When level-one cutting, it is specifically used for:
According to the similar threshold value of Hamming distance, cutting hop count is determined;
According to the cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
12. a kind of electronic equipment, which is characterized in that including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
13. electronic equipment according to claim 12, which is characterized in that the processor to the pending data into When row sentences weight, it is specifically used for:
Using at least two signature prefix as querying condition, inquired in history signs prefix;
If not inquiring history signature prefix identical with any signature prefix in at least two signatures prefix, described in determination Pending data is non-duplicate data.
14. electronic equipment according to claim 13, which is characterized in that the processor is additionally operable to:
If it is determined that the pending data is non-duplicate data, then before the corresponding storage pending data, each signature Sew and the signature.
CN201611139984.7A 2016-12-12 2016-12-12 Data segmentation method, judging method and electronic equipment Pending CN108614827A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611139984.7A CN108614827A (en) 2016-12-12 2016-12-12 Data segmentation method, judging method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611139984.7A CN108614827A (en) 2016-12-12 2016-12-12 Data segmentation method, judging method and electronic equipment

Publications (1)

Publication Number Publication Date
CN108614827A true CN108614827A (en) 2018-10-02

Family

ID=63643314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611139984.7A Pending CN108614827A (en) 2016-12-12 2016-12-12 Data segmentation method, judging method and electronic equipment

Country Status (1)

Country Link
CN (1) CN108614827A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162752A (en) * 2019-05-13 2019-08-23 百度在线网络技术(北京)有限公司 Article sentences weight processing method, device and electronic equipment
CN111143393A (en) * 2018-11-03 2020-05-12 广州市明领信息科技有限公司 Big data processing system
CN113194030A (en) * 2021-05-06 2021-07-30 中国人民解放军国防科技大学 Multipath message forwarding method based on network prefix segmentation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156689A (en) * 2011-03-31 2011-08-17 百度在线网络技术(北京)有限公司 Method and device for detecting document
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN103345449A (en) * 2013-06-19 2013-10-09 暨南大学 Method and system for prefetching fingerprints oriented to data de-duplication technology
CN103793522A (en) * 2008-10-20 2014-05-14 王强 Method and system for rapidly scanning feature codes
CN104391915A (en) * 2014-11-19 2015-03-04 湖南国科微电子有限公司 Duplicated data delete method
US9135674B1 (en) * 2007-06-19 2015-09-15 Google Inc. Endpoint based video fingerprinting

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9135674B1 (en) * 2007-06-19 2015-09-15 Google Inc. Endpoint based video fingerprinting
CN103793522A (en) * 2008-10-20 2014-05-14 王强 Method and system for rapidly scanning feature codes
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN102156689A (en) * 2011-03-31 2011-08-17 百度在线网络技术(北京)有限公司 Method and device for detecting document
CN103345449A (en) * 2013-06-19 2013-10-09 暨南大学 Method and system for prefetching fingerprints oriented to data de-duplication technology
CN104391915A (en) * 2014-11-19 2015-03-04 湖南国科微电子有限公司 Duplicated data delete method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈春玲 等: "基于Simhash算法的重复数据删除技术的研究与改进", 《南京邮电大学学报(自然科学版)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143393A (en) * 2018-11-03 2020-05-12 广州市明领信息科技有限公司 Big data processing system
CN110162752A (en) * 2019-05-13 2019-08-23 百度在线网络技术(北京)有限公司 Article sentences weight processing method, device and electronic equipment
CN110162752B (en) * 2019-05-13 2023-06-27 百度在线网络技术(北京)有限公司 Article judging and re-processing method and device and electronic equipment
CN113194030A (en) * 2021-05-06 2021-07-30 中国人民解放军国防科技大学 Multipath message forwarding method based on network prefix segmentation
CN113194030B (en) * 2021-05-06 2021-12-24 中国人民解放军国防科技大学 Multipath message forwarding method based on network prefix segmentation

Similar Documents

Publication Publication Date Title
US10565244B2 (en) System and method for text categorization and sentiment analysis
CN105095195B (en) Nan-machine interrogation's method and system of knowledge based collection of illustrative plates
US9904694B2 (en) NoSQL relational database (RDB) data movement
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US9104979B2 (en) Entity recognition using probabilities for out-of-collection data
WO2019074975A2 (en) Data processing method, apparatus and electronic device
US20160196277A1 (en) Data record compression with progressive and/or selective decompression
CN105630847B (en) Date storage method, data query method, apparatus and system
US20170322930A1 (en) Document based query and information retrieval systems and methods
CN107885874A (en) Data query method and apparatus, computer equipment and computer-readable recording medium
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US11023452B2 (en) Data dictionary with a reduced need for rebuilding
Dreßler et al. On the efficient execution of bounded jaro-winkler distances
CN108614827A (en) Data segmentation method, judging method and electronic equipment
CN105678129B (en) A kind of method and apparatus of determining subscriber identity information
CN108572789A (en) Disk storage method and apparatus, information push method and device and electronic equipment
WO2023024975A1 (en) Text processing method and apparatus, and electronic device
CN109446337B (en) Knowledge graph construction method and device
WO2020172649A1 (en) System and method for text categorization and sentiment analysis
CN113032420A (en) Data query method and device and server
US20190012310A1 (en) Method and device for providing notes by using artificial intelligence-based correlation calculation
Varol et al. Detecting near-duplicate text documents with a hybrid approach
CN101374307A (en) Method and apparatus for updating digital content information of mobile equipment
US11669555B2 (en) System and method of creating index
KR101589626B1 (en) Method for establishing start-up data or management data from big data based on lexico semantic pattern analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination