CN108614827A - Data segmentation method, judging method and electronic equipment - Google Patents
Data segmentation method, judging method and electronic equipment Download PDFInfo
- Publication number
- CN108614827A CN108614827A CN201611139984.7A CN201611139984A CN108614827A CN 108614827 A CN108614827 A CN 108614827A CN 201611139984 A CN201611139984 A CN 201611139984A CN 108614827 A CN108614827 A CN 108614827A
- Authority
- CN
- China
- Prior art keywords
- signature
- prefix
- cutting
- data
- pending data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
This application provides data segmentation method, judging method and electronic equipments.Data judging method includes:Obtain the signature of pending data;At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;According at least two signatures prefix, the pending data is carried out to sentence weight.The time efficiency that data sentence weight can be improved using technical scheme.
Description
Technical field
This application involves a kind of Internet technical field more particularly to data segmentation method, judging method and electronic equipments.
Background technology
When handling mass data (such as document or webpage), in order to save memory space, usually data can all be carried out
Sentence weight.The mainstream way of industry is to carry out data based on SimHash algorithms to sentence weight at present.SimHash algorithms be data deduplication most
Common Hash (Hash) method, principle are:The digit of selected SimHash values;Everybody of SimHash values is initialized as 0;
Extract the feature in data to be signed;The hash value of each feature is calculated using traditional Hash functions;To the hash value of each feature
Each, if the position be 1, SimHash value corresponding positions value add 1;Otherwise subtract 1;To each of obtained SimHash values
Position is set as 1, is otherwise set as 0 if the position is more than 1, obtains SimHash signatures.The speed of SimHash algorithms is quickly.
Data based on SimHash algorithms sentence weight process:SimHash signatures are carried out to historical data and are stored
SimHash signs;To new data, SimHash signatures are carried out to it first, then compare its SimHash signatures and historical data
SimHash signature it is whether similar, to judge whether be present in historical data in new data.
In worst case, the SimHash signatures for needing to be traversed for whole historical datas every time are compared said program,
Although on time complexity being O (n), since historical data radix is larger, such as reptile platform, data volume base
All it is more than one hundred million ranks in sheet, so time efficiency is still relatively low.
Invention content
A kind of data segmentation method of the application offer, judging method and electronic equipment, to improve the time that data sentence weight
Efficiency.
In order to achieve the above objectives, embodiments herein adopts the following technical scheme that:
In a first aspect, a kind of data segmentation method is provided, including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
Second aspect provides a kind of data judging method, including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
The third aspect provides a kind of electronic equipment, including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
Fourth aspect provides a kind of electronic equipment, including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
In the embodiment of the present application, at least level-one cutting is carried out to the signature of data, obtains at least two signature prefixes, it is right
Signature prefix and signature should be stored, when progress data sentence weight, the signature of data is no longer based on but is based on signature prefix, with label
Name is compared, and the digit for prefix of signing is relatively fewer, is the equal of the index of signature, and being conducive to reduction data by prefix of signing looks into
Range is ask, the time efficiency for sentencing weight is improved.
Above description is only the general introduction of technical scheme, in order to better understand the technological means of the application,
And can be implemented in accordance with the contents of the specification, and in order to allow above and other objects, features and advantages of the application can
It is clearer and more comprehensible, below the special specific implementation mode for lifting the application.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit are common for this field
Technical staff will become clear.Attached drawing only for the purpose of illustrating preferred embodiments, and is not considered as to the application
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the structural schematic diagram for the operation system that one embodiment of the application provides;
Fig. 2 a are the flow diagram for the data segmentation method that one embodiment of the application provides;
Fig. 2 b are the schematic diagram for the signature cutting result that another embodiment of the application provides;
Fig. 3 is the flow diagram for the data judging method that the another embodiment of the application provides;
Fig. 4 is the structural schematic diagram for the data cutting device that the another embodiment of the application provides;
Fig. 5 is the structural schematic diagram for the electronic equipment that the another embodiment of the application provides;
Fig. 6 is the structural schematic diagram that the data that the another embodiment of the application provides sentence that refitting is set;
Fig. 7 is the structural schematic diagram for the electronic equipment that the another embodiment of the application provides.
Specific implementation mode
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
The problems such as existing time efficiency is relatively low when sentence weight to data for the prior art, the application provides a kind of solution
Certainly scheme, cardinal principle are:Cutting is carried out to the signatures of data, obtains at least two signature prefixes, based on signature prefix into
Row data sentence weight.Compared with signature, the digit for prefix of signing is relatively fewer, is the equal of the index of signature, passes through prefix of signing
Be conducive to reduce data query range, improve the time efficiency for sentencing weight.
It is worth noting that method provided by the embodiments of the present application can be applied to any business for sentencing weight logic with data
System, such as can be crawler system.Crawler system will crawl a large amount of webpage daily, and the webpage newly crawled daily is in millions
Not, if all storages, can occupy a large amount of memory space, so needing to judge whether are the webpage that currently crawls and history web pages
It is similar, if similarity degree is higher, current web page can be abandoned, to save memory space.If using existing scheme, this
It is relatively low kind to sentence weight efficiency, the time efficiency for sentencing weight can be greatlyd improve using method provided by the embodiments of the present application.
For some operation systems, one may be stored with before implementing method provided by the embodiments of the present application
A little historical datas, these historical datas are to sentence weight according to what the method for the prior art carried out, i.e., these historical datas do not carry out
Signature cutting.In order to carry out the method that data sentence weight based on signature prefix using provided by the embodiments of the present application, reality is not only needed
Existing new data sentences weight logic, it is also necessary to carry out signature cutting to historical data and store.
Data segmentation method provided in this embodiment and data judging method can be executed by operation system, referred to herein
Operation system, can be that reptile platform etc. is directed to the business platform that is handled of magnanimity webpage, integrated document can also be carried out
Document process platform of processing etc..As shown in Figure 1, its structural schematic diagram for the operation system of one embodiment of the application, Fig. 1 institutes
The structure shown is only one of the example of the adaptable operation system of technical scheme of the present invention.Operation system includes data
Cutting device and data major punishment device, external call service can be any services for being capable of providing or generating data, master
Come from operation system to the Operational Visit or service call of other systems either client, external call service is new number
According to the main source of generation.It is corresponding that the corresponding database of operation system is used for the corresponding data of storage service system, the data
Signature and the signature prefix that signature generated after cutting processing, wherein data, the signature of data and signature prefix can
Storage is carried out with separately different databases to be stored in same database.In the database, there are partly not
Carry out the historical data of signature cutting.
Data cutting device is mainly used to execute process flow shown in following Fig. 2 a, is mainly used for in database,
The historical data for not carrying out signature cutting is handled, and generates the signature prefix after the corresponding cutting of historical data and data are written
In library, gradually the total data in database can be made to all have signature prefix by the processing of data cutting device, to
Sentencing after entering convenient for subsequent new data compares again.
Data major punishment device is mainly used to execute following process flows shown in Fig. 3, is carried out primarily directed to new data
Processing, after new data enters operation system, it is raw to first pass through signature generation module (there is also the modules in the prior art)
At the signature of new data, cutting then is carried out to the signature and the signature prefix generated based on cutting carries out sentencing weight, for non-heavy
Complex data is then stored in database together with prefix of signing and sign, discard processing is then carried out for duplicate data together.
It should be noted that it can be overlapping that above-mentioned data cutting device and data, which sentence the part of module during refitting is set,
, for example, the first acquisition module and the second acquisition module can be a module, it is used to obtain signature, only obtains signature
Source it is different, the first cutting module and the second cutting module may be same module, be used to carry out cutting for signature
Processing.
The explanation carried out above for the technical principle of the embodiment of the present invention and illustrative application framework, is situated between in detail below
The specific technical solution for the embodiment of the present invention of continuing.
Based on above-mentioned, the application provides a kind of data segmentation method from the demand of historical data, to solve number
Cutting according to signature and storage problem;And a kind of data judging method is provided from the demand of new data, to solve number
According to sentence weight problem.It is described in detail below by different embodiments.
Fig. 2 a are the flow diagram for the data segmentation method that one embodiment of the application provides.As shown in Figure 2 a, this method
Including:
11, the signature of pending data is obtained.The step can be executed by the data cutting device in Fig. 1.Wherein, it waits locating
The signature of reason data can be formed and stored in by the scheme of the prior art in the database of operation system.In the step
In, data cutting device can directly acquire signature from the database of storage signature, and acquired signature is main in step
It is the signature of stored historical data.
12, at least level-one cutting is carried out to above-mentioned signature, to obtain at least two signature prefixes, which can be by Fig. 1
In data cutting device execute.
13, each of at least two signature prefix of corresponding storage signature prefix and above-mentioned signature, the step can be by Fig. 1
In data cutting device execute.Specifically, signature prefix and above-mentioned signature can be stored in existing by data cutting device
In database for storing signature.
In the present embodiment, it would be desirable to which the data for carrying out signature cutting are known as pending data.Optionally, pending data
Can be the historical data that signature cutting is not yet carried out in operation system, but it is not limited to this.Optionally, pending data can be with
It is any type of data, such as document, sentence or webpage etc..
When needing to carry out signature cutting to pending data, the signature of pending data can be obtained.In the acquisition
Cheng Zhong, if the signature of pending data is existing, data cutting device can directly access respective storage devices, and acquisition waits for
Handle the signature of data;If the signature of pending data does not exist, data cutting device can generate logic according to signature
Generate the signature of pending data.
Optionally, the signature of above-mentioned pending data can be SimHash signatures, but not limited to this.
Wherein, the process of the SimHash signatures of data cutting device acquisition pending data is:Selected SimHash values
Digit;Everybody of SimHash values is initialized as 0;Extract the feature in pending data;It is calculated using traditional Hash functions each
The hash value of a feature;To each of the hash value of each feature, if the value that the position is 1, SimHash value corresponding positions adds 1;
Otherwise subtract 1;Is set as by 1, is otherwise set as 0, to obtain waiting locating if the position is more than 1 for each of obtained SimHash values
Manage the SimHash signatures of data.
After the signature for obtaining pending data, cutting can be carried out to acquired signature.In the present embodiment, number
Level-one cutting can be carried out to acquired signature, multistage cutting can also be carried out to acquired signature according to cutting device.It is logical
It crosses and cutting is carried out to acquired signature, at least two signature prefixes can be obtained.The digit for prefix of signing is label less than signature
A part for name, is equivalent to the index of signature.
After obtaining signature prefix, data cutting device will can each sign, and prefix is corresponding with signature to be stored, so as to
Sentence weight in being subsequently based on the signature prefix and signature progress data.
In each embodiment of the application, described sign is equivalent to the fingerprint of pending data.This signature is not random
What content of signing but rely on pending data itself generated, so for the larger data of content difference, the difference of signature
Also not larger, data similar for content are signed also more similar.For signature, it can be signed by two
Between Hamming distance whether compare the two signatures similar.Hamming distance is a concept, it indicates the word of two equal lengths
The different quantity in the corresponding position of symbol string.Based on this, it can correspond to and compare two signatures, determine the different number in the corresponding position of two signatures
Amount;Then judge whether the different quantity in the corresponding position is less than or equal to preset similar threshold value;If judging result be less than
Or it is equal to, then it represents that two signatures are same or similar;If judging result be more than, then it represents that two signatures are dissimilar.
Above-mentioned similar threshold value indicates at most to allow different digit when two signatures are similar.Generally, for long article
The similar threshold value of Hamming distance is 7 for shelves, it is meant that the Hamming distance between the signature of two documents (it is different to correspond to position
Quantity) be less than or equal to 7 when, the two documents are more similar;The similar threshold value of Hamming distance is 3 for sentence, meaning
When Hamming distance (corresponding to the different quantity in position) between the signature of two sentences and being less than or equal to 3, the two sentences compared with
It is close.It illustrates, it is assumed that the signature of document A is 100010010, and the signature of document B is 110010011, then document A and text
There are two differences in the signature of shelves B, so Hamming distance is 2.If similar threshold value is 7, because document A and document B
The Hamming distance of signature is less than 7, so the two belongs to similar document.
Optionally, it signs for two, if be up to N different, is signed two by the way of cutting successively
Name all cuttings are N+1 sections, then can be obtained according to drawer principle (or principle of pigeon hole):In the N+1 sections of two signature cuttings extremely
Rare 1 section is identical.Drawer principle is an important principle in Combinational Mathematics.Assuming that there are ten apples on table,
This ten apples are put into nine drawers, are put in any case, at least can at least put two apples there are one drawer the inside.It is based on
This, the general sense of drawer principle is:Each drawer represents a set, each apple can represent an element, false
Be put into n set if any n+1 element, wherein must there are one in set at least there are two element.Signature is analogous to cut
Point, N+1 sections correspond to n+1 element, and similar threshold value N corresponds to n set.
Based on above-mentioned analysis, it can determine that each cutting needs to be by signature cutting according to the similar threshold value of Hamming distance
Several sections, i.e. cutting hop count;Then, according to the cutting hop count, at least level-one cutting is carried out to signature, to obtain at least two
Signature prefix.Wherein, the similar threshold value of Hamming distance is denoted as N, cutting hop count is denoted as m, then needing to meet m>=N+1.
Preferably, m=N+1.For example, if N=3, m=4, i.e., it is 4 sections by signature cutting;If N=5, m=6, i.e., will
Cutting of signing is 6 sections;If N=6, m=7, i.e., it is 7 sections by signature cutting;If m=7, m=8, i.e., will sign cutting
It is 8 sections.It is worth noting that cutting here is not required for impartial cutting, cutting successively.Optionally, described to cut successively
It can be according to sequence from left to right successively cutting to divide.
Optionally, data cutting device can carry out two-stage cutting to signature.A kind of reality of two-stage cutting is carried out to signature
The mode of applying includes:
According to cutting hop count, level-one cutting is carried out to signature, to obtain at least two prefix stems;To at least two prefixes
Each prefix stem in stem, according to cutting hop count, to remaining digit carries out two level in addition to the prefix stem in signature
Cutting, to obtain corresponding at least two prefix tail of the prefix stem;Prefix stem each in this way can correspond at least two again
Prefix tail, each prefix stem each prefix tail corresponding with its is respectively combined together with, a signature will be formed
Prefix.Signature after two level cutting, square signature prefix with cutting hop count;Then will each sign prefix and label
The corresponding storage of name.
Optionally, above-mentioned level-one cutting and two level cutting can be according to sequence from left to right, successively cutting.
The two level dicing process is illustrated below by specific example.In the example below, it is assumed that N=3, m=4,
I.e. similar threshold value is 3, and each cutting needs signature cutting to be 4 sections, and for carrying out signature cutting to history archive, specifically
Dicing process is as follows:
Generate 64 SimHash signatures of history archive;Two level division is carried out to the SimHash signatures of history archive.Such as
SimHash shown in Fig. 2 b sign, first, according to from left to right in order by the signature cutting be 4 sections, 16 every section, this 4 sections
The portions A, the portions B, the portions C, the portions D shown in respectively Fig. 2 b.To every section in tetra- sections of A, B, C, D, remaining part in signature is pressed
Two level cutting is carried out according to sequence from left to right, remaining part is 48 here, after cutting is 4 sections successively, 12 every section, is cut
Divide result respectively such as the corresponding inferior portion in the portion A, B, C, D in Fig. 2 b.Based on slit mode shown in Fig. 2 b, SimHash Autograph Sessions
It is split as 16 signature prefixes.
In an optional embodiment, data cutting device can be by signature and the signature prefix pair obtained by signature cutting
It should store in traditional database.It is traditional database corresponding to the database in Fig. 1.Then each signature prefix of corresponding storage and
The mode of signature can be:With the signature for major key, storage is to one after each signature prefix separator is spliced
In field, a data record is formed, the field can be named arbitrarily, such as be named as extend, and be arranged for the field
Index.Optionally, the index can be signature field, but not limited to this.
In an optional embodiment, before the signature that data cutting device can be obtained by signature prefix and by signature cutting
Sew in corresponding storage to the KV type databases of support list (list), which is actually KV cachings.Corresponding in Fig. 1
Database be support list list KV type databases.Then each signature prefix of corresponding storage and the mode of signature can be:
Using each signature prefix as the key (key) in KV type databases, using signature as value (value), in the form of list (list)
It is appended in the corresponding value of the key.In this embodiment, each Autograph Session is redundantly stored, and repeats the number etc. of storage
In the number of signature prefix, the number for prefix of signing is denoted as M for ease of description, it is contemplated that KV type databases are supported parallel
It retrieves, so retrieval time can be reduced to original 1/M.
In an optional embodiment, data cutting device can be by signature and the signature prefix pair obtained by signature cutting
It should store in the KV type databases for not supporting list.It is the KV type databases for not supporting list corresponding to the database in Fig. 1.
Then each signature prefix of corresponding storage and the mode of signature can be:Using each signature prefix as key, signature is updated to
The corresponding value of key.
By above-mentioned processing, the cutting and storage of the signature of historical data can be completed, for subsequently based on signature prefix
Data sentence weight process and provide condition.
Fig. 3 is the flow diagram for the data judging method that one embodiment of the application provides.As shown in figure 3, this method packet
It includes:
31, the signature of pending data is obtained.The step can sentence refitting by the data in Fig. 1 and set execution.Wherein, it waits locating
Reason data can come from the new data except operation system, can be the new data generated by external call service, newly
After data carry out system, the signature of new data is generated by special signature generation module.
32, at least level-one cutting is carried out to the signature, to obtain at least two signature prefixes.The step can be by Fig. 1
In data sentence refitting and set execution.
33, according at least two signatures prefix, the pending data is carried out to sentence weight.The step can be by Fig. 1
In data sentence refitting and set execution.
It is worth noting that if the data segmentation method that the data judging method of the present embodiment is provided with above-described embodiment is answered
For same operation system, then it can be same device that data cutting device sentences refitting to set with data, can also be self-contained unit.
Wherein, if the historical data in operation system does not carry out signature cutting, signature of the data cutting device to historical data
Cutting is carried out, the signature prefix of historical data is obtained, is that data sentence the premise for resetting and setting and sentencing weight based on signature prefix progress data
Condition;If the historical data in operation system has carried out signature cutting, data, which are sentenced refitting and set, can be directly based upon label
Name prefix carries out data and sentences weight.
In the present embodiment, it would be desirable to which the data for sentence weight are known as pending data.Optionally, pending data can be with
It is the new data in operation system, but it is not limited to this.Optionally, pending data can be any type of data, such as
Document, sentence or webpage etc..
When needing to carry out sentencing weight to pending data, data sentence refitting and set the signature that can obtain pending data.
In the acquisition process, if the signature of pending data is existing, data, which are sentenced refitting and set, can directly access respective stored and set
It is standby, obtain the signature of pending data;If the signature of pending data does not exist, data sentence that refitting sets can be according to signature
Generate the signature that logic generates pending data.
Optionally, the signature of above-mentioned pending data can be SimHash signatures, but not limited to this.
Wherein, data sentence refitting set obtain pending data SimHash signature process be:Selected SimHash values
Digit;Everybody of SimHash values is initialized as 0;Extract the feature in pending data;It is calculated using traditional Hash functions each
The hash value of a feature;To each of the hash value of each feature, if the value that the position is 1, SimHash value corresponding positions adds 1;
Otherwise subtract 1;Is set as by 1, is otherwise set as 0, to obtain waiting locating if the position is more than 1 for each of obtained SimHash values
Manage the SimHash signatures of data.
After the signature for obtaining pending data, cutting can be carried out to acquired signature.In the present embodiment, number
It sets according to refitting is sentenced and can carry out level-one cutting to acquired signature, multistage cutting can also be carried out to acquired signature.It is logical
It crosses and cutting is carried out to acquired signature, at least two signature prefixes can be obtained.The digit for prefix of signing is label less than signature
A part for name, is equivalent to the index of signature.
After obtaining signature prefix, data sentence refitting and set and can sentence weight based on signature prefix progress data.With signature phase
Than the digit for prefix of signing is relatively fewer, is the equal of the index of signature, is conducive to reduce data query model by prefix of signing
It encloses, improves the time efficiency for sentencing weight.
It is worth noting that the present embodiment carries out signature the mode of cutting, and in aforementioned data cutting method embodiment
The mode that cutting is carried out to signature is identical, therefore can be found in previous embodiment, and details are not described herein.
In an optional embodiment, above-mentioned steps 33 carry out pending data that is, according at least two signature prefixes
Sentence the embodiment of weight, including:
Using at least two signature prefixes of pending data as querying condition, inquired in history signs prefix;
If not inquiring history signature prefix identical with any signature prefix at least two signature prefixes, it is determined that pending data
For non-duplicate data.In this case, the signature prefix of pending data need to only be inquired, is looked into without carrying out signature
It askes, and prefix of signing is equivalent to the index of signature, the search efficiency that search efficiency will significantly larger than sign, therefore can improve
Sentence the time efficiency of weight.
It further, will if inquiring history signature prefix identical with prefix of arbitrarily signing at least two signature prefixes
The comparison similar with the corresponding history signature progress of history signature prefix inquired of the signature of pending data;If pending data
Signature and the history signature it is dissimilar, determine that pending data is non-duplicate data;If signature and the institute of pending data
It is similar to state history signature, determines that pending data is duplicate data.
The above-mentioned signature by pending data history signature corresponding with the history signature prefix inquired carries out the likelihood ratio
Compared with whether the Hamming distance for being primarily referred to as comparing two signatures is less than or equal to the process of similar threshold value.For example, ratio can be corresponded to
It signs compared with two, determines the different quantity in the corresponding position of two signatures, i.e. Hamming distance;Then judge the different number in the corresponding position
Measure whether (i.e. Hamming distance) is less than or equal to preset similar threshold value;If judging result be less than or equal to, then it represents that two
It signs same or similar;If judging result be more than, then it represents that two signatures are dissimilar.
History signature prefix identical with prefix of arbitrarily signing at least two signature prefixes is being inquired, because
The digit for prefix of signing is less than the digit of signature, in the case of data balancing, each corresponding history number of signature of prefix of signing
Amount is far smaller than the total quantity of history signature, this, which is equivalent to, reduces data screening range.It illustrates, it is assumed that have 10,000,000,000
Signature, each prefix of signing is 28, then in data balancing, each prefix of signing corresponding signature number that is averaged is
10000000000/228, about 37.It can be seen that even if the history signature prefix pair for needing the signature by pending data and inquiring
The history signature answered is compared, but since data screening range reduces, therefore the time effect that data sentence weight can also be improved
Rate.
In addition, above-mentioned determining pending data be non-duplicate data in the case of, can correspond to storage pending data,
The signature of each of pending data signature prefix and pending data, and execute corresponding service processing.The step can be by Fig. 1
In data sentence refitting and set execution, specific storage location can be the corresponding database of operation system in Fig. 1.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored into traditional database.Then
Each signature prefix and the mode of signature can be for corresponding storage:With the signature for major key, each signature prefix is used
After separator splicing in storage a to field, a data record is formed, the field can be named arbitrarily, such as be named as
Extend, and be field setting index.Optionally, the index can be signature field, but not limited to this.
Optionally, the KV type numbers for supporting list are arrived in the prefix storage corresponding with signature that each of pending data can be signed
According in library.Then each signature prefix of corresponding storage and the mode of signature can be:The prefix that will each sign is as KV type databases
In key be appended in the form of list in the corresponding value of the key using signing as value.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored to the KV types for not supporting list
In database.Then each signature prefix of corresponding storage and the mode of signature can be:Using each signature prefix as key, will sign
Name is updated to the corresponding value of key.
In addition, in the case where above-mentioned determining pending data is duplicate data, pending data can be abandoned.
Sentence for heavy process from entire data, by carrying out cutting to signature, data is carried out based on signature prefix and are sentenced again, one
Aspect because signature prefix as signature index, it is possible to reduce in database magnanimity signature inquiry times, on the other hand because
It is less than the digit of signature for the digit for prefix of signing, corresponding number of signatures is relatively fewer, can reduce data screening range,
Therefore the time efficiency that data sentence weight can be improved.
Fig. 4 is the structural schematic diagram for the data cutting device that the another embodiment of the application provides.As shown in figure 4, the device
Including:First acquisition module 41, the first cutting module 42 and the first memory module 43.
First acquisition module 41, the signature for obtaining pending data.
First cutting module 42, for carrying out at least level-one cutting to the signature, to obtain at least two signature prefixes.
First memory module 43 is each signed prefix and described for the described at least two signatures prefix of corresponding storage
Signature.
In an optional embodiment, the first cutting module 42 is specifically used for:According to the similar threshold value of Hamming distance, determine
Cutting hop count;According to cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
Still optionally further, the first cutting module 42 is specifically used for:According to the cutting hop count, one is carried out to the signature
Grade cutting, to obtain at least two prefix stems;For each prefix stem at least two prefix capital, according to institute
State cutting hop count, to remaining digit carries out two level cutting in addition to the prefix stem in the signature, with obtain it is described before
Sew corresponding at least two prefix tail of stem;To each prefix stem, by the prefix stem and the prefix stem
Each prefix tail in corresponding at least two prefix tail is respectively combined together, forms at least two signatures prefix
In one signature prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively
It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, the first memory module 43 is specifically used for:
With the signature for major key, after each signature prefix separator is spliced in storage a to field, and
It is arranged for the field and indexes;Optionally, the index can be signature field, but not limited to this;Or
Using each signature prefix as key, the signature is appended in the corresponding comparative example of the key;Or
Using each signature prefix as key, signature prefix, and corresponding storage label can be obtained with the signature of cutting data
Name prefix and signature based on signature prefix carry out data and sentence to provide condition again to be follow-up.
The foregoing describe the built-in function of data cutting device and structures, as shown in figure 5, in practice, data cutting dress
It sets and can be achieved as electronic equipment, including:Memory 51 and processor 52.
Memory 51, for storing program.
In addition to above procedure, memory 51 is also configured to store various other data to support on an electronic device
Operation.The example of these data includes the instruction for any application program or method that operate on an electronic device, contact
Personal data, telephone book data, message, picture, video etc..
Memory 51 can be by any kind of volatibility or non-volatile memory device or combination thereof realization, such as
Static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only to be deposited
Reservoir (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or
CD.
Processor 52 is coupled with memory 51, executes the program that the memory 51 is stored, for:
Obtain the signature of pending data;At least level-one cutting is carried out to the signature, before obtaining at least two signatures
Sew;Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
In an optional embodiment, processor 52 is particularly used in when carrying out at least level-one cutting to the signature:
According to the similar threshold value of Hamming distance, cutting hop count is determined;According to the cutting hop count, at least level-one is carried out to the signature and is cut
Point, to obtain at least two signatures prefix.
Still optionally further, processor 52 is according to the cutting hop count, when carrying out at least level-one cutting to the signature,
It is specifically used for:According to the cutting hop count, level-one cutting is carried out to the signature, to obtain at least two prefix stems;For
Each prefix stem at least two prefix capital, according to the cutting hop count, to removing the prefix in the signature
Remaining digit carries out two level cutting except stem, to obtain corresponding at least two prefix tail of the prefix stem;To institute
Each prefix stem is stated, by each prefix in the prefix stem and corresponding at least two prefix tail of the prefix stem
Tail portion is combined respectively, forms a signature prefix in at least two signatures prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively
It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, processor 52 will can each sign, and prefix is corresponding with signature to be stored to memory 51
In.Correspondingly, memory 51 is additionally operable to each signature prefix of corresponding storage and signature.
In an optional embodiment, processor 52 will can each sign, and prefix is corresponding with signature to be stored to external data
In library.For example, processor 52 will can each sign in prefix storage to traditional database corresponding with signature, then processor 52 has
Body can store after splicing each signature prefix separator into a field, and be with the signature for major key
The field setting index.In another example processor 52 can be by each signature prefix storage corresponding with signature to support list's
In KV type databases, then the signature can be specifically appended to described by processor 52 using each signature prefix as key
In the list of the corresponding value of key.In another example processor 52 will can each sign, prefix storage corresponding with signature is not to propping up
It holds in the KV type databases of list, then processor 52 specifically can be using each signature prefix as key, more by the signature
It is newly the corresponding value of the key.
Further, as shown in figure 5, electronic equipment further includes:Communication component 53, power supply module 54, audio component 55, display
Other components such as device 56.Members are only schematically provided in Fig. 5, are not meant to that electronic equipment only includes component shown in Fig. 5.
Communication component 53 is configured to facilitate the communication of wired or wireless way between electronic equipment and other equipment.Electronics
Equipment can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary reality
It applies in example, communication component 53 receives broadcast singal or the related letter of broadcast from external broadcasting management system via broadcast channel
Breath.In one exemplary embodiment, the communication component 53 further includes near-field communication (NFC) module, to promote short range communication.
For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) skill can be based in NFC module
Art, bluetooth (BT) technology and other technologies are realized.
Based on communication component 53, processor 52 will can each be signed by communication component 53, and prefix is corresponding with signature to be stored
Into external data base.
Power supply module 54 provides electric power for the various assemblies of electronic equipment.Power supply module 54 may include power management system
System, one or more power supplys and other generate, manage and distribute electric power associated component with for electronic equipment.
Audio component 55 is configured as output and/or input audio signal.For example, audio component 55 includes a microphone
(MIC), when electronic equipment is in operation mode, when such as call model, logging mode and speech recognition mode, microphone is configured
To receive external audio signal.The received audio signal can be further stored in memory 51 or via communication component 53
It sends.In some embodiments, audio component 55 further includes a loud speaker, is used for exports audio signal.
Display 56 includes screen, and screen may include liquid crystal display (LCD) and touch panel (TP).If screen
Including touch panel, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one
Or multiple touch sensors are to sense the gesture on touch, slide, and touch panel.The touch sensor can be sensed not only
The boundary of a touch or slide action, but also detect duration and pressure associated with the touch or slide operation.
Fig. 6 is the structural schematic diagram that the data that the another embodiment of the application provides sentence that refitting is set.As shown in fig. 6, the device
Including:Second acquisition module 61, the second cutting module 62 and sentence molality block 63.
Second acquisition module 61, the signature for obtaining pending data.
Second cutting module 62, for carrying out at least level-one cutting to the signature, to obtain at least two signature prefixes.
Molality block 63 is sentenced, for according at least two signatures prefix, carrying out sentencing weight to the pending data.
In an optional embodiment, the second acquisition module 61 is particularly used in:If the signature of pending data has been deposited
Respective storage devices then can be directly being accessed, the signature of pending data is obtained;If the signature of pending data is not deposited
The signature that logic generates pending data then can generated according to signature.
In an optional embodiment, the second cutting module 62 is specifically used for:According to the similar threshold value of Hamming distance, determine
Cutting hop count;According to cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
Still optionally further, the second cutting module 62 is specifically used for:According to the cutting hop count, one is carried out to the signature
Grade cutting, to obtain at least two prefix stems;For each prefix stem at least two prefix capital, according to institute
State cutting hop count, to remaining digit carries out two level cutting in addition to the prefix stem in the signature, with obtain it is described before
Sew corresponding at least two prefix tail of stem;To each prefix stem, by the prefix stem and the prefix stem
Each prefix tail in corresponding at least two prefix tail is respectively combined together, forms at least two signatures prefix
In one signature prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively
It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, sentences molality block 63 and be particularly used in:Using at least two signatures prefix as looking into
Inquiry condition is inquired in history signs prefix;If do not inquire in at least two signatures prefix before any signature
Sew identical history signature prefix, determines that the pending data is non-duplicate data.
Further, sentence molality block 63 to be additionally operable to:If inquiring and prefix of arbitrarily signing in at least two signatures prefix
Identical history signature prefix, the signature is similar with the corresponding history signature progress of history signature prefix inquired
Compare;If the signature and history signature are dissimilar, determine that the pending data is non-duplicate data.
Further, sentence molality block 63 to be additionally operable to:If the signature is similar to history signature, the pending number is determined
According to for duplicate data.
Further, it further includes the second memory module 64 that data, which sentence refitting to set, is additionally operable to:Determining that the pending data is
It is corresponding to store the pending data, each signature prefix and the signature when non-duplicate data.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored into traditional database.Then
Second memory module 64 can specifically store after splicing each signature prefix separator and arrive with the signature for major key
In one field, which can be named as extend, and be field setting index.Optionally, the index can be label
File-name field, but not limited to this.
Optionally, the KV type numbers for supporting list are arrived in the prefix storage corresponding with signature that each of pending data can be signed
According in library.Then the second memory module 64 specifically can using each signature prefix as the key in KV type databases, using sign as
Value is appended in the form of list in the corresponding value of the key.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored to the KV types for not supporting list
In database.Then signature specifically it is corresponding can be updated to key using each signature prefix as key by the second memory module 64
value。
Further, sentence molality block 63 to be additionally operable to:When it is duplicate data to determine the pending data, wait locating described in discarding
Manage data, each signature prefix and the signature.
Data provided in this embodiment are sentenced refitting and are set, and by carrying out cutting to signature, carry out data based on signature prefix and sentence
Weight, on the one hand because of index of the signature prefix as signature, it is possible to reduce the inquiry times that magnanimity is signed in database, another party
Because the digit of signature prefix is less than the digit of signature, corresponding number of signatures is relatively fewer, can reduce data screening in face
Range, therefore the time efficiency that data sentence weight can be improved.
The foregoing describe data to sentence the built-in function and structure that refitting is set, as shown in fig. 7, in practice, which sentences refitting
It sets and can be achieved as electronic equipment, including:Memory 71 and processor 72.
Memory 71 is for storing program.
In addition to above procedure, program's memory space is also configured to store various other data to support to set in electronics
Standby upper operation.The example of these data includes the instruction for any application program or method that operate on an electronic device,
Contact data, telephone book data, message, picture, video etc..
Memory 71 can be by any kind of volatibility or non-volatile memory device or combination thereof realization, such as
Static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only to be deposited
Reservoir (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or
CD.
Processor 72 is coupled to memory 71, for executing program, for:
Obtain the signature of pending data;At least level-one cutting is carried out to the signature, before obtaining at least two signatures
Sew;According at least two signatures prefix, the pending data is carried out to sentence weight.
In an optional embodiment, processor 72 is particularly used in when obtaining the signature of pending data:If waited for
The signature for handling data is existing, then can directly access respective storage devices, obtain the signature of pending data;If waiting locating
The signature of reason data does not exist, then can generate the signature that logic generates pending data according to signature.
In an optional embodiment, processor 72 is particularly used in when carrying out at least level-one cutting to the signature:
According to the similar threshold value of Hamming distance, cutting hop count is determined;According to cutting hop count, at least level-one cutting is carried out to the signature,
To obtain at least two signatures prefix.
Further, processor 72 when carrying out at least level-one cutting to the signature, specifically can be used according to cutting hop count
In:According to the cutting hop count, level-one cutting is carried out to the signature, to obtain at least two prefix stems;For it is described extremely
Each prefix stem in few two prefix capitals, according to the cutting hop count, to removed in the signature prefix stem it
Outer remaining digit carries out two level cutting, to obtain corresponding at least two prefix tail of the prefix stem;To described each
Prefix stem, by each prefix tail in the prefix stem and corresponding at least two prefix tail of the prefix stem point
It does not combine, forms a signature prefix in at least two signatures prefix.
It is worth noting that above-mentioned cutting is not required for impartial cutting, cutting successively.Optionally, described to cut successively
It can be according to sequence from left to right successively cutting to divide.
In an optional embodiment, processor 72 is particularly used in when carrying out sentencing weight to the pending data:With
At least two signature prefix is inquired as querying condition in history signs prefix;If do not inquire with it is described extremely
The identical history of any signature prefix is signed prefix in few two signatures prefixes, determines the pending data for non-duplicate number
According to.
Further, processor 72 is additionally operable to:If inquiring and prefix phase of arbitrarily signing in at least two signatures prefix
Same history signature prefix, by the signature history signature progress likelihood ratio corresponding with the history signature prefix inquired
Compared with;If the signature and history signature are dissimilar, determine that the pending data is non-duplicate data.
Further, processor 72 is additionally operable to:If the signature is similar to history signature, the pending data is determined
For duplicate data.
Further, processor 72 is additionally operable to:When it is non-duplicate data to determine the pending data, described in corresponding storage
Pending data, each signature prefix and the signature.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored into traditional database.Then
Processor 72 can specifically store after splicing each signature prefix separator to a word with the signature for major key
Duan Zhong, the field can be named as extend, and be field setting index.Optionally, the index can be signature field,
But not limited to this.
Optionally, the KV type numbers for supporting list are arrived in the prefix storage corresponding with signature that each of pending data can be signed
According in library.Then processor 72 specifically can using each signature prefix as the key in KV type databases, to sign as value,
It is appended in the form of list in the corresponding value of the key.
Optionally, each of pending data being signed, prefix is corresponding with signature to be stored to the KV types for not supporting list
In database.Then signature specifically can be updated to the corresponding value of key by processor 72 using each signature prefix as key.
Further, processor 72 is additionally operable to:When it is duplicate data to determine the pending data, abandon described pending
Data, each signature prefix and the signature.
Further, as shown in fig. 7, electronic equipment further includes:Communication component 73, power supply module 74, audio component 75, display
Other components such as device 76.Members are only schematically provided in Fig. 7, are not meant to that electronic equipment only includes component shown in Fig. 7.
Communication component 73 is configured to facilitate the communication of wired or wireless way between electronic equipment and other equipment.Electronics
Equipment can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary reality
It applies in example, communication component 73 receives broadcast singal or the related letter of broadcast from external broadcasting management system via broadcast channel
Breath.In one exemplary embodiment, the communication component 73 further includes near-field communication (NFC) module, to promote short range communication.
For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) skill can be based in NFC module
Art, bluetooth (BT) technology and other technologies are realized.
Power supply module 74 provides electric power for the various assemblies of electronic equipment.Power supply module 74 may include power management system
System, one or more power supplys and other generate, manage and distribute electric power associated component with for electronic equipment.
Audio component 75 is configured as output and/or input audio signal.For example, audio component 75 includes a microphone
(MIC), when electronic equipment is in operation mode, when such as call model, logging mode and speech recognition mode, microphone is configured
To receive external audio signal.The received audio signal can be further stored in memory 71 or via communication component 73
It sends.In some embodiments, audio component 75 further includes a loud speaker, is used for exports audio signal.
Display 76 includes screen, and screen may include liquid crystal display (LCD) and touch panel (TP).If screen
Including touch panel, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one
Or multiple touch sensors are to sense the gesture on touch, slide, and touch panel.The touch sensor can be sensed not only
The boundary of a touch or slide action, but also detect duration and pressure associated with the touch or slide operation.
One of ordinary skill in the art will appreciate that:Realize that all or part of step of above-mentioned each method embodiment can lead to
The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer read/write memory medium.The journey
When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned includes:ROM, RAM, magnetic disc or
The various media that can store program code such as person's CD.
Finally it should be noted that:The above various embodiments is only to illustrate the technical solution of the application, rather than its limitations;To the greatest extent
Pipe is described in detail the application with reference to foregoing embodiments, it will be understood by those of ordinary skill in the art that:Its according to
So can with technical scheme described in the above embodiments is modified, either to which part or all technical features into
Row equivalent replacement;And these modifications or replacements, each embodiment technology of the application that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (14)
1. a kind of data segmentation method, which is characterized in that including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
2. according to the method described in claim 1, it is characterized in that, described carry out at least level-one cutting to the signature, to obtain
At least two signature prefixes are obtained, including:
According to the similar threshold value of Hamming distance, cutting hop count is determined;
According to the cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
3. according to the method described in claim 2, it is characterized in that, described according to the cutting hop count, the signature is carried out
At least level-one cutting, to obtain at least two signatures prefix, including:
According to the cutting hop count, level-one cutting is carried out to the signature, to obtain at least two prefix stems;
For each prefix stem at least two prefix capital, according to the cutting hop count, to being removed in the signature
Remaining digit carries out two level cutting except the prefix stem, to obtain the corresponding at least two prefixes tail of the prefix stem
Portion;
It, will be in the prefix stem and corresponding at least two prefix tail of the prefix stem to each prefix stem
Each prefix tail is respectively combined together, forms a signature prefix in at least two signatures prefix.
4. according to claim 1-3 any one of them methods, which is characterized in that corresponding storage at least two signature
Each of prefix signature prefix and the signature, including:
After each signature prefix separator is spliced in storage a to field, and it is institute with the signature for major key
State field setting index;Or
Using each signature prefix as key, the signature is appended in the corresponding comparative example of the key;Or
Using each signature prefix as key, the signature is updated to the corresponding value of the key.
5. a kind of data judging method, which is characterized in that including:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
6. according to the method described in claim 6, it is characterized in that, it is described according to it is described at least two sign prefix, to described
Pending data carries out sentencing weight, including:
Using at least two signature prefix as querying condition, inquired in history signs prefix;
If not inquiring history signature prefix identical with any signature prefix in at least two signatures prefix, described in determination
Pending data is non-duplicate data.
7. according to the method described in claim 6, it is characterized in that, further including:
If history signature prefix identical with prefix of arbitrarily signing in at least two signatures prefix is inquired, by the signature
Comparison similar with the corresponding history signature progress of history signature prefix inquired;
If the signature and history signature are dissimilar, determine that the pending data is non-duplicate data.
8. the method according to the description of claim 7 is characterized in that further including:
If the signature is similar to history signature, determine that the pending data is duplicate data.
9. the method described according to claim 6 or 7, which is characterized in that further include:
It is corresponding to store the pending data, each signature prefix and the signature.
10. a kind of electronic equipment, which is characterized in that including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
Each of the corresponding storage at least two signatures prefix signature prefix and the signature.
11. electronic equipment according to claim 10, which is characterized in that the processor carries out at least to the signature
When level-one cutting, it is specifically used for:
According to the similar threshold value of Hamming distance, cutting hop count is determined;
According to the cutting hop count, at least level-one cutting is carried out to the signature, to obtain at least two signatures prefix.
12. a kind of electronic equipment, which is characterized in that including:
Memory, for storing program;
Processor, for executing described program, for:
Obtain the signature of pending data;
At least level-one cutting is carried out to the signature, to obtain at least two signature prefixes;
According at least two signatures prefix, the pending data is carried out to sentence weight.
13. electronic equipment according to claim 12, which is characterized in that the processor to the pending data into
When row sentences weight, it is specifically used for:
Using at least two signature prefix as querying condition, inquired in history signs prefix;
If not inquiring history signature prefix identical with any signature prefix in at least two signatures prefix, described in determination
Pending data is non-duplicate data.
14. electronic equipment according to claim 13, which is characterized in that the processor is additionally operable to:
If it is determined that the pending data is non-duplicate data, then before the corresponding storage pending data, each signature
Sew and the signature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611139984.7A CN108614827A (en) | 2016-12-12 | 2016-12-12 | Data segmentation method, judging method and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611139984.7A CN108614827A (en) | 2016-12-12 | 2016-12-12 | Data segmentation method, judging method and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108614827A true CN108614827A (en) | 2018-10-02 |
Family
ID=63643314
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611139984.7A Pending CN108614827A (en) | 2016-12-12 | 2016-12-12 | Data segmentation method, judging method and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108614827A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162752A (en) * | 2019-05-13 | 2019-08-23 | 百度在线网络技术(北京)有限公司 | Article sentences weight processing method, device and electronic equipment |
CN111143393A (en) * | 2018-11-03 | 2020-05-12 | 广州市明领信息科技有限公司 | Big data processing system |
CN113194030A (en) * | 2021-05-06 | 2021-07-30 | 中国人民解放军国防科技大学 | Multipath message forwarding method based on network prefix segmentation |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156689A (en) * | 2011-03-31 | 2011-08-17 | 百度在线网络技术(北京)有限公司 | Method and device for detecting document |
CN102609442A (en) * | 2010-12-28 | 2012-07-25 | 微软公司 | Adaptive Index for Data Deduplication |
CN103345449A (en) * | 2013-06-19 | 2013-10-09 | 暨南大学 | Method and system for prefetching fingerprints oriented to data de-duplication technology |
CN103793522A (en) * | 2008-10-20 | 2014-05-14 | 王强 | Method and system for rapidly scanning feature codes |
CN104391915A (en) * | 2014-11-19 | 2015-03-04 | 湖南国科微电子有限公司 | Duplicated data delete method |
US9135674B1 (en) * | 2007-06-19 | 2015-09-15 | Google Inc. | Endpoint based video fingerprinting |
-
2016
- 2016-12-12 CN CN201611139984.7A patent/CN108614827A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9135674B1 (en) * | 2007-06-19 | 2015-09-15 | Google Inc. | Endpoint based video fingerprinting |
CN103793522A (en) * | 2008-10-20 | 2014-05-14 | 王强 | Method and system for rapidly scanning feature codes |
CN102609442A (en) * | 2010-12-28 | 2012-07-25 | 微软公司 | Adaptive Index for Data Deduplication |
CN102156689A (en) * | 2011-03-31 | 2011-08-17 | 百度在线网络技术(北京)有限公司 | Method and device for detecting document |
CN103345449A (en) * | 2013-06-19 | 2013-10-09 | 暨南大学 | Method and system for prefetching fingerprints oriented to data de-duplication technology |
CN104391915A (en) * | 2014-11-19 | 2015-03-04 | 湖南国科微电子有限公司 | Duplicated data delete method |
Non-Patent Citations (1)
Title |
---|
陈春玲 等: "基于Simhash算法的重复数据删除技术的研究与改进", 《南京邮电大学学报(自然科学版)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143393A (en) * | 2018-11-03 | 2020-05-12 | 广州市明领信息科技有限公司 | Big data processing system |
CN110162752A (en) * | 2019-05-13 | 2019-08-23 | 百度在线网络技术(北京)有限公司 | Article sentences weight processing method, device and electronic equipment |
CN110162752B (en) * | 2019-05-13 | 2023-06-27 | 百度在线网络技术(北京)有限公司 | Article judging and re-processing method and device and electronic equipment |
CN113194030A (en) * | 2021-05-06 | 2021-07-30 | 中国人民解放军国防科技大学 | Multipath message forwarding method based on network prefix segmentation |
CN113194030B (en) * | 2021-05-06 | 2021-12-24 | 中国人民解放军国防科技大学 | Multipath message forwarding method based on network prefix segmentation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10565244B2 (en) | System and method for text categorization and sentiment analysis | |
CN105095195B (en) | Nan-machine interrogation's method and system of knowledge based collection of illustrative plates | |
US9904694B2 (en) | NoSQL relational database (RDB) data movement | |
CN109670163B (en) | Information identification method, information recommendation method, template construction method and computing device | |
US9104979B2 (en) | Entity recognition using probabilities for out-of-collection data | |
WO2019074975A2 (en) | Data processing method, apparatus and electronic device | |
US20160196277A1 (en) | Data record compression with progressive and/or selective decompression | |
CN105630847B (en) | Date storage method, data query method, apparatus and system | |
US20170322930A1 (en) | Document based query and information retrieval systems and methods | |
CN107885874A (en) | Data query method and apparatus, computer equipment and computer-readable recording medium | |
CN110929125B (en) | Search recall method, device, equipment and storage medium thereof | |
US11023452B2 (en) | Data dictionary with a reduced need for rebuilding | |
Dreßler et al. | On the efficient execution of bounded jaro-winkler distances | |
CN108614827A (en) | Data segmentation method, judging method and electronic equipment | |
CN105678129B (en) | A kind of method and apparatus of determining subscriber identity information | |
CN108572789A (en) | Disk storage method and apparatus, information push method and device and electronic equipment | |
WO2023024975A1 (en) | Text processing method and apparatus, and electronic device | |
CN109446337B (en) | Knowledge graph construction method and device | |
WO2020172649A1 (en) | System and method for text categorization and sentiment analysis | |
CN113032420A (en) | Data query method and device and server | |
US20190012310A1 (en) | Method and device for providing notes by using artificial intelligence-based correlation calculation | |
Varol et al. | Detecting near-duplicate text documents with a hybrid approach | |
CN101374307A (en) | Method and apparatus for updating digital content information of mobile equipment | |
US11669555B2 (en) | System and method of creating index | |
KR101589626B1 (en) | Method for establishing start-up data or management data from big data based on lexico semantic pattern analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |