Summary of the invention
In order to solve the problems of the technologies described above, the object of this invention is to provide that a kind of accuracy is high and data-handling efficiency is high for excavating the method for RFID data isolated point.
The technical solution adopted in the present invention is: a kind of for excavating the method for RFID data isolated point, the method step comprises:
A, from middleware, obtain original RFID data;
B, the original RFID data of obtaining are carried out compressing after cluster;
C, by RFID, read feature, make original RFID data after compression carry out the excavation of RFID data isolated point in tlv triple mode as RFID data point reading characteristic vector space;
D, the Reverse Nearest algorithm of employing based on weights carry out data processing to RFID data point, and then output RFID data isolated point.
Further, in described step B, adopt Hash table to carry out cluster to the original RFID data of obtaining.
Further, described step B comprises:
B1, to the original RFID data of obtaining, utilize Hash table to carry out cluster;
B2, according to original RFID data, judge whether this label has existed Hash table, if exist, increase the read-write number of times of this label, upgrade and read the time recently, and calculate current read signal intensity; If do not exist, this label is inserted in Hash table, and record reads the read signal intensity of record and this label and reads the time first.
Further, the RFID described in described step C reads feature and comprises that the read time interval, label of label read intensity reading reading times in the time interval and the average signal of label.
Further, described step D comprises:
D1, calculate the distance between RFID data point between two, and then generate distance matrix;
D2, according to distance matrix and default K value, adopt the K nearest-neighbors set of K arest neighbors classified calculating each RFID data point;
D3, calculate the mean distance of each RFID data point;
D4, choose after the RFID data point of mean distance maximum according to the RFID data point of mean distance maximum and then calculate the density weights of RFID data point;
D5, adopt the reverse K nearest-neighbors set of the each RFID data point of reverse K arest neighbors classified calculating, the reverse K nearest-neighbors set of each RFID data point is comprised of the K nearest-neighbors set that comprises this RFID data point;
D6, the isolated coefficient of each RFID data point is sorted after calculating the isolated coefficient of each RFID data point, and then according to default number percent output RFID data isolated point.
Further, described step D2, according to distance matrix and default K value, adopts the K nearest-neighbors set of the each RFID data point of K arest neighbors classified calculating, and it is specially,
According to distance matrix and default K value, calculate respectively each RFID data point with itself nearest K RFID data point, the K calculating a RFID data point forms the set of K nearest-neighbors.
Further, described step D3, calculates the mean distance of each RFID data point, and it is specially, and calculates respectively the mean value of the distance of all RFID data points in each RFID data point and its K nearest-neighbors set.
Further, the formula of density weights that calculates RFID data point in described step D4 is as follows,
Above-mentioned KNN
maxdistmaximum mean distance in the mean distance of expression RFID data point, KNN
distrepresent the mean distance of RFID data point.
Further, the formula of isolated coefficient that calculates each RFID data point in described step D6 is as follows,
The denominator of above-mentioned formula represents the density weights summation of all elements in the reverse K nearest-neighbors set of RFID data point, and in denominator, m represents the element number in the reverse K nearest-neighbors set of RFID data point.
Further, the number of exporting isolated point in described step D6 is that the front n*pct% that isolated coefficient is the highest is individual, and described n represents total number of RFID data point, and pct% represents default number percent.
The invention has the beneficial effects as follows: before excavation RFID data isolated point, RFID data are carried out to cluster and compression, can greatly reduce the scale of data and improve the efficiency of data processing, and the Reverse Nearest algorithm of employing based on weights can increase for each RFID data point the weights of the dense degree in its region of living in of symbol, thereby the degree of accuracy that greatly improves identification RIFD data isolated point, has good detection effect to frontier point especially.
Embodiment
As shown in Figure 1, a kind of for excavating the method for RFID data isolated point, the method step comprises:
A, from middleware, obtain original RFID data;
B, the original RFID data of obtaining are carried out compressing after cluster;
C, by RFID, read feature, make original RFID data after compression carry out the excavation of RFID data isolated point in tlv triple mode as RFID data point reading characteristic vector space;
D, the Reverse Nearest algorithm of employing based on weights carry out data processing to RFID data point, and then output RFID data isolated point.
Be further used as preferred embodiment, in described step B, adopt Hash table to carry out cluster to the original RFID data of obtaining.
As shown in Figure 2, a kind of for excavating the method for RFID data isolated point, the method step comprises:
A, from middleware, obtain original RFID data, described original RFID data are with <epc, location, time, ss> four-tuple mode exists, and wherein epc represents label ID, residing position when location represents that label reads, be reading device position information, time represents that label reads the time, and ss represents the read signal intensity of label;
B1, to the original RFID data of obtaining, utilize Hash table to carry out cluster, described Hash table has multiple, and each Hash table is with <epc, location, t_star, t_end> four-tuple mode represents different classification, and then original RFID data are carried out to cluster, wherein t_star represents that label reads the start time of time, t_end represents that label reads the end time of time, and the embodiment of cluster is, as long as the reading device position information location in original RFID data is consistent with the location information of some Hash tables, and the label of original RFID data reads time time and meets t_star≤time≤t_end, these original RFID data are assigned in this Hash table like this, and according to the embodiment of above-mentioned cluster, all original RFID data are carried out after cluster forming multiple data set cluster results, and data clustering result namely a Hash table carry out the result after original RFID data clusters,
B2, according to original RFID data, judge whether this label has existed Hash table, if exist, increase the read-write number of times of this label, upgrade and read the time recently, and calculate current read signal intensity; If do not exist, this label is inserted in Hash table, and record reads the read signal intensity of record and this label and reads the time first.
C, by RFID, read feature, make original RFID data after compression carry out the excavation of RFID data isolated point in tlv triple mode as RFID data point reading characteristic vector space;
D1, calculate the distance between RFID data point between two, and then generate distance matrix;
D2, according to distance matrix and default K value, adopt the K nearest-neighbors set of K arest neighbors classified calculating each RFID data point;
D3, calculate the mean distance of each RFID data point;
D4, choose after the RFID data point of mean distance maximum according to the RFID data point of mean distance maximum and then calculate the density weights of RFID data point;
D5, adopt the reverse K nearest-neighbors set of the each RFID data point of reverse K arest neighbors classified calculating, the reverse K nearest-neighbors set of each RFID data point is comprised of the K nearest-neighbors set that comprises this RFID data point;
D6, the isolated coefficient of each RFID data point is sorted after calculating the isolated coefficient of each RFID data point, and then according to default number percent output RFID data isolated point.
RFID described in above-mentioned steps C reads feature and comprises that the read time interval, label of label read intensity reading reading times in the time interval and the average signal of label, and compression after original RFID data with <epc, time_duration, tagcnt, ss
avg> four-tuple mode exists, and wherein time_duration represents reading the time interval of label, and tagcnt represents that label is at the reading times reading in the time interval, ss
avgthe average signal that represents label reads intensity, and with <time_duration, tagcnt, ss
avg> tlv triple mode represents that the RFID of original RFID data reads feature and reading characteristic vector space and carry out the excavation of RFID data isolated point as RFID data point, described in read characteristic vector space and according to RFID, read feature and form.But because the RFID adopting reads, the range size of value of feature is inconsistent and unit is inconsistent, if directly RFID data point is processed to the Output rusults that can affect RFID data isolated point, therefore in order to obtain the effect of better excavation RFID data isolated point, eliminate the impact of metric unit, therefore carrying out, before RFID data outlier mining, need to carrying out standardization to RFID data point.
The tlv triple mode <time_duration of feature, tagcnt, ss will be read for the RFID that represents RFID data point
avg> is reduced to x
i=(x
i1, x
i2, x
i3), wherein x
irepresent arbitrary RFID data point, i represents i RFID data point, x
i1represent reading the time interval of label, x
i2represent that label is at the reading times reading in the time interval, x
i3the average signal that represents label reads intensity, the therefore matrix representation of the available n*3 of set of RFID data point, and its representation is as follows,
And it is as follows that RFID data point is carried out to standardized formula,
Wherein Y
ijrepresent standardized value,
J represents that j is read feature, m
jbe j mean value that reads feature, its value is
therefore by RFID data point being carried out to can obtain after standardization the matrix Y=(y of a standardized RFID data point of process
ij)
n × 3.
Above-mentioned steps D1 calculates the distance between RFID data point between two, and then generates distance matrix, and due to after standardization, the set of RFID data point is by Y=(y
ij)
n × 3represent, and the distance between RFID data point adopts Euclidean distance to calculate between two, the formula that therefore calculates the distance between RFID data point is between two as follows,
Wherein i and j represent any two points RFID data point, and k represents that k RFID reads feature, and m is 3.In addition owing to considering the distance dist (i of RFID data point i to RFID data point j, j) there is reflexivity and symmetry, therefore there is dist (i, i)=0, dist (i, j)=dist (j, i), so can reduce the calculated amount of half when calculating distance matrix, its final distance matrix generating is as follows
The Definition Principle of the K arest neighbors classification adopting in step D2 described above is, for given positive integer K and set of data points DataSet, for putting arbitrarily p, p ∈ Dataset, calculate the distance of other point in p point and set of data points DataSet, from wherein choosing K the point nearest with p point (not comprising p point itself), and then the K nearest-neighbors set of ordering as p, therefore described step D2, according to distance matrix and default K value, adopt the K nearest-neighbors set of the each RFID data point of K arest neighbors classified calculating, it is specially
According to distance matrix and default K value, calculate respectively each RFID data point with itself nearest K RFID data point, the K calculating a RFID data point forms the set of K nearest-neighbors, namely RFID data point has one by the K nearest-neighbors set forming with itself nearest K RFID data point, and the set of described K nearest-neighbors does not comprise this RFID data point.
The mean distance that calculates each RFID data point in above-mentioned steps D3, it is specially, and calculates respectively the mean value of the distance of all RFID data points in each RFID data point and its K nearest-neighbors set, and its computing formula adopting is as follows,
Wherein p represents arbitrary RFID data point, and q
irepresent the arbitrary RFID data point in the K nearest-neighbors set corresponding with RFID data point p, and k is the number of RFID data point in the K nearest-neighbors set corresponding with RFID data point p.
The formula of density weights that calculates RFID data point in above-mentioned steps D4 is as follows,
Wherein KNN
maxdistmaximum mean distance in the mean distance of expression RFID data point, KNN
distrepresent the mean distance of RFID data point.
Above-mentioned steps D5 adopts the reverse K nearest-neighbors set of the each RFID data point of reverse K arest neighbors classified calculating, the reverse K nearest-neighbors set of each RFID data point is comprised of the K nearest-neighbors set that comprises this RFID data point, the Definition Principle of its reverse K arest neighbors classification is, for given positive integer K and set of data points DataSet, for putting arbitrarily p, point p belongs to set of data points DataSet, and including by those K nearest-neighbors set that p orders, the reverse K nearest-neighbors set that p is ordered forms, for example ask the reverse K nearest-neighbors set of RFID data point p, the total N of K nearest-neighbors set that supposes all RFID data points is individual, and in the set of N K nearest-neighbors, have the set of M K nearest-neighbors to contain RFID data point p, the reverse K nearest-neighbors set of RFID data point p is exactly to be comprised of M K nearest-neighbors set that contains RFID data point p so, be that each K nearest-neighbors set that contains RFID data point p is an element in the reverse K nearest-neighbors set of RFID data point p.
In above-mentioned steps D6, calculate the isolated coefficient of each RFID data point, and the isolated coefficient of a RFID data point is to be determined by the density weights of element in the reverse K nearest-neighbors set of this RFID data point, therefore the formula of isolated coefficient that calculates each RFID data point in step D6 is as follows
The denominator of above-mentioned formula represents the density weights summation of all elements in the reverse K nearest-neighbors set of RFID data point, and in denominator, m represents the element number in the reverse K nearest-neighbors set of RFID data point.After calculating the isolated coefficient of each RFID data point, just the isolated coefficient of each RFID data point is sorted, isolated front n*pct% the highest RFID data point of coefficient of output, namely export RFID data isolated point, wherein n represents total number of RFID data point, pct% represents default number percent, and described default number percent can be adjusted according to actual situation.
Therefore by the present invention, excavate RFID data isolated point, owing to adopting reading the time interval of label, label reads intensity at the average signal that reads reading times in the time interval and label and reads feature as RFID, therefore can excavate more accurately RFID data isolated point, in addition, due to by label is carried out to cluster and compression, therefore can greatly reduce the scale of data and improve the efficiency of data processing, but also adopt the Reverse Nearest algorithm based on weights, therefore can increase for each RFID data point the weights of the dense degree in its region of living in of symbol, thereby greatly improve the degree of accuracy of identification RIFD data isolated point, especially frontier point is had to good detection effect.
Fig. 3 utilizes matlab emulation method of the present invention to carry out the testing result schematic diagram of RFID data outlier mining.The setting parameter of emulation is, K value is 50, and according to μ=0.25, σ=0.05, μ=0.5, σ=0.05, μ=0.75, σ=0.05, the tuple number of these three groups of Gaussian distribution and every group is 200 and then generates three groups of data, wherein μ represents average, σ represents error, default number percent is 2%, according to 2% ratio output isolated point, and can find out by the testing result schematic diagram shown in Fig. 3, reading X-axis in characteristic vector space is reading the time interval of label, Y-axis is label is that the average signal of label reads intensity reading reading times in the time interval and Z axis, and the method for the application of the invention can be found out all isolated points, totally 13 isolated points, therefore verified that method of the present invention has very high validity.
Fig. 4 is used matlab diplomatic copy to invent the compression effectiveness schematic diagram under different pieces of information scale, data being compressed.According to 500,1000,5000,20000,50000,100000 tuples generate six groups of data, obtain the contrast of data compression front and back.As shown in Figure 4, vertical black band represents the number of tuples of original data, vertical white band represents the number of tuples of the data after compression, when original data scale hour, the compression effectiveness of data is not clearly, and when data scale constantly increases, the compression effectiveness of data is obvious gradually, when the scale of data arrives 100000 tuple, probably can obtain the data compression rate of 5 times, therefore from the experimental result of emulation, it is extraordinary that the present invention compresses to the RFID data of magnanimity the effect of processing.
Fig. 5 is the accuracy rate contrast schematic diagram of Reverse K Nearest Neighbors algorithm and the Reverse Nearest algorithm based on weights of the present invention under different pieces of information scale.Utilize matlab emulation Reverse K Nearest Neighbors algorithm and Reverse Nearest algorithm based on weights of the present invention to carry out respectively the accuracy rate of RFID data outlier mining.The setting parameter of emulation is, K value is 50, with μ=0.25, and σ=0.05, μ=0.5, σ=0.05, μ=0.75, σ=0.05, this three groups of Gaussian distribution and be 100,500 according to the number of every group respectively, the data scale of 1000,2000,4000 tuples and then generated data, wherein μ represents average, and σ represents error, and therefore total data scale has become 300,1500,3000,6000,12000, and default number percent is 2%, according to 2% ratio output isolated point.As shown in Figure 5, WRKNN represents the Reverse Nearest algorithm based on weights of the present invention, and ODRKNN represents Reverse K Nearest Neighbors algorithm, and show from the accuracy rate comparing result of Fig. 5, Reverse Nearest algorithm based on weights of the present invention is obviously better than Reverse K Nearest Neighbors algorithm, and therefore the Reverse Nearest algorithm performance based on weights of the present invention is more stable.
More than that better enforcement of the present invention is illustrated, but the invention is not limited to described embodiment, those of ordinary skill in the art also can make all equivalent variations or replacement under the prerequisite without prejudice to spirit of the present invention, and the distortion that these are equal to or replacement are all included in the application's claim limited range.