Embodiment
Below in conjunction with the accompanying drawings, the scheme provided this specification is described.
The update method of sample can be applied to field as shown in Figure 1 in the sample storehouse that this specification one embodiment provides
In Jing Zhong, Fig. 1, the sample in sample storehouse can be by artificially being collected in advance from the background data base of server.The sample storehouse
In sample can have corresponding sample label (e.g., 1 and 0).Specifically, if the sample label of sample is:" 1 ", the then sample
This is positive sample;And if the sample label of sample is:" 0 ", then the sample is negative sample.
In Fig. 1, similarity searching system can search for and the most like or most different sample of current sample from sample storehouse
This.
Certainly, in practical applications, the update method of sample can also answer in the sample storehouse that this specification embodiment provides
For in other scenes.This specification is not construed as limiting this.
Fig. 2 is the update method flow chart of sample in the sample storehouse that this specification one embodiment provides.The method
Executive agent can be the equipment with disposal ability:Server either system or device.As shown in Fig. 2, the method has
Body can include:
Step 210, the first sample to be updated to sample storehouse is obtained.
First sample herein can be any sample to be updated that will update sample storehouse.
Step 220, first sample and the similarity value of multiple default samples for having added sample label are calculated.
In one implementation, first sample and multiple default samples for having added sample label are calculated in step 220
Similarity value can be implemented by the following steps:
Step A, determines the sampling feature vectors of first sample, and determines the sampling feature vectors of each default sample.
With first sample:For exemplified by B, it can be expressed as:B=(b1,b2,...,bn), then b1,b2,...,bnI.e.
For first sample:The sampling feature vectors of B.Above-mentioned n is used for the number for representing the sampling feature vectors of sample B.Further, it is also possible to
Assuming that there are m default samples:X1,X2,...,Xm, the m default samples can be expressed as:
Wherein, x1,1,x1,2,...,x1,nTo preset sample:X1N sampling feature vectors, x2,1,x2,2,...,x2,nFor
Default sample:X2N sampling feature vectors, and so on.
Step B, according to the sampling feature vectors of first sample and the sampling feature vectors of each default sample, calculates the
The distance value of one sample and each default sample.
With i-th of default sample (Xi) exemplified by for, first sample:B and XiDistance value can be expressed as formula:
di=| | Xi-B||2, i=1,2,3 ..., m (formula 1)
Wherein, diFor first sample and the distance value of i-th of default sample, m is the number of default sample.
Step C, value of adjusting the distance are normalized.
Step D, according to the distance value after normalized, determines similarity value.
For exemplified by i-th of distance value to be normalized, the formula of its normalized can be as follows:
Simi=1-di/ 100, i=1,2 ..., m (formula 2)
Wherein, simiFor first sample and the similarity value of i-th of default sample, diFor first sample and i-th of default sample
This distance value.
Certainly, in practical applications, first sample and the phase of each default sample can also be calculated in other ways
Like angle value, e.g., the different value of first sample and default sample can be calculated.Similarity value is determined according to different value afterwards
Deng this specification does not repeat this again.
Step 230, according to similarity value and multiple threshold values, determine and the corresponding prediction label of each threshold value.
Multiple threshold values herein can rule of thumb be set, and in different application scenarios, the plurality of threshold value can be with
There is different values.
In addition, above-mentioned prediction label is identical with the definition of above-mentioned sample label, you can with including:" 0 " and " 1 ", this explanation
Book does not repeat again herein.
In one implementation, can be by the similarity value of each default sample compared with threshold value, if some is pre-
If the similarity value of sample exceedes threshold value, then the prediction label that this can be preset to sample is determined as:" 1 " (or " 0 "), otherwise
The prediction label that this can be preset to sample is determined as:" 0 " (or " 1 ").It is understood that sample is preset according to said one
The determination mode of this prediction label, it may be determined that go out the prediction label of each default sample.
It should be noted that the setting means (being set as " 1 " or " 0 ") of the prediction label of above-mentioned default sample can root
Determined according to the setting means of sample label.
Step 240, to the corresponding prediction label of each threshold value, according to the prediction label and sample label, determine corresponding
Accuracy rate and recall rate, so that it is determined that going out multiple accuracys rate and recall rate.
, can be according to the sample of each default sample for exemplified by determining the corresponding accuracy rate of a threshold value and recall rate
This label and prediction label, (are denoted as to count the number for the default sample that prediction label is " 1 " and sample label is " 1 ":
TP);And the number for counting the default sample that prediction label is " 1 " and sample label is " 0 " (is denoted as:) and prediction label FP
Number for " 0 " and default sample that sample label is " 1 " (is denoted as:FN).Afterwards can be according to formula:TP/ (TP+FP) comes true
Accuracy rate is determined, according to formula:TP/ (TP+FN) determines recall rate.
It should be noted that when threshold value is different, the prediction label of each default sample will be different, so that statistics
TP, FP and FN will be inconsistent, and the accuracy rate and recall rate thereby determined that out is with regard to inconsistent.Therefore, the standard in this specification
True rate and recall rate are corresponding with threshold value.
Above-mentioned steps a to step c is repeated, until determining corresponding with each threshold value accuracy rate and recall rate.
Step 250, according to definite multiple accuracys rate and recall rate, the Average Accuracy AP of first sample is calculated.
In one implementation, after corresponding with multiple threshold values accuracy rate and recall rate is determined, with it is each
The corresponding recall rate of threshold value is x coordinate, and accuracy rate is y-coordinate, and accuracy rate-recall rate is drawn in plane right-angle coordinate
(Precision-Recall, P-R) curve.Determine first quartile of the accuracy rate-recall rate curve in plane right-angle coordinate
The area surrounded with x-axis and y-axis.According to the area, Average Accuracy (Average Precis ion, AP) is determined.
Certainly, in practical applications, AP can also be determined otherwise, e.g., take the average value of all accuracys rate,
This specification does not repeat again herein.
Step 260, when AP meets preset condition, by first sample renewal into sample storehouse.
In one implementation, can be with preset threshold value, which can set based on experience value.When AP is more than upper
When stating threshold value, by first sample renewal into sample storehouse.Thus, it is possible to finer judge is carried out to sample.
In other implementations, the first threshold that corresponding accuracy rate is preset value can also be chosen from multiple threshold values
Value.In one example, which can be 90%.Namely the threshold that corresponding accuracy rate is 90% is chosen from multiple threshold values
Value.Afterwards, the correspondence of first sample, first threshold and AP are updated into sample storehouse.Such as, the content in sample storehouse can
With as shown in table 1.
Table 1
Sample |
First threshold |
AP |
Sample 1 |
0.8 |
0.9 |
Sample 2 |
0.6 |
0.9 |
... |
... |
... |
First threshold in table 1 can also be searched for as the similarity threshold of sample according to the first threshold
The similar sample or different sample of the sample.For exemplified by searching for similar sample, its detailed process can be:It can calculate
The matching angle value of the sample and candidate samples.If matching angle value exceedes the corresponding first threshold of the sample, can be by candidate's sample
Originally it is chosen for the similar sample of the sample.For by taking 90% corresponding first threshold as an example, above-mentioned selection matching angle value is more than the
The candidate samples of one threshold value are that the principle of the similar sample of the sample is:When the matching angle value of candidate samples and the sample is more than the
During one threshold value, the confidence level for having 90% thinks that candidate samples can fall into the sample storehouse of the sample.
It should be noted that after the similar sample to first sample is chosen, which can also also be updated
Into sample storehouse.
It should also be noted that, table 1 is only to facilitate the exemplary illustration for understanding the present embodiment and providing, is not intended as
The limitation of the present embodiment.It can also include other contents, e.g., sample label etc. in table 1.
To sum up, the method provided by this specification above-described embodiment, can more accurately and rapidly realize sample storehouse
Cleaning, renewal.Further, since this method is automatically performed, so as to meet the application demand of big data.Furthermore due to
The threshold value of sample is have updated in sample storehouse at the same time, this can to realize the recalling of particular type sample, comprehensive grading provides basis.
Fig. 3 is the update method schematic diagram that this illustrates sample in the sample storehouse that another embodiment provides, can be with Fig. 3
The distance of sample to be updated and default sample is calculated based on sampling feature vectors, and using the distance after normalization as similarity
Value.(it is expressed as according to the sample label of similarity value, default sample:Y=(y1,y2,...,ym),yi∈ { 0,1 }, i=1,
2 ..., m) and multiple threshold values, to generate accuracy rate-recall rate curve.It is 0.9 corresponding threshold that accuracy rate can be chosen afterwards
It is worth (simt), the area that accuracy rate-recall rate curve is surrounded with x-axis is calculated, and using the area as AP values.Finally, if AP values
More than threshold value, then by the sample, s imtAnd AP renewals are into sample storehouse.Thus, it is possible to it is more effective, rapidly improve sample
The quality and generalization ability of sample in this storehouse.
With the update method of sample in above-mentioned sample storehouse accordingly, a kind of sample that this specification one embodiment also provides
The updating device of sample in storehouse, as shown in figure 4, the device includes:
Acquiring unit 401, for obtaining the first sample to be updated to sample storehouse.
Computing unit 402, the first sample for calculating the acquisition of acquiring unit 401 have added the pre- of sample label with multiple
If the similarity value of sample.
Alternatively, computing unit 402 specifically can be used for:
Determine the sampling feature vectors of first sample, and determine the sampling feature vectors of each default sample.
According to the sampling feature vectors of first sample and the sampling feature vectors of each default sample, first sample is calculated
With the distance value of each default sample.
Value of adjusting the distance is normalized.
According to the distance value after normalized, similarity value is determined.
Determination unit 403, for the similarity value calculated according to computing unit 402 and multiple threshold values, determine with it is each
The corresponding prediction label of threshold value.
Determination unit 403, is additionally operable to the corresponding prediction label of each threshold value, according to prediction label and sample label,
Corresponding accuracy rate and recall rate are determined, so that it is determined that going out multiple accuracys rate and recall rate.
Computing unit 402, is additionally operable to the multiple accuracys rate and recall rate determined according to determination unit 403, calculates the first sample
This Average Accuracy AP.
Alternatively, computing unit 402 specifically can be used for:
Using recall rate corresponding with each threshold value as x coordinate, accuracy rate is y-coordinate, is drawn in plane right-angle coordinate
Accuracy rate-recall rate curve.
Determine the area that accuracy rate-recall rate curve is surrounded in the first quartile of plane right-angle coordinate with x-axis and y-axis.
According to area, AP is determined.
Updating block 404, when the AP for being calculated when computing unit 402 meets preset condition, first sample renewal is arrived
In sample storehouse.
Alternatively, updating block 404 specifically can be used for:
The first threshold that corresponding accuracy rate is preset value is chosen from multiple threshold values.
The correspondence of first sample, first threshold and AP are updated into sample storehouse.
Alternatively, which can also include:Choose unit 405.
Computing unit 402, is additionally operable to calculate the matching angle value of first sample and candidate samples.
Unit 405 is chosen, if exceeding first threshold for the matching angle value that computing unit 402 calculates, by candidate samples
It is chosen for the similar sample of first sample.
Alternatively, updating block 404, are additionally operable to the similar Sample Refreshment by first sample into sample storehouse.
The function of each function module of this specification above-described embodiment device, can pass through each step of above method embodiment
Rapid to realize, therefore, the specific work process for the device that this specification one embodiment provides, does not repeat again herein.
The updating device of sample in the sample storehouse that this specification one embodiment provides, acquiring unit 401 obtains to be updated
To the first sample of sample storehouse.Computing unit 402 calculates first sample and the phase of multiple default samples for having added sample label
Like angle value.Determination unit 403 determines and the corresponding prediction label of each threshold value according to similarity value and multiple threshold values.It is right
Each corresponding prediction label of threshold value, determination unit 403 according to prediction label and sample label, determine corresponding accuracy rate and
Recall rate, so that it is determined that going out multiple accuracys rate and recall rate.Computing unit 402 calculates the according to multiple accuracys rate and recall rate
The Average Accuracy AP of one sample.When AP meets preset condition, updating block 404 updates first sample into sample storehouse.
Thus, it is possible to more reliably in control sample database sample quality.
Those skilled in the art are it will be appreciated that in said one or multiple examples, work(described in the invention
It is able to can be realized with hardware, software, firmware or their any combination.When implemented in software, can be by these functions
It is stored in computer-readable medium or is transmitted as one or more instructions on computer-readable medium or code.
Above-described embodiment, has carried out the purpose of the present invention, technical solution and beneficial effect further
Describe in detail, it should be understood that the foregoing is merely the embodiment of the present invention, be not intended to limit the present invention
Protection domain, all any modification, equivalent substitution, improvement and etc. on the basis of technical scheme, done should all
It is included within protection scope of the present invention.