Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
The method for updating the samples in the sample library provided by one embodiment of the present specification may be applied to the scenario shown in fig. 1, and in fig. 1, the samples in the sample library may be collected by a human from a background database of a server in advance. The samples in the sample library may have corresponding sample labels (e.g., 1 and 0). Specifically, if the sample label of the sample is: 1, the sample is a positive sample; and if the sample label of the sample is: and "0", the sample is a negative sample.
In fig. 1, the similarity search system may search for a sample that is most similar or most dissimilar to the current sample from the sample library.
Of course, in practical applications, the method for updating the samples in the sample library provided in the embodiment of the present specification may also be applied to other scenarios. This is not a limitation of the present specification.
Fig. 2 is a flowchart of a method for updating samples in a sample library according to an embodiment of the present disclosure. The execution subject of the method may be a device with processing capabilities: a server or a system or device. As shown in fig. 2, the method may specifically include:
step 210, a first sample to be updated to a sample library is obtained.
The first sample here may be any sample to be updated that is to be updated to the sample repository.
Step 220, calculating similarity values between the first sample and a plurality of preset samples with sample labels added.
In one implementation, the calculating of the similarity value between the first sample and the plurality of preset samples to which the sample labels are added in step 220 may be implemented by:
and step A, determining sample characteristic vectors of the first sample, and determining the sample characteristic vectors of all preset samples.
With the first sample: for example, B can be expressed as: b ═ B1,b2,...,bn) Then b1,b2,...,bnNamely a first sample: b sample feature vectors. The above n is used to indicate the number of sample feature vectors of sample B. Furthermore, it may also be assumed that there are m preset samples: x1,X2,...,XmThe m preset samples may be expressed as:
wherein x is1,1,x1,2,...,x1,nFor the preset samples: x1N isSample feature vector, x2,1,x2,2,...,x2,nFor the preset samples: x2N sample feature vectors, and so on.
And B, calculating the distance value between the first sample and each preset sample according to the sample characteristic vector of the first sample and the sample characteristic vector of each preset sample.
With the ith preset sample (X)i) For example, the first sample: b and XiThe distance value of (d) can be expressed as the following equation:
di=||Xi-B||2i ═ 1,2,3,. ·, m (equation 1)
Wherein d isiIs the distance value between the first sample and the ith preset sample, and m is the number of the preset samples.
And step C, normalizing the distance value.
And D, determining a similarity value according to the distance value after the normalization processing.
For example, the normalization process is performed on the ith distance value, and the formula of the normalization process can be as follows:
simi=1-di100, i ═ 1,2,.. m (equation 2)
Wherein, simiIs the similarity value between the first sample and the ith preset sample, diThe distance value between the first sample and the ith preset sample is obtained.
Of course, in practical applications, the similarity value between the first sample and each preset sample may also be calculated by other manners, for example, different values of the first sample and the preset sample may be calculated. And then determining a similarity value according to the different values, etc., which is not described herein again.
And step 230, determining a prediction label corresponding to each threshold according to the similarity value and the plurality of thresholds.
The multiple thresholds may be set empirically, and in different application scenarios, the multiple thresholds may have different values.
In addition, the prediction tag may be defined as the sample tag, and may include: "0" and "1", which are not repeated herein.
In one implementation, the similarity value of each preset sample may be compared with a threshold, and if the similarity value of a certain preset sample exceeds the threshold, the prediction label of the preset sample may be determined as: a "1" (or "0"), otherwise the prediction tag of the preset sample may be determined as: "0" (or "1"). It can be understood that, according to the above-mentioned manner for determining the prediction tag of one preset sample, the prediction tag of each preset sample can be determined.
The setting mode of the prediction flag of the preset sample (set to "1" or "0") may be determined according to the setting mode of the sample flag.
And 240, determining corresponding accuracy and recall rate according to the prediction label and the sample label for the prediction label corresponding to each threshold value, thereby determining a plurality of accuracy rates and recall rates.
For example, the accuracy and the recall rate corresponding to one threshold are determined, the number of the preset samples (marked as TP) with the prediction label of "1" and the sample label of "1" can be counted according to the sample label and the prediction label of each preset sample; and counting the number of the preset samples with the prediction label of '1' and the sample label of '0' (marked as: FP) and the number of the preset samples with the prediction label of '0' and the sample label of '1' (marked as: FN). The following can be done according to the formula: TP/(TP + FP) to determine the accuracy rate according to the formula: TP/(TP + FN) to determine recall.
It should be noted that, when the thresholds are different, the prediction labels of the preset samples are different, so that the statistical TP, FP and FN are inconsistent, and thus the determined accuracy and recall rate are inconsistent. Therefore, the accuracy and recall in this specification correspond to the threshold values.
And (c) repeatedly executing the steps a to c until the accuracy and the recall ratio corresponding to each threshold value are determined.
And step 250, calculating the average accuracy AP of the first sample according to the plurality of determined accuracies and recall rates.
In one implementation, after determining the accuracy and Recall corresponding to the plurality of thresholds, a Precision-Recall (P-R) curve is plotted in a planar rectangular coordinate system with the Recall corresponding to each threshold as an x-coordinate and the accuracy as a y-coordinate. And determining the area of the accuracy-recall rate curve enclosed by the first quadrant of the plane rectangular coordinate system and the x axis and the y axis. From this area, the Average Accuracy (AP) is determined.
Of course, in practical applications, the AP may also be determined in other manners, for example, taking an average value of all accuracy rates, which is not described herein again.
And step 260, updating the first sample to a sample library when the AP meets a preset condition.
In one implementation, a threshold value may be preset, which may be set based on empirical values. When the AP exceeds the threshold, the first sample is updated to the sample base. Thus, a more refined evaluation of the sample can be performed.
In other implementation manners, the first threshold with the corresponding accuracy as the preset value may be further selected from the plurality of thresholds. In one example, the preset value may be 90%. That is, a threshold with an accuracy of 90% is selected from the plurality of thresholds. And then updating the corresponding relation of the first sample, the first threshold and the AP into a sample library. For example, the contents of the sample library may be as shown in table 1.
TABLE 1
Sample(s)
|
First threshold value
| AP
|
Sample |
1
|
0.8
|
0.9
|
Sample 2
|
0.6
|
0.9
|
...
|
...
|
... |
The first threshold in table 1 may be used as a similarity threshold of the sample, that is, a similar sample or a different sample of the sample may be searched according to the first threshold. Taking the search for similar samples as an example, the specific process may be: a matching value of the sample to the candidate sample may be calculated. If the matching degree value exceeds the first threshold corresponding to the sample, the candidate sample can be selected as a similar sample of the sample. Taking the first threshold corresponding to 90% as an example, the principle of selecting the candidate sample with the matching degree value exceeding the first threshold as the similar sample of the sample is as follows: when the matching value of a candidate sample to the sample is greater than a first threshold, there is a 90% confidence that the candidate sample may fall into the sample library of the sample.
It should be noted that, after the similar sample to the first sample is selected, the similar sample may also be updated to the sample library.
It should be further noted that table 1 is only an exemplary illustration given for the convenience of understanding the present embodiment, and is not a limitation of the present embodiment. Other content, such as sample labels, etc., may also be included in table 1.
In summary, the method provided by the above embodiments of the present specification can more accurately and quickly implement the cleaning and updating of the sample library. In addition, the method is automatically completed, so that the application requirement of big data can be met. Furthermore, since the threshold values of the samples are updated in the sample library at the same time, a basis can be provided for recalling and comprehensively scoring the samples of specific types.
Fig. 3 is a schematic diagram illustrating an updating method of samples in a sample library according to another embodiment of the present disclosure, and in fig. 3, a distance between a sample to be updated and a preset sample may be calculated based on a sample feature vector, and the normalized distance may be used as a similarity value. According to the similarity value and the sample label of the preset sample (expressed as Y ═ Y1,y2,...,ym),yiE {0,1}, i 1, 2.. m) and a plurality of thresholds to generate an accuracy-recall curve. Then, a threshold (sim) with an accuracy of 0.9 can be selectedt) And calculating the area enclosed by the accuracy-recall ratio curve and the x axis, and taking the area as the AP value. Finally, if the AP value exceeds the threshold, the sample, s imtAnd the AP updates to the sample repository. Therefore, the quality and generalization capability of the samples in the sample library can be improved more effectively and rapidly.
Corresponding to the method for updating samples in a sample library, an embodiment of the present disclosure further provides an apparatus for updating samples in a sample library, as shown in fig. 4, the apparatus includes:
an obtaining unit 401 is configured to obtain a first sample to be updated to a sample library.
A calculating unit 402, configured to calculate similarity values between the first sample acquired by the acquiring unit 401 and a plurality of preset samples to which sample labels have been added.
Optionally, the computing unit 402 may specifically be configured to:
and determining a sample feature vector of the first sample, and determining a sample feature vector of each preset sample.
And calculating the distance value between the first sample and each preset sample according to the sample characteristic vector of the first sample and the sample characteristic vector of each preset sample.
And normalizing the distance value.
And determining a similarity value according to the distance value after the normalization processing.
A determining unit 403, configured to determine, according to the similarity value calculated by the calculating unit 402 and the plurality of threshold values, a predicted label corresponding to each threshold value.
The determining unit 403 is further configured to determine, for the prediction label corresponding to each threshold, a corresponding accuracy and a recall ratio according to the prediction label and the sample label, so as to determine a plurality of accuracies and recall ratios.
The calculating unit 402 is further configured to calculate an average accuracy AP of the first sample according to the plurality of accuracies and recalls determined by the determining unit 403.
Optionally, the computing unit 402 may specifically be configured to:
and drawing an accuracy-recall ratio curve in a plane rectangular coordinate system by taking the recall ratio corresponding to each threshold as an x coordinate and the accuracy as a y coordinate.
And determining the area of the accuracy-recall ratio curve enclosed by the first quadrant of the plane rectangular coordinate system and the x axis and the y axis.
From the area, the AP is determined.
An updating unit 404, configured to update the first sample to the sample library when the AP calculated by the calculating unit 402 satisfies a preset condition.
Optionally, the updating unit 404 may specifically be configured to:
and selecting a first threshold with the corresponding accuracy as a preset value from the plurality of thresholds.
And updating the corresponding relation of the first sample, the first threshold and the AP into a sample library.
Optionally, the apparatus may further include: a selection unit 405.
The calculating unit 402 is further configured to calculate a matching value of the first sample and the candidate sample.
A selecting unit 405, configured to select the candidate sample as a similar sample of the first sample if the matching degree value calculated by the calculating unit 402 exceeds a first threshold.
Optionally, the updating unit 404 is further configured to update the similar sample of the first sample into the sample library.
The functions of each functional module of the device in the above embodiments of the present description may be implemented through each step of the above method embodiments, and therefore, a specific working process of the device provided in one embodiment of the present description is not repeated herein.
In the apparatus for updating samples in a sample library provided in one embodiment of the present specification, the obtaining unit 401 obtains a first sample to be updated to the sample library. The calculating unit 402 calculates similarity values between the first sample and a plurality of preset samples to which sample labels have been added. The determination unit 403 determines a predicted label corresponding to each threshold value from the similarity value and the plurality of threshold values. For the prediction label corresponding to each threshold, the determining unit 403 determines the corresponding accuracy and recall rate according to the prediction label and the sample label, thereby determining a plurality of accuracy rates and recall rates. The calculating unit 402 calculates an average accuracy AP of the first sample according to the plurality of accuracies and the recall rate. When the AP satisfies a preset condition, the updating unit 404 updates the first sample to the sample library. Thereby, the quality of the samples in the sample database can be controlled more reliably.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.