CN113177596B - Block chain address classification method and device - Google Patents

Block chain address classification method and device Download PDF

Info

Publication number
CN113177596B
CN113177596B CN202110480968.9A CN202110480968A CN113177596B CN 113177596 B CN113177596 B CN 113177596B CN 202110480968 A CN202110480968 A CN 202110480968A CN 113177596 B CN113177596 B CN 113177596B
Authority
CN
China
Prior art keywords
sample
classifier
iteration
error rate
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110480968.9A
Other languages
Chinese (zh)
Other versions
CN113177596A (en
Inventor
穆长春
吕远
卿苏德
王艳辉
吴浩
刘睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Currency Institute of the Peoples Bank of China
Original Assignee
Digital Currency Institute of the Peoples Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Currency Institute of the Peoples Bank of China filed Critical Digital Currency Institute of the Peoples Bank of China
Priority to CN202110480968.9A priority Critical patent/CN113177596B/en
Publication of CN113177596A publication Critical patent/CN113177596A/en
Application granted granted Critical
Publication of CN113177596B publication Critical patent/CN113177596B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a block chain address classification method and device, and relates to the technical field of computers. One embodiment of the method comprises the following steps: iteratively training a selected classifier by utilizing a set of first block chain address samples, wherein each iteration selects one classifier to train with the minimum weighted average error rate as a target, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and calculating the sample weight of the first block chain address sample in the current iteration based on the initial value of the sample weight or the sample weight in the last iteration; and determining a block chain address classification model based on each classifier and classifier weight after iterative training, wherein the block chain address classification model is used for determining the class of the block chain address to be classified. The implementation method can improve the accuracy and reliability of classifying the block chain addresses and overcome the defects of error transfer problem, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like in the prior art.

Description

Block chain address classification method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for classifying blockchain addresses.
Background
Currently, in a common classification method of blockchain addresses, a common input heuristic classification method is based on an important assumption that if there are multiple input addresses in a transaction at the same time, the input addresses are considered to be from the same entity, and in short, different addresses transferred to the same object are considered to be from one entity. When the basic assumption is not true, for example, there is an input from other entities in the same transaction, or a single input occurs, the address labeling result will no longer be reliable, and the labeling of the initial address label must be accurate, otherwise the results from subsequent radiation will all be unreliable. The heuristic identification method of the change address is a classical classification method aiming at a UTXO (unconsumed transaction output) model data structure. Once the UTXO is created, it cannot be split, so that "change" often occurs in blockchain transactions, i.e., the "change" paid out by the payor (typically crypto currency) is greater than the "change" received by the recipient, at which point the excess "change" is transferred to the change address. The algorithm firstly needs to judge whether a change-making action exists or not, and then judges which address is the change-making address, so that the error transfer problem exists, and the algorithm has too high dependence on rules or assumptions, so that the generalization of the algorithm is insufficient, and the algorithm needs to be traversed for many times, so that the requirements on hardware resources and time cost are high. However, the existing linear decision boundary classifier has poor performance in a few application scenes, but has poor effect in the classification of block chain addresses, and has the problem of low classification accuracy caused by factors such as wrong model setting, constant sample weight, uneven distribution of samples of each label and the like.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:
the block chain address classification accuracy is low, the reliability is poor, and the defects of error transfer problem, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like exist.
Disclosure of Invention
In view of the above, the embodiment of the invention provides a method and a device for classifying blockchain addresses, which can improve the accuracy and reliability of classifying blockchain addresses and overcome the defects of error transfer problem, insufficient algorithm generalization, higher requirements on hardware resources and time cost and the like in the prior art.
To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a blockchain address classification method.
A blockchain address classification method, comprising: determining a set of first blockchain address samples, each of the first blockchain address samples including a first number of features representing blockchain addresses; iteratively training a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first blockchain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first blockchain address sample in the current iteration is calculated based on the sample weight initial value of the first blockchain address sample or the sample weight of the first blockchain address sample in the last iteration; and determining a block chain address classification model based on each classifier and the corresponding classifier weight after the iterative training, wherein the block chain address classification model is used for determining the class of the block chain address to be classified.
Optionally, the determining the set of first blockchain address samples includes: obtaining a set of second blockchain address samples, each of the second blockchain address samples including a feature representing a second number of blockchain addresses, the second number being greater than or equal to the first number; and performing multiple rounds of feature screening to select the first number of features from the second number of features, and constructing corresponding first blockchain address samples according to the first number of features selected from the second blockchain address samples to determine a set of the first blockchain address samples.
Optionally, the performing multiple rounds of feature screening to select the first number of features from the second number of features includes performing the following steps in each round of feature screening: training a simplified version classifier by using a single feature in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened; and if the minimum error rate is greater than the preset threshold, ending the process of the multi-round feature screening to obtain the first number of features.
Optionally, the error rate of the reduced-version classifier is calculated by: calculating a predicted value and a label value of the simplified version classifier corresponding to each training sample, wherein the training samples comprise the single feature used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and carrying out weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
Optionally, after selecting the features used by the simplified version classifier corresponding to the minimum error rate if the minimum error rate is less than or equal to a preset threshold, the method further includes: processing the absolute value of the difference between the label value and the predicted value obtained by the simplified version classifier corresponding to the minimum error rate of the training sample through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated by the feature screening of the round and the second symbol function value; multiplying the corresponding weight of the training sample in the current round of screening with the coefficient to update the current weight of the training sample to the corresponding weight of the training sample in the next round of screening.
Optionally, before iteratively training the selected classifier using the set of first blockchain address samples, determining an optimal number of iterations of the iterative training by: dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross-validation error rate estimation value after each training; calculating the variance of the cross-validation error rate estimation value by using the cross-validation error rate estimation value obtained by each training to select the target iteration times corresponding to the minimum cross-validation error rate estimation value; and calculating the sum of the minimum cross-validation error rate estimation value and the standard deviation of the cross-validation error rate estimation value, and selecting the minimum value of the iteration times of which the corresponding cross-validation error rate estimation value is not more than the sum from a preset iteration time set as the optimal iteration time.
Optionally, at each iteration, the weighted average error rate is calculated by: and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain corresponding third sign function values, carrying out weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the current iteration as the weighted average error rate.
Optionally, in the case that the current iteration is the first iteration, the sample weight of the first blockchain address sample in the current iteration is a sample weight initial value of the first blockchain address sample; under the condition that the iteration is not the first iteration, the sample weight of the first blockchain address sample in the iteration is obtained by calculating the product of the following three items: the first blockchain address sample is subjected to sample weight in the last iteration, a value obtained by utilizing classifier weight in the current iteration through preset function operation, and the third symbol function value corresponding to the first blockchain address sample, wherein the preset function is an exponential function based on e.
Optionally, the classifiers selected for each iteration are the same or different linear decision boundary classifiers.
According to another aspect of an embodiment of the present invention, a blockchain address classification device is provided.
A blockchain address classification device, comprising: a first set of blockchain address samples determination module to determine a set of first blockchain address samples, each of the first blockchain address samples including a feature representing a first number of blockchain addresses; a classifier iteration training module for iteratively training a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first blockchain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first blockchain address sample in the current iteration is calculated based on the sample weight initial value of the first blockchain address sample or the sample weight of the first blockchain address sample in the last iteration; and the block chain address classification model determining module is used for determining a block chain address classification model based on each classifier and the corresponding classifier weight after the iterative training, and the block chain address classification model is used for determining the class of the block chain address to be classified.
Optionally, the first blockchain address sample set determination module is further configured to: obtaining a set of second blockchain address samples, each of the second blockchain address samples including a feature representing a second number of blockchain addresses, the second number being greater than or equal to the first number; and performing multiple rounds of feature screening to select the first number of features from the second number of features, and constructing corresponding first blockchain address samples according to the first number of features selected from the second blockchain address samples to determine a set of the first blockchain address samples.
Optionally, the first blockchain address sample set determination module includes a feature screening sub-module to: in each round of feature screening, the following steps are performed: training a simplified version classifier by using a single feature in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened; and if the minimum error rate is greater than the preset threshold, ending the process of the multi-round feature screening to obtain the first number of features.
Optionally, the feature screening submodule calculates the error rate of the reduced-version classifier by: calculating a predicted value and a label value of the simplified version classifier corresponding to each training sample, wherein the training samples comprise the single feature used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and carrying out weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
Optionally, the first blockchain address sample set determination module further includes a weight update sub-module for: processing the absolute value of the difference between the label value and the predicted value obtained by the simplified version classifier corresponding to the minimum error rate of the training sample through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated by the feature screening of the round and the second symbol function value; multiplying the corresponding weight of the training sample in the current round of screening with the coefficient to update the current weight of the training sample to the corresponding weight of the training sample in the next round of screening.
Optionally, the method further comprises an optimal iteration number determining module, configured to determine an optimal iteration number of the iterative training by: dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross-validation error rate estimation value after each training; calculating the variance of the cross-validation error rate estimation value by using the cross-validation error rate estimation value obtained by each training to select the target iteration times corresponding to the minimum cross-validation error rate estimation value; and calculating the sum of the minimum cross-validation error rate estimation value and the standard deviation of the cross-validation error rate estimation value, and selecting the minimum value of the iteration times of which the corresponding cross-validation error rate estimation value is not more than the sum from a preset iteration time set as the optimal iteration time.
Optionally, the classifier iterative training module calculates the weighted average error rate at each iteration by: and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain corresponding third sign function values, carrying out weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the current iteration as the weighted average error rate.
Optionally, in the case that the current iteration is the first iteration, the sample weight of the first blockchain address sample in the current iteration is a sample weight initial value of the first blockchain address sample; under the condition that the iteration is not the first iteration, the sample weight of the first blockchain address sample in the iteration is obtained by calculating the product of the following three items: the first blockchain address sample is subjected to sample weight in the last iteration, a value obtained by utilizing classifier weight in the current iteration through preset function operation, and the third symbol function value corresponding to the first blockchain address sample, wherein the preset function is an exponential function based on e.
Optionally, the classifier iterative training module selects the same or different linear decision boundary classifier at each iteration.
According to yet another aspect of an embodiment of the present invention, an electronic device is provided.
An electronic device, comprising: one or more processors; and the memory is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the blockchain address classification method provided by the embodiment of the invention.
According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.
A computer readable medium having stored thereon a computer program which when executed by a processor implements a blockchain address classification method provided by an embodiment of the invention.
One embodiment of the above invention has the following advantages or benefits: iteratively training a selected classifier using a set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the sample weight initial value of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration; based on each classifier and the corresponding classifier weight after iterative training, a block chain address classification model is determined to classify the block chain address to be classified. The method can improve the accuracy and reliability of classifying the block chain addresses and overcome the defects of error transfer problem, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like in the prior art.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram illustrating the main steps of a blockchain address classification method according to an embodiment of the invention;
FIG. 2 is a block chain address classification flow diagram according to one embodiment of the invention;
FIG. 3 is a schematic diagram of the main blocks of a block chain address sorter according to one embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
FIG. 5 is a schematic diagram of a computer system suitable for use with a server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
FIG. 1 is a schematic diagram illustrating the main steps of a block chain address classification method according to an embodiment of the present invention.
As shown in fig. 1, the blockchain address classification method according to an embodiment of the present invention mainly includes the following steps S101 to S103.
Step S101: a set of first blockchain address samples is determined, each first blockchain address sample including a feature representing a first number of blockchain addresses.
Step S102: iteratively training the selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the sample weight initial value of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration;
step S103: based on each classifier and the corresponding classifier weight after iterative training, a block chain address classification model is determined, and the block chain address classification model is used for determining the class of the block chain address to be classified.
The step of determining the set of first blockchain address samples may specifically include: obtaining a set of second blockchain address samples, each second blockchain address sample including a feature representing a second number of blockchain addresses, the second number being greater than or equal to the first number; multiple rounds of feature screening are performed to select a first number of features from the second number of features, and corresponding first blockchain address samples are constructed from the first number of features selected from each second blockchain address sample to determine a set of first blockchain address samples.
The step of performing a plurality of rounds of feature screening to select a first number of features from a second number of features may specifically include, in each round of feature screening, performing the steps of: training a simplified version classifier by using a single feature in the feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened; and if the minimum error rate is greater than the preset threshold, ending the process of multi-round feature screening to obtain a first number of features.
The error rate of the reduced-version classifier can be calculated as follows: calculating a predicted value and a label value of the simplified version classifier corresponding to each training sample, wherein the training samples comprise single characteristics used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and carrying out weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
If the obtained minimum error rate is less than or equal to the preset threshold, after the step of selecting the features used by the simplified version classifier corresponding to the minimum error rate, the method may further include: processing the absolute value of the difference between the label value and the predicted value obtained by the simplified version classifier of the training sample corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by utilizing the minimum error rate and the second symbol function value calculated by the feature screening of the round; multiplying the corresponding weight of the training sample in the current round of screening by a coefficient for updating the weight of the training sample to update the current weight of the training sample to the corresponding weight of the training sample in the next round of screening.
Using the set of first blockchain address samples, prior to the step of iteratively training the selected classifier, determining an optimal number of iterations of the iterative training may include: dividing a set of first blockchain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross-validation error rate estimation value after each training; calculating the variance of the cross-validation error rate estimation value by using the cross-validation error rate estimation value obtained by each training to select the target iteration times corresponding to the minimum cross-validation error rate estimation value; and calculating the sum of the minimum cross-validation error rate estimation value and the standard deviation of the cross-validation error rate estimation value, and selecting the minimum value of the iteration times of which the corresponding cross-validation error rate estimation value is not more than the sum from a preset iteration time set as the optimal iteration time.
At each iteration, a weighted average error rate may be calculated as follows: and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a symbol function to obtain corresponding third symbol function values, carrying out weighted summation on each third symbol function value according to the sample weight of the corresponding first block chain address sample in the iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the iteration as a weighted average error rate.
For each iteration, under the condition that the iteration is the first iteration, the sample weight of the first blockchain address sample in the iteration is the sample weight initial value of the first blockchain address sample; under the condition that the iteration is not the first iteration, the sample weight of the first blockchain address sample in the iteration is obtained by calculating the product of the following three items: the first block chain address sample is subjected to sample weight in the last iteration, a value obtained by utilizing classifier weight in the current iteration through preset function operation, and a third symbol function value corresponding to the first block chain address sample, wherein the preset function is an exponential function based on e.
The classifiers selected for each iteration may be the same or different linear decision boundary classifiers.
FIG. 2 is a block chain address classification flow diagram according to one embodiment of the invention.
The block chain address classification flow of the embodiment of the invention realizes the classification of the block chain address to be classified by the self-adaptive enhanced block chain address classification algorithm based on the linear decision boundary classifier, and mainly comprises the following steps: extracting a blockchain address on a blockchain, performing feature engineering to select hundreds of features including cross section and time sequence features, and randomly dividing sample data into a training set and a testing set according to a self-determined proportion; screening the features by using the training set to obtain a set of first blockchain address samples; determining optimal iteration times M by using a set of first block chain address samples and K-fold cross validation; training a blockchain address classification model by utilizing a set of first blockchain address samples according to the optimal iteration number M; predicting the test set by using the trained block chain address classification model, and checking the accuracy of the model; and classifying the block chain addresses to be classified by using a block chain address classification model.
The detailed flow of the blockchain address classification in accordance with embodiments of the present invention is described below.
In the data preparation step, relevant addresses (the blockchain addresses which are already classified and need to be classified) are extracted on the blockchain, and feature engineering is performed. Traversing the whole blockchain, extracting as many features as possible including cross section and time sequence information from historical transaction information stored on the chain, and obtaining blockchain address sample data.
Splitting the block chain address sample data into a training set and a testing set, and performing feature screening, wherein the training set used in the feature screening stage is the set of second block chain address samples.
The following describes a specific process of feature screening according to an embodiment of the present invention.
Taking a classification scenario as an example, assume that the blockchain address sample (i.e., the second blockchain address sample) is (x) 1 ,y 1 ),(x 2 ,y 2 ),...,(x n ,y n ) Wherein x is j ∈R P J=1,..n is the P-dimensional covariate for each blockchain address sample;a tag representing a second blockchain address sample, in embodiments of the present invention, the tag may be 1 or-1; p (i.e., the second number) represents the number of features (i.e., the dimension of the covariate) of the second blockchain address sample; n represents the sample size of the training set; n is n + To number of positive samples in training set, n - The number of negative samples in the training set; w (W) i,j Representing the weight of the jth sample at the ith iteration (i.e., the ith round of screening); beta represents an acceptable maximum error rate; h is a p Represents the linear classifier trained when the P-th feature was used alone, p=1, …, P.
Respectively for positive and negative labels as y j Samples of =1, -1 are given initial weightsNamely: y is j The initial weight of the second blockchain address sample of =1 is +.>y j The initial weight of the second blockchain address sample of = -1 is
The normalization processing is carried out on the weights as follows:
for each feature p (single feature), a reduced-version classifier is trained using the feature, and a reduced-version linear classifier h using only the feature is calculated p (i.e., reduced version classifier), the features of the training samples used to train the reduced version classifier include only a single feature in the second blockchain address sample, the tag being y as described above j The weight is W i,j In the feature screening stage, the weight is called training sample weight.
And calculates the error rate as follows:
ε p called reduced version classifier h p Is a sign (x) form of a function, wherein:
h p (x j ) For the predicted value of the jth training sample, y j For the tag value of the jth training sample,sign(|h p (x j )-y j i) value, i.e., the first symbol function value.
And when each round of feature screening is performed, the number of the trained simplified version classifiers is the same as the number of the features to be screened in the feature set to be screened, for each feature to be screened, training a simplified version classifier by using a training sample containing the feature, namely, the simplified version classifiers are in one-to-one correspondence with the features in the feature set to be screened, and the simplified version classifiers trained by using each feature are in one-to-one correspondence with one error rate, so that the minimum error rate can be selected.
In the first round of feature screening (i=1), P error rates are calculated for p=1, …, and P, respectively, and the minimum error rate min { ε) is further determined 1 ,...,ε P-i+1 Judging whether the minimum error rate is less than or equal to beta (i.e., a preset threshold, representing an acceptable maximum error rate), if so, selecting to achieve the minimum error rateP of (2) min The features are selected and added to the feature list. Namely: simplified version classifier +_for the minimum error rate to be applied>The features used are selected, and the simplified version classifier corresponding to the minimum error rate is the optimal linear decision boundary classifier.
After this round of screening features, for j=1..n, the training sample weights are updated as follows:
Wherein,representation classifier h pmin Is a predicted value of (a). />I.e. coefficients for updating the weights of training samples, < +.>The value of the second symbol function value, y j The tag value for the j-th training sample.
And then entering the next round of feature screening, and continuously training the simplified version classifier by using the single features respectively during the next round of feature screening, and performing feature screening in the same way as the above, namely, screening the features by calculating the minimum error rate, wherein the training sample weight is updated once every time the algorithm iterates.
For each feature screening round, if the minimum error rate min { ε } 1 ,...,ε P-i+1 }>And beta, terminating the calculation, and not continuing the next round of feature screening, namely terminating the whole process of multi-round feature screening.
After the above-mentioned multi-round feature screening process is finished, the feature quantity in the finally obtained feature list is denoted as P, i.e. the first quantity. Screening P characteristics and epsilon from P characteristics by characteristic screening 1 ,...,ε p* And (5) the error rate corresponding to the screened characteristics.
The following describes a specific process of determining the optimal iteration number of iterative training according to an embodiment of the present invention.
According to the requirements on computing resources, data size and model accuracy, selecting a possible optimal solution M' = (M) 1 ,M 2 ,...,M U ) The set is a preset iteration number set. And traversing the iteration number set by using training set data, adopting a K-fold cross verification method, and selecting the optimal iteration number M from the feature set training classifier selected by the feature screening flow.
Dividing the training set into K large blocks with the same size, circulating K times, and training the model by using K-1 modules except the K large blocks each timeModel->Models built for a single classifier, (M) 1 ,M 2 ,...,M U ) In the set, each M corresponds to a single classifier, i.e. M 1 Corresponding to the first single classifier, M 2 Corresponds to the second single classifier, … …, M U Corresponding to the U-th single classifier. Training model->The samples used were characterized by P features screened, labeled y j That is, embodiments of the present invention train a model using a set of first blockchain address samplesA kind of electronic device.
For the U-th single classifier (u=1, 2.. U), calculating cross-validation error rate estimates
Wherein,the loss function, which is self-determined according to the data characteristics and analysis requirements, can be specifically various common forms of loss functions, such as least squares loss functions, and the like. block K represents samples in the kth chunk, |block k| represents the number of samples of the kth chunk, k=1, …, K.
For each single classifier, the method can be used to calculate a cross-validation error rate estimateAnd due to (M) 1 ,M 2 ,...,M U ) In a collectionIf each M corresponds to a single classifier, then there are a total of U single classifiers, and U +.>By means of the U->The variance of the cross-validation error rate estimate can be calculated, i.e. +.>
Selecting the optimal iteration number M, specifically, selecting M with the lowest cross-validation error rate estimation value err_min (i.e., a target number of iterations), i.e.:
finding out in the set M' that the cross-validation error rate estimate is not higher thanAnd->The minimum M of the sum of (2) is taken as M, wherein +.>Is M err_min The corresponding cross-validation error rate estimate,for the standard deviation of the cross-validation error rate estimation value, M is as follows:
M*=min{M 1 ,...,M U }s.t.
namely, under constraint conditions (s.t.)Find M 1 ,...,M U In other words, in the preset iteration number set, selecting the minimum value of the corresponding iteration number of which the cross-validation error rate estimation value is not greater than the sum (which refers to the sum of the minimum cross-validation error rate estimation value and the standard deviation of the cross-validation error rate estimation value), as the optimal iteration number.
The following describes a specific process for determining a blockchain address classification model in accordance with an embodiment of the present invention.
n represents the sample size of the training set;representing P features (i.e., features of a first number of first blockchain address samples) selected from the P features of the data initiation (i.e., features of a second number of second blockchain address samples); based on->The obtained blockchain address samples are first blockchain address samples, and each first blockchain address sample comprises P characteristics obtained through characteristic screening; />A tag representing a first blockchain address sample (the same as the tag correspondence of a second blockchain address sample of the feature screening stage above), where j=1 i,j Representing the weight of the jth sample at the ith iteration (the same as the weight correspondence of the second blockchain address sample at the feature screening stage above), the embodiment of the present invention is referred to as the sample weight; m represents the optimal number of iterations; g i Representing the linear classifier selected at the i-th iteration (i=1, i., M), wherein the classifier selected at the different iterations may be of the same type, different types of classifiers are also possible, for example, a particular iteration may select a logistic regression-based classifierThe classifier, the next iteration may likewise select a classifier based on logistic regression, or may select a classifier based on support vector machine (classifier type is only an example); alpha i Representation classifier g i Is used for the classifier weights of (1).
For the j=1, …, n samples, an initial weight (i.e., sample weight initial value) is defined:
for the i=1, …, M iterations:
the linear classifier g is trained by minimizing the weighted average error rate as follows i
From this, the weighted average error rate err of this iteration i Sample weight w of each first blockchain address sample in the iteration i,j Related to the following. Wherein sign { |y j -g i (x j ) The value of the third symbol function, g i (x j ) The predicted value at the ith iteration is for the jth first blockchain address sample.I.e. the sum of all sample weights in the ith iteration.
And carrying out weighted summation on each third symbol function value according to the sample weight of the corresponding first block chain address sample in the ith iteration to obtain a weighted sum.
Each iteration of the algorithm, a linear decision boundary classifier (i.e., linear classifier) is trained with the goal of minimizing the error rate, i.e., the weighted average error rate described above.
By the following meansComputing classifier g i Classifier weights alpha of (a) i
For j=1..n, calculate sample weights:
W i+1,j =W i,j ×exp(α i )×sign{|y j -g i (x j )|},i=1,...,M*
that is, the sample weight W of the first blockchain address sample in a certain iteration i+1,j A sample weight initial value W based on the first block chain address sample 1,j Or the sample weight W of the first blockchain address sample in the last iteration i,j Calculated. exp (alpha) i ) The value of (2) is obtained by using the classifier weight in the ith iteration through the operation of a preset function, namely an exponential function based on e. The sample weights are updated once for each iteration of the algorithm.
After M iterations are completed, determining a blockchain address classification model based on each classifier and corresponding classifier weights after iterative training as follows:
and testing the model accuracy of the trained block chain address classification model on a test set, comparing the error rate of the training set with the error rate of the test set, and confirming that the model has no fitting problem.
In the above-mentioned blockchain address classification model, f (x) is an output of the blockchain address classification model, where the output indicates a class of the blockchain address to be classified, for example, two classes may indicate that the class of the blockchain address to be classified is an exchange blockchain address or a non-exchange blockchain address.
The classifier used in the above stages (such as feature screening, determining the optimal iteration number, determining the blockchain address classification model, etc.) of the embodiments of the present invention may be various linear decision boundary classifiers, including but not limited to logistic regression, decision trees, random forests, support vector machines, etc.
In machine learning, a large number of linear decision boundary classifiers, such as a linear kernel support vector machine, logistic regression, naive bayes and the like, are used for not ideal effects in the block chain address classification in the prior art, although the methods perform well in a plurality of application scenes, and the main reasons are as follows: 1. many linear decision boundary classifiers require the assumption of a linear relationship between address labels and covariates; 2. the problem of exclusive OR (namely, the classification boundary is a nonlinear hyperplane) is obvious in a high-dimensional space, and the linear decision boundary can not well distinguish different classes; 3. many classifiers need to assume that covariates are mutually independent, and each covariate is equally important to label division; 4. the classifier usually assumes that the weight of each sample is constant; 5. most classifiers generally need better classification results when the distribution of the number of samples of each class is relatively balanced. The embodiment of the invention provides a self-adaptive enhanced blockchain address classification method based on a linear decision boundary classification model, which overcomes the problems in the prior art. In addition, the sample weight of the embodiment of the invention is dynamically updated, so that mutual independence between covariates is not needed as in the prior art, and each covariate is equally important to label division.
FIG. 3 is a schematic diagram of the main blocks of a block chain address sorter according to an embodiment of the present invention.
As shown in FIG. 3, the blockchain address classification device 300 of one embodiment of the present invention mainly includes: a first blockchain address sample set determination module 301, a classifier iteration training module 302, and a blockchain address classification model determination module 303.
A first set of blockchain address samples determination module 301 for determining a set of first blockchain address samples, each first blockchain address sample including a first number of features representing blockchain addresses;
a classifier iteration training module 302 for iteratively training a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the sample weight initial value of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration;
The blockchain address classification model determining module 303 is configured to determine a blockchain address classification model based on each classifier and the corresponding classifier weight after iterative training, where the blockchain address classification model is used to determine a class of the blockchain address to be classified.
The first blockchain address sample set determination module 301 may be specifically configured to: obtaining a set of second blockchain address samples, each second blockchain address sample including a feature representing a second number of blockchain addresses, the second number being greater than or equal to the first number; multiple rounds of feature screening are performed to select a first number of features from the second number of features, and corresponding first blockchain address samples are constructed from the first number of features selected from each second blockchain address sample to determine a set of first blockchain address samples.
The first blockchain address sample set determination module 301 may include a feature screening sub-module to: in each round of feature screening, the following steps are performed: training a simplified version classifier by using a single feature in the feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened; and if the minimum error rate is greater than the preset threshold, ending the process of multi-round feature screening to obtain a first number of features.
The feature screening submodule may calculate the error rate of the reduced version classifier as follows: calculating a predicted value and a label value of the simplified version classifier corresponding to each training sample, wherein the training samples comprise single characteristics used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and carrying out weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
The first blockchain address sample set determination module may further include a weight update sub-module to: processing the absolute value of the difference between the label value and the predicted value obtained by the simplified version classifier of the training sample corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by utilizing the minimum error rate and the second symbol function value calculated by the feature screening of the round; and multiplying the corresponding weight of the training sample in the round of screening by the coefficient to update the current weight of the training sample to the corresponding weight of the training sample in the next round of screening.
The blockchain address classification device 300 may further include an optimal iteration number determination module for determining an optimal iteration number for iterative training by: dividing a set of first blockchain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross-validation error rate estimation value after each training; calculating the variance of the cross-validation error rate estimation value by using the cross-validation error rate estimation value obtained by each training to select the target iteration times corresponding to the minimum cross-validation error rate estimation value; and calculating the sum of the minimum cross-validation error rate estimation value and the standard deviation of the cross-validation error rate estimation value, and selecting the minimum value of the iteration times of which the corresponding cross-validation error rate estimation value is not more than the sum from a preset iteration time set as the optimal iteration time.
The classifier iteration training module 302 calculates a weighted average error rate at each iteration by: and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a symbol function to obtain corresponding third symbol function values, carrying out weighted summation on each third symbol function value according to the sample weight of the corresponding first block chain address sample in the iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the iteration as a weighted average error rate.
Under the condition that the iteration is the first iteration, the sample weight of the first block chain address sample in the iteration is the sample weight initial value of the first block chain address sample; under the condition that the iteration is not the first iteration, the sample weight of the first blockchain address sample in the iteration is obtained by calculating the product of the following three items: the first block chain address sample is subjected to sample weight in the last iteration, a value obtained by utilizing classifier weight in the current iteration through preset function operation, and a third symbol function value corresponding to the first block chain address sample, wherein the preset function is an exponential function based on e.
The classifier iteration training module 302 selects the same or different linear decision boundary classifiers at each iteration.
In addition, the implementation of the blockchain address classification device in the embodiment of the present invention is described in detail in the above blockchain address classification method, so the description thereof will not be repeated here.
FIG. 4 illustrates an exemplary system architecture 400 to which the blockchain address classification method or blockchain address classification device of embodiments of the invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 is used as a medium to provide communication links between the terminal devices 401, 402, 403 and the server 405. The network 404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 405 via the network 404 using the terminal devices 401, 402, 403 to receive or send messages or the like. Various communication client applications, such as a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 401, 402, 403, as just examples.
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (by way of example only) providing support for shopping-type websites browsed by users using the terminal devices 401, 402, 403. The background management server may analyze and process the received data, such as the blockchain address classification request, and feedback the processing result (e.g., address class—only by way of example) to the terminal device.
It should be noted that, the method for classifying blockchain addresses according to the embodiment of the present invention is generally executed by the server 405, and accordingly, the blockchain address classifying device is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, there is illustrated a schematic diagram of a computer system 500 suitable for use in implementing a server of an embodiment of the present application. The server illustrated in fig. 5 is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments herein.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 501.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a first blockchain address sample set determination module, a classifier iteration training module, and a blockchain address classification model determination module. The names of these modules do not constitute a limitation on the module itself in some cases, for example, the first blockchain address sample set determination module may also be described as "a module for determining a set of first blockchain address samples".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: determining a set of first blockchain address samples, each of the first blockchain address samples including a first number of features representing blockchain addresses; iteratively training a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first blockchain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first blockchain address sample in the current iteration is calculated based on the sample weight initial value of the first blockchain address sample or the sample weight of the first blockchain address sample in the last iteration; and determining a block chain address classification model based on each classifier and the corresponding classifier weight after the iterative training, wherein the block chain address classification model is used for determining the class of the block chain address to be classified.
According to the technical scheme of the embodiment of the invention, the selected classifier is trained iteratively by utilizing the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the sample weight initial value of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration; based on each classifier and the corresponding classifier weight after iterative training, a block chain address classification model is determined to classify the block chain address to be classified. The method can improve the accuracy and reliability of classifying the block chain addresses and overcome the defects of error transfer problem, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like in the prior art.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (18)

1. A method of blockchain address classification, comprising:
determining a set of first blockchain address samples, each of the first blockchain address samples including a first number of features representing blockchain addresses;
iteratively training a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first blockchain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first blockchain address sample in the current iteration is calculated based on the sample weight initial value of the first blockchain address sample or the sample weight of the first blockchain address sample in the last iteration;
determining a block chain address classification model based on each classifier and corresponding classifier weight after the iterative training, wherein the block chain address classification model is used for determining the class of the block chain address to be classified;
at each iteration, the weighted average error rate is calculated by: processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain corresponding third sign function values, carrying out weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted summation, and taking the ratio of the weighted summation to the sum of all the sample weights in the current iteration as the weighted average error rate;
The blockchain address classification model is shown as follows:
wherein f (x) is the output of the blockchain address classification model, the output indicating the class of blockchain addresses to be classified; x represents a blockchain address; g i Representing the classifier selected for the ith iteration; alpha i Representative classifier g i Classifier weights of (2); g i (x) Predictive value representing blockchain address in ith iterationThe method comprises the steps of carrying out a first treatment on the surface of the M represents the number of iterations; sign () represents a sign function.
2. The method of claim 1, wherein the determining the set of first blockchain address samples comprises:
obtaining a set of second blockchain address samples, each of the second blockchain address samples including a feature representing a second number of blockchain addresses, the second number being greater than or equal to the first number;
and performing multiple rounds of feature screening to select the first number of features from the second number of features, and constructing corresponding first blockchain address samples according to the first number of features selected from the second blockchain address samples to determine a set of the first blockchain address samples.
3. The method of claim 2, wherein the performing a plurality of feature screening rounds to select the first number of features from the second number of features includes, in each feature screening round, performing the steps of:
Training a simplified version classifier by using a single feature in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features;
calculating the error rate of each simplified version classifier to obtain the minimum error rate;
if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened;
and if the minimum error rate is greater than the preset threshold, ending the process of the multi-round feature screening to obtain the first number of features.
4. A method according to claim 3, wherein the error rate of the reduced-version classifier is calculated by:
calculating a predicted value and a label value of the simplified version classifier corresponding to each training sample, wherein the training samples comprise the single feature used for training the simplified version classifier;
and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and carrying out weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
5. The method of claim 4, wherein if the minimum error rate is less than or equal to a preset threshold, after selecting the features used by the reduced version classifier corresponding to the minimum error rate, further comprising:
processing the absolute value of the difference between the label value and the predicted value obtained by the simplified version classifier corresponding to the minimum error rate of the training sample through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated by the feature screening of the round and the second symbol function value;
multiplying the corresponding weight of the training sample in the current round of screening with the coefficient to update the current weight of the training sample to the corresponding weight of the training sample in the next round of screening.
6. The method of claim 1, wherein prior to iteratively training a selected classifier using the set of first blockchain address samples, comprising determining an optimal number of iterations of the iterative training by:
dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross-validation error rate estimation value after each training;
Calculating the variance of the cross-validation error rate estimation value by using the cross-validation error rate estimation value obtained by each training to select the target iteration times corresponding to the minimum cross-validation error rate estimation value;
and calculating the sum of the minimum cross-validation error rate estimation value and the standard deviation of the cross-validation error rate estimation value, and selecting the minimum value of the iteration times of which the corresponding cross-validation error rate estimation value is not more than the sum from a preset iteration time set as the optimal iteration time.
7. The method of claim 6, wherein the step of providing the first layer comprises,
under the condition that the iteration is the first iteration, the sample weight of the first block chain address sample in the iteration is the sample weight initial value of the first block chain address sample;
under the condition that the iteration is not the first iteration, the sample weight of the first blockchain address sample in the iteration is obtained by calculating the product of the following three items: the first blockchain address sample is subjected to sample weight in the last iteration, a value obtained by utilizing classifier weight in the current iteration through preset function operation, and the third symbol function value corresponding to the first blockchain address sample, wherein the preset function is an exponential function based on e.
8. The method of claim 1, wherein the classifiers selected for each iteration are the same or different linear decision boundary classifiers.
9. A blockchain address classification device, comprising:
a first set of blockchain address samples determination module to determine a set of first blockchain address samples, each of the first blockchain address samples including a feature representing a first number of blockchain addresses;
a classifier iteration training module for iteratively training a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the minimum weighted average error rate as a target, wherein the weighted average error rate of the current iteration is related to the sample weight of each first blockchain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first blockchain address sample in the current iteration is calculated based on the sample weight initial value of the first blockchain address sample or the sample weight of the first blockchain address sample in the last iteration;
The block chain address classification model determining module is used for determining a block chain address classification model based on each classifier and corresponding classifier weight after the iterative training, and the block chain address classification model is used for determining the class of the block chain address to be classified;
the classifier iterative training module calculates the weighted average error rate at each iteration by: processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain corresponding third sign function values, carrying out weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted summation, and taking the ratio of the weighted summation to the sum of all the sample weights in the current iteration as the weighted average error rate;
the blockchain address classification model is shown as follows:
wherein f (x) is the output of the blockchain address classification model, the output indicating the class of blockchain addresses to be classified; x represents a blockchain address; g i Representing the classifier selected for the ith iteration; alpha i Representative classifier g i Classifier weights of (2); g i (x) Representing a predicted value of the blockchain address in the ith iteration;m represents the number of iterations; sign () represents a sign function.
10. The apparatus of claim 9, wherein the first blockchain address sample set determination module is further to:
obtaining a set of second blockchain address samples, each of the second blockchain address samples including a feature representing a second number of blockchain addresses, the second number being greater than or equal to the first number;
and performing multiple rounds of feature screening to select the first number of features from the second number of features, and constructing corresponding first blockchain address samples according to the first number of features selected from the second blockchain address samples to determine a set of the first blockchain address samples.
11. The apparatus of claim 10, wherein the first blockchain address sample set determination module includes a feature screening sub-module to: in each round of feature screening, the following steps are performed:
training a simplified version classifier by using a single feature in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features;
Calculating the error rate of each simplified version classifier to obtain the minimum error rate;
if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened;
and if the minimum error rate is greater than the preset threshold, ending the process of the multi-round feature screening to obtain the first number of features.
12. The apparatus of claim 11, wherein the feature screening submodule calculates the error rate of the reduced-version classifier by:
calculating a predicted value and a label value of the simplified version classifier corresponding to each training sample, wherein the training samples comprise the single feature used for training the simplified version classifier;
and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and carrying out weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
13. The apparatus of claim 12, wherein the first blockchain address sample set determination module further includes a weight update sub-module to:
Processing the absolute value of the difference between the label value and the predicted value obtained by the simplified version classifier corresponding to the minimum error rate of the training sample through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated by the feature screening of the round and the second symbol function value;
multiplying the corresponding weight of the training sample in the current round of screening with the coefficient to update the current weight of the training sample to the corresponding weight of the training sample in the next round of screening.
14. The apparatus of claim 9, further comprising an optimal iteration number determination module configured to determine an optimal iteration number for the iterative training by:
dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross-validation error rate estimation value after each training;
calculating the variance of the cross-validation error rate estimation value by using the cross-validation error rate estimation value obtained by each training to select the target iteration times corresponding to the minimum cross-validation error rate estimation value;
And calculating the sum of the minimum cross-validation error rate estimation value and the standard deviation of the cross-validation error rate estimation value, and selecting the minimum value of the iteration times of which the corresponding cross-validation error rate estimation value is not more than the sum from a preset iteration time set as the optimal iteration time.
15. The apparatus of claim 14, wherein the device comprises a plurality of sensors,
under the condition that the iteration is the first iteration, the sample weight of the first block chain address sample in the iteration is the sample weight initial value of the first block chain address sample;
under the condition that the iteration is not the first iteration, the sample weight of the first blockchain address sample in the iteration is obtained by calculating the product of the following three items: the first blockchain address sample is subjected to sample weight in the last iteration, a value obtained by utilizing classifier weight in the current iteration through preset function operation, and the third symbol function value corresponding to the first blockchain address sample, wherein the preset function is an exponential function based on e.
16. The apparatus of claim 9, wherein the classifier iterative training module selects the classifier at each iteration to be the same or different linear decision boundary classifier.
17. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.
18. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method according to any of claims 1-8.
CN202110480968.9A 2021-04-30 2021-04-30 Block chain address classification method and device Active CN113177596B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110480968.9A CN113177596B (en) 2021-04-30 2021-04-30 Block chain address classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110480968.9A CN113177596B (en) 2021-04-30 2021-04-30 Block chain address classification method and device

Publications (2)

Publication Number Publication Date
CN113177596A CN113177596A (en) 2021-07-27
CN113177596B true CN113177596B (en) 2024-03-22

Family

ID=76925718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110480968.9A Active CN113177596B (en) 2021-04-30 2021-04-30 Block chain address classification method and device

Country Status (1)

Country Link
CN (1) CN113177596B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368911A (en) * 2020-03-03 2020-07-03 腾讯科技(深圳)有限公司 Image classification method and device and computer readable storage medium
CN111444232A (en) * 2020-01-03 2020-07-24 上海宓猿信息技术有限公司 Method for mining digital currency exchange address and storage medium
CN111754345A (en) * 2020-06-18 2020-10-09 天津理工大学 Bit currency address classification method based on improved random forest
CN111797942A (en) * 2020-07-23 2020-10-20 深圳壹账通智能科技有限公司 User information classification method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11562228B2 (en) * 2019-06-12 2023-01-24 International Business Machines Corporation Efficient verification of machine learning applications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444232A (en) * 2020-01-03 2020-07-24 上海宓猿信息技术有限公司 Method for mining digital currency exchange address and storage medium
CN111368911A (en) * 2020-03-03 2020-07-03 腾讯科技(深圳)有限公司 Image classification method and device and computer readable storage medium
CN111754345A (en) * 2020-06-18 2020-10-09 天津理工大学 Bit currency address classification method based on improved random forest
CN111797942A (en) * 2020-07-23 2020-10-20 深圳壹账通智能科技有限公司 User information classification method and device, computer equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Analysis on block chain financial transaction under artificial neural network of deep learning;Wenyou Gao.et al;《Journal of Computational and Applied Mathematics》;全文 *
Regulating Cryptocurrencies: A Supervised Machine Learning Approach to De-Anonymizing the Bitcoin Blockchain;Hao Hua Sun Yin.et al;《Journal of Management Information Systems》;全文 *
区块链生态安全挑战及解决方案研究;杨霞;《网络空间安全》;第11卷(第3期);全文 *
基于启发式的比特币地址聚类方法;毛洪亮等;北京邮电大学学报(第02期);全文 *

Also Published As

Publication number Publication date
CN113177596A (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CN108520470B (en) Method and apparatus for generating user attribute information
US11875253B2 (en) Low-resource entity resolution with transfer learning
CN110995459B (en) Abnormal object identification method, device, medium and electronic equipment
US11281999B2 (en) Predictive accuracy of classifiers using balanced training sets
CN112966701A (en) Method and device for classifying objects
CN107291774B (en) Error sample identification method and device
CN113743971A (en) Data processing method and device
CN111191825A (en) User default prediction method and device and electronic equipment
CN114707644A (en) Method and device for training graph neural network
CN112418258A (en) Feature discretization method and device
CN112231299B (en) Method and device for dynamically adjusting feature library
US11928584B2 (en) Distributed hyperparameter tuning and load balancing for mathematical models
CN113448821B (en) Method and device for identifying engineering defects
CN111930858A (en) Representation learning method and device of heterogeneous information network and electronic equipment
CN113177596B (en) Block chain address classification method and device
US20220171985A1 (en) Item recommendation with application to automated artificial intelligence
CN113239259A (en) Method and device for determining similar stores
CN113590754A (en) Big data analysis method and big data analysis server applied to community interaction
CN114610953A (en) Data classification method, device, equipment and storage medium
US11177018B2 (en) Stable genes in comparative transcriptomics
US11487964B2 (en) Comprehensive data science solution for segmentation analysis
CN113792952A (en) Method and apparatus for generating a model
CN112990311A (en) Method and device for identifying admitted client
CN110895564A (en) Potential customer data processing method and device
CN113590721B (en) Block chain address classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant