CN113177596A - Block chain address classification method and device - Google Patents

Block chain address classification method and device Download PDF

Info

Publication number
CN113177596A
CN113177596A CN202110480968.9A CN202110480968A CN113177596A CN 113177596 A CN113177596 A CN 113177596A CN 202110480968 A CN202110480968 A CN 202110480968A CN 113177596 A CN113177596 A CN 113177596A
Authority
CN
China
Prior art keywords
sample
classifier
error rate
block chain
iteration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110480968.9A
Other languages
Chinese (zh)
Other versions
CN113177596B (en
Inventor
穆长春
吕远
卿苏德
王艳辉
吴浩
刘睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Currency Institute of the Peoples Bank of China
Original Assignee
Digital Currency Institute of the Peoples Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Currency Institute of the Peoples Bank of China filed Critical Digital Currency Institute of the Peoples Bank of China
Priority to CN202110480968.9A priority Critical patent/CN113177596B/en
Publication of CN113177596A publication Critical patent/CN113177596A/en
Application granted granted Critical
Publication of CN113177596B publication Critical patent/CN113177596B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a block chain address classification method and device, and relates to the technical field of computers. One embodiment of the method comprises: iteratively training the selected classifier by utilizing a set of first block chain address samples, selecting one classifier in each iteration to train by taking the weighted average error rate minimization as a target, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and calculating the sample weight of the first block chain address samples in the current iteration based on the initial value of the sample weight or the sample weight in the last iteration; and determining a block chain address classification model based on each classifier after iterative training and the weight of the classifier so as to determine the class of the block chain address to be classified. The implementation method can improve the accuracy and reliability of block chain address classification, and overcomes the defects of error transmission problem, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like in the prior art.

Description

Block chain address classification method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for block chain address classification.
Background
Currently, in the common classification method of blockchain addresses, the common input heuristic classification method is based on an important assumption that if a transaction has multiple input addresses at the same time, the input addresses are considered to be from the same entity, and in short, different addresses transferred to the same object are considered to be from one entity. When the basic assumption is not true, such as the presence of input from other entities in the same transaction, or the occurrence of a single input, the address labeling result will no longer be reliable and the labeling of the initial address label must be accurate, otherwise the results from subsequent irradiation will all be unreliable. The heuristic identification method of the change-giving address is a classical classification method aiming at a UTXO (unconsumed transaction output) model data structure. Once the UTXO is created it cannot be split so that "change" is often made in blockchain transactions where the "money" paid out by the payer (typically in cryptocurrency) is greater than the "money" billed by the recipient and the excess "money" is transferred to the change address. The algorithm firstly needs to judge whether the change-making action exists or not, and then judges which address is the change-making address, so that the error transmission problem exists, the dependency of the algorithm on rules or assumptions is too high, the generalization of the algorithm is insufficient, the algorithm needs to go through multiple traversals, and the requirements on hardware resources and time cost are high. The existing linear decision boundary classifier has good performance in a plurality of application scenes, but has unsatisfactory effect in block chain address classification, and has the problem of low classification accuracy caused by factors such as model misarrangement, constant sample weight, uneven distribution of each label sample and the like.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the block chain address classification method has the defects of low accuracy and poor reliability, error transmission, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for classifying a blockchain address, which can improve accuracy and reliability of blockchain address classification, and overcome the defects of the prior art, such as error propagation problem, insufficient algorithm generalization, and high requirements on hardware resources and time cost.
To achieve the above object, according to an aspect of an embodiment of the present invention, a method for sorting a block chain address is provided.
A method for block chain address classification, comprising: determining a set of first blockchain address samples, each of the first blockchain address samples including a characteristic representing a first number of blockchain addresses; iteratively training a selected classifier using the set of first blockchain address samples, wherein, for each iteration: selecting a classifier to train with the goal of minimizing weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the previous iteration; and determining a block chain address classification model based on each classifier after iterative training and the corresponding classifier weight, wherein the block chain address classification model is used for determining the category of the block chain address to be classified.
Optionally, the determining a set of first blockchain address samples comprises: obtaining a set of second blockchain address samples, each of the second blockchain address samples comprising a characteristic representing a second number of blockchain addresses, the second number being greater than or equal to the first number; performing a plurality of rounds of feature screening to select the first number of features from the second number of features, and constructing the corresponding first blockchain address sample according to the selected first number of features in each second blockchain address sample to determine the set of first blockchain address samples.
Optionally, the performing multiple rounds of feature screening to select the first number of features from the second number of features includes performing the following steps in each round of feature screening: training a simplified version classifier by using single features in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened; if the minimum error rate is greater than the preset threshold, ending the multi-round feature screening process to obtain the first number of features.
Optionally, the error rate of the simplified version classifier is calculated by: calculating a predicted value and a label value of each training sample corresponding to the simplified version classifier, wherein the training samples comprise the single feature used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and performing weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
Optionally, if the minimum error rate is less than or equal to a preset threshold, after selecting the features used by the simplified version classifier corresponding to the minimum error rate, the method further includes: processing the absolute value of the difference between the label value and the predicted value of the training sample obtained by the simplified version classifier corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated in the feature screening of the current round and the second symbol function value; and multiplying the corresponding weight of the training sample in the current screening with the coefficient so as to update the current weight of the training sample to the corresponding weight of the training sample in the next screening.
Optionally, before iteratively training the selected classifier using the set of first blockchain address samples, determining an optimal number of iterations of the iterative training by: dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross validation error rate estimation value after each training; calculating the variance of the cross validation error rate estimated value by using the cross validation error rate estimated value obtained by each training so as to select the target iteration times corresponding to the minimum cross validation error rate estimated value; and calculating the sum of the minimum cross validation error rate estimation value and the standard deviation of the cross validation error rate estimation value, and selecting the minimum value of each iteration number of which the corresponding cross validation error rate estimation value is not more than the sum from a preset iteration number set as the optimal iteration number.
Optionally, at each iteration, the weighted average error rate is calculated by: and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain a corresponding third sign function value, performing weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the current iteration as the weighted average error rate.
Optionally, when the current iteration is a first iteration, a sample weight of the first blockchain address sample in the current iteration is a sample weight initial value of the first blockchain address sample; under the condition that the current iteration is not the first iteration, the sample weight of the first block chain address sample in the current iteration is obtained by calculating the product of the following three terms: the sample weight of the first block chain address sample in the previous iteration, a value obtained by a classifier weight in the current iteration through a preset function operation, and the third sign function value corresponding to the first block chain address sample, wherein the preset function is an exponential function with e as a base.
Optionally, the classifiers selected for each iteration are the same or different linear decision boundary classifiers.
According to another aspect of the embodiments of the present invention, an apparatus for sorting block chain addresses is provided.
A block chain address classification apparatus, comprising: a first set of blockchain address samples determining module for determining a set of first blockchain address samples, each of the first blockchain address samples comprising a first number of features representing blockchain addresses; a classifier iterative training module to iteratively train a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the goal of minimizing weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the previous iteration; and the block chain address classification model determining module is used for determining a block chain address classification model based on each iteratively trained classifier and the corresponding classifier weight, and the block chain address classification model is used for determining the category of the block chain address to be classified.
Optionally, the first block chain address sample set determining module is further configured to: obtaining a set of second blockchain address samples, each of the second blockchain address samples comprising a characteristic representing a second number of blockchain addresses, the second number being greater than or equal to the first number; performing a plurality of rounds of feature screening to select the first number of features from the second number of features, and constructing the corresponding first blockchain address sample according to the selected first number of features in each second blockchain address sample to determine the set of first blockchain address samples.
Optionally, the first blockchain address sample set determining module includes a feature filtering sub-module, configured to: in each round of feature screening, the following steps are performed: training a simplified version classifier by using single features in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened; if the minimum error rate is greater than the preset threshold, ending the multi-round feature screening process to obtain the first number of features.
Optionally, the feature filtering sub-module calculates an error rate of the simplified version classifier by: calculating a predicted value and a label value of each training sample corresponding to the simplified version classifier, wherein the training samples comprise the single feature used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and performing weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
Optionally, the first block chain address sample set determining module further includes a weight updating submodule, configured to: processing the absolute value of the difference between the label value and the predicted value of the training sample obtained by the simplified version classifier corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated in the feature screening of the current round and the second symbol function value; and multiplying the corresponding weight of the training sample in the current screening with the coefficient so as to update the current weight of the training sample to the corresponding weight of the training sample in the next screening.
Optionally, the method further includes an optimal iteration number determining module, configured to determine an optimal iteration number of the iterative training by: dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross validation error rate estimation value after each training; calculating the variance of the cross validation error rate estimated value by using the cross validation error rate estimated value obtained by each training so as to select the target iteration times corresponding to the minimum cross validation error rate estimated value; and calculating the sum of the minimum cross validation error rate estimation value and the standard deviation of the cross validation error rate estimation value, and selecting the minimum value of each iteration number of which the corresponding cross validation error rate estimation value is not more than the sum from a preset iteration number set as the optimal iteration number.
Optionally, the classifier iterative training module calculates the weighted average error rate at each iteration by: and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain a corresponding third sign function value, performing weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the current iteration as the weighted average error rate.
Optionally, when the current iteration is a first iteration, a sample weight of the first blockchain address sample in the current iteration is a sample weight initial value of the first blockchain address sample; under the condition that the current iteration is not the first iteration, the sample weight of the first block chain address sample in the current iteration is obtained by calculating the product of the following three terms: the sample weight of the first block chain address sample in the previous iteration, a value obtained by a classifier weight in the current iteration through a preset function operation, and the third sign function value corresponding to the first block chain address sample, wherein the preset function is an exponential function with e as a base.
Optionally, the classifier selected by the classifier iterative training module at each iteration is the same or a different linear decision boundary classifier.
According to yet another aspect of an embodiment of the present invention, an electronic device is provided.
An electronic device, comprising: one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method for block chain address classification provided by embodiments of the present invention.
According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.
A computer readable medium, on which a computer program is stored, which when executed by a processor implements the method for block chain address classification provided by an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: iteratively training the selected classifier using a set of first blockchain address samples, wherein, for each iteration: selecting a classifier to train with the objective of minimizing the weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration; and determining a block chain address classification model based on each classifier after iterative training and the corresponding classifier weight so as to classify the block chain address to be classified. The method can improve the accuracy and reliability of block chain address classification, and overcome the defects of error transmission, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like in the prior art.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a block chain address classification method according to an embodiment of the present invention;
FIG. 2 is a block chain address classification flow diagram according to an embodiment of the invention;
FIG. 3 is a block diagram of a device for sorting blockchain addresses according to an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 5 is a schematic block diagram of a computer system suitable for use with a server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram illustrating the main steps of a block chain address classification method according to an embodiment of the present invention.
As shown in fig. 1, the block chain address classification method according to an embodiment of the present invention mainly includes the following steps S101 to S103.
Step S101: a set of first blockchain address samples is determined, each first blockchain address sample including a characteristic representing a first number of blockchain addresses.
Step S102: iteratively training the selected classifier using a set of first blockchain address samples, wherein, for each iteration: selecting a classifier to train with the objective of minimizing the weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration;
step S103: and determining a block chain address classification model based on each classifier after iterative training and the corresponding classifier weight, wherein the block chain address classification model is used for determining the category of the block chain address to be classified.
The step of determining the set of first blockchain address samples may specifically include: obtaining a set of second blockchain address samples, each second blockchain address sample comprising a characteristic representing a second number of blockchain addresses, the second number being greater than or equal to the first number; and performing multiple rounds of feature screening to select a first number of features from the second number of features, and constructing corresponding first blockchain address samples according to the selected first number of features in each second blockchain address sample to determine a set of first blockchain address samples.
The step of performing a plurality of rounds of feature screening to select a first number of features from a second number of features may specifically include, in each round of feature screening, performing the steps of: training a simplified version classifier by using single features in the feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is less than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating a feature set to be screened; if the minimum error rate is greater than the preset threshold value, ending the multi-round feature screening process to obtain a first number of features.
The error rate of the simplified version classifier can be calculated as follows: calculating a predicted value and a label value of each training sample corresponding to the simplified version classifier, wherein the training samples comprise single characteristics used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and performing weighted summation on the first symbol function values according to corresponding weights of the corresponding training samples in the round of screening to obtain the error rate of the simplified version classifier.
If the obtained minimum error rate is less than or equal to the preset threshold, the step of selecting the features used by the simplified version classifier corresponding to the minimum error rate may further include: processing the absolute value of the difference between the label value and the predicted value of the training sample obtained by the simplified version classifier corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate and the second symbol function value calculated in the feature screening of the round; and multiplying the weight of the training sample corresponding to the current round of screening by the coefficient for updating the weight of the training sample so as to update the current weight of the training sample to the weight of the training sample corresponding to the next round of screening.
Using the set of first blockchain address samples, before the step of iteratively training the selected classifier, determining an optimal number of iterations of iterative training by: dividing a set of first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross validation error rate estimation value after each training; calculating the variance of the cross validation error rate estimated value by using the cross validation error rate estimated value obtained by each training so as to select the target iteration times corresponding to the minimum cross validation error rate estimated value; and calculating the sum of the minimum cross validation error rate estimation value and the standard deviation of the cross validation error rate estimation value, and selecting the minimum value of each iteration number of which the corresponding cross validation error rate estimation value is not more than the sum from a preset iteration number set as the optimal iteration number.
At each iteration, the weighted average error rate may be calculated as follows: and processing the absolute value of the difference between the label value of each first block chain address sample and the predicted value through a sign function to obtain a corresponding third sign function value, performing weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all sample weights in the current iteration as a weighted average error rate.
For each iteration, under the condition that the current iteration is the first iteration, the sample weight of the first block chain address sample in the current iteration is the initial value of the sample weight of the first block chain address sample; under the condition that the current iteration is not the first iteration, the sample weight of the first block chain address sample in the current iteration is obtained by calculating the product of the following three terms: the sample weight of the first block chain address sample in the previous iteration, a value obtained by the classifier weight in the current iteration through the operation of a preset function and a third symbol function value corresponding to the first block chain address sample, wherein the preset function is an exponential function taking e as a base.
The classifiers selected for each iteration may be the same or different linear decision boundary classifiers.
Fig. 2 is a block chain address classification flow diagram according to an embodiment of the invention.
The block chain address classification process of the embodiment of the invention realizes the classification of the block chain address to be classified through the self-adaptive enhanced block chain address classification algorithm based on the linear decision boundary classifier provided by the embodiment of the invention, and mainly comprises the following steps: extracting a block chain address on a block chain, performing feature engineering to select more than one hundred features including cross section and time sequence features, and randomly dividing sample data into a training set and a test set according to a self-defined proportion; screening the features by using a training set to obtain a set of first block chain address samples; determining an optimal iteration number M by using K-fold cross validation by using the set of first block chain address samples; training a block chain address classification model by using the set of the first block chain address samples according to the optimal iteration times M; predicting the test set by using the trained block chain address classification model, and verifying the accuracy of the model; and classifying the blockchain address to be classified by using a blockchain address classification model.
The specific process of the block chain address classification according to the embodiment of the present invention is described in detail below.
In the data preparation step, relevant addresses (the addresses of the block chains which are marked by classification and need to be marked by classification) are extracted from the block chains, and characteristic engineering is carried out. Traversing the whole block chain, extracting features including cross section and time sequence information as much as possible from historical transaction information stored on the chain, and obtaining block chain address sample data.
And splitting the sample data of the block chain address into a training set and a test set, and performing feature screening, wherein the training set used in the feature screening stage is a set of second block chain address samples.
The following describes a specific process of feature screening according to an embodiment of the present invention.
For example, in a binary scenario, assume the blockchain address sample (i.e., the second blockchain address sample) is (x)1,y1),(x2,y2),...,(xn,yn) Wherein x isj∈RPN is a P-dimensional covariate for each blockchain address sample;
Figure BDA0003048543130000111
a tag representing a second blockchain address sample, which may be 1 or-1 in embodiments of the present invention; p (i.e., the second number) represents the characteristic number of second blockchain address samples (i.e., the dimension of the covariate); n represents the sample size of the training set; n is+For the number of positive samples in the training set, n-The number of negative samples in the training set; wi,jIndicating the jth sample in the ith iterationWeight at generation (i.e., round i screening); β represents the maximum error rate acceptable; h ispThe linear classifier obtained by training when the P-th feature is used alone is shown, and P is 1, … and P.
Respectively for positive and negative labels as yjSamples of 1, -1 are given initial weights
Figure BDA0003048543130000112
Namely: y isjThe initial weight of the second blockchain address sample of 1 is
Figure BDA0003048543130000113
yjThe initial weight of the second blockchain address sample of-1 is
Figure BDA0003048543130000114
The weight is normalized as follows:
Figure BDA0003048543130000115
for each feature p (single feature), training a simplified version classifier by using the feature, and calculating a simplified version linear classifier h only using the featurep(i.e., reduced version classifier) training the reduced version classifier using training samples whose features only include a single feature in the second blockchain address sample, i.e., the label is y as described abovejWeight is W as described abovei,jIn the feature screening stage, the weight is called the training sample weight.
And calculates the following error rates:
Figure BDA0003048543130000121
εpcalled simplified version classifier hpThe function of the form sign (x) is a sign function, wherein:
Figure BDA0003048543130000122
hp(xj) Is the predicted value of the jth training sample, yjIs the label value of the jth training sample, sign (| h)p(xj)-yjI) is the first sign function value.
And when each round of feature screening is carried out, the number of the trained simplified version classifiers is the same as that of the features to be screened in the feature set to be screened, and for each feature to be screened, a training sample containing the feature is used for training one simplified version classifier, namely, the simplified version classifiers correspond to the features in the feature set to be screened one by one, and the simplified version classifiers trained by using each feature correspond to an error rate, so that the minimum error rate can be selected.
In the first round of feature screening (i ═ 1), P error rates were calculated for P ═ 1, …, and P, respectively, and the minimum error rate min { epsilon ∈ among them was determined1,...,εP-i+1Judging whether the minimum error rate is less than or equal to beta (namely a preset threshold value representing the acceptable maximum error rate), if so, selecting the error rate which can reach the minimum error rate
Figure BDA0003048543130000128
P of (2)minIndividual features are selected and added to the feature list. Namely: a simplified version classifier that will correspond to the minimum error rate
Figure BDA0003048543130000127
The features used are selected as the simplified version of the classifier corresponding to the minimum error rate, i.e., the optimal linear decision boundary classifier.
After the feature is screened in this round, for j 1.
Figure BDA0003048543130000123
Wherein the content of the first and second substances,
Figure BDA0003048543130000124
representation classifier hpminThe predicted value of (2).
Figure BDA0003048543130000125
I.e. the coefficients used to update the weights of the training samples,
Figure BDA0003048543130000126
the value of (a), i.e. the second sign function value, yjIs the label value of the jth training sample.
And then entering the next round of feature screening, continuously and respectively using the single feature training simplified version classifier during the next round of feature screening, and carrying out feature screening according to the same way, namely screening the features by calculating the minimum error rate, wherein the training sample weight is updated once every iteration of the algorithm.
In each round of feature screening, if the minimum error rate min ∈ is set1,...,εP-i+1}>And beta, stopping the calculation, and not continuing to perform the next round of feature screening, namely stopping the whole multi-round feature screening process.
After the above-mentioned multi-round feature screening process is completed, the number of features in the finally obtained feature list is denoted as P, i.e. the first number. Screening P characteristics, epsilon from P characteristics by characteristic screening1,...,εp*The error rate corresponding to the screened features.
The following describes a specific process for determining the optimal iteration number of the iterative training according to the embodiment of the present invention.
Selecting the possible optimal solution M' ═ according to the requirements for computing resource, data size and model accuracy1,M2,...,MU) And the set is a preset iteration number set. And traversing the iteration time set by using the training set data, and training a classifier by using the feature set selected by the feature screening process by adopting a K-fold cross validation method to select the optimal iteration time M.
Splitting the training set into K large blocks with the same size, and circulating for K times, wherein each time, except for the K large blocks, the training set is usedK-1 module training model
Figure BDA0003048543130000131
Model (model)
Figure BDA0003048543130000132
Models built for a single classifier, (M)1,M2,...,MU) In the set, each M corresponds to a single classifier, i.e. M1Corresponding to the first single classifier, M2Corresponding to a second single classifier, … …, MUCorresponding to the U-th single classifier. Training model
Figure BDA0003048543130000133
The characteristics of the sample are P × characteristics screened out, and the label is yjThat is, the embodiment of the present invention trains the model by using the first set of blockchain address samples
Figure BDA0003048543130000134
In (1).
For the U-th single classifier (U ═ 1, 2.. U), a cross validation error rate estimate is calculated
Figure BDA0003048543130000135
Figure BDA0003048543130000136
Wherein the content of the first and second substances,
Figure BDA0003048543130000137
the loss function is determined according to data characteristics and analysis requirements, and may be embodied in various common forms of loss functions, such as least squares loss functions, and the like. block K denotes the samples in the kth chunk, | block K | denotes the number of samples in the kth chunk, K ═ 1, …, K.
For each single classifier, the cycle is repeated for K times, and finally a cross validation error rate estimation can be calculated according to the methodEvaluating value
Figure BDA0003048543130000141
Due to (M)1,M2,...,MU) In the set, each M corresponds to a single classifier, so that U single classifiers are obtained in total
Figure BDA0003048543130000142
Utilize the U pieces
Figure BDA0003048543130000143
The variance of the cross-validation error rate estimate can be calculated, i.e.
Figure BDA0003048543130000144
Selecting optimal iteration times M, specifically selecting M with lowest cross validation error rate estimation valueerr_min(i.e., target number of iterations), i.e.:
Figure BDA0003048543130000145
finding an estimate in set M' that satisfies a cross-validation error rate not higher than
Figure BDA0003048543130000146
And
Figure BDA0003048543130000147
the minimum M of the sum of (a) is M, wherein,
Figure BDA0003048543130000148
is Merr_minThe corresponding cross-validation error rate estimate,
Figure BDA0003048543130000149
for cross validation error rate estimate standard deviation, M is given by:
M*=min{M1,...,MU}s.t.
Figure BDA00030485431300001410
i.e., in the constraints (s.t.)
Figure BDA00030485431300001411
Next, M is found1,...,MUIn other words, in a preset iteration number set, the minimum value of the iteration numbers of which the corresponding cross validation error rate estimation value is not more than the sum (the sum of the minimum cross validation error rate estimation value and the standard deviation of the cross validation error rate estimation value) is selected as the optimal iteration number.
The following describes a specific process of determining a block chain address classification model according to an embodiment of the present invention.
n represents the sample size of the training set;
Figure BDA00030485431300001412
representing P x features (i.e., the first number of features of the first blockchain address sample) selected from P features (i.e., the second number of features of the second blockchain address sample) of the data start; based on
Figure BDA00030485431300001413
The block chain address samples are first block chain address samples, and each first block chain address sample comprises P characteristics obtained through characteristic screening;
Figure BDA00030485431300001414
a label representing a first blockchain address sample (corresponding to the same label as the second blockchain address sample of the feature screening stage above), where j is 1i,jRepresents the weight of the jth sample at the ith iteration (which is the same as the weight of the second blockchain address sample in the feature screening stage above), and is referred to as the sample weight in the embodiment of the present invention; m represents the optimal iteration number; giDenotes a linear classifier (abbreviated classifier) selected at the i-th iteration, i 1The classifiers selected in different iterations may be the same type of classifier or different types of classifiers, for example, one iteration may select a classifier based on logistic regression, the next iteration may select a classifier based on logistic regression as well, or a classifier based on a support vector machine may be selected (the classifier type is merely an example); alpha is alphaiRepresents the classifier giThe classifier weight of (1).
For the j-th sample, 1, …, n, an initial weight (i.e., sample weight initial value) is defined:
Figure BDA0003048543130000151
for the i-th iteration 1, …, M ×:
training the Linear classifier g by minimizing the weighted average error Ratei
Figure BDA0003048543130000152
From this equation, the weighted average error rate err of this iterationiSample weight w of each first blockchain address sample in the current iterationi,jIt is related. Wherein sign { | yj-gi(xj) The value of i is the third sign function value, gi(xj) The predictor for the jth first blockchain address sample at the ith iteration.
Figure BDA0003048543130000153
I.e. the sum of all sample weights in the ith iteration.
Figure BDA0003048543130000154
And weighting and summing the third symbol function values according to the sample weights of the corresponding first block chain address samples in the ith iteration to obtain weighted sums.
Each iteration of the algorithm trains a linear decision boundary classifier (i.e., linear classifier) with the goal of minimizing the error rate, i.e., the weighted average error rate described above.
Calculate classifier g as followsiIs weighted by the classifier ofi
Figure BDA0003048543130000155
For j 1.., n, the sample weight is calculated:
Wi+1,j=Wi,j×exp(αi)×sign{|yj-gi(xj)|},i=1,...,M*
i.e. the sample weight W of the first blockchain address sample in a certain iterationi+1,jIs based on the initial value W of the sample weight of the first blockchain address sample1,jOr the sample weight W of the first blockchain address sample in the last iterationi,jAnd (4) calculating. exp (alpha)i) The value of (a) is calculated by a preset function, namely an exponential function with e as the base, by using the classifier weight in the ith iteration. The sample weights are updated once per iteration of the algorithm.
After M times of iteration is completed, based on each classifier after iterative training and the corresponding classifier weight, determining a block chain address classification model as follows:
Figure BDA0003048543130000156
and testing the accuracy of the model on the test set by using the trained block chain address classification model, and comparing the error rate of the test set with the error rate of the training set to confirm that the model has no overfitting problem.
In the above block chain address classification model, f (x) is an output of the block chain address classification model, which indicates the type of the block chain address to be classified, for example, two classifications may indicate that the type of the block chain address to be classified is an exchange block chain address or a non-exchange block chain address.
The classifiers used in the above stages (the stages of feature screening, determining the optimal iteration number, determining the block chain address classification model, etc.) of the embodiment of the present invention may be various linear decision boundary classifiers, including but not limited to logistic regression, decision tree, random forest, support vector machine, etc.
In machine learning, a large number of linear decision boundary classifiers such as linear kernel support vector machines, logistic regression, naive bayes and the like exist, although these methods have good performance in many application scenarios, the effect of applying these methods in block chain address classification in the prior art is not ideal, and the main reasons are as follows: firstly, many linear decision boundary classifiers need to assume a linear relationship between an address tag and a covariate; secondly, the XOR problem (namely the classification boundary is a nonlinear hyperplane) is obvious in a high-dimensional space, and the linear decision boundary cannot well distinguish different classes; thirdly, a plurality of classifiers need to assume that covariates are independent from each other, and each covariate is equally important for label division; fourthly, the classifier usually assumes that the weight of each sample is constant; and fifthly, most classifiers usually have better classification results when the number distribution of samples of each class is relatively balanced. The embodiment of the invention provides a self-adaptive enhanced block chain address classification method based on a linear decision boundary classification model, which overcomes the problems in the prior art. In addition, the sample weight of the embodiment of the invention is dynamically updated, so that mutual independence between covariates is not required to be assumed and each covariate is equally important for label division as in the prior art, the weight of each sample of the embodiment of the invention is not constant, and the embodiment of the invention determines the optimal iteration times through a K-fold cross validation method and then carries out iterative training, so that an accurate classification result can be obtained under the condition of no need of balanced quantity distribution of samples of various classes.
Fig. 3 is a schematic diagram of the main blocks of the device for sorting the blockchain address according to an embodiment of the present invention.
As shown in fig. 3, the apparatus 300 for sorting block chain addresses according to an embodiment of the present invention mainly includes: a first blockchain address sample set determining module 301, a classifier iterative training module 302, and a blockchain address classification model determining module 303.
A first set of blockchain address samples determining module 301 for determining a set of first blockchain address samples, each first blockchain address sample comprising a first number of features representing blockchain addresses;
a classifier iterative training module 302 to iteratively train a selected classifier using a set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the objective of minimizing the weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration;
and the block chain address classification model determining module 303 is configured to determine a block chain address classification model based on each iteratively trained classifier and a corresponding classifier weight, where the block chain address classification model is used to determine a category of a block chain address to be classified.
The first blockchain address sample set determining module 301 may be specifically configured to: obtaining a set of second blockchain address samples, each second blockchain address sample comprising a characteristic representing a second number of blockchain addresses, the second number being greater than or equal to the first number; and performing multiple rounds of feature screening to select a first number of features from the second number of features, and constructing corresponding first blockchain address samples according to the selected first number of features in each second blockchain address sample to determine a set of first blockchain address samples.
The first blockchain address sample set determination module 301 may include a feature screening submodule for: in each round of feature screening, the following steps are performed: training a simplified version classifier by using single features in the feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features; calculating the error rate of each simplified version classifier to obtain the minimum error rate; if the minimum error rate is less than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating a feature set to be screened; if the minimum error rate is greater than the preset threshold value, ending the multi-round feature screening process to obtain a first number of features.
The feature filtering sub-module may calculate the error rate of the simplified version classifier by: calculating a predicted value and a label value of each training sample corresponding to the simplified version classifier, wherein the training samples comprise single characteristics used for training the simplified version classifier; and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and performing weighted summation on the first symbol function values according to corresponding weights of the corresponding training samples in the round of screening to obtain the error rate of the simplified version classifier.
The first blockchain address sample set determination module may further include a weight update submodule for: processing the absolute value of the difference between the label value and the predicted value of the training sample obtained by the simplified version classifier corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate and the second symbol function value calculated in the feature screening of the round; and multiplying the corresponding weight of the training sample in the current round of screening by the coefficient so as to update the current weight of the training sample to the corresponding weight of the training sample in the next round of screening.
The block chain address classification apparatus 300 may further include an optimal iteration number determining module, configured to determine an optimal iteration number of the iterative training by: dividing a set of first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross validation error rate estimation value after each training; calculating the variance of the cross validation error rate estimated value by using the cross validation error rate estimated value obtained by each training so as to select the target iteration times corresponding to the minimum cross validation error rate estimated value; and calculating the sum of the minimum cross validation error rate estimation value and the standard deviation of the cross validation error rate estimation value, and selecting the minimum value of each iteration number of which the corresponding cross validation error rate estimation value is not more than the sum from a preset iteration number set as the optimal iteration number.
Classifier iterative training module 302 calculates a weighted average error rate at each iteration by: and processing the absolute value of the difference between the label value of each first block chain address sample and the predicted value through a sign function to obtain a corresponding third sign function value, performing weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all sample weights in the current iteration as a weighted average error rate.
Under the condition that the current iteration is the first iteration, the sample weight of the first block chain address sample in the current iteration is the initial value of the sample weight of the first block chain address sample; under the condition that the current iteration is not the first iteration, the sample weight of the first block chain address sample in the current iteration is obtained by calculating the product of the following three terms: the method comprises the steps of obtaining a sample weight of a first block chain address sample in the previous iteration, a value obtained by the operation of a preset function by utilizing a classifier weight in the current iteration, and a third symbol function value corresponding to the first block chain address sample, wherein the preset function is an exponential function taking e as a base.
The classifier selected by the classifier iterative training module 302 at each iteration is the same or a different linear decision boundary classifier.
In addition, the detailed implementation of the block chain address classification apparatus in the embodiment of the present invention has been described in detail in the above block chain address classification method, and therefore, the repeated content is not described again.
Fig. 4 shows an exemplary system architecture 400 of a blockchain address classification method or a blockchain address classification apparatus to which embodiments of the present invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have installed thereon various communication client applications, such as a web browser application, a search-type application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the block chain address classification request, and feed back a processing result (for example, an address category — just an example) to the terminal device.
It should be noted that the method for classifying a blockchain address provided by the embodiment of the present invention is generally performed by the server 405, and accordingly, the device for classifying a blockchain address is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a server according to embodiments of the present application is shown. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a first blockchain address sample set determination module, a classifier iterative training module, and a blockchain address classification model determination module. Where the names of these modules do not in some cases constitute a limitation of the module itself, for example, the first blockchain address sample set determining module may also be described as a "module for determining a set of first blockchain address samples".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: determining a set of first blockchain address samples, each of the first blockchain address samples including a characteristic representing a first number of blockchain addresses; iteratively training a selected classifier using the set of first blockchain address samples, wherein, for each iteration: selecting a classifier to train with the goal of minimizing weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the previous iteration; and determining a block chain address classification model based on each classifier after iterative training and the corresponding classifier weight, wherein the block chain address classification model is used for determining the category of the block chain address to be classified.
According to the technical solution of the embodiment of the present invention, a selected classifier is iteratively trained using a set of first blockchain address samples, wherein, for each iteration: selecting a classifier to train with the objective of minimizing the weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the last iteration; and determining a block chain address classification model based on each classifier after iterative training and the corresponding classifier weight so as to classify the block chain address to be classified. The method can improve the accuracy and reliability of block chain address classification, and overcome the defects of error transmission, insufficient algorithm generalization, high requirements on hardware resources and time cost and the like in the prior art.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (20)

1. A method for block chain address classification, comprising:
determining a set of first blockchain address samples, each of the first blockchain address samples including a characteristic representing a first number of blockchain addresses;
iteratively training a selected classifier using the set of first blockchain address samples, wherein, for each iteration: selecting a classifier to train with the goal of minimizing weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the previous iteration;
and determining a block chain address classification model based on each classifier after iterative training and the corresponding classifier weight, wherein the block chain address classification model is used for determining the category of the block chain address to be classified.
2. The method of claim 1, wherein determining the set of first blockchain address samples comprises:
obtaining a set of second blockchain address samples, each of the second blockchain address samples comprising a characteristic representing a second number of blockchain addresses, the second number being greater than or equal to the first number;
performing a plurality of rounds of feature screening to select the first number of features from the second number of features, and constructing the corresponding first blockchain address sample according to the selected first number of features in each second blockchain address sample to determine the set of first blockchain address samples.
3. The method of claim 2, wherein performing multiple rounds of feature screening to select the first number of features from the second number of features comprises performing the following steps in each round of feature screening:
training a simplified version classifier by using single features in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features;
calculating the error rate of each simplified version classifier to obtain the minimum error rate;
if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened;
if the minimum error rate is greater than the preset threshold, ending the multi-round feature screening process to obtain the first number of features.
4. The method of claim 3, wherein the error rate of the reduced version classifier is calculated by:
calculating a predicted value and a label value of each training sample corresponding to the simplified version classifier, wherein the training samples comprise the single feature used for training the simplified version classifier;
and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and performing weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
5. The method of claim 4, wherein the selecting the features used by the simplified version classifier corresponding to the minimum error rate if the minimum error rate is less than or equal to a predetermined threshold further comprises:
processing the absolute value of the difference between the label value and the predicted value of the training sample obtained by the simplified version classifier corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated in the feature screening of the current round and the second symbol function value;
and multiplying the corresponding weight of the training sample in the current screening with the coefficient so as to update the current weight of the training sample to the corresponding weight of the training sample in the next screening.
6. The method of claim 1, wherein prior to iteratively training the selected classifier using the first set of blockchain address samples, comprising determining an optimal number of iterations of the iterative training by:
dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross validation error rate estimation value after each training;
calculating the variance of the cross validation error rate estimated value by using the cross validation error rate estimated value obtained by each training so as to select the target iteration times corresponding to the minimum cross validation error rate estimated value;
and calculating the sum of the minimum cross validation error rate estimation value and the standard deviation of the cross validation error rate estimation value, and selecting the minimum value of each iteration number of which the corresponding cross validation error rate estimation value is not more than the sum from a preset iteration number set as the optimal iteration number.
7. The method of claim 1, wherein at each iteration, the weighted average error rate is calculated by:
and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain a corresponding third sign function value, performing weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the current iteration as the weighted average error rate.
8. The method of claim 7,
under the condition that the current iteration is the first iteration, the sample weight of the first block chain address sample in the current iteration is the initial value of the sample weight of the first block chain address sample;
under the condition that the current iteration is not the first iteration, the sample weight of the first block chain address sample in the current iteration is obtained by calculating the product of the following three terms: the sample weight of the first block chain address sample in the previous iteration, a value obtained by a classifier weight in the current iteration through a preset function operation, and the third sign function value corresponding to the first block chain address sample, wherein the preset function is an exponential function with e as a base.
9. The method of claim 1, wherein the classifiers selected in each iteration are the same or different linear decision boundary classifiers.
10. An apparatus for block chain address classification, comprising:
a first set of blockchain address samples determining module for determining a set of first blockchain address samples, each of the first blockchain address samples comprising a first number of features representing blockchain addresses;
a classifier iterative training module to iteratively train a selected classifier using the set of first blockchain address samples, wherein for each iteration: selecting a classifier to train with the goal of minimizing weighted average error rate, wherein the weighted average error rate of the current iteration is related to the sample weight of each first block chain address sample in the current iteration, calculating the classifier weight of the classifier based on the weighted average error rate of the current iteration, and the sample weight of the first block chain address sample in the current iteration is calculated based on the initial value of the sample weight of the first block chain address sample or the sample weight of the first block chain address sample in the previous iteration;
and the block chain address classification model determining module is used for determining a block chain address classification model based on each iteratively trained classifier and the corresponding classifier weight, and the block chain address classification model is used for determining the category of the block chain address to be classified.
11. The apparatus of claim 10, wherein the first block chain address sample set determining module is further configured to:
obtaining a set of second blockchain address samples, each of the second blockchain address samples comprising a characteristic representing a second number of blockchain addresses, the second number being greater than or equal to the first number;
performing a plurality of rounds of feature screening to select the first number of features from the second number of features, and constructing the corresponding first blockchain address sample according to the selected first number of features in each second blockchain address sample to determine the set of first blockchain address samples.
12. The apparatus of claim 11, wherein the first block chain address sample set determining module comprises a feature filtering sub-module configured to: in each round of feature screening, the following steps are performed:
training a simplified version classifier by using single features in a feature set to be screened, wherein the simplified version classifier corresponds to the features in the feature set to be screened one by one, and the feature set to be screened is a set formed by features which are not selected in the second number of features;
calculating the error rate of each simplified version classifier to obtain the minimum error rate;
if the minimum error rate is smaller than or equal to a preset threshold value, selecting the features used by the simplified version classifier corresponding to the minimum error rate, and updating the feature set to be screened;
if the minimum error rate is greater than the preset threshold, ending the multi-round feature screening process to obtain the first number of features.
13. The apparatus of claim 12, wherein the feature filtering sub-module calculates the error rate of the reduced version classifier by:
calculating a predicted value and a label value of each training sample corresponding to the simplified version classifier, wherein the training samples comprise the single feature used for training the simplified version classifier;
and processing the absolute value of the difference between the predicted value and the label value of each training sample through a symbol function to obtain corresponding first symbol function values, and performing weighted summation on each first symbol function value according to the corresponding weight of the corresponding training sample in the round of screening to obtain the error rate of the simplified version classifier.
14. The apparatus of claim 13, wherein the first block chain address sample set determination module further comprises a weight update submodule configured to:
processing the absolute value of the difference between the label value and the predicted value of the training sample obtained by the simplified version classifier corresponding to the minimum error rate through a symbol function to obtain a second symbol function value, and calculating a coefficient for updating the weight of the training sample by using the minimum error rate calculated in the feature screening of the current round and the second symbol function value;
and multiplying the corresponding weight of the training sample in the current screening with the coefficient so as to update the current weight of the training sample to the corresponding weight of the training sample in the next screening.
15. The apparatus of claim 10, further comprising an optimal iteration number determining module configured to determine an optimal iteration number of the iterative training by:
dividing the set of the first block chain address samples into K subsets, training a classifier by using the K-1 subsets each time, and calculating a cross validation error rate estimation value after each training;
calculating the variance of the cross validation error rate estimated value by using the cross validation error rate estimated value obtained by each training so as to select the target iteration times corresponding to the minimum cross validation error rate estimated value;
and calculating the sum of the minimum cross validation error rate estimation value and the standard deviation of the cross validation error rate estimation value, and selecting the minimum value of each iteration number of which the corresponding cross validation error rate estimation value is not more than the sum from a preset iteration number set as the optimal iteration number.
16. The apparatus of claim 10, wherein the classifier iterative training module calculates the weighted average error rate at each iteration by:
and processing the absolute value of the difference between the label value and the predicted value of each first block chain address sample through a sign function to obtain a corresponding third sign function value, performing weighted summation on each third sign function value according to the sample weight of the corresponding first block chain address sample in the current iteration to obtain a weighted sum, and taking the ratio of the weighted sum to the sum of all the sample weights in the current iteration as the weighted average error rate.
17. The apparatus of claim 16,
under the condition that the current iteration is the first iteration, the sample weight of the first block chain address sample in the current iteration is the initial value of the sample weight of the first block chain address sample;
under the condition that the current iteration is not the first iteration, the sample weight of the first block chain address sample in the current iteration is obtained by calculating the product of the following three terms: the sample weight of the first block chain address sample in the previous iteration, a value obtained by a classifier weight in the current iteration through a preset function operation, and the third sign function value corresponding to the first block chain address sample, wherein the preset function is an exponential function with e as a base.
18. The apparatus of claim 10, wherein the classifier selected by the iterative classifier training module at each iteration is the same or a different linear decision boundary classifier.
19. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-9.
20. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202110480968.9A 2021-04-30 2021-04-30 Block chain address classification method and device Active CN113177596B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110480968.9A CN113177596B (en) 2021-04-30 2021-04-30 Block chain address classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110480968.9A CN113177596B (en) 2021-04-30 2021-04-30 Block chain address classification method and device

Publications (2)

Publication Number Publication Date
CN113177596A true CN113177596A (en) 2021-07-27
CN113177596B CN113177596B (en) 2024-03-22

Family

ID=76925718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110480968.9A Active CN113177596B (en) 2021-04-30 2021-04-30 Block chain address classification method and device

Country Status (1)

Country Link
CN (1) CN113177596B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368911A (en) * 2020-03-03 2020-07-03 腾讯科技(深圳)有限公司 Image classification method and device and computer readable storage medium
CN111444232A (en) * 2020-01-03 2020-07-24 上海宓猿信息技术有限公司 Method for mining digital currency exchange address and storage medium
CN111754345A (en) * 2020-06-18 2020-10-09 天津理工大学 Bit currency address classification method based on improved random forest
CN111797942A (en) * 2020-07-23 2020-10-20 深圳壹账通智能科技有限公司 User information classification method and device, computer equipment and storage medium
US20200394471A1 (en) * 2019-06-12 2020-12-17 International Business Machines Corporation Efficient database maching learning verification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200394471A1 (en) * 2019-06-12 2020-12-17 International Business Machines Corporation Efficient database maching learning verification
CN111444232A (en) * 2020-01-03 2020-07-24 上海宓猿信息技术有限公司 Method for mining digital currency exchange address and storage medium
CN111368911A (en) * 2020-03-03 2020-07-03 腾讯科技(深圳)有限公司 Image classification method and device and computer readable storage medium
CN111754345A (en) * 2020-06-18 2020-10-09 天津理工大学 Bit currency address classification method based on improved random forest
CN111797942A (en) * 2020-07-23 2020-10-20 深圳壹账通智能科技有限公司 User information classification method and device, computer equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HAO HUA SUN YIN.ET AL: "Regulating Cryptocurrencies: A Supervised Machine Learning Approach to De-Anonymizing the Bitcoin Blockchain", 《JOURNAL OF MANAGEMENT INFORMATION SYSTEMS》 *
WENYOU GAO.ET AL: "Analysis on block chain financial transaction under artificial neural network of deep learning", 《JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS》 *
杨霞: "区块链生态安全挑战及解决方案研究", 《网络空间安全》 *
毛洪亮等: "基于启发式的比特币地址聚类方法", 北京邮电大学学报 *

Also Published As

Publication number Publication date
CN113177596B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CN108520470B (en) Method and apparatus for generating user attribute information
CN108197652B (en) Method and apparatus for generating information
CN105761102B (en) Method and device for predicting commodity purchasing behavior of user
US11281999B2 (en) Predictive accuracy of classifiers using balanced training sets
CN110995459B (en) Abnormal object identification method, device, medium and electronic equipment
CN110555451A (en) information identification method and device
CN112966701A (en) Method and device for classifying objects
US20180349476A1 (en) Evaluating theses using tree structures
CN113743971A (en) Data processing method and device
US11954910B2 (en) Dynamic multi-resolution processing for video classification
CN112231299B (en) Method and device for dynamically adjusting feature library
CN110555747A (en) method and device for determining target user
CN110619253A (en) Identity recognition method and device
CN110084255A (en) The detection method and device of abnormal data
CN111930858A (en) Representation learning method and device of heterogeneous information network and electronic equipment
CN113177596B (en) Block chain address classification method and device
US11556935B1 (en) Financial risk management based on transactions portrait
US20220245460A1 (en) Adaptive self-adversarial negative sampling for graph neural network training
CN113239259A (en) Method and device for determining similar stores
CN114066603A (en) Post-loan risk early warning method and device, electronic equipment and computer readable medium
CN113792952A (en) Method and apparatus for generating a model
CN113590754A (en) Big data analysis method and big data analysis server applied to community interaction
CN113190730A (en) Method and device for classifying block chain addresses
CN110895564A (en) Potential customer data processing method and device
CN112528103A (en) Method and device for recommending objects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant