CN107679734A

CN107679734A - It is a kind of to be used for the method and system without label data classification prediction

Info

Publication number: CN107679734A
Application number: CN201710890305.8A
Authority: CN
Inventors: 田斌; 王纯斌; 赵红军; 覃进学
Original assignee: Chengdu Sefon Software Co Ltd
Current assignee: Chengdu Sefon Software Co Ltd
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2018-02-09

Abstract

It is used for the method and system without label data classification prediction the invention discloses a kind of, it both make use of historical data, there is no hard requirement to historical data label again, and combine the posterior infromation database accumulated in business procedure, it can subsequently be continued to optimize by True Data, reach the automatic precision of prediction that improves and obtain purpose.This method includes incoming traffic flow data, obtains multiple business scenario data in operation flow；Incoming traffic content-data, and business tine data are grouped according to business scenario data；According to business scenario data and corresponding business tine database, the characteristic index of construction business tine data；The characteristic index of business tine data is cleaned；Business tine data by characteristic index cleaning are clustered, and determine all kinds of class centers；Calculate all kinds of sample weights；Business tine data are sampled according to sample weight, and to sampled result data markers prediction label.

Description

It is a kind of to be used for the method and system without label data classification prediction

Technical field

The present invention relates to data classification electric powder prediction, more particularly to a kind of side for being used to predict without label data classification Method and system.

Background technology

Risk supervision is conventional quality determining method, and this method is widely used in the business diagnosis of every profession and trade, for examining Potential risks in survey business, to find and to control in advance.For general enterprises or supervision department, the mode master of risk supervision It is divided into three kinds：One is one by one checking detected object using Quality Inspector, find to be detected the risk of product；The second is Detected object is inspected by random samples, finds to be detected the risk of product；The third is by produce the product information based data and Historical data, the probability of each detected object risk is predicted, actual take out then is carried out to the detected product of high risk Inspection.

In three kinds of risk supervision modes of foregoing description, the first is that full dose data are checked, suitable for detection Mesh is few, the less product of technical difficulty, is often applied to detect product (, the technology single with product that this enterprise is produced The characteristics of simple).Second of detection method usage scenario is similar with the first scene, more to product category, the production of technical sophistication Product do not apply to, and qualified (normal) accounting for being detected product which can count, but can let off a certain proportion of risk and be detected Survey product.The third mainly utilizes existing information system, by being modeled to historical data and (really constructing a grader), root Risk probability is found according to the characteristic of detected product, as long as historical data has label, multiclass product can be applicable, and completely Rule is found from data, is related to less ins and outs, has a wide range of application.

In government regulator, supervised entities' industry for being related to is numerous, rich choice of products.As customs is false to inlet and outlet Trade detects, and will be related to all industries and product for participating in trade.Therefore, above-mentioned first two detection mode needs Expend substantial amounts of manpower and time and seem less suitable.The third mode detects the risk of each detected product from data, needs History tab data are wanted, but due to a variety of causes, many systems are not stored in label data, therefore this method has precision of prediction Low technical problem, also, due to the label that its heavy dependence historical data has marked, can not be applied to pre- without label data The environment of survey, also it is not used to the business scenario to abnormality detection.

The content of the invention

An object of the present invention at least that, for how to overcome the above-mentioned problems of the prior art, there is provided a kind of For the method and system without label data classification prediction, it both make use of historical data, and not hard to historical data label Property require, and combine the posterior infromation database accumulated in business procedure, can subsequently be continued to optimize, reached by True Data Purpose is obtained to the automatic precision of prediction that improves.

To achieve these goals, the technical solution adopted by the present invention includes following aspects.

A kind of to be used for the method without label data classification prediction, it comprises the following steps：

Incoming traffic flow data, obtain multiple business scenario data in operation flow；Incoming traffic content-data, and Business tine data are grouped according to business scenario data；According to business scenario data and corresponding business tine data Storehouse, construct the characteristic index of business tine data；The characteristic index of business tine data is cleaned；To by characteristic index The business tine data of cleaning are clustered, and determine all kinds of class centers；Calculate all kinds of sample weights；According to sample weight Business tine data are sampled, and to sampled result data markers prediction label.

Preferably, the characteristic index of the construction business tine data includes：By the business tine data value and industry of input History service content-data value in business content data base under the identical services scene that stores is compared, according to business tine The degree of closeness construction risk forward direction index of the business tine data value of false trading is produced in database.

Preferably, the characteristic index to business tine data, which carries out cleaning, includes：At characteristic index missing Reason, it is the neutral characteristic index value of business tine data distribution；The characteristic index value of business tine data is normalized in [0,1] To eliminate dimension impact, and characteristic index database is established to record the weighted value of the characteristic index of every business tine data.

Preferably, methods described is used in KMeans, DBSCAN, hierarchical clustering, partition clustering and spectral clustering One or more persons are clustered；

Clustering the distance metric used is included in Euclidean distance, Minkowski Distance and cosine similarity One or more persons.

Preferably, all kinds of sample weight of the calculating includes：Using equation below Calculate all kinds of sample weight w_i, wherein, j is the quantity of class,Represent the i-th class center to square of origin of coordinates distance.

Preferably, methods described further comprises：Labeled sampled data is divided into instruction according to default ratio Practice collection and test set；According to new business tine data and its corresponding prediction result and actual result data are inputted, to training Collection is expanded.

Preferably, methods described further comprises：Bayes classifier is constructed using labeled sampled data, and Prediction result is tested on test set, if accuracy rate is less than predetermined threshold value, the data in training set change business The characteristic index of content-data, and all kinds of sample weights, to obtain the higher prediction result of accuracy rate.

Preferably, methods described further comprises：Bayes classifier is constructed using labeled sampled data, and Prediction result is tested on test set, if the accuracy rate of prediction is more than or equal to predetermined threshold value, inputs new business tine Data, and prediction result is obtained, and indicating risk is carried out according to prediction result.

Preferably, methods described further comprises：Repeat one of described each step or more persons, and each step In to the selection of distance function, class in the construction of business tine data, the number of characteristic index, the selection of clustering algorithm, cluster The selection of number, the calculation formula at class center, the ratio of sampled data is some or all of when repeating differs.

A kind of to be used for the system without label data classification prediction, it includes at least electronic equipment by network connection With a database server；

Wherein, the electronic equipment includes at least one processor, and is connected with least one processor communication Memory；The memory storage have can by the instruction of at least one computing device, the instruction by it is described at least One computing device, so that at least one processor is able to carry out preceding method；

The database server is used for storage service content data base and characteristic index database.

In summary, by adopting the above-described technical solution, the present invention at least has the advantages that：

(1) present invention is applied to most of environment to being predicted without label data, is particularly suitable for use in and exception (risk) is examined The business scenario of survey, solve the problems, such as under this kind of environment can not Direct Modeling new data is given a forecast.

(2) present invention need not artificially label to the historical data without label, save the great effort of mankind's mark And the time, it with only historical data and the prior information of business expert, you can modeling analysis.

(3) present invention utilizes the thought of classical Bayes, that is, priori and possibility predication posterior value, Ke Yizhi are passed through Connect the data identified after reaching the standard grade and add model, Optimized model prediction accuracy.

Brief description of the drawings

Fig. 1 is a kind of flow chart for being used for the method without label data classification prediction according to an embodiment of the invention.

Fig. 2 is a kind of flow chart for being used for the method without label data classification prediction according to another embodiment of the present invention.

Fig. 3 is a kind of structural representation for being used for the system without label data classification prediction according to an embodiment of the invention Figure.

Embodiment

Below in conjunction with the accompanying drawings and embodiment, the present invention will be described in further detail, so that the purpose of the present invention, technology Scheme and advantage are more clearly understood.It should be appreciated that specific embodiment described herein is only to explain the present invention, and do not have to It is of the invention in limiting.

Fig. 1 shows a kind of method for being used to predict without label data classification according to embodiments of the present invention, hereafter with it Exemplified by being detected applied to customs service false trading, this method is described in detail.

Step 101：Incoming traffic flow data, obtain multiple business scenario data in operation flow

For example, for the whole process of customs's clearance, customs's clearance business scenario is divided into：Live customs's reception customs declaration, It is total close concentrate document examination, live customs examination ＆ verification customs declaration, examination ＆ verification examination result list, determine collection of duties and fees, to sign and issue customs document etc. more Individual business scenario.

Step 102：Incoming traffic content-data, and business tine data are grouped according to business scenario data

Specifically, can will Description of Goods corresponding with packing list, invoice, contract, verifying and writing-off instrument, port clearance or strip, commodity The information such as coding, freight charges, premium, number of packages, weight, size, value of goods are classified according to business scenario, so that clearance Each single item business tine data can be corresponding with one or more business scenario.

Step 103：According to business scenario data and corresponding business tine database, the feature of construction business tine data Index

Wherein it is possible to the identical services field that will be stored in the business tine data value currently inputted and business tine database History service content-data value under scape is compared, the business tine data with producing false trading in business tine database It is bigger to be worth the possibility of the business tine data value generation false trading of closer current input, and is configured to risk accordingly just To index, i.e. desired value is bigger, then the possibility for producing false trading is bigger.Also, when desired value is more than default threshold value, The characteristic index label such as false trading is stamped for the business tine data.

In a preferred embodiment, method of the invention can be applied equally to other it is various there are a large number of services data, But the result label that the characteristic index of business tine data is analyzed is stored in operation system because other a variety of causes do not have, and Subsequently wish to realize the particular surroundings of intelligence aided decision.On the characteristic index label of business tine data, preferably business Clearer and more definite two classification problem of meaning, such as normally with it is abnormal, qualified with unqualified, normal with false two classification problem.And And influence direction of each index to prediction result is determined according to business tine database (forward direction influences or negative sense influences). All characteristics can be the data after going dimension, or the data gone before dimension, but feature need to be referred to Mark carries out sliding-model control.

Step 104：The characteristic index of business tine data is cleaned

The step specifically includes both sides content, first, handling characteristic index missing, refers to second, eliminating feature Dimension impact between mark.Wherein, carrying out processing to characteristic index missing includes, and is the neutral feature of business tine data distribution Desired value.Normalization operation can be used by eliminating dimension impact, such as the characteristic index value normalization of business tine data is existed [0,1], and weighting operations are done to index according to the importance of characteristic index.Specifically, characteristic index database can be established Record the weighted value of the characteristic index of every business tine data.

Step 105：Business tine data by characteristic index cleaning are clustered, and determine all kinds of class centers

It is for instance possible to use such as KMeans, DBSCAN, hierarchical clustering, partition clustering and spectral clustering clustering algorithm come really Determine class center, the coordinate of each business tine data can also be averaged or median determines class center.Cluster use away from Various distance metric methods, such as Euclidean distance, Minkowski Distance, cosine similarity can be used from measurement.

In same embodiment, class center can be determined respectively to improve accuracy with a variety of clustering algorithms.Wherein, gather The quality of class result is assessed and can examined by two ways, and a kind of is the test rating based on distance, such as silhouette coefficient；It is another Kind it is to combine following step 210 to sample out the data risk that data rely on artificial micro-judgment to be sampled out.

Step 106：Calculate all kinds of sample weights

Specifically, because what is taken in step 103 during latent structure is risk forward direction index, i.e. index is bigger, and risk is got over Greatly.Therefore the probability of the false risk of all kinds of generations may be different, and is 0 class because achievement data all normalizes minimum value, Therefore the distance of class center to origin is bigger, then the risk that false trading occurs should be bigger, and sample weight should be bigger, here Equation below can be used to calculate all kinds of sample weights：Wherein, j is the quantity of class,Represent the i-th class center to square of origin of coordinates distance.

Step 107：Business tine data are sampled according to sample weight, and to the pre- mark of sampled result data markers Label

Specifically, can be according to the history false trading ratio data in business tine database, it is determined that needing what is sampled Sample size accounts for the ratio of total number of samples amount, then according to the use weight in step 106 (for example, being more than to sample weight value 0.5 data are sampled) every business tine data of input are sampled, and by the sampled data of acquisition labeled as void False trade, the data not being sampled are then labeled as normal trade, so as to obtain the tentative prediction result of false trading and deposit Enter business tine database.Wherein, the step may occur the sampled data bar number of certain class and be more than such sample total, this feelings Condition just directly samples such all data.

Fig. 2 is shown is used for the method without label data classification prediction according to a kind of of further embodiment of the present invention, its Further comprise following individual steps on the basis of above-described embodiment, wherein, step 201~207 respectively with step 101~107 Corresponding, difference is, step 203 is according to business scenario data and corresponding business tine database, in construction business Holding the characteristic index of data also includes modifying to it afterwards；Step 206 also includes after all kinds of sample weights is calculated It is modified.

Step 208：Structure/expansion training set and test set

Specifically, can by labeled sampled data according to default ratio (for example, 8：2) be divided into training set and Test set.Further, can also according to inputting new business tine data and its corresponding prediction result and actual result data, Training set is expanded.

Step 209：Bayes classifier is constructed, and effect detection is predicted to test set

Wherein it is possible to construct Bayes classifier using labeled sampled data, and tested in advance on test set Result is surveyed, if the accuracy rate of prediction is more than or equal to predetermined threshold value (for example, 75%), performs step 210；If accuracy rate Less than predetermined threshold value, then perform step 203 and step 206 respectively, and the data in training set change business tine number According to characteristic index, and all kinds of sample weight, to obtain the higher prediction result of accuracy rate.Wherein, in training set Characteristic can be data after nonterminal character cleans or not make the data of characteristic processing and (but need to fill Missing values), Bayes classifier is built after then making sliding-model control to characteristic, prediction effect is examined on test set.

Step 210：New business tine data are inputted, and obtain prediction result, and risk is carried out according to prediction result and carried Show

Specifically, for marking the sampled data output for being in the prediction result of the new business content-data of acquisition During more than threshold value under specific transactions scene, false Risk-warning information is sent by human-computer interaction interface, reminds customs's work Personnel carry out artificial detection, and testing result is stored in business tine database, and training set is expanded by step 208 Fill, further to improve predictablity rate.And it is possible to one or more in above steps is repeated, And to distance function in the construction or the number of extraction feature of business and data, the selection of clustering algorithm, cluster in each step Selection, the selection of classification number, the calculation formula at class center, ratio of sampled data etc., it can be carried out when repeating different Selection, combined with forming the higher hyper parameter of predictablity rate.And this method can be repaiied with the abnormal data detected Positive prior information, so as to lift precision of prediction.

Fig. 3 shows a kind of system for being used to predict without label data classification according to embodiments of the present invention, and it includes logical Cross an at least electronic equipment 310 and a database server 330 for the connection of network 320.

Wherein, the electronic equipment 310 includes at least one processor 311, and leads to at least one processor Believe the memory 312 of connection；The memory 312 is stored with the instruction that can be performed by least one processor 311, described Instruction is performed by least one processor 311, so that at least one processor 311 is able to carry out foregoing any implementation Method disclosed in example.The database server 330 is used for storage service content data base 331 and characteristic index database 332。

Above-described embodiment first extracts or constructed the characteristic index significant to class categories according to business, and characteristic is entered Row cleaning, then make cluster operation in characteristic index after cleaning, all kinds of cluster centres is obtained, according to each coordinate value in class center Size determine marking data sample weight, then weight sampling Various types of data be used as priori as abnormal data, finally The data structure Bayes classifier of existing label is used to classify to new data, detects business risk.In practical business system , can be by determining true tag to excessive risk Data Detection in system, and database is stored in, correct prior probability so that model Automatically adjust accuracy.The present invention is applied to most of environment to being predicted without label data, solves the nothing under this kind of environment The problem of method Direct Modeling gives a forecast to new data, the prior information of business expert is make use of well.

It is described above, the only detailed description of the specific embodiment of the invention, rather than limitation of the present invention.Correlation technique The technical staff in field is not in the case where departing from the principle and scope of the present invention, various replacements, modification and the improvement made It should be included in the scope of the protection.

Claims

1. a kind of be used for the method without label data classification prediction, it is characterised in that the described method comprises the following steps：

Incoming traffic flow data, obtain multiple business scenario data in operation flow；Incoming traffic content-data, and according to Business scenario data are grouped to business tine data；According to business scenario data and corresponding business tine database, structure Make the characteristic index of business tine data；The characteristic index of business tine data is cleaned；To being cleaned by characteristic index Business tine data clustered, and determine all kinds of class centers；Calculate all kinds of sample weights；According to sample weight to industry Business content-data is sampled, and to sampled result data markers prediction label.

2. according to the method for claim 1, it is characterised in that the characteristic index of the construction business tine data includes： History service content number under the identical services scene that will be stored in the business tine data value of input and business tine database It is compared according to value, is constructed according to the degree of closeness of the business tine data value with producing false trading in business tine database Risk forward direction index.

3. according to the method for claim 1, it is characterised in that the characteristic index to business tine data is cleaned Including：Characteristic index missing is handled, is the neutral characteristic index value of business tine data distribution；By business tine data Characteristic index value normalize and eliminate dimension impact in [0,1], and establish characteristic index database to record in every business Hold the weighted value of the characteristic index of data.

4. according to the method for claim 1, it is characterised in that methods described using KMeans, DBSCAN, hierarchical clustering, One of partition clustering and spectral clustering or more persons are clustered；

Clustering the distance metric used includes one of Euclidean distance, Minkowski Distance and cosine similarity Or more persons.

5. according to the method for claim 1, it is characterised in that all kinds of sample weight of the calculating includes：Using as follows FormulaCalculate all kinds of sample weight w_i, wherein, j is the quantity of class,Represent i-th Square of the class center to origin of coordinates distance.

6. according to the method for claim 1, it is characterised in that methods described further comprises：By labeled sampling Data are divided into training set and test set according to default ratio；According to the new business tine data of input and its corresponding prediction As a result with actual result data, training set is expanded.

7. according to the method for claim 6, it is characterised in that methods described further comprises：Adopted using labeled Sample data construct Bayes classifier, and prediction result is tested on test set, if accuracy rate is less than predetermined threshold value, root The characteristic index of business tine data, and all kinds of sample weights are changed according to the data in training set, it is accurate to obtain The higher prediction result of rate.

8. according to the method for claim 6, it is characterised in that methods described further comprises：Adopted using labeled Sample data construct Bayes classifier, and prediction result is tested on test set, if the accuracy rate of prediction is more than or equal to Predetermined threshold value, then new business tine data are inputted, and obtain prediction result, and indicating risk is carried out according to prediction result.

9. according to the method for claim 1, it is characterised in that methods described further comprises：Repeat each step One of rapid or more persons, and to the construction of business tine data, the choosing of the number of characteristic index, clustering algorithm in each step Select, cluster in the selection of distance function, the selection of class number, the calculation formula at class center, the ratio of sampled data repeat hold It is some or all of during row to differ.

10. a kind of be used for the system without label data classification prediction, it is characterised in that the system is included by network connection An at least electronic equipment and a database server；

Wherein, the electronic equipment includes at least one processor, and is deposited with what at least one processor communication was connected Reservoir；The memory storage has can be by the instruction of at least one computing device, and the instruction is by described at least one Computing device, so that at least one processor is able to carry out the side any one of foregoing any claim 1 to 9 Method；