CN108268873A

CN108268873A - A kind of population data sorting technique and device based on SVM

Info

Publication number: CN108268873A
Application number: CN201611254023.0A
Authority: CN
Inventors: 黄超; 李青海; 潘宇翔; 王平; 张晓亭; 杨婉
Original assignee: Guangdong Fine Point Data Polytron Technologies Inc
Current assignee: Guangdong Fine Point Data Polytron Technologies Inc
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2018-07-10

Abstract

The present invention discloses a kind of population data sorting technique based on SVM and device, method include：Step S1 extracts history population data, determines the characteristic of group and group；Step S2 according to the characteristic, builds the quadratic character matrix of the group；Step S3, according to the quadratic character matrix, the corresponding SVM classifier of training；Step S4 treats classification population data using the SVM classifier and classifies；Its device includes corresponding historical data processing unit, eigenmatrix construction unit, classifier training unit and grader taxon.In this way, can classify by computer to population data, easily and fast, manpower and materials energetically are saved；In addition, compared to other graders, SVM has a distinct increment on classifier performance, and the advantage with high-class precision, so as to improve the accuracy of group's composition analysis.

Description

A kind of population data sorting technique and device based on SVM

Technical field

The present invention relates to data classification fields, and in particular to a kind of population data sorting technique and device based on SVM.

Background technology

Market survey is a long-standing subject, in developing history so for many years, has emerged in large numbers many research methods. Into after 21st century, with the development of computer technology, the computing platform by investigation of market survey field also slowly It is transferred on computer.The analysis of marketing data is carried out using computer, can quickly generate report with all kinds of visualizations Data model, greatly reduce the time of artificial calculation amount and investigation, improve accuracy.At this by information dominance Epoch, we are higher and higher for the attention degree of information.Equally, during some group is studied, understand this group The composition of body and essential.

Analysis to group's composition is substantially exactly to be classified according to historical data to sample populations data, but Be current sorting technique mainly by manually carrying out, not only heavy workload, but also time-consuming and laborious.It can therefore, it is necessary to one kind With the method and device classified by computer to population data.

In view of drawbacks described above, creator of the present invention obtains the present invention finally by prolonged research and practice.

Invention content

To solve above-mentioned technological deficiency, the technical solution adopted by the present invention is, provides a kind of group based on SVM first Data classification method, including：

Step S1 extracts history population data, determines the characteristic of group and group；

Step S2 according to the characteristic, builds the quadratic character matrix of the group；

Step S3, according to the quadratic character matrix, the corresponding SVM classifier of training；

Step S4 treats classification population data using the SVM classifier and classifies.

Preferably, the step 2 includes：

Step S21, analyzes the characteristic of group, therefrom extracts the corresponding essential characteristic of each classification of group；

Data in the history population data are converted into feature vector by step S22；

Step S24, with the quadratic character matrix of described eigenvector building group.

Preferably, the step 2 further includes：Step S23 assigns its different power according to the significance level of characteristic Value, and correct described eigenvector.

Preferably, the step 3 includes：

Step S31 adds in the classification information of each classification in group in the quadratic character matrix；

Step S32 learns the quadratic character matrix with the classification information, described eigenvector with Correspondence is established between the classification information of group, the training SVM classifier obtains its discriminant function.

Preferably, in the step S2, row vector and column vector represent the every of group respectively in the quadratic character matrix Individual and the characteristic of group, each element in the quadratic character matrix is corresponding individual in population and spy Levy the degree of association of data.

Preferably, in the step S4, the quantity of the SVM classifier is identical with the categorical measure of the group.

Preferably, in the step S4, the quantity of the SVM classifier is identical with the categorical measure of the group and one by one Corresponding, during classification, the population data to be sorted passes through all SVM classifiers, if only one of which SVM classifier exports Positive number, then the population data to be sorted belong to the corresponding classification of the SVM classifier；If wherein there are zero or more than one SVM Grader export positive number, then the population data to be sorted belong to value maximum of discriminant function in all SVM classifiers SVM divide The corresponding classification of class device.

Secondly a kind of group based on SVM corresponding with the population data sorting technique described above based on SVM is provided Device for classifying data, including：

Historical data processing unit extracts history population data, determines the characteristic of group and group；

Eigenmatrix construction unit according to the characteristic, builds the quadratic character matrix of the group；

Classifier training unit, according to the quadratic character matrix, the corresponding SVM classifier of training；

Grader taxon treats classification population data using the SVM classifier and classifies.

Preferably, the eigenmatrix construction unit includes：

Essential characteristic extracts subelement, analyzes the characteristic of group, it is corresponding therefrom to extract each classification of group Essential characteristic；

Data in the history population data are converted into feature vector by feature vector transforming subunit；

Vector structure matrix subelement, with the quadratic character matrix of described eigenvector building group.

Preferably, the eigenmatrix construction unit further includes：Weights assign subelement, according to the important of characteristic Degree and assign its different weights, and correct described eigenvector.

Compared with the prior art the beneficial effects of the present invention are：In this way, population data can be carried out by computer Classification, easily and fast, saves manpower and materials energetically；In addition, compared to Various Classifiers on Regional, such as：Neural network, decision tree, Naive Bayesian etc., SVM has a distinct increment on classifier performance, and the advantage with high-class precision, so as to improve group The accuracy of body composition analysis；Population characteristic is analyzed and is extracted, the pass of feature and corresponding classification can be greatly improved Connection degree, so as to make classification results relatively reliable.

Description of the drawings

It is required in being described below to embodiment in order to illustrate more clearly of the technical solution in various embodiments of the present invention The attached drawing used is briefly described.

Fig. 1 is the flow chart of the population data sorting technique the present invention is based on SVM；

Fig. 2 is the flow chart one of the population data sorting technique step S2 the present invention is based on SVM；

Fig. 3 is the flowchart 2 of the population data sorting technique step S2 the present invention is based on SVM；

Fig. 4 is the flow chart of the population data sorting technique step S3 the present invention is based on SVM；

Fig. 5 is the structure chart of the population data sorter the present invention is based on SVM；

Fig. 6 is the structure chart one of the population data sorter eigenmatrix construction unit the present invention is based on SVM；

Fig. 7 is the structure chart two of the population data sorter eigenmatrix construction unit the present invention is based on SVM；

Fig. 8 is the structure chart of the population data sorter classifier training unit the present invention is based on SVM.

Specific embodiment

Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.

Embodiment 1

As shown in Figure 1, it is the flow chart of the population data sorting technique the present invention is based on SVM；Wherein, it is described to be based on The population data sorting technique of SVM, including：

The history population data, the corresponding characteristic of classification and group including at least group；

History population data analyzed, first have to analyze history population data, therefrom determine group The corresponding characteristic of the group of each classification and each classification.

By taking market shopping as an example, group therein is the shopping group in market, can be labeled as student, white collar, religion for we The classification of teacher, old man, youth, child etc. as group, wherein the place for having conflict can be adjusted according to actual conditions, But the classification of each group should have apparent differentiation with other classifications of group, otherwise during follow-up progress group classification Accuracy can substantially reduce；The characteristic of group, it is related with the classification of group, for example, the type of student's classification shopping is is somebody's turn to do The characteristic of classification, wherein may include：Books, stationery, rubber, fruit, milk etc., are its characteristic, old man's class Another characteristic data may include：Manufacture of Walnut Milk, Radix Isatidis, fruit etc. are also its characteristic.History population data comes Source can be by artificial or computer statistics daily shopping data, specifically be subject to actual conditions.

The corresponding characteristic of classification and each group of group is determined from history population data, it can be combed Reason, while the data of wherein apparent error can be rejected, improve the accuracy rate of subsequent analysis；Subsequent analysis speed can also be improved Degree, and then improve the speed and efficiency of the entirely population data sorting technique based on SVM.

The group determined according to above-mentioned steps and corresponding characteristic, the quadratic character matrix of building group, wherein, two Row vector and column vector represent each individual of group and the characteristic of group, quadratic character matrix respectively in secondary eigenmatrix In each element be corresponding individual in population and characteristic the degree of association.

In this way, group and corresponding characteristic can be converted to the form of matrix, digitized, convenient for computer It is identified and classifies, fast and easy, and then improve the entirely efficiency of the population data sorting technique based on SVM and accuracy.

Step S3, according to the quadratic character matrix, the corresponding SVM classifier of training.

According to the quadratic character matrix that history population data is built, SVM classifier is trained, so as to obtain maturation SVM classifier, subsequently to classify to new population data.

SVM solves two class classification problems and is mainly based upon structural risk minimization, and it is super flat to find an optimal classification Face is separated two class data with largest interval.If linear separability sample set S=(xi, yi) | i=1 ..., n }, wherein xi ∈ Rd (Rd is d dimensional feature spaces), yi={+1, -1 } is the corresponding category labels of xi.Linear discriminant function in d dimension spaces it is general Form is g (x)=wx+b, and corresponding classifying face equation normalizes discriminant function g (x) so that two class samples for wx+b=0. This all meets | g (x) | >=1, and such class interval is equal to 2/ | | w | |.Therefore, class interval face maximum is made just to be equivalent to make | | W | | it is minimum；And require classifying face that can correctly classify all samples, it seeks to meet

Yi [(wx)+b] -1 >=0, i=1,2 ... n

The classifying face for meeting above-mentioned two condition is exactly optimal classification surface, and by nearest from classifying face in two class samples Point and be parallel to the super unilateral H1 of optimal classification surface, the training sample on H2 be exactly so that those samples of above formula equal sign inside the city, Referred to as supporting vector.Optimal classification surface problem can be expressed as under the constraint of above formula condition, seek object function

Minimum value.For linearly inseparable sample, introduce slack variable ξ i and penalty factor, object function are rewritten as

For this purpose, introducing Lagrange multiplier (α 1, α 2 ..., α N), constrained quadratic function extreme value can be converted into Problem solving optimal classification surface, corresponding solution are w=Σ α iyi xi, wherein, α i are only to xi non-zeros, then optimal classification function can It is rewritten as

F (x)=sign { (wx)+b }=sign { Σ α iyi (xix)+b }

In this way, can classify by computer to population data, easily and fast, manpower and materials energetically are saved. In addition, compared to Various Classifiers on Regional, such as：Neural network, decision tree, naive Bayesian etc., SVM have larger on classifier performance It is promoted, and the advantage with high-class precision, so as to improve the accuracy of group's composition analysis.

Embodiment 2

Population data sorting technique based on SVM as described above, the present embodiment is different from part and is, such as Fig. 2 Shown, the step S2 includes：

Group has multiple classifications, and each classification has multiple characteristics again；But these characteristics and classification The degree of association simultaneously differs, it is also necessary to extract；For example, the corresponding characteristic of student's classification includes books, stationery, rubber Deng, but be also possible to due to cause specific purchase seafood, soymilk, a milk powder etc. in characteristic can also include Seafood, soymilk, milk powder, but the product that seafood, soymilk, milk powder and not all student or Most students can all be bought, very may be used Can be several students single shopping product, if seafood, soymilk, milk powder are also assert be student's classification characteristic, The accuracy rate of classification results can be caused to substantially reduce, it is therefore desirable to extract.

Extraction can express the feature of classification information, be the primary premise for realizing machine learning.Characteristic more being capable of table The characteristics of up to group, shows that its discrimination is higher, and the effect of machine learning also will be better.Therefore, point of effective group is selected Category feature is the key that realize to be classified to group.Demographic categories can be described by being extracted from the characteristic of group Essential characteristic can greatly improve the accuracy to group classification.

For example, books, stationery, rubber, fruit, milk etc. are extracted from the corresponding characteristic of student's classification is used as such Other essential characteristic.In this way when subsequently judging, the accuracy of judgement can be greatly improved.

For the individual data items in the history population data, there is the classification described in it, an individual centainly belongs to one A classification if the individual has some foundation characteristic of the category, 1 is denoted as in this feature, is otherwise 0, thus will be every A individual data items are converted into a basic feature vector.For example, student's single purchase books, stationery, rubber, water Fruit, soymilk (not being the essential characteristic of student), then its feature vector may be (0,1,0,1,1,1,0,0,0), wherein, feature to Each element in amount corresponds to character pair data seafood, books, Manufacture of Walnut Milk, stationery, rubber, fruit, flapjack, three texts respectively Fish, soymilk (being only demonstration in this example, it is not intended that the inclusion relation between characteristic)；Wherein, since soymilk is not student Essential characteristic, therefore its corresponding position is still denoted as 0.

Step S24, with the quadratic character matrix of described eigenvector building group；

Wherein, row vector and column vector represent each individual of group and the characteristic of group respectively in quadratic character matrix According to each element in quadratic character matrix is the degree of association of corresponding individual in population and characteristic.

In this way, being analyzed population characteristic and being extracted, the degree of association of feature and corresponding classification can be greatly improved, from And make classification results relatively reliable.

Embodiment 3

Population data sorting technique based on SVM as described above, the present embodiment is different from part and is, such as Fig. 3 Shown, the step S2 is further included：

Step S23 assigns its different weights according to the significance level of characteristic, and corrects described eigenvector；

Characteristic of division usually requires us and adds weight for it.In the foundation characteristic of one classification, each feature and such Other degree of association is also different, such as student, and books, stationery, rubber and its degree of association are higher than rubber, water Fruit；If its degree of association not distinguished, it can so that subsequent classification is inaccurate.Therefore it needs, for some classification, to assign The different weights of the characteristic, to be modified to feature vector.For example, we are by student's classification, books, stationery, Rubber, fruit weights be assigned to 5,4,3,2 respectively, then its feature vector modification be (0,5,0,4,3,2,0,0,0).

To characteristic setting weights process be：It counts for each classification in group, p before degree of association ranking The feature of name (according to actual conditions choose, and p value is bigger, and analysis result is more accurate, but comparatively workload is also got over by p value Greatly), remaining feature is we can assume that their discrimination is identical and be 1.For the features of p before these rankings, power is set Again (being in general greater than 1), then the feature vector of feature will be corrected.

The feature extracted is weighted, the degree of association of feature and corresponding classification can be greatly improved, so as to make point Class result is relatively reliable.

Embodiment 4

Population data sorting technique based on SVM as described above, the present embodiment is different from part and is, such as Fig. 4 Shown, the step S3 includes：

It that is to say in the quadratic character matrix and add a column data, which is the classification number of corresponding each individual According to (classification information of group), in this way, the classification of each individual is added in the quadratic character matrix, convenient for SVM points Class device is trained.

Step S32 learns the quadratic character matrix with the classification information, described eigenvector with Correspondence is established between the classification information of group, the training SVM classifier obtains its discriminant function；

Wherein, the quantity of the SVM classifier is identical with the categorical measure of the group, in this way, multiple institutes can be trained SVM classifier is stated, each SVM classifier corresponds to a classification of the group.In training, by the corresponding classification of the grader With it is remaining it is of all categories demarcate, take the category as positive class, it is remaining of all categories for negative class, which is trained, Obtain discriminant function；Wherein, the discriminant function is the g (x).

In this way, it is only necessary to which a small amount of SVM classifier of training significantly reduces the workload of calculating, improves classification speed.

Assuming that customer is divided into k classes by us, by gained feature+classification information matrix, supporting vector machine model is calculated (SVM), the support vector machines of k two classification can be obtained.Wherein i-th of vector machine is the i-th similar remaining all kinds of divisions It opens, the i-th class will be taken when training as positive class, remaining other class is that negative class is trained.For k class classification problems, it is only necessary to Training k two class category support vector machines, therefore the number of its obtained classification function (k) is less, then speed of its classification It is relatively fast.

Embodiment 5

Population data sorting technique based on SVM as described above, the present embodiment are different from part and are, the step In rapid S4, the quantity of the SVM classifier is identical with the categorical measure of the group and one-to-one correspondence, described to treat point during classification Monoid volume data passes through all SVM classifiers, if only one of which SVM classifier exports positive number, the group to be sorted Volume data belongs to the corresponding classification of the SVM classifier；If wherein there are zero or more than one SVM classifier output positive number, institute State the corresponding classification of SVM classifier that population data to be sorted belongs to the value maximum of discriminant function in all SVM classifiers.

When differentiating, k output valve fi (x)=sign (gi (x)) is obtained by k classifier respectively in sample, if only There are one+1 occur, then its corresponding classification be input signal classification；The decision function constructed under actual conditions is always wrong Difference, if output more than just one+1 (more than one class claims it to one's name) or neither one output (do not have for+1 One class claims it to one's name), then compare the output valve of g (x), the corresponding classification of the maximum is the sample class of input.

In this way, it is only necessary to by population data to be sorted successively through too small amount of several SVM classifiers, significantly reduce calculating Workload, improve classification speed.And compared to Various Classifiers on Regional, such as：Neural network, decision tree, naive Bayesian etc., SVM methods are having a distinct increment in classifier performance, and the advantage with high-class precision, so as to improve group's composition analysis Accuracy.

Embodiment 6

Population data sorting technique based on SVM as described above, the present embodiment be different from part be, be with The corresponding population data sorter based on SVM of the population data sorting technique based on SVM, as shown in figure 5, it is wrapped It includes：

Historical data processing unit 1 extracts history population data, determines the characteristic of group and group；

Eigenmatrix construction unit 2 according to the characteristic, builds the quadratic character matrix of the group；

Classifier training unit 3, according to the quadratic character matrix, the corresponding SVM classifier of training.

Grader taxon 4 treats classification population data using the SVM classifier and classifies.

In historical data processing unit 1,

The corresponding characteristic of classification and each group of group is determined from history population data, it can be combed Reason, while the data of wherein apparent error can be rejected, improve the accuracy rate of subsequent analysis；Subsequent analysis speed can also be improved Degree, and then improve the speed and efficiency of the entirely population data sorter based on SVM.

In eigenmatrix construction unit 2,

The group determined according to said units and corresponding characteristic, the quadratic character matrix of building group, wherein, two Row vector and column vector represent each individual of group and the characteristic of group, quadratic character matrix respectively in secondary eigenmatrix In each element be corresponding individual in population and characteristic the degree of association.

In this way, group and corresponding characteristic can be converted to the form of matrix, digitized, convenient for computer It is identified and classifies, fast and easy, and then improve the entirely efficiency of the population data sorter based on SVM and accuracy.

In classifier training unit 3,

Embodiment 7

Population data sorter based on SVM as described above, the present embodiment is different from part and is, such as Fig. 6 Shown, the eigenmatrix construction unit 2 includes：

Essential characteristic extracts subelement 21, analyzes the characteristic of group, therefrom extracts each classification of group and corresponds to Essential characteristic；

Data in the history population data are converted into feature vector by feature vector transforming subunit 22；

Vector structure matrix subelement 24, with the quadratic character matrix of described eigenvector building group.

In essential characteristic extraction subelement 21,

In feature vector transforming subunit 22,

In vector structure matrix subelement 24,

In quadratic character matrix row vector and column vector represent respectively group it is each individual and group characteristic, two Each element in secondary eigenmatrix is the degree of association of corresponding individual in population and characteristic.

Embodiment 8

Population data sorter based on SVM as described above, the present embodiment is different from part and is, such as Fig. 7 Shown, the eigenmatrix construction unit 2 further includes：

Weights assign subelement 23, its different weights are assigned according to the significance level of characteristic, and correct institute State feature vector；

Embodiment 9

Population data sorter based on SVM as described above, the present embodiment is different from part and is, such as Fig. 8 Shown, the classifier training unit 3 includes：

Classification information adds subelement 31, and the classification letter of each classification in group is added in the quadratic character matrix Breath；

Matrix learning training subelement 32 learns the quadratic character matrix with the classification information, Correspondence is established between described eigenvector and the classification information of group, the training SVM classifier obtains it and sentences Disconnected function.

In classification information addition subelement 31, it that is to say in the quadratic character matrix and add a column data, the column data It is the categorical data (classification information of group) of corresponding each individual, in this way, the classification of each individual is added to described two In secondary eigenmatrix, convenient for being trained to SVM classifier.

In matrix learning training subelement 32,

The quantity of the SVM classifier is identical with the categorical measure of the group, in this way, multiple SVM can be trained Grader, each SVM classifier correspond to a classification of the group.In training, by the corresponding classification congruence of the grader Under it is of all categories demarcate, take the category as positive class, it is remaining of all categories for negative class, which is trained, is obtained Discriminant function；Wherein, the discriminant function is the g (x).

Embodiment 10

Population data sorter based on SVM as described above, the present embodiment are different from part and are, described point In class device taxon 4, the quantity of the SVM classifier is identical with the categorical measure of the group and corresponds, during classification, Therefore the population data to be sorted passes through all SVM classifiers, if only one of which SVM classifier exports positive number, The population data to be sorted belongs to the corresponding classification of the SVM classifier；If wherein having zero or more than one SVM classifier defeated Go out positive number, then the population data to be sorted belongs to the SVM classifier correspondence of the value maximum of discriminant function in all SVM classifiers Classification.

The foregoing is merely presently preferred embodiments of the present invention, is merely illustrative for the purpose of the present invention, and not restrictive 's.Those skilled in the art understands, many changes can be carried out to it in the spirit and scope limited in the claims in the present invention, It changes or even equivalent, but falls in protection scope of the present invention.

Claims

1. a kind of population data sorting technique based on SVM, which is characterized in that including：

2. the population data sorting technique based on SVM as described in claim 1, which is characterized in that the step 2 includes：

3. the population data sorting technique based on SVM as claimed in claim 2, which is characterized in that the step 2 further includes： Step S23 assigns its different weights according to the significance level of characteristic, and corrects described eigenvector.

4. the population data sorting technique based on SVM as described in any in claim 1-3, which is characterized in that the step 3 Including：

Step S32 learns the quadratic character matrix with the classification information, in described eigenvector and group The classification information between establish correspondence, the training SVM classifier obtains its discriminant function.

5. the population data sorting technique based on SVM as described in any in claim 1-3, which is characterized in that the step In S2, row vector and column vector represent each individual of group and the characteristic of group respectively in the quadratic character matrix, Each element in the quadratic character matrix is the degree of association of corresponding individual in population and characteristic.

6. the population data sorting technique based on SVM as described in any in claim 1-3, which is characterized in that the step In S4, the quantity of the SVM classifier is identical with the categorical measure of the group.

7. the population data sorting technique based on SVM as claimed in claim 4, which is characterized in that described in the step S4 The quantity of SVM classifier is identical with the categorical measure of the group and corresponds, during classification, the population data warp to be sorted All SVM classifiers are crossed, if only one of which SVM classifier exports positive number, the population data to be sorted belongs to this The corresponding classification of SVM classifier；If wherein there are zero or more than one SVM classifier output positive number, the group to be sorted Data belong to the corresponding classification of SVM classifier of the value maximum of discriminant function in all SVM classifiers.

It is 8. a kind of corresponding based on SVM's with the population data sorting technique based on SVM any in the claims Population data sorter, which is characterized in that including：

9. the population data sorter based on SVM as claimed in claim 8, which is characterized in that the eigenmatrix structure Unit includes：

Essential characteristic extracts subelement, analyzes the characteristic of group, and it is corresponding basic therefrom to extract each classification of group Feature；

10. the population data sorter based on SVM as claimed in claim 9, which is characterized in that the eigenmatrix structure Unit further includes：Weights assign subelement, its different weights are assigned according to the significance level of characteristic, and correct institute State feature vector.