CN108268873A - A kind of population data sorting technique and device based on SVM - Google Patents

A kind of population data sorting technique and device based on SVM Download PDF

Info

Publication number
CN108268873A
CN108268873A CN201611254023.0A CN201611254023A CN108268873A CN 108268873 A CN108268873 A CN 108268873A CN 201611254023 A CN201611254023 A CN 201611254023A CN 108268873 A CN108268873 A CN 108268873A
Authority
CN
China
Prior art keywords
group
svm
classification
population data
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611254023.0A
Other languages
Chinese (zh)
Inventor
黄超
李青海
潘宇翔
王平
张晓亭
杨婉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Fine Point Data Polytron Technologies Inc
Original Assignee
Guangdong Fine Point Data Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Fine Point Data Polytron Technologies Inc filed Critical Guangdong Fine Point Data Polytron Technologies Inc
Priority to CN201611254023.0A priority Critical patent/CN108268873A/en
Publication of CN108268873A publication Critical patent/CN108268873A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of population data sorting technique based on SVM and device, method include:Step S1 extracts history population data, determines the characteristic of group and group;Step S2 according to the characteristic, builds the quadratic character matrix of the group;Step S3, according to the quadratic character matrix, the corresponding SVM classifier of training;Step S4 treats classification population data using the SVM classifier and classifies;Its device includes corresponding historical data processing unit, eigenmatrix construction unit, classifier training unit and grader taxon.In this way, can classify by computer to population data, easily and fast, manpower and materials energetically are saved;In addition, compared to other graders, SVM has a distinct increment on classifier performance, and the advantage with high-class precision, so as to improve the accuracy of group's composition analysis.

Description

A kind of population data sorting technique and device based on SVM
Technical field
The present invention relates to data classification fields, and in particular to a kind of population data sorting technique and device based on SVM.
Background technology
Market survey is a long-standing subject, in developing history so for many years, has emerged in large numbers many research methods. Into after 21st century, with the development of computer technology, the computing platform by investigation of market survey field also slowly It is transferred on computer.The analysis of marketing data is carried out using computer, can quickly generate report with all kinds of visualizations Data model, greatly reduce the time of artificial calculation amount and investigation, improve accuracy.At this by information dominance Epoch, we are higher and higher for the attention degree of information.Equally, during some group is studied, understand this group The composition of body and essential.
Analysis to group's composition is substantially exactly to be classified according to historical data to sample populations data, but Be current sorting technique mainly by manually carrying out, not only heavy workload, but also time-consuming and laborious.It can therefore, it is necessary to one kind With the method and device classified by computer to population data.
In view of drawbacks described above, creator of the present invention obtains the present invention finally by prolonged research and practice.
Invention content
To solve above-mentioned technological deficiency, the technical solution adopted by the present invention is, provides a kind of group based on SVM first Data classification method, including:
Step S1 extracts history population data, determines the characteristic of group and group;
Step S2 according to the characteristic, builds the quadratic character matrix of the group;
Step S3, according to the quadratic character matrix, the corresponding SVM classifier of training;
Step S4 treats classification population data using the SVM classifier and classifies.
Preferably, the step 2 includes:
Step S21, analyzes the characteristic of group, therefrom extracts the corresponding essential characteristic of each classification of group;
Data in the history population data are converted into feature vector by step S22;
Step S24, with the quadratic character matrix of described eigenvector building group.
Preferably, the step 2 further includes:Step S23 assigns its different power according to the significance level of characteristic Value, and correct described eigenvector.
Preferably, the step 3 includes:
Step S31 adds in the classification information of each classification in group in the quadratic character matrix;
Step S32 learns the quadratic character matrix with the classification information, described eigenvector with Correspondence is established between the classification information of group, the training SVM classifier obtains its discriminant function.
Preferably, in the step S2, row vector and column vector represent the every of group respectively in the quadratic character matrix Individual and the characteristic of group, each element in the quadratic character matrix is corresponding individual in population and spy Levy the degree of association of data.
Preferably, in the step S4, the quantity of the SVM classifier is identical with the categorical measure of the group.
Preferably, in the step S4, the quantity of the SVM classifier is identical with the categorical measure of the group and one by one Corresponding, during classification, the population data to be sorted passes through all SVM classifiers, if only one of which SVM classifier exports Positive number, then the population data to be sorted belong to the corresponding classification of the SVM classifier;If wherein there are zero or more than one SVM Grader export positive number, then the population data to be sorted belong to value maximum of discriminant function in all SVM classifiers SVM divide The corresponding classification of class device.
Secondly a kind of group based on SVM corresponding with the population data sorting technique described above based on SVM is provided Device for classifying data, including:
Historical data processing unit extracts history population data, determines the characteristic of group and group;
Eigenmatrix construction unit according to the characteristic, builds the quadratic character matrix of the group;
Classifier training unit, according to the quadratic character matrix, the corresponding SVM classifier of training;
Grader taxon treats classification population data using the SVM classifier and classifies.
Preferably, the eigenmatrix construction unit includes:
Essential characteristic extracts subelement, analyzes the characteristic of group, it is corresponding therefrom to extract each classification of group Essential characteristic;
Data in the history population data are converted into feature vector by feature vector transforming subunit;
Vector structure matrix subelement, with the quadratic character matrix of described eigenvector building group.
Preferably, the eigenmatrix construction unit further includes:Weights assign subelement, according to the important of characteristic Degree and assign its different weights, and correct described eigenvector.
Compared with the prior art the beneficial effects of the present invention are:In this way, population data can be carried out by computer Classification, easily and fast, saves manpower and materials energetically;In addition, compared to Various Classifiers on Regional, such as:Neural network, decision tree, Naive Bayesian etc., SVM has a distinct increment on classifier performance, and the advantage with high-class precision, so as to improve group The accuracy of body composition analysis;Population characteristic is analyzed and is extracted, the pass of feature and corresponding classification can be greatly improved Connection degree, so as to make classification results relatively reliable.
Description of the drawings
It is required in being described below to embodiment in order to illustrate more clearly of the technical solution in various embodiments of the present invention The attached drawing used is briefly described.
Fig. 1 is the flow chart of the population data sorting technique the present invention is based on SVM;
Fig. 2 is the flow chart one of the population data sorting technique step S2 the present invention is based on SVM;
Fig. 3 is the flowchart 2 of the population data sorting technique step S2 the present invention is based on SVM;
Fig. 4 is the flow chart of the population data sorting technique step S3 the present invention is based on SVM;
Fig. 5 is the structure chart of the population data sorter the present invention is based on SVM;
Fig. 6 is the structure chart one of the population data sorter eigenmatrix construction unit the present invention is based on SVM;
Fig. 7 is the structure chart two of the population data sorter eigenmatrix construction unit the present invention is based on SVM;
Fig. 8 is the structure chart of the population data sorter classifier training unit the present invention is based on SVM.
Specific embodiment
Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.
Embodiment 1
As shown in Figure 1, it is the flow chart of the population data sorting technique the present invention is based on SVM;Wherein, it is described to be based on The population data sorting technique of SVM, including:
Step S1 extracts history population data, determines the characteristic of group and group;
The history population data, the corresponding characteristic of classification and group including at least group;
History population data analyzed, first have to analyze history population data, therefrom determine group The corresponding characteristic of the group of each classification and each classification.
By taking market shopping as an example, group therein is the shopping group in market, can be labeled as student, white collar, religion for we The classification of teacher, old man, youth, child etc. as group, wherein the place for having conflict can be adjusted according to actual conditions, But the classification of each group should have apparent differentiation with other classifications of group, otherwise during follow-up progress group classification Accuracy can substantially reduce;The characteristic of group, it is related with the classification of group, for example, the type of student's classification shopping is is somebody's turn to do The characteristic of classification, wherein may include:Books, stationery, rubber, fruit, milk etc., are its characteristic, old man's class Another characteristic data may include:Manufacture of Walnut Milk, Radix Isatidis, fruit etc. are also its characteristic.History population data comes Source can be by artificial or computer statistics daily shopping data, specifically be subject to actual conditions.
The corresponding characteristic of classification and each group of group is determined from history population data, it can be combed Reason, while the data of wherein apparent error can be rejected, improve the accuracy rate of subsequent analysis;Subsequent analysis speed can also be improved Degree, and then improve the speed and efficiency of the entirely population data sorting technique based on SVM.
Step S2 according to the characteristic, builds the quadratic character matrix of the group;
The group determined according to above-mentioned steps and corresponding characteristic, the quadratic character matrix of building group, wherein, two Row vector and column vector represent each individual of group and the characteristic of group, quadratic character matrix respectively in secondary eigenmatrix In each element be corresponding individual in population and characteristic the degree of association.
In this way, group and corresponding characteristic can be converted to the form of matrix, digitized, convenient for computer It is identified and classifies, fast and easy, and then improve the entirely efficiency of the population data sorting technique based on SVM and accuracy.
Step S3, according to the quadratic character matrix, the corresponding SVM classifier of training.
According to the quadratic character matrix that history population data is built, SVM classifier is trained, so as to obtain maturation SVM classifier, subsequently to classify to new population data.
SVM solves two class classification problems and is mainly based upon structural risk minimization, and it is super flat to find an optimal classification Face is separated two class data with largest interval.If linear separability sample set S=(xi, yi) | i=1 ..., n }, wherein xi ∈ Rd (Rd is d dimensional feature spaces), yi={+1, -1 } is the corresponding category labels of xi.Linear discriminant function in d dimension spaces it is general Form is g (x)=wx+b, and corresponding classifying face equation normalizes discriminant function g (x) so that two class samples for wx+b=0. This all meets | g (x) | >=1, and such class interval is equal to 2/ | | w | |.Therefore, class interval face maximum is made just to be equivalent to make | | W | | it is minimum;And require classifying face that can correctly classify all samples, it seeks to meet
Yi [(wx)+b] -1 >=0, i=1,2 ... n
The classifying face for meeting above-mentioned two condition is exactly optimal classification surface, and by nearest from classifying face in two class samples Point and be parallel to the super unilateral H1 of optimal classification surface, the training sample on H2 be exactly so that those samples of above formula equal sign inside the city, Referred to as supporting vector.Optimal classification surface problem can be expressed as under the constraint of above formula condition, seek object function
Minimum value.For linearly inseparable sample, introduce slack variable ξ i and penalty factor, object function are rewritten as
For this purpose, introducing Lagrange multiplier (α 1, α 2 ..., α N), constrained quadratic function extreme value can be converted into Problem solving optimal classification surface, corresponding solution are w=Σ α iyi xi, wherein, α i are only to xi non-zeros, then optimal classification function can It is rewritten as
F (x)=sign { (wx)+b }=sign { Σ α iyi (xix)+b }
Step S4 treats classification population data using the SVM classifier and classifies.
In this way, can classify by computer to population data, easily and fast, manpower and materials energetically are saved. In addition, compared to Various Classifiers on Regional, such as:Neural network, decision tree, naive Bayesian etc., SVM have larger on classifier performance It is promoted, and the advantage with high-class precision, so as to improve the accuracy of group's composition analysis.
Embodiment 2
Population data sorting technique based on SVM as described above, the present embodiment is different from part and is, such as Fig. 2 Shown, the step S2 includes:
Step S21, analyzes the characteristic of group, therefrom extracts the corresponding essential characteristic of each classification of group;
Group has multiple classifications, and each classification has multiple characteristics again;But these characteristics and classification The degree of association simultaneously differs, it is also necessary to extract;For example, the corresponding characteristic of student's classification includes books, stationery, rubber Deng, but be also possible to due to cause specific purchase seafood, soymilk, a milk powder etc. in characteristic can also include Seafood, soymilk, milk powder, but the product that seafood, soymilk, milk powder and not all student or Most students can all be bought, very may be used Can be several students single shopping product, if seafood, soymilk, milk powder are also assert be student's classification characteristic, The accuracy rate of classification results can be caused to substantially reduce, it is therefore desirable to extract.
Extraction can express the feature of classification information, be the primary premise for realizing machine learning.Characteristic more being capable of table The characteristics of up to group, shows that its discrimination is higher, and the effect of machine learning also will be better.Therefore, point of effective group is selected Category feature is the key that realize to be classified to group.Demographic categories can be described by being extracted from the characteristic of group Essential characteristic can greatly improve the accuracy to group classification.
For example, books, stationery, rubber, fruit, milk etc. are extracted from the corresponding characteristic of student's classification is used as such Other essential characteristic.In this way when subsequently judging, the accuracy of judgement can be greatly improved.
Data in the history population data are converted into feature vector by step S22;
For the individual data items in the history population data, there is the classification described in it, an individual centainly belongs to one A classification if the individual has some foundation characteristic of the category, 1 is denoted as in this feature, is otherwise 0, thus will be every A individual data items are converted into a basic feature vector.For example, student's single purchase books, stationery, rubber, water Fruit, soymilk (not being the essential characteristic of student), then its feature vector may be (0,1,0,1,1,1,0,0,0), wherein, feature to Each element in amount corresponds to character pair data seafood, books, Manufacture of Walnut Milk, stationery, rubber, fruit, flapjack, three texts respectively Fish, soymilk (being only demonstration in this example, it is not intended that the inclusion relation between characteristic);Wherein, since soymilk is not student Essential characteristic, therefore its corresponding position is still denoted as 0.
Step S24, with the quadratic character matrix of described eigenvector building group;
Wherein, row vector and column vector represent each individual of group and the characteristic of group respectively in quadratic character matrix According to each element in quadratic character matrix is the degree of association of corresponding individual in population and characteristic.
In this way, being analyzed population characteristic and being extracted, the degree of association of feature and corresponding classification can be greatly improved, from And make classification results relatively reliable.
Embodiment 3
Population data sorting technique based on SVM as described above, the present embodiment is different from part and is, such as Fig. 3 Shown, the step S2 is further included:
Step S23 assigns its different weights according to the significance level of characteristic, and corrects described eigenvector;
Characteristic of division usually requires us and adds weight for it.In the foundation characteristic of one classification, each feature and such Other degree of association is also different, such as student, and books, stationery, rubber and its degree of association are higher than rubber, water Fruit;If its degree of association not distinguished, it can so that subsequent classification is inaccurate.Therefore it needs, for some classification, to assign The different weights of the characteristic, to be modified to feature vector.For example, we are by student's classification, books, stationery, Rubber, fruit weights be assigned to 5,4,3,2 respectively, then its feature vector modification be (0,5,0,4,3,2,0,0,0).
To characteristic setting weights process be:It counts for each classification in group, p before degree of association ranking The feature of name (according to actual conditions choose, and p value is bigger, and analysis result is more accurate, but comparatively workload is also got over by p value Greatly), remaining feature is we can assume that their discrimination is identical and be 1.For the features of p before these rankings, power is set Again (being in general greater than 1), then the feature vector of feature will be corrected.
The feature extracted is weighted, the degree of association of feature and corresponding classification can be greatly improved, so as to make point Class result is relatively reliable.
Embodiment 4
Population data sorting technique based on SVM as described above, the present embodiment is different from part and is, such as Fig. 4 Shown, the step S3 includes:
Step S31 adds in the classification information of each classification in group in the quadratic character matrix;
It that is to say in the quadratic character matrix and add a column data, which is the classification number of corresponding each individual According to (classification information of group), in this way, the classification of each individual is added in the quadratic character matrix, convenient for SVM points Class device is trained.
Step S32 learns the quadratic character matrix with the classification information, described eigenvector with Correspondence is established between the classification information of group, the training SVM classifier obtains its discriminant function;
Wherein, the quantity of the SVM classifier is identical with the categorical measure of the group, in this way, multiple institutes can be trained SVM classifier is stated, each SVM classifier corresponds to a classification of the group.In training, by the corresponding classification of the grader With it is remaining it is of all categories demarcate, take the category as positive class, it is remaining of all categories for negative class, which is trained, Obtain discriminant function;Wherein, the discriminant function is the g (x).
In this way, it is only necessary to which a small amount of SVM classifier of training significantly reduces the workload of calculating, improves classification speed.
Assuming that customer is divided into k classes by us, by gained feature+classification information matrix, supporting vector machine model is calculated (SVM), the support vector machines of k two classification can be obtained.Wherein i-th of vector machine is the i-th similar remaining all kinds of divisions It opens, the i-th class will be taken when training as positive class, remaining other class is that negative class is trained.For k class classification problems, it is only necessary to Training k two class category support vector machines, therefore the number of its obtained classification function (k) is less, then speed of its classification It is relatively fast.
Embodiment 5
Population data sorting technique based on SVM as described above, the present embodiment are different from part and are, the step In rapid S4, the quantity of the SVM classifier is identical with the categorical measure of the group and one-to-one correspondence, described to treat point during classification Monoid volume data passes through all SVM classifiers, if only one of which SVM classifier exports positive number, the group to be sorted Volume data belongs to the corresponding classification of the SVM classifier;If wherein there are zero or more than one SVM classifier output positive number, institute State the corresponding classification of SVM classifier that population data to be sorted belongs to the value maximum of discriminant function in all SVM classifiers.
When differentiating, k output valve fi (x)=sign (gi (x)) is obtained by k classifier respectively in sample, if only There are one+1 occur, then its corresponding classification be input signal classification;The decision function constructed under actual conditions is always wrong Difference, if output more than just one+1 (more than one class claims it to one's name) or neither one output (do not have for+1 One class claims it to one's name), then compare the output valve of g (x), the corresponding classification of the maximum is the sample class of input.
In this way, it is only necessary to by population data to be sorted successively through too small amount of several SVM classifiers, significantly reduce calculating Workload, improve classification speed.And compared to Various Classifiers on Regional, such as:Neural network, decision tree, naive Bayesian etc., SVM methods are having a distinct increment in classifier performance, and the advantage with high-class precision, so as to improve group's composition analysis Accuracy.
Embodiment 6
Population data sorting technique based on SVM as described above, the present embodiment be different from part be, be with The corresponding population data sorter based on SVM of the population data sorting technique based on SVM, as shown in figure 5, it is wrapped It includes:
Historical data processing unit 1 extracts history population data, determines the characteristic of group and group;
Eigenmatrix construction unit 2 according to the characteristic, builds the quadratic character matrix of the group;
Classifier training unit 3, according to the quadratic character matrix, the corresponding SVM classifier of training.
Grader taxon 4 treats classification population data using the SVM classifier and classifies.
In this way, can classify by computer to population data, easily and fast, manpower and materials energetically are saved. In addition, compared to Various Classifiers on Regional, such as:Neural network, decision tree, naive Bayesian etc., SVM have larger on classifier performance It is promoted, and the advantage with high-class precision, so as to improve the accuracy of group's composition analysis.
In historical data processing unit 1,
The history population data, the corresponding characteristic of classification and group including at least group;
History population data analyzed, first have to analyze history population data, therefrom determine group The corresponding characteristic of the group of each classification and each classification.
By taking market shopping as an example, group therein is the shopping group in market, can be labeled as student, white collar, religion for we The classification of teacher, old man, youth, child etc. as group, wherein the place for having conflict can be adjusted according to actual conditions, But the classification of each group should have apparent differentiation with other classifications of group, otherwise during follow-up progress group classification Accuracy can substantially reduce;The characteristic of group, it is related with the classification of group, for example, the type of student's classification shopping is is somebody's turn to do The characteristic of classification, wherein may include:Books, stationery, rubber, fruit, milk etc., are its characteristic, old man's class Another characteristic data may include:Manufacture of Walnut Milk, Radix Isatidis, fruit etc. are also its characteristic.History population data comes Source can be by artificial or computer statistics daily shopping data, specifically be subject to actual conditions.
The corresponding characteristic of classification and each group of group is determined from history population data, it can be combed Reason, while the data of wherein apparent error can be rejected, improve the accuracy rate of subsequent analysis;Subsequent analysis speed can also be improved Degree, and then improve the speed and efficiency of the entirely population data sorter based on SVM.
In eigenmatrix construction unit 2,
The group determined according to said units and corresponding characteristic, the quadratic character matrix of building group, wherein, two Row vector and column vector represent each individual of group and the characteristic of group, quadratic character matrix respectively in secondary eigenmatrix In each element be corresponding individual in population and characteristic the degree of association.
In this way, group and corresponding characteristic can be converted to the form of matrix, digitized, convenient for computer It is identified and classifies, fast and easy, and then improve the entirely efficiency of the population data sorter based on SVM and accuracy.
In classifier training unit 3,
According to the quadratic character matrix that history population data is built, SVM classifier is trained, so as to obtain maturation SVM classifier, subsequently to classify to new population data.
In this way, can classify by computer to population data, easily and fast, manpower and materials energetically are saved. In addition, compared to Various Classifiers on Regional, such as:Neural network, decision tree, naive Bayesian etc., SVM have larger on classifier performance It is promoted, and the advantage with high-class precision, so as to improve the accuracy of group's composition analysis.
Embodiment 7
Population data sorter based on SVM as described above, the present embodiment is different from part and is, such as Fig. 6 Shown, the eigenmatrix construction unit 2 includes:
Essential characteristic extracts subelement 21, analyzes the characteristic of group, therefrom extracts each classification of group and corresponds to Essential characteristic;
Data in the history population data are converted into feature vector by feature vector transforming subunit 22;
Vector structure matrix subelement 24, with the quadratic character matrix of described eigenvector building group.
In this way, being analyzed population characteristic and being extracted, the degree of association of feature and corresponding classification can be greatly improved, from And make classification results relatively reliable.
In essential characteristic extraction subelement 21,
Group has multiple classifications, and each classification has multiple characteristics again;But these characteristics and classification The degree of association simultaneously differs, it is also necessary to extract;For example, the corresponding characteristic of student's classification includes books, stationery, rubber Deng, but be also possible to due to cause specific purchase seafood, soymilk, a milk powder etc. in characteristic can also include Seafood, soymilk, milk powder, but the product that seafood, soymilk, milk powder and not all student or Most students can all be bought, very may be used Can be several students single shopping product, if seafood, soymilk, milk powder are also assert be student's classification characteristic, The accuracy rate of classification results can be caused to substantially reduce, it is therefore desirable to extract.
Extraction can express the feature of classification information, be the primary premise for realizing machine learning.Characteristic more being capable of table The characteristics of up to group, shows that its discrimination is higher, and the effect of machine learning also will be better.Therefore, point of effective group is selected Category feature is the key that realize to be classified to group.Demographic categories can be described by being extracted from the characteristic of group Essential characteristic can greatly improve the accuracy to group classification.
For example, books, stationery, rubber, fruit, milk etc. are extracted from the corresponding characteristic of student's classification is used as such Other essential characteristic.In this way when subsequently judging, the accuracy of judgement can be greatly improved.
In feature vector transforming subunit 22,
For the individual data items in the history population data, there is the classification described in it, an individual centainly belongs to one A classification if the individual has some foundation characteristic of the category, 1 is denoted as in this feature, is otherwise 0, thus will be every A individual data items are converted into a basic feature vector.For example, student's single purchase books, stationery, rubber, water Fruit, soymilk (not being the essential characteristic of student), then its feature vector may be (0,1,0,1,1,1,0,0,0), wherein, feature to Each element in amount corresponds to character pair data seafood, books, Manufacture of Walnut Milk, stationery, rubber, fruit, flapjack, three texts respectively Fish, soymilk (being only demonstration in this example, it is not intended that the inclusion relation between characteristic);Wherein, since soymilk is not student Essential characteristic, therefore its corresponding position is still denoted as 0.
In vector structure matrix subelement 24,
In quadratic character matrix row vector and column vector represent respectively group it is each individual and group characteristic, two Each element in secondary eigenmatrix is the degree of association of corresponding individual in population and characteristic.
In this way, being analyzed population characteristic and being extracted, the degree of association of feature and corresponding classification can be greatly improved, from And make classification results relatively reliable.
Embodiment 8
Population data sorter based on SVM as described above, the present embodiment is different from part and is, such as Fig. 7 Shown, the eigenmatrix construction unit 2 further includes:
Weights assign subelement 23, its different weights are assigned according to the significance level of characteristic, and correct institute State feature vector;
Characteristic of division usually requires us and adds weight for it.In the foundation characteristic of one classification, each feature and such Other degree of association is also different, such as student, and books, stationery, rubber and its degree of association are higher than rubber, water Fruit;If its degree of association not distinguished, it can so that subsequent classification is inaccurate.Therefore it needs, for some classification, to assign The different weights of the characteristic, to be modified to feature vector.For example, we are by student's classification, books, stationery, Rubber, fruit weights be assigned to 5,4,3,2 respectively, then its feature vector modification be (0,5,0,4,3,2,0,0,0).
To characteristic setting weights process be:It counts for each classification in group, p before degree of association ranking The feature of name (according to actual conditions choose, and p value is bigger, and analysis result is more accurate, but comparatively workload is also got over by p value Greatly), remaining feature is we can assume that their discrimination is identical and be 1.For the features of p before these rankings, power is set Again (being in general greater than 1), then the feature vector of feature will be corrected.
The feature extracted is weighted, the degree of association of feature and corresponding classification can be greatly improved, so as to make point Class result is relatively reliable.
Embodiment 9
Population data sorter based on SVM as described above, the present embodiment is different from part and is, such as Fig. 8 Shown, the classifier training unit 3 includes:
Classification information adds subelement 31, and the classification letter of each classification in group is added in the quadratic character matrix Breath;
Matrix learning training subelement 32 learns the quadratic character matrix with the classification information, Correspondence is established between described eigenvector and the classification information of group, the training SVM classifier obtains it and sentences Disconnected function.
In this way, it is only necessary to which a small amount of SVM classifier of training significantly reduces the workload of calculating, improves classification speed.
In classification information addition subelement 31, it that is to say in the quadratic character matrix and add a column data, the column data It is the categorical data (classification information of group) of corresponding each individual, in this way, the classification of each individual is added to described two In secondary eigenmatrix, convenient for being trained to SVM classifier.
In matrix learning training subelement 32,
The quantity of the SVM classifier is identical with the categorical measure of the group, in this way, multiple SVM can be trained Grader, each SVM classifier correspond to a classification of the group.In training, by the corresponding classification congruence of the grader Under it is of all categories demarcate, take the category as positive class, it is remaining of all categories for negative class, which is trained, is obtained Discriminant function;Wherein, the discriminant function is the g (x).
In this way, it is only necessary to which a small amount of SVM classifier of training significantly reduces the workload of calculating, improves classification speed.
Assuming that customer is divided into k classes by us, by gained feature+classification information matrix, supporting vector machine model is calculated (SVM), the support vector machines of k two classification can be obtained.Wherein i-th of vector machine is the i-th similar remaining all kinds of divisions It opens, the i-th class will be taken when training as positive class, remaining other class is that negative class is trained.For k class classification problems, it is only necessary to Training k two class category support vector machines, therefore the number of its obtained classification function (k) is less, then speed of its classification It is relatively fast.
Embodiment 10
Population data sorter based on SVM as described above, the present embodiment are different from part and are, described point In class device taxon 4, the quantity of the SVM classifier is identical with the categorical measure of the group and corresponds, during classification, Therefore the population data to be sorted passes through all SVM classifiers, if only one of which SVM classifier exports positive number, The population data to be sorted belongs to the corresponding classification of the SVM classifier;If wherein having zero or more than one SVM classifier defeated Go out positive number, then the population data to be sorted belongs to the SVM classifier correspondence of the value maximum of discriminant function in all SVM classifiers Classification.
When differentiating, k output valve fi (x)=sign (gi (x)) is obtained by k classifier respectively in sample, if only There are one+1 occur, then its corresponding classification be input signal classification;The decision function constructed under actual conditions is always wrong Difference, if output more than just one+1 (more than one class claims it to one's name) or neither one output (do not have for+1 One class claims it to one's name), then compare the output valve of g (x), the corresponding classification of the maximum is the sample class of input.
In this way, it is only necessary to by population data to be sorted successively through too small amount of several SVM classifiers, significantly reduce calculating Workload, improve classification speed.And compared to Various Classifiers on Regional, such as:Neural network, decision tree, naive Bayesian etc., SVM methods are having a distinct increment in classifier performance, and the advantage with high-class precision, so as to improve group's composition analysis Accuracy.
The foregoing is merely presently preferred embodiments of the present invention, is merely illustrative for the purpose of the present invention, and not restrictive 's.Those skilled in the art understands, many changes can be carried out to it in the spirit and scope limited in the claims in the present invention, It changes or even equivalent, but falls in protection scope of the present invention.

Claims (10)

1. a kind of population data sorting technique based on SVM, which is characterized in that including:
Step S1 extracts history population data, determines the characteristic of group and group;
Step S2 according to the characteristic, builds the quadratic character matrix of the group;
Step S3, according to the quadratic character matrix, the corresponding SVM classifier of training;
Step S4 treats classification population data using the SVM classifier and classifies.
2. the population data sorting technique based on SVM as described in claim 1, which is characterized in that the step 2 includes:
Step S21, analyzes the characteristic of group, therefrom extracts the corresponding essential characteristic of each classification of group;
Data in the history population data are converted into feature vector by step S22;
Step S24, with the quadratic character matrix of described eigenvector building group.
3. the population data sorting technique based on SVM as claimed in claim 2, which is characterized in that the step 2 further includes: Step S23 assigns its different weights according to the significance level of characteristic, and corrects described eigenvector.
4. the population data sorting technique based on SVM as described in any in claim 1-3, which is characterized in that the step 3 Including:
Step S31 adds in the classification information of each classification in group in the quadratic character matrix;
Step S32 learns the quadratic character matrix with the classification information, in described eigenvector and group The classification information between establish correspondence, the training SVM classifier obtains its discriminant function.
5. the population data sorting technique based on SVM as described in any in claim 1-3, which is characterized in that the step In S2, row vector and column vector represent each individual of group and the characteristic of group respectively in the quadratic character matrix, Each element in the quadratic character matrix is the degree of association of corresponding individual in population and characteristic.
6. the population data sorting technique based on SVM as described in any in claim 1-3, which is characterized in that the step In S4, the quantity of the SVM classifier is identical with the categorical measure of the group.
7. the population data sorting technique based on SVM as claimed in claim 4, which is characterized in that described in the step S4 The quantity of SVM classifier is identical with the categorical measure of the group and corresponds, during classification, the population data warp to be sorted All SVM classifiers are crossed, if only one of which SVM classifier exports positive number, the population data to be sorted belongs to this The corresponding classification of SVM classifier;If wherein there are zero or more than one SVM classifier output positive number, the group to be sorted Data belong to the corresponding classification of SVM classifier of the value maximum of discriminant function in all SVM classifiers.
It is 8. a kind of corresponding based on SVM's with the population data sorting technique based on SVM any in the claims Population data sorter, which is characterized in that including:
Historical data processing unit extracts history population data, determines the characteristic of group and group;
Eigenmatrix construction unit according to the characteristic, builds the quadratic character matrix of the group;
Classifier training unit, according to the quadratic character matrix, the corresponding SVM classifier of training;
Grader taxon treats classification population data using the SVM classifier and classifies.
9. the population data sorter based on SVM as claimed in claim 8, which is characterized in that the eigenmatrix structure Unit includes:
Essential characteristic extracts subelement, analyzes the characteristic of group, and it is corresponding basic therefrom to extract each classification of group Feature;
Data in the history population data are converted into feature vector by feature vector transforming subunit;
Vector structure matrix subelement, with the quadratic character matrix of described eigenvector building group.
10. the population data sorter based on SVM as claimed in claim 9, which is characterized in that the eigenmatrix structure Unit further includes:Weights assign subelement, its different weights are assigned according to the significance level of characteristic, and correct institute State feature vector.
CN201611254023.0A 2016-12-30 2016-12-30 A kind of population data sorting technique and device based on SVM Pending CN108268873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611254023.0A CN108268873A (en) 2016-12-30 2016-12-30 A kind of population data sorting technique and device based on SVM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611254023.0A CN108268873A (en) 2016-12-30 2016-12-30 A kind of population data sorting technique and device based on SVM

Publications (1)

Publication Number Publication Date
CN108268873A true CN108268873A (en) 2018-07-10

Family

ID=62754314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611254023.0A Pending CN108268873A (en) 2016-12-30 2016-12-30 A kind of population data sorting technique and device based on SVM

Country Status (1)

Country Link
CN (1) CN108268873A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110197207A (en) * 2019-05-13 2019-09-03 腾讯科技(深圳)有限公司 To not sorting out the method and relevant apparatus that user group is sorted out

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110197207A (en) * 2019-05-13 2019-09-03 腾讯科技(深圳)有限公司 To not sorting out the method and relevant apparatus that user group is sorted out

Similar Documents

Publication Publication Date Title
CN103309953B (en) Method for labeling and searching for diversified pictures based on integration of multiple RBFNN classifiers
CN106611052A (en) Text label determination method and device
CN102521656B (en) Integrated transfer learning method for classification of unbalance samples
CN109933670A (en) A kind of file classification method calculating semantic distance based on combinatorial matrix
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN105260437B (en) Text classification feature selection approach and its application in biological medicine text classification
CN103632168A (en) Classifier integration method for machine learning
CN106845528A (en) A kind of image classification algorithms based on K means Yu deep learning
CN105045913B (en) File classification method based on WordNet and latent semantic analysis
CN104966105A (en) Robust machine error retrieving method and system
CN101604322A (en) A kind of decision level text automatic classified fusion method
CN110674846A (en) Genetic algorithm and k-means clustering-based unbalanced data set oversampling method
CN109213853A (en) A kind of Chinese community's question and answer cross-module state search method based on CCA algorithm
CN105938565A (en) Multi-layer classifier and Internet image aided training-based color image emotion classification method
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
CN105975611A (en) Self-adaptive combined downsampling reinforcing learning machine
CN110297888A (en) A kind of domain classification method based on prefix trees and Recognition with Recurrent Neural Network
CN103020167A (en) Chinese text classification method for computer
CN106570076A (en) Computer text classification system
CN104978569A (en) Sparse representation based incremental face recognition method
Li et al. Support cluster machine
CN112183652A (en) Edge end bias detection method under federated machine learning environment
CN109933648A (en) A kind of differentiating method and discriminating device of real user comment
CN103745242A (en) Cross-equipment biometric feature recognition method
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180710