CN107729926A

CN107729926A - A kind of data amplification method based on higher dimensional space conversion, mechanical recognition system

Info

Publication number: CN107729926A
Application number: CN201710899032.3A
Authority: CN
Inventors: 赵凤军; 吴斌; 贺小伟; 侯榆青; 易黄建; 曹欣; 王宾
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2018-02-23
Anticipated expiration: 2037-09-28
Also published as: CN107729926B

Abstract

The invention belongs to image procossing, machine learning techniques field, discloses a kind of data amplification method based on higher dimensional space conversion, mechanical recognition system, background sample data are transformed into higher dimensional space from luv space；Distribution histogram based on background sample obtains the distribution of higher dimensional space target sample, generates higher dimensional space target sample data；Equation group conversion is carried out using distance function, amplification data transforms to luv space by higher dimensional space.The present invention has expanded corresponding positive sample data set, has solved the positive and negative sample data mismatch problem in machine learning model, improve classification performance, be improved particularly the nicety of grading of positive sample by learning to the distribution histogram of negative sample；Statistical analysis is carried out based on background sample, obtain the distribution of target sample data to be generated, and then target sample is generated, the validity of amplification data is improved, avoids and traditional synthesizes that sample caused by new target sample is overlapping, model over-fitting problem based on a small amount of sample.

Description

A kind of data amplification method based on higher dimensional space conversion, mechanical recognition system

Technical field

The invention belongs to image procossing, machine learning techniques field, more particularly to a kind of number based on higher dimensional space conversion According to amplification method, mechanical recognition system.

Background technology

Machine learning is a research machine recognition existing knowledge, obtains the knowledge of new knowledge and new technical ability, extensively Applied to every field, such as image recognition, data mining, fault diagnosis.Needed in machine learning techniques first to sample data Handled and trained.In actual applications, sample data set is often unbalanced, and negative sample quantity is remote in usual data set More than positive sample, the result being trained to this kind of data set is that the classification performance of grader declines；Such as know in vascular plaque In other problem, often accounting is less for vascular system sample medium vessels patch, largely belongs to healthy blood vessel, is entered with such sample Row training, obtained grader precision is relatively low, and normal blood vessels may be identified as to the blood vessel that patch be present, false judgment patient The state of an illness, it is also possible to the blood vessel for having patch is identified as normal blood vessels, so as to be delayed the state of an illness of patient.Therefore to this kind of inequality Weighing apparatus data are correctly classified, and are improved the accuracy rate of classification, are had very important significance for its affiliated research field. At present, the processing for unbalanced dataset mainly has two aspects, first, from the angle of data, by studying sample The mode of sampling or amplification reaches the purpose of equilibrium criterion collection, second, from the angle of algorithm, algorithm performance is changed Come in improve classifier performance.Traditional angle from data, the method handled unbalanced dataset mainly have Two kinds, one kind is sampling algorithm, by being sampled to negative sample, the negative sample of sampling is equal to the set of former positive sample, this Kind of method can cause the missing of the information entrained by the sample that is not sampled, for negative sample data much larger than positive sample data It sample, can lack the most information of research sample, participate in the sample size wretched insufficiency of training；Another method is to pass through Data amplification technique increases the quantity of positive sample, and the technology is analyzed based on target sample, and artificial according to target sample Synthesize new sample and carry out equilibrium criterion collection, for example, simple copy positive sample, to positive sample plus noise, positive sample rotation, upset etc. Mode, but simple data amplification technique easily causes that sample is overlapping and model over-fitting problem, the training for increasing model are difficult Degree；For the improvement of simple data amplification technique, some scholars propose new amplification algorithm, as SMOTE algorithms be by The artificial synthesized new sample of linear interpolation is carried out between positive sample similar in position and carrys out equilibrium criterion collection, this method to it is each just Sample all generates new samples, improves model over-fitting problem, but it is overlapping to easily cause sample, at the same the algorithm have ignored it is close Influence of the sample and isolated point of classification boundaries to target sample classification performance, there is certain blindness when synthesizing new samples Property；BSMOTE algorithms are to be based on SMOTE algorithms, and target sample is classified using nearest neighbor algorithm, obtain its noise sample, Internal specimen (sample away from classification boundaries), boundary sample, the synthesis of new samples is carried out using the target sample of classification boundaries, This algorithm have ignored background sample and isolated point, not be suitable for the few research sample of target sample.

In summary, the problem of prior art is present be：Analyzing to synthesize new sample based on target sample, easily makes Into sample it is overlapping, ignore border and isolated point the problems such as, due to the limitation of training sample so that grader classification is inaccurate, Certain limitation in raising to target sample classification performance be present, asking for model over-fitting is likely to result in as sample is overlapping Inscribe, ignore border and the problem of isolated point can be caused to this kind of sample point classification error etc..

The content of the invention

The problem of existing for prior art, the invention provides a kind of data amplification side based on higher dimensional space conversion Method, mechanical recognition system.

The present invention is achieved in that a kind of data amplification method based on higher dimensional space conversion, described empty based on higher-dimension Between the data amplification method that converts background sample data are transformed into higher dimensional space from luv space；Distribution based on background sample Histogram obtains the distribution of higher dimensional space target sample, generates higher dimensional space target sample data；Equation is carried out using distance function Group conversion, amplification data transform to luv space by higher dimensional space.

Further, the data amplification method based on higher dimensional space conversion comprises the following steps：

Step 1, data sample is divided into positive sample and negative sample, positive sample is target sample, and negative sample is background sample This；The Euclidean distance square of each background sample data and all background samples is calculated respectively, and the higher-dimension for obtaining background sample is empty Between convert, so as to which background sample data are transformed into higher dimensional space by luv space；

Step 2, the histogram of the higher dimensional space background sample in each dimension is counted respectively, to per one-dimensional sample data Distribution is normalized；Supplement is carried out to the histogram of the background sample after normalization, obtains target sample in each dimension Histogram distribution, and be standardized to obtain the probability distribution of target sample；Obtained according to the probability distribution in each dimension Take needs to generate sample point number and its span in each dimension；To generating preliminary target sample per one-dimensional probability distribution Data, obtained every one dimensional numerical internal sequence is upset at random, generates the target sample data of higher dimensional space；

The distance between target sample point of step 3, background sample point and generation is distance function, is obtained by distance function The distance function equation group of a certain data point into background sample point and amplification data；Adjacent two works of functional equation group of adjusting the distance Difference, carries out transposition and coefficient merges, and obtains the Linear Equations on certain point in data to be generated；Solve to be generated Data certain point is generalized to institute in data to be generated and a little, obtains the matrix equation on low-dimensional amplification data to be generated, solves Matrix equation, amplification data is transformed into luv space from higher dimensional space, the target sample data after being expanded.

Further, background sample data are transformed into higher dimensional space by luv space in the step 1 to specifically include：

(1) it is N initial data to be divided into research sample and background sample, background sample number, and background sample point is x₀₁, x₀₂,…,x_0n,…,x_0N, wherein each sample point includes Q dimension datas, i-th of sample data is a line vector x_0i=[x_0i1, x_0i2,…,x_0iq,…,x_0iQ]；

(2) to each background sample data point x_0i, the Euclidean distance square of it and all background sample data points is calculated, Obtain：d_i,1,d_i,2,…,d_i,n,…,d_i,N, wherein d_i,n=| | x_0i-x_0n||₂ ²=(x_0i1-x_0n1)²+(x_0i2-x_0n2)²+…+ (x_0iq-x_0nq)²+…+(x_0iQ-x_0nQ)², (1≤i≤N, 1≤n≤N), in formula | | x_0i-x_0n||₂Represent (x_0i-x_0n) L2 models Number, finally give the N-dimensional space sample data of background sample：

Further, the target sample data that higher dimensional space is generated in the step 2 specifically include：

(1) histogram of N number of data in the higher dimensional space conversion of background sample is counted respectively by dimension, by the every of histogram One-dimensional data is divided into h section；

(2) sample counting in each section is counted, is designated as y_t, y_tFor a row vector, represent that background sample higher dimensional space becomes The sample counting in each section of t dimension datas in changing, to the section sample counting y of the dimension data_tExcept sample in all sections The maximum of number is normalized

(3) the section sample counting y after normalizing_t' supplement and standardization are carried out, obtain the probability point of target sample Cloth

(4) the number k of each section target sample data point to be generated in the dimension data is calculated_t=M × p_t, k_tFor one Row vector, represent that t ties up the counting of each section generation data, M is represented to generate the number of data point, pressed in each section K is generated at random according to being uniformly distributed_tIndividual data point, and be l by the target sample data record of generation_1,t,l_2,t,…,l_m,t,…, l_M,t；

(5) to being proceeded as described above in the higher dimensional space conversion of background sample per one-dimensional sample data, the M to be expanded is generated Each dimension sample data of the higher dimensional space of individual data point, the higher-dimension for upset to obtain amplification data by dimension progress internal random to it Space sample data：

Further, amplification data is transformed into luv space from higher dimensional space in the step 3 to specifically include：

(1) M sample point of amplification is designated as x₁,x₂,…,x_m,…,x_M, wherein each sample point includes Q dimension datas, i-th Individual sample data is a line vector x_i=[x_i1,x_i2,…,x_iq,…,x_iQ], by distance function l_m,n=| | x_m-x_0n||₂ ² (1≤m ≤ M, 1≤n≤N), x_mTo generate m-th of sample point of target sample, x_0nFor n-th sample point of background sample, can obtain The distance function equation group of background sample point and amplification data：

(2) quadratic term of distance function equation group is deployed, and made the difference with n-th and (n+1)th, 1≤n≤N；On Generate the linear equation of than the m-th data in amplification data：

Linear equation is transplanted and coefficient merge after, can obtain：

Equation group is written as matrix equation：

By solution matrix equation, the certain point x of amplification data is calculated_m；

(3) certain point x in amplification data will be calculated_mProcess be generalized to all M points, obtain on data point to be generated Matrix equation：

AX=B+C；

Wherein

Above-mentioned equation group is solved, obtains its unknown quantity X=A^-1(B+C), wherein A^-1Representing matrix A pseudo inverse matrix, is obtained Data result be amplification data point, complete conversion of the amplification data from higher dimensional space to luv space.

Another object of the present invention is to provide the data amplification method based on higher dimensional space conversion described in a kind of utilization Mechanical recognition system.

Another object of the present invention is to provide the data amplification method based on higher dimensional space conversion described in a kind of utilization Image identification system.

Advantages of the present invention and good effect are：In machine learning model, the classification that is trained based on original sample Device causes grader classification performance relatively low, the present invention passes through the distribution histogram to negative sample due to positive sample lazy weight Practise, expanded corresponding positive sample data set, solved the positive and negative sample data mismatch problem in machine learning model, improve Classification performance, especially substantially increase the nicety of grading of positive sample；The present invention is united based on background sample (negative sample) Meter analysis, the distribution of target sample to be generated (positive sample) data is obtained, and then generate target sample, solved in conventional method The problem of ignoring border and isolated point when generating target sample, so as to improve the validity of amplification data, avoid traditional The problems such as overlapping sample caused by new target sample, model over-fitting are synthesized based on a small amount of sample.

Brief description of the drawings

Fig. 1 is provided in an embodiment of the present invention to be based on higher dimensional space characteristic amplification method flow chart.

Fig. 2 is the regional choice figure of sample space feature extraction provided in an embodiment of the present invention.

Fig. 3 is to generate amplification data in the data amplification method provided in an embodiment of the present invention based on higher dimensional space conversion Flow chart is embodied.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The present invention has expanded corresponding positive sample data set, has solved machine by learning to the distribution histogram of negative sample Positive and negative sample data mismatch problem in device learning model；Statistical analysis is carried out based on background sample (negative sample), treated The distribution of target sample (positive sample) data is generated, and then generates target sample, improves the validity of amplification data.

The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.

As shown in figure 1, the data amplification method provided in an embodiment of the present invention based on higher dimensional space conversion includes following step Suddenly：

S101：Sample is pre-processed, background sample data transform to higher dimensional space from luv space；

S102：Statistics with histogram and analysis higher dimensional space background sample data, the distribution of higher dimensional space target sample is obtained, and Generate higher dimensional space target sample data；

S103：Equation group conversion is carried out using distance function, amplification data is transformed into luv space.

The application principle of the present invention is further described below in conjunction with the accompanying drawings.

As shown in figure 3, it is provided in an embodiment of the present invention based on higher dimensional space conversion data amplification method specifically include with Lower step：

(1) to sample preprocessing, it is as follows that background sample data from luv space are transformed into higher dimensional space step；

The data that (1a) this example uses are along the vascular cross-section image with central axis direction in Human vascular's system；

(1b) chooses normal blood vessels cross-sectional image and obtained as background sample, vascular plaque cross-sectional image as target sample N is designated as to background sample number, background sample point is x₀₁,x₀₂,…,x_0n,…,x_0N；

(1c) is as shown in Fig. 2 using current background center of a sample point as the center of circle, according to 1,3,5 voxels of center of a sample's point Circle on sample respectively, since innermost circle, sampling angle is followed successively by 90 °, and 45 °, 30 ° are sampled, obtain 24 sampling Region；

(1d) carries out feature extraction to background sample, and the average gray value in each region is the ash of all voxels in the region Average value is spent, obtains 24 characteristic vector [x_0i1,x_0i2,…,x_0i24], wherein i represents i-th of background sample；Calculate each area The average curvature in domain is designated as the curvature feature in the region, obtains 24 characteristic vector [x_0i25,x_0i26,…,x_0i48]；By two dimension Gabor filtering obtains textural characteristics with the texture maps of 90 ° of filterings, obtains characteristic vector [x_0i49,x_0i50,…,x_0i72]；Calculate every The Hessian matrixes of individual point, obtain representing three characteristic values in the direction, characteristic vector [x can be obtained_0i73,x_0i74,…, x_0i144]；

(1e) carries out above sample mode to each background sample, calculates its characteristic vector, obtains each background sample point The Q=144 dimension datas being made up of the feature of four types, i-th of sample data are a line vector x_0i=[x_0i1,x_0i2,…, x_0iq,…,x_0iQ]；

(1f) is to each background sample data point x_0i, the Euclidean distance for calculating it with all background sample data points puts down Side, is obtained：d_i,1,d_i,2,…,d_i,n,…,d_i,N, wherein d_i,n=| | x_0i-x_0n||₂ ²=(x_0i1-x_0n1)²+(x_0i2-x_0n2)²+…+ (x_0iq-x_0nq)²+…+(x_0iQ-x_0nQ)², (1≤i≤N, 1≤n≤N), in formula | | x_0i-x_0n||₂Represent (x_0i-x_0n) L2 models Number, finally give the N-dimensional space sample data of background sample：

(2) higher dimensional space background sample data are analyzed, generates the target sample data detailed process of higher dimensional space It is as follows：

(2a) counted respectively by dimension background sample higher dimensional space conversion in N number of data histogram, by histogram H section is divided into per one-dimensional data；

(2b) counts the sample counting in each section, is designated as y_t, y_tFor a row vector, background sample higher dimensional space is represented The sample counting in each section of t dimension datas in conversion, to the section sample counting y of the dimension data_tExcept sample in all sections The maximum of this number is normalized

Section sample counting y after (2c) normalization_t' supplement and standardization are carried out, obtain the probability of target sample Distribution

(2d) calculates the number k of each section target sample data point to be generated in the dimension data_t=M × p_t, k_tFor One row vector, represent that t ties up the counting of each section generation data, M represents to generate the number of data point, in each section K is generated at random according to being uniformly distributed_tIndividual data point, and be l by the target sample data record of generation_1,t,l_2,t,…,l_m,t,…, l_M,t；

(2e) generates what is expanded to being proceeded as described above in the higher dimensional space conversion of background sample per one-dimensional sample data Each dimension sample data of the higher dimensional space of M data point, the height for upset to obtain amplification data by dimension progress internal random to it Dimension space sample data：

(3) that amplification data is transformed into luv space step from higher dimensional space is as follows：

The M sample point of (3a) amplification is designated as x₁,x₂,…,x_m,…,x_M, wherein each sample point includes Q dimension datas, i-th Individual sample data is a line vector x_i=[x_i1,x_i2,…,x_iq,…,x_iQ], by distance function l_m,n=| | x_m-x_0n||₂ ² (1≤m ≤ M, 1≤n≤N), x_mTo generate m-th of sample point of target sample, x_0nFor n-th sample point of background sample, can obtain The distance function equation group of background sample point and amplification data：

(3b) deploys the quadratic term of distance function equation group, and is made the difference with n-th and (n+1)th (1≤n≤N), can Obtain the linear equation on than the m-th data in generation amplification data：

Linear equation is transplanted and coefficient merge after, can obtain：

Equation group can be written as matrix equation：

By solution matrix equation, the certain point x of amplification data can be calculated_m；

(3c) will calculate certain point x in amplification data_mProcess be generalized to all M points, obtain on data to be generated The matrix equation of point：

AX=B+C；

Wherein

C=[c, c ..., c],

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims

A kind of 1. data amplification method based on higher dimensional space conversion, it is characterised in that the number based on higher dimensional space conversion Background sample data are transformed into higher dimensional space from luv space according to amplification method；Distribution histogram based on background sample obtains Higher dimensional space target sample is distributed, and generates higher dimensional space target sample data；Equation group conversion, amplification are carried out using distance function Data transform to luv space by higher dimensional space.
2. the data amplification method as claimed in claim 1 based on higher dimensional space conversion, it is characterised in that described to be based on higher-dimension The data amplification method of spatial alternation comprises the following steps：

Step 1, data sample is divided into positive sample and negative sample, positive sample is target sample, and negative sample is background sample；Point The Euclidean distance square of each background sample data and all background samples is not calculated, and the higher dimensional space for obtaining background sample becomes Change, so as to which background sample data are transformed into higher dimensional space by luv space；

Step 2, the histogram of the higher dimensional space background sample in each dimension is counted respectively, to being distributed per one-dimensional sample data It is normalized；Supplement is carried out to the histogram of the background sample after normalization, it is straight in each dimension to obtain target sample Side's figure distribution, and be standardized to obtain the probability distribution of target sample；Obtained according to the probability distribution in each dimension Each dimension needs to generate sample point number and its span；To generating preliminary target sample number per one-dimensional probability distribution According to being upset at random to obtained every one dimensional numerical internal sequence, generate the target sample data of higher dimensional space；

The distance between target sample point of step 3, background sample point and generation is distance function, is carried on the back by distance function The distance function equation group of a certain data point in scape sample point and amplification data；Adjacent two works of functional equation group of adjusting the distance are poor, Carry out transposition and coefficient merges, obtain the Linear Equations on certain point in data to be generated；Solve number to be generated Institute in data to be generated is generalized to according to certain point and a little, obtains the matrix equation on low-dimensional amplification data to be generated, solves square Battle array equation, transforms to luv space, the target sample data after being expanded by amplification data from higher dimensional space.
3. the data amplification method as claimed in claim 2 based on higher dimensional space conversion, it is characterised in that in the step 1 Background sample data are transformed into higher dimensional space by luv space to specifically include：

(1) it is N initial data to be divided into research sample and background sample, background sample number, and background sample point is x₀₁,x₀₂,…, x_0n,…,x_0N, wherein each sample point includes Q dimension datas, i-th of sample data is a line vector x_0i=[x_0i1,x_0i2,…, x_0iq,…,x_0iQ]；

(2) to each background sample data point x_0i, the Euclidean distance square of it and all background sample data points is calculated, is obtained Arrive：d_i,1,d_i,2,…,d_i,n,…,d_i,N, wherein d_i,_n=| | x_0i-x_0n||₂ ²=(x_0i1-x_0n1)²+(x_0i2-x_0n2)²+…+(x_0iq- x_0nq)²+…+(x_0iQ-x_0nQ)², (1≤i≤N, 1≤n≤N), in formula | | x_0i-x_0n||₂Represent (x_0i-x_0n) L2 norms, finally Obtain the N-dimensional space sample data of background sample：
4. the data amplification method as claimed in claim 2 based on higher dimensional space conversion, it is characterised in that in the step 2 The target sample data of generation higher dimensional space specifically include：

(1) histogram of N number of data in the higher dimensional space conversion of background sample is counted respectively by dimension, by the every one-dimensional of histogram Data are divided into h section；

(2) sample counting in each section is counted, is designated as y_t, y_tFor a row vector, represent in the conversion of background sample higher dimensional space The sample counting in each section of t dimension datas, number of samples in all sections is removed to the section sample counting yt of the dimension data Maximum be normalized

(3) the section sample counting y after normalizing_t' supplement and standardization are carried out, obtain the probability distribution of target sample

(4) the number k of each section target sample data point to be generated in the dimension data is calculated_t=M × p_t, k_tFor a line to Amount, represent that t ties up the counting of each section generation data, M represents to generate the number of data point, according to equal in each section The even random generation k of distribution_tIndividual data point, and be l by the target sample data record of generation_1,t,l_2,t,…,l_m,t,…,l_M,t；

(5) to being proceeded as described above in the higher dimensional space conversion of background sample per one-dimensional sample data, the M number to be expanded is generated Each dimension sample data of the higher dimensional space at strong point, the higher dimensional space for upset to obtain amplification data by dimension progress internal random to it Sample data：
5. the data amplification method as claimed in claim 2 based on higher dimensional space conversion, it is characterised in that in the step 3 Amplification data is transformed into luv space from higher dimensional space to specifically include：

(1) M sample point of amplification is designated as x₁,x₂,…,x_m,…,x_M, wherein each sample point includes Q dimension datas, i-th of sample Data are a line vector x_i=[x_i1,x_i2,…,x_iq,…,x_iQ], by distance function l_m,n=| | x_m-x_0n||₂ ²(1≤m≤M,1≤n ≤ N), x_mTo generate m-th of sample point of target sample, x_0nFor n-th of sample point of background sample, background sample point can obtain With the distance function equation group of amplification data：

(2) quadratic term of distance function equation group is deployed, and made the difference with n-th and (n+1)th, 1≤n≤N；Obtain on generation The linear equation of than the m-th data in amplification data：

Linear equation is transplanted and coefficient merge after, can obtain：

Equation group is written as matrix equation：

By solution matrix equation, the certain point x of amplification data is calculated_m；

(3) certain point x in amplification data will be calculated_mProcess be generalized to all M points, obtain the square on data point to be generated Battle array equation：

AX=B+C；

Wherein

Above-mentioned equation group is solved, obtains its unknown quantity X=A^-1(B+C), wherein A^-1Representing matrix A pseudo inverse matrix, obtained number It is amplification data point according to result, completes conversion of the amplification data from higher dimensional space to luv space.
6. a kind of machine of data amplification method using based on higher dimensional space conversion described in any one of Claims 1 to 55 is known Other system.
7. a kind of image of data amplification method using based on higher dimensional space conversion described in any one of Claims 1 to 55 is known Other system.