CN115169470A

CN115169470A - High-dimensional small sample data expansion method based on acceptable region

Info

Publication number: CN115169470A
Application number: CN202210840445.5A
Authority: CN
Inventors: 陈志文; 邸若海; 吕志刚; 王鹏; 李晓艳; 贺楚超; 张玉芳; 陈晨
Original assignee: Xian Technological University
Current assignee: Xian Technological University
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-10-11

Abstract

The invention relates to a high-dimensional small sample data expansion method based on an acceptable area, which solves the problems that reasonable virtual data is difficult to generate, the characteristic combination of the virtual data is wrong, an effective range is difficult to define and the generation of the virtual data is limited in the prior art. The invention comprises the following steps: step 1: analyzing the data and determining the overall trend of the data; and 2, step: determining a distribution trend among the input features; and step 3: defining an acceptable area: defining an acceptable area Q for each feature, wherein the acceptable area is composed of two parts, and one part is a generalized allowable range Q _a The other part is the range of existence of the mutual influence Q _β (ii) a And 4, step 4: generating virtual data: firstly, inputting an acceptable area Q of a characteristic space _X Sampling a multivariate joint probability distribution based on small samples and an acceptable area, and passing through the y ^q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed _Y Virtual data within.

Description

High-dimensional small sample data expansion method based on acceptable region

The technical field is as follows:

the invention belongs to the technical field of virtual data expansion, and relates to a high-dimensional small sample data expansion method based on an acceptable region, which can be used for generating virtual samples aiming at a time sequence in an actual problem under the condition of a small sample data set and modeling by using the generated samples.

The background art comprises the following steps:

the problem that an effective machine learning model cannot be constructed under the condition of rare data quantity exists in a small sample, and two technical approaches are mainly adopted, namely data expansion and model optimization. The invention belongs to the field of data expansion methods.

The mainstream data expansion method at present is a virtual sample expansion technology based on distribution and a virtual sample expansion technology based on prior knowledge. Distribution-based virtual sample expansion techniques include boottrap, whole-body trend diffusion (MTD), and the like. Bootstrap is a resampling technique that has the advantage that the true distribution can be simulated by sampling the distribution, but has the disadvantage that no new samples are generated, and the method only redistributes the original sample set. MTD is a well-established virtual sample expansion technique in some scenarios, but since this method generates each input feature separately, there is a problem that virtual data combination errors lead to poor virtual sample validity. The virtual sample expansion technology based on the prior knowledge is to generate virtual data by utilizing the prior knowledge or extracting the knowledge from a limited sample. In this type of method, the accuracy of the prior knowledge directly determines the quality of the virtual sample. The ability to obtain accurate a priori knowledge becomes a determining factor in the use of such methods. The method for extracting knowledge from the limited sample to generate the virtual data has the problems of how to extract effective knowledge, whether the extracted knowledge is suitable for the class of objects or not and the like.

The invention content is as follows:

the invention aims to provide a high-dimensional small sample data expansion method based on an acceptable area, which solves the problems that reasonable virtual data is difficult to generate, the characteristic combination of the virtual data is wrong, and an effective range is difficult to define to limit the generation of the virtual data in the prior art. The method avoids the problem of combination errors when the virtual input features are generated, avoids uncertainty caused by various intermediate models, and can effectively generate the virtual data which accord with the data features of the small samples in a high-dimensional space.

In order to realize the purpose, the invention adopts the technical scheme that:

a high-dimensional small sample data expansion method based on an acceptable area is characterized in that: the method comprises the following steps:

step 1: analyzing the data and determining the overall trend of the data;

step 2: determining a distribution trend among the input features;

and step 3: defining an acceptable area:

an acceptable area Q is defined for each feature, and the acceptable area is composed of two parts, wherein one part is a generalized allowable range Q _a The other part is the range of existence of the mutual influence Q _β ；

Acceptable area Q and generalized allowable range Q _a The range of mutual influence Q _β The relationship of (1) is:

and 4, step 4: generating virtual data:

firstly, inputting an acceptable area Q of a characteristic space _X The multivariate joint probability distribution based on small samples and acceptable regions is sampled and then passes through the y ^q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed _Y Virtual data within.

Step 3 comprises the following steps

3.1 setting a generalized allowable range;

3.2 setting the existence range of the mutual influence;

3.3 determining an acceptable area of the input feature;

3.4 determining an acceptable region of the output characteristic.

Step 4 comprises the following steps:

4.1 in the acceptable region Q _X Internally generating virtual input features:

acceptable region Q obtained in step 3 _X With each input feature

Is a mode

In a distribution trend

Is a median

The range is limited to the acceptable area of the input features

Inner bias distribution, and multi-element joint probability distribution U obtained by combining all input characteristics, and finally in discrete

Making smooth transition between them to obtain the acceptable input characteristic region Q _X In the output characteristic acceptable region Q for any one range _Y Y of (A) to (B) ^q All have a corresponding

A joint probability distribution within which samples are taken;

4.2 according toY and x obtained in step 1 ¹ ,x ² The calculation of the relationship between each group of virtual input features

Output characteristic y' _i ，

Y' _i ＝{y' _i |1≤i≤n'}；

4.3 finally obtaining a virtual sample set D',

D'＝{(X' _i ,Y' _i )|1≤i≤n',i∈N}。

in step 1, the overall variation trend of each output characteristic and all input characteristic data is described by least square normal fitting.

In step 2, the distribution trend of the input features in the data space is described by using least square normal linear fitting.

Compared with the prior art, the invention has the following advantages and effects:

(1) Compared with the existing method, the virtual data expansion method for the high-dimensional small sample determines the effective range of virtual data generation by defining the acceptable area, so that the virtual data generated in the area is closer to the real sample;

(2) The invention avoids the problem of combination error when generating virtual input characteristics through high-dimensional sampling, thereby greatly increasing the effectiveness of the generated virtual data and avoiding the situation that the virtual sample is too far away from the real sample;

(3) The invention describes the trend of the data by means of the least square method and generates the virtual sample by utilizing the joint probability distribution on the basis, thereby avoiding the uncertainty caused by various intermediate models and effectively generating the virtual data which accords with the data characteristics of the small sample in a high-dimensional space.

Description of the drawings:

FIG. 1 is an overall flow diagram of the present invention;

FIG. 2 is a diagram of an original full sample set and a small sample set extracted from an experimental subject;

FIG. 3 is an overall trend for a small sample dataset;

FIG. 4 is a distribution trend (in different coordinate systems) of the input features F1 and F2;

FIG. 5 is a generalized allowable range (in different coordinate systems) of the input features F1, F2;

FIG. 6 shows the interaction existence ranges (in different coordinate systems) of the input features F1 and F2;

FIG. 7 is an acceptable area (in different coordinate systems) for the input features F1, F2;

FIG. 8 shows acceptable regions (in the same coordinate system) for F1 and F2 of the input features;

FIG. 9 is a depiction of a virtual input feature generated within an input feature acceptable area;

FIG. 10 is a comparison of virtual input features with original full sample input features;

FIG. 11 is a virtual sample generated from a small sample set;

FIG. 12 is a comparison of a dummy sample with an original full sample;

FIG. 13 is a comparison of the overall trends for different data sets.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention relates to a high-dimensional small sample data expansion method based on an acceptable area, which determines an effective range generated by virtual data by defining the acceptable area and solves the problem of combination errors among various input characteristics of the virtual data by high-dimensional sampling. Firstly, describing the overall trend of data and the distribution trend of input characteristics by a least square method; then, determining a generalized allowable range of the data by using prior knowledge, and determining an interaction existence range of the data by selecting a deviation limiting method or a deviation expanding method according to the distribution condition of the data, thereby defining an acceptable area and limiting the range of virtual data generation; and finally, generating virtual data in a reasonable sampling mode according to the overall trend of the data in the acceptable area. The invention mainly solves the following problems: (1) Reasonable virtual data is difficult to generate under the condition of less prior knowledge; (2) Generating characteristic combination errors existing in the virtual data through the single-dimensional samples; (3) It is difficult to define the effective range to limit the generation of the dummy data.

The invention specifically comprises the following steps:

step 1: the data is analyzed and the overall trend of the data is determined.

Describing the overall variation trend of each output characteristic and all input characteristic data by least square normal fitting;

step 2: a distribution trend between the input features is determined.

And describing the distribution trend of the input features in the data space by using least square normal fitting.

And step 3: defining an acceptable area.

An acceptable area Q is defined for each feature. The acceptable area consists of two parts, one part is a generalized allowable range Q _a The other part is the range of existence of the mutual influence Q _β 。

The generalized allowable range only defines the theoretical upper and lower limits of each characteristic, and the range is usually determined by physical factors such as materials, forms and the like, and is an objective limit, in particular for some production line products, each parameter has a rated range and a maximum fluctuation range, and the maximum fluctuation range is the generalized allowable range here.

The interaction existence range needs to be determined according to the distribution among the characteristics, and the reason is that some relevant factors may exist among a plurality of input characteristics, so that a certain input characteristic can change along with the change of other input characteristics, and the output characteristic can also change along with the change of the input characteristics.

Therefore, the acceptable region Q and the generalized allowable range Q _a The range of mutual influence Q _β The relationship of (1) is:

and 4, step 4: generating virtual data

The way of generating the dummy data can be divided into two ways, one is to directly sample in the acceptable region Q, and the other is to input the acceptable region Q in the feature space first _X The multivariate joint probability distribution based on small samples and acceptable regions is sampled and then passes through the y ^q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed _Y Virtual data within.

Example (b):

referring to fig. 1, the basic idea of the method of the present invention is to determine the overall trend of data and the distribution trend of input features through data analysis, then to define an acceptable area in a data space, to limit the range of virtual data generation, and finally to generate virtual data according to the overall trend of data and the acceptable area.

Based on the basic thought, the invention provides a high-dimensional small sample data expansion method based on an acceptable area, which comprises the following steps:

step 1: analyzing data and determining overall trends of the data

For an existing original full sample set

n ^# =326 and small sample dataset D extracted therefrom { (X) _i ,Y _i ) I 1 is more than or equal to i and less than or equal to N, i belongs to N }, N =34, and the data visualization is as shown in FIG. 2, so that the overall variation trend of the data can be judged: capacity has a clear overall upward trend with increasing F1 and decreasing F2.

Obtaining the overall variation trend of the output characteristic y and all the input characteristic data X by using a least square method, namely describing y and X ¹ ,x ² The relationship between them. As shown in fig. 3.

y＝f(X)＝f(x ¹ ,x ² )＝0.9452+0.0003*(x ¹ )-0.0001*(x ² )

Step 2: determining distribution trends between input features

Describing x in a unary linear relationship ¹ And x ² Overall distribution trend between. As shown in fig. 4.

F(x ¹ ,x ² )

Then for x ¹ Is provided with

Then for x ² Is provided with

And step 3: demarcating an acceptable area

3.1 setting the generalized allowable Range

The object in this example is a lithium ion battery, the characteristics of which have definite significance, and x can be known by a priori knowledge ¹ The generalized allowable range of (1) is: [1400,3600]，x ² The generalized allowable range of (c) is: [1100,1600]The generalized allowable range of y is: [0,2]. As shown in fig. 5.

3.2 setting the range of existence of the mutual influence

In this example, there are two input features, and it is necessary to calculate the interaction existence range of each input feature and calculate the interaction existence range of the output feature y.

3.2.1 calculating x by means of the bias-limiting method ¹ There is a range of interactions.

Calculate for each one

Distance between two adjacent devices

A value of (d) is noted

Take its maximum and minimum values

In this example, the margin of increase is 30% of the maximum minimum deviation, then

Then x ¹ The range of the existence of the mutual influence of (1) is:

namely, it is

3.2.2 calculating x by means of the bias-limiting method ² There is a range of interactions.

Calculate for each one

Distance between two adjacent plates

Is given as

Take its maximum and minimum values

Then x ² The range of the existence of the mutual influence of (1) is:

namely, it is

x ¹ And x ² The range of existence of the mutual influence is shown in fig. 6, and the acceptable region is shown in fig. 7.

3.2.3 calculate the intersection of the two input feature interaction existence ranges. To facilitate the calculation of the intersection, the two mutual influences are normalized to the same coordinate system.

Since x is known ¹ And x ² So normalized to x ¹ Is the abscissa, x ² In a coordinate system of ordinate, then x ¹ The range of the existence of the mutual influence of (a) can be converted into:

namely, it is

Then the mutual influence of the input characteristics exists in the range Q _Xβ Is composed of

3.2.4 calculating the range of the mutual influence of the output characteristics y by adopting a deviation limit method.

Calculate for each y _i Distance f (X) _i ) A value of (d) is marked as e _yi Take its maximum and minimum values

Then the mutual influence of y exists in the range Q _Yβ Comprises the following steps:

[0.9452+0.0003*(x ¹ )-0.0001*(x ² )-0.02,0.9452+0.0003*(x ¹ )-0.0001*(x ² )+0.03]

namely, it is

[0.9252+0.0003*(x ¹ )-0.0001*(x ² ),0.9752+0.0003*(x ¹ )-0.0001*(x ² )]

3.3 determining acceptable regions of input features

Range Q for existence of mutual influence of input characteristics _Xβ And x ¹ Wide allowable range of [1400,3600 ]]、x ² Wide allowable range [1100,1600 ]]Taking the intersection to obtain the final acceptable region Q of the input characteristics _X . As shown in fig. 8.

3.4 determining an acceptable region of output characteristics

Range Q of mutual influence of output characteristics _Yβ And a broad allowable range [0,2 ]]Taking intersection to obtain the final acceptable region Q of output characteristics _Y 。

And 4, step 4: generating virtual data

In this example, the acceptable region Q of the input feature space is firstly adopted _X Inner sample, then pass y ^q The relationship with X is mapped to the output feature space and is within the acceptable region Q of the output feature space _Y Virtual data is generated.

4.1 in the acceptable region Q _X Internally generating virtual input features

Acceptable region Q obtained in step 3 _X With each input feature

Is a mode

In a distribution trend

Is a median

The range is limited to the acceptable input feature area

Inner bias distribution, and multiple joint probability distribution U obtained by combining all input characteristics, and finally in discrete

Making smooth transition between them to obtain acceptable input characteristic zone Q _X In the output characteristic acceptable region Q for any one range _Y Y of (A) to (B) ^q All have a corresponding

A joint probability distribution within which samples are taken. The resulting virtual input features X' are shown in fig. 9, and the virtual input features vs. the original full sample input features are shown in fig. 10.

4.2 obtaining y and x according to step 1 ¹ ,x ² The calculation of the relationship between each group of virtual input features

Output characteristic y' _i 。

Wherein, mu to U (-0.02, 0.03) to obtain Y' _i 。

Y' _i ＝{y' _i |1≤i≤n'}

4.3 finally obtaining the virtual sample set D'.

D'＝{(X' _i ,Y' _i )|1≤i≤n',i∈N}

The effect of the virtual samples is shown in fig. 11 versus the original full samples in fig. 12.

In order to verify whether the generated virtual sample set is effective or not, the small sample set, the virtual sample set and the original full sample set are compared in terms of indexes, and the overall trend of each sample is estimated by using a least square method for comparison. Table 1 compares the mean and standard deviation of the features.

Table 1 mean and standard deviation of the characteristics

Table 2 is a comparison of the overall trends for the different sample sets, which corresponds to fig. 13.

TABLE 2 Overall trends for different sample sets

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be included in the scope of the present invention.

Claims

1. A high-dimensional small sample data expansion method based on an acceptable area is characterized in that: the method comprises the following steps:

step 1: analyzing the data and determining the overall trend of the data;

step 2: determining a distribution trend among the input features;

and 3, step 3: defining an acceptable area:

for each oneThe characteristic defines an acceptable area Q, and the acceptable area consists of two parts, wherein one part is a generalized allowable range Q _a The other part is the range of existence of the mutual influence Q _β ；

Acceptable area Q and generalized allowable range Q _a Existence range of mutual influence Q _β The relationship of (1) is:

and 4, step 4: generating virtual data:

firstly, inputting an acceptable area Q of a characteristic space _X Sampling a multivariate joint probability distribution based on small samples and an acceptable area, and passing through the y ^q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed _Y Virtual data within.

2. The method for expanding high-dimensional small sample data based on the acceptable area of claim 1, wherein: step 3 comprises the following steps

3.1 setting a generalized allowable range;

3.2 setting the existence range of the mutual influence;

3.3 determining an acceptable area of the input feature;

3.4 determining an acceptable region of the output characteristic.

3. The method for expanding high-dimensional small sample data based on the acceptable area as claimed in claim 2, wherein: step 4 comprises the following steps:

4.1 in the acceptable region Q _X Internally generating virtual input features:

acceptable region Q obtained in step 3 _X With each input feature

Is a mode

In a distribution trend

Is a median

The range is limited to the acceptable input feature area

A joint probability distribution within which samples are taken;

Output characteristic y' _i ，

Y′ _i ＝{y′ _i |1≤i≤n'}；

4.3 finally obtaining a virtual sample set D',

D'＝{(X′ _i ,Y′ _i )|1≤i≤n',i∈N}。

4. the method for expanding high-dimensional small sample data based on the acceptable area as claimed in claim 3, wherein: in step 1, the overall variation trend of each output characteristic and all input characteristic data is described by least square normal fitting.

5. The method for expanding high-dimensional small sample data based on the acceptable region as claimed in claim 4, wherein: in step 2, the distribution trend of the input features in the data space is described by using least square normal linear fitting.