CN115169470A - High-dimensional small sample data expansion method based on acceptable region - Google Patents
High-dimensional small sample data expansion method based on acceptable region Download PDFInfo
- Publication number
- CN115169470A CN115169470A CN202210840445.5A CN202210840445A CN115169470A CN 115169470 A CN115169470 A CN 115169470A CN 202210840445 A CN202210840445 A CN 202210840445A CN 115169470 A CN115169470 A CN 115169470A
- Authority
- CN
- China
- Prior art keywords
- acceptable
- data
- acceptable area
- range
- area
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The invention relates to a high-dimensional small sample data expansion method based on an acceptable area, which solves the problems that reasonable virtual data is difficult to generate, the characteristic combination of the virtual data is wrong, an effective range is difficult to define and the generation of the virtual data is limited in the prior art. The invention comprises the following steps: step 1: analyzing the data and determining the overall trend of the data; and 2, step: determining a distribution trend among the input features; and step 3: defining an acceptable area: defining an acceptable area Q for each feature, wherein the acceptable area is composed of two parts, and one part is a generalized allowable range Q a The other part is the range of existence of the mutual influence Q β (ii) a And 4, step 4: generating virtual data: firstly, inputting an acceptable area Q of a characteristic space X Sampling a multivariate joint probability distribution based on small samples and an acceptable area, and passing through the y q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed Y Virtual data within.
Description
The technical field is as follows:
the invention belongs to the technical field of virtual data expansion, and relates to a high-dimensional small sample data expansion method based on an acceptable region, which can be used for generating virtual samples aiming at a time sequence in an actual problem under the condition of a small sample data set and modeling by using the generated samples.
The background art comprises the following steps:
the problem that an effective machine learning model cannot be constructed under the condition of rare data quantity exists in a small sample, and two technical approaches are mainly adopted, namely data expansion and model optimization. The invention belongs to the field of data expansion methods.
The mainstream data expansion method at present is a virtual sample expansion technology based on distribution and a virtual sample expansion technology based on prior knowledge. Distribution-based virtual sample expansion techniques include boottrap, whole-body trend diffusion (MTD), and the like. Bootstrap is a resampling technique that has the advantage that the true distribution can be simulated by sampling the distribution, but has the disadvantage that no new samples are generated, and the method only redistributes the original sample set. MTD is a well-established virtual sample expansion technique in some scenarios, but since this method generates each input feature separately, there is a problem that virtual data combination errors lead to poor virtual sample validity. The virtual sample expansion technology based on the prior knowledge is to generate virtual data by utilizing the prior knowledge or extracting the knowledge from a limited sample. In this type of method, the accuracy of the prior knowledge directly determines the quality of the virtual sample. The ability to obtain accurate a priori knowledge becomes a determining factor in the use of such methods. The method for extracting knowledge from the limited sample to generate the virtual data has the problems of how to extract effective knowledge, whether the extracted knowledge is suitable for the class of objects or not and the like.
The invention content is as follows:
the invention aims to provide a high-dimensional small sample data expansion method based on an acceptable area, which solves the problems that reasonable virtual data is difficult to generate, the characteristic combination of the virtual data is wrong, and an effective range is difficult to define to limit the generation of the virtual data in the prior art. The method avoids the problem of combination errors when the virtual input features are generated, avoids uncertainty caused by various intermediate models, and can effectively generate the virtual data which accord with the data features of the small samples in a high-dimensional space.
In order to realize the purpose, the invention adopts the technical scheme that:
a high-dimensional small sample data expansion method based on an acceptable area is characterized in that: the method comprises the following steps:
step 1: analyzing the data and determining the overall trend of the data;
step 2: determining a distribution trend among the input features;
and step 3: defining an acceptable area:
an acceptable area Q is defined for each feature, and the acceptable area is composed of two parts, wherein one part is a generalized allowable range Q a The other part is the range of existence of the mutual influence Q β ;
Acceptable area Q and generalized allowable range Q a The range of mutual influence Q β The relationship of (1) is:
and 4, step 4: generating virtual data:
firstly, inputting an acceptable area Q of a characteristic space X The multivariate joint probability distribution based on small samples and acceptable regions is sampled and then passes through the y q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed Y Virtual data within.
Step 3 comprises the following steps
3.1 setting a generalized allowable range;
3.2 setting the existence range of the mutual influence;
3.3 determining an acceptable area of the input feature;
3.4 determining an acceptable region of the output characteristic.
Step 4 comprises the following steps:
4.1 in the acceptable region Q X Internally generating virtual input features:
acceptable region Q obtained in step 3 X With each input featureIs a modeIn a distribution trendIs a medianThe range is limited to the acceptable area of the input featuresInner bias distribution, and multi-element joint probability distribution U obtained by combining all input characteristics, and finally in discreteMaking smooth transition between them to obtain the acceptable input characteristic region Q X In the output characteristic acceptable region Q for any one range Y Y of (A) to (B) q All have a correspondingA joint probability distribution within which samples are taken;
4.2 according toY and x obtained in step 1 1 ,x 2 The calculation of the relationship between each group of virtual input featuresOutput characteristic y' i ,
Y' i ={y' i |1≤i≤n'};
4.3 finally obtaining a virtual sample set D',
D'={(X' i ,Y' i )|1≤i≤n',i∈N}。
in step 1, the overall variation trend of each output characteristic and all input characteristic data is described by least square normal fitting.
In step 2, the distribution trend of the input features in the data space is described by using least square normal linear fitting.
Compared with the prior art, the invention has the following advantages and effects:
(1) Compared with the existing method, the virtual data expansion method for the high-dimensional small sample determines the effective range of virtual data generation by defining the acceptable area, so that the virtual data generated in the area is closer to the real sample;
(2) The invention avoids the problem of combination error when generating virtual input characteristics through high-dimensional sampling, thereby greatly increasing the effectiveness of the generated virtual data and avoiding the situation that the virtual sample is too far away from the real sample;
(3) The invention describes the trend of the data by means of the least square method and generates the virtual sample by utilizing the joint probability distribution on the basis, thereby avoiding the uncertainty caused by various intermediate models and effectively generating the virtual data which accords with the data characteristics of the small sample in a high-dimensional space.
Description of the drawings:
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a diagram of an original full sample set and a small sample set extracted from an experimental subject;
FIG. 3 is an overall trend for a small sample dataset;
FIG. 4 is a distribution trend (in different coordinate systems) of the input features F1 and F2;
FIG. 5 is a generalized allowable range (in different coordinate systems) of the input features F1, F2;
FIG. 6 shows the interaction existence ranges (in different coordinate systems) of the input features F1 and F2;
FIG. 7 is an acceptable area (in different coordinate systems) for the input features F1, F2;
FIG. 8 shows acceptable regions (in the same coordinate system) for F1 and F2 of the input features;
FIG. 9 is a depiction of a virtual input feature generated within an input feature acceptable area;
FIG. 10 is a comparison of virtual input features with original full sample input features;
FIG. 11 is a virtual sample generated from a small sample set;
FIG. 12 is a comparison of a dummy sample with an original full sample;
FIG. 13 is a comparison of the overall trends for different data sets.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention relates to a high-dimensional small sample data expansion method based on an acceptable area, which determines an effective range generated by virtual data by defining the acceptable area and solves the problem of combination errors among various input characteristics of the virtual data by high-dimensional sampling. Firstly, describing the overall trend of data and the distribution trend of input characteristics by a least square method; then, determining a generalized allowable range of the data by using prior knowledge, and determining an interaction existence range of the data by selecting a deviation limiting method or a deviation expanding method according to the distribution condition of the data, thereby defining an acceptable area and limiting the range of virtual data generation; and finally, generating virtual data in a reasonable sampling mode according to the overall trend of the data in the acceptable area. The invention mainly solves the following problems: (1) Reasonable virtual data is difficult to generate under the condition of less prior knowledge; (2) Generating characteristic combination errors existing in the virtual data through the single-dimensional samples; (3) It is difficult to define the effective range to limit the generation of the dummy data.
The invention specifically comprises the following steps:
step 1: the data is analyzed and the overall trend of the data is determined.
Describing the overall variation trend of each output characteristic and all input characteristic data by least square normal fitting;
step 2: a distribution trend between the input features is determined.
And describing the distribution trend of the input features in the data space by using least square normal fitting.
And step 3: defining an acceptable area.
An acceptable area Q is defined for each feature. The acceptable area consists of two parts, one part is a generalized allowable range Q a The other part is the range of existence of the mutual influence Q β 。
The generalized allowable range only defines the theoretical upper and lower limits of each characteristic, and the range is usually determined by physical factors such as materials, forms and the like, and is an objective limit, in particular for some production line products, each parameter has a rated range and a maximum fluctuation range, and the maximum fluctuation range is the generalized allowable range here.
The interaction existence range needs to be determined according to the distribution among the characteristics, and the reason is that some relevant factors may exist among a plurality of input characteristics, so that a certain input characteristic can change along with the change of other input characteristics, and the output characteristic can also change along with the change of the input characteristics.
Therefore, the acceptable region Q and the generalized allowable range Q a The range of mutual influence Q β The relationship of (1) is:
and 4, step 4: generating virtual data
The way of generating the dummy data can be divided into two ways, one is to directly sample in the acceptable region Q, and the other is to input the acceptable region Q in the feature space first X The multivariate joint probability distribution based on small samples and acceptable regions is sampled and then passes through the y q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed Y Virtual data within.
Example (b):
referring to fig. 1, the basic idea of the method of the present invention is to determine the overall trend of data and the distribution trend of input features through data analysis, then to define an acceptable area in a data space, to limit the range of virtual data generation, and finally to generate virtual data according to the overall trend of data and the acceptable area.
Based on the basic thought, the invention provides a high-dimensional small sample data expansion method based on an acceptable area, which comprises the following steps:
step 1: analyzing data and determining overall trends of the data
For an existing original full sample setn # =326 and small sample dataset D extracted therefrom { (X) i ,Y i ) I 1 is more than or equal to i and less than or equal to N, i belongs to N }, N =34, and the data visualization is as shown in FIG. 2, so that the overall variation trend of the data can be judged: capacity has a clear overall upward trend with increasing F1 and decreasing F2.
Obtaining the overall variation trend of the output characteristic y and all the input characteristic data X by using a least square method, namely describing y and X 1 ,x 2 The relationship between them. As shown in fig. 3.
y=f(X)=f(x 1 ,x 2 )=0.9452+0.0003*(x 1 )-0.0001*(x 2 )
Step 2: determining distribution trends between input features
Describing x in a unary linear relationship 1 And x 2 Overall distribution trend between. As shown in fig. 4.
F(x 1 ,x 2 )
Then for x 1 Is provided with
Then for x 2 Is provided with
And step 3: demarcating an acceptable area
3.1 setting the generalized allowable Range
The object in this example is a lithium ion battery, the characteristics of which have definite significance, and x can be known by a priori knowledge 1 The generalized allowable range of (1) is: [1400,3600],x 2 The generalized allowable range of (c) is: [1100,1600]The generalized allowable range of y is: [0,2]. As shown in fig. 5.
3.2 setting the range of existence of the mutual influence
In this example, there are two input features, and it is necessary to calculate the interaction existence range of each input feature and calculate the interaction existence range of the output feature y.
3.2.1 calculating x by means of the bias-limiting method 1 There is a range of interactions.
Calculate for each oneDistance between two adjacent devicesA value of (d) is notedTake its maximum and minimum values
In this example, the margin of increase is 30% of the maximum minimum deviation, then
Then x 1 The range of the existence of the mutual influence of (1) is:
namely, it is
3.2.2 calculating x by means of the bias-limiting method 2 There is a range of interactions.
Calculate for each oneDistance between two adjacent platesIs given asTake its maximum and minimum values
In this example, the margin of increase is 30% of the maximum minimum deviation, then
Then x 2 The range of the existence of the mutual influence of (1) is:
namely, it is
x 1 And x 2 The range of existence of the mutual influence is shown in fig. 6, and the acceptable region is shown in fig. 7.
3.2.3 calculate the intersection of the two input feature interaction existence ranges. To facilitate the calculation of the intersection, the two mutual influences are normalized to the same coordinate system.
Since x is known 1 And x 2 So normalized to x 1 Is the abscissa, x 2 In a coordinate system of ordinate, then x 1 The range of the existence of the mutual influence of (a) can be converted into:
namely, it is
Then the mutual influence of the input characteristics exists in the range Q Xβ Is composed of
3.2.4 calculating the range of the mutual influence of the output characteristics y by adopting a deviation limit method.
Calculate for each y i Distance f (X) i ) A value of (d) is marked as e yi Take its maximum and minimum values
In this example, the margin of increase is 30% of the maximum minimum deviation, then
Then the mutual influence of y exists in the range Q Yβ Comprises the following steps:
[0.9452+0.0003*(x 1 )-0.0001*(x 2 )-0.02,0.9452+0.0003*(x 1 )-0.0001*(x 2 )+0.03]
namely, it is
[0.9252+0.0003*(x 1 )-0.0001*(x 2 ),0.9752+0.0003*(x 1 )-0.0001*(x 2 )]
3.3 determining acceptable regions of input features
Range Q for existence of mutual influence of input characteristics Xβ And x 1 Wide allowable range of [1400,3600 ]]、x 2 Wide allowable range [1100,1600 ]]Taking the intersection to obtain the final acceptable region Q of the input characteristics X . As shown in fig. 8.
3.4 determining an acceptable region of output characteristics
Range Q of mutual influence of output characteristics Yβ And a broad allowable range [0,2 ]]Taking intersection to obtain the final acceptable region Q of output characteristics Y 。
And 4, step 4: generating virtual data
In this example, the acceptable region Q of the input feature space is firstly adopted X Inner sample, then pass y q The relationship with X is mapped to the output feature space and is within the acceptable region Q of the output feature space Y Virtual data is generated.
4.1 in the acceptable region Q X Internally generating virtual input features
Acceptable region Q obtained in step 3 X With each input featureIs a modeIn a distribution trendIs a medianThe range is limited to the acceptable input feature areaInner bias distribution, and multiple joint probability distribution U obtained by combining all input characteristics, and finally in discreteMaking smooth transition between them to obtain acceptable input characteristic zone Q X In the output characteristic acceptable region Q for any one range Y Y of (A) to (B) q All have a correspondingA joint probability distribution within which samples are taken. The resulting virtual input features X' are shown in fig. 9, and the virtual input features vs. the original full sample input features are shown in fig. 10.
4.2 obtaining y and x according to step 1 1 ,x 2 The calculation of the relationship between each group of virtual input featuresOutput characteristic y' i 。
Wherein, mu to U (-0.02, 0.03) to obtain Y' i 。
Y' i ={y' i |1≤i≤n'}
4.3 finally obtaining the virtual sample set D'.
D'={(X' i ,Y' i )|1≤i≤n',i∈N}
The effect of the virtual samples is shown in fig. 11 versus the original full samples in fig. 12.
In order to verify whether the generated virtual sample set is effective or not, the small sample set, the virtual sample set and the original full sample set are compared in terms of indexes, and the overall trend of each sample is estimated by using a least square method for comparison. Table 1 compares the mean and standard deviation of the features.
Table 1 mean and standard deviation of the characteristics
Table 2 is a comparison of the overall trends for the different sample sets, which corresponds to fig. 13.
TABLE 2 Overall trends for different sample sets
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be included in the scope of the present invention.
Claims (5)
1. A high-dimensional small sample data expansion method based on an acceptable area is characterized in that: the method comprises the following steps:
step 1: analyzing the data and determining the overall trend of the data;
step 2: determining a distribution trend among the input features;
and 3, step 3: defining an acceptable area:
for each oneThe characteristic defines an acceptable area Q, and the acceptable area consists of two parts, wherein one part is a generalized allowable range Q a The other part is the range of existence of the mutual influence Q β ;
Acceptable area Q and generalized allowable range Q a Existence range of mutual influence Q β The relationship of (1) is:
and 4, step 4: generating virtual data:
firstly, inputting an acceptable area Q of a characteristic space X Sampling a multivariate joint probability distribution based on small samples and an acceptable area, and passing through the y q The relationship between X and X is mapped to the output characteristic space, and finally an acceptable area Q in the output characteristic space is formed Y Virtual data within.
2. The method for expanding high-dimensional small sample data based on the acceptable area of claim 1, wherein: step 3 comprises the following steps
3.1 setting a generalized allowable range;
3.2 setting the existence range of the mutual influence;
3.3 determining an acceptable area of the input feature;
3.4 determining an acceptable region of the output characteristic.
3. The method for expanding high-dimensional small sample data based on the acceptable area as claimed in claim 2, wherein: step 4 comprises the following steps:
4.1 in the acceptable region Q X Internally generating virtual input features:
acceptable region Q obtained in step 3 X With each input featureIs a modeIn a distribution trendIs a medianThe range is limited to the acceptable input feature areaInner bias distribution, and multiple joint probability distribution U obtained by combining all input characteristics, and finally in discreteMaking smooth transition between them to obtain the acceptable input characteristic region Q X In the output characteristic acceptable region Q for any one range Y Y of (A) to (B) q All have a correspondingA joint probability distribution within which samples are taken;
4.2 obtaining y and x according to step 1 1 ,x 2 The calculation of the relationship between each group of virtual input featuresOutput characteristic y' i ,
Y′ i ={y′ i |1≤i≤n'};
4.3 finally obtaining a virtual sample set D',
D'={(X′ i ,Y′ i )|1≤i≤n',i∈N}。
4. the method for expanding high-dimensional small sample data based on the acceptable area as claimed in claim 3, wherein: in step 1, the overall variation trend of each output characteristic and all input characteristic data is described by least square normal fitting.
5. The method for expanding high-dimensional small sample data based on the acceptable region as claimed in claim 4, wherein: in step 2, the distribution trend of the input features in the data space is described by using least square normal linear fitting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210840445.5A CN115169470A (en) | 2022-07-18 | 2022-07-18 | High-dimensional small sample data expansion method based on acceptable region |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210840445.5A CN115169470A (en) | 2022-07-18 | 2022-07-18 | High-dimensional small sample data expansion method based on acceptable region |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115169470A true CN115169470A (en) | 2022-10-11 |
Family
ID=83494446
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210840445.5A Pending CN115169470A (en) | 2022-07-18 | 2022-07-18 | High-dimensional small sample data expansion method based on acceptable region |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115169470A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115310212A (en) * | 2022-10-12 | 2022-11-08 | 中汽研(天津)汽车工程研究院有限公司 | Method for sampling characteristic data of automobile shock absorber |
CN115310337A (en) * | 2022-10-12 | 2022-11-08 | 中汽研(天津)汽车工程研究院有限公司 | Vehicle dynamic performance prediction method based on artificial intelligence |
-
2022
- 2022-07-18 CN CN202210840445.5A patent/CN115169470A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115310212A (en) * | 2022-10-12 | 2022-11-08 | 中汽研(天津)汽车工程研究院有限公司 | Method for sampling characteristic data of automobile shock absorber |
CN115310337A (en) * | 2022-10-12 | 2022-11-08 | 中汽研(天津)汽车工程研究院有限公司 | Vehicle dynamic performance prediction method based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115169470A (en) | High-dimensional small sample data expansion method based on acceptable region | |
CN112417573B (en) | GA-LSSVM and NSGA-II shield tunneling multi-objective optimization method based on existing tunnel construction | |
Du et al. | Environmental technical efficiency, technology gap and shadow price of coal-fuelled power plants in China: A parametric meta-frontier analysis | |
CN106384282A (en) | Method and device for building decision-making model | |
CN104573050A (en) | Continuous attribute discretization method based on Canopy clustering and BIRCH hierarchical clustering | |
CN105786584B (en) | A kind of adaptation Suresh Kumar BIM modeling software interface analytic method | |
CN104166731A (en) | Discovering system for social network overlapped community and method thereof | |
CN108376183B (en) | City CA model construction method based on maximum entropy principle | |
CN111444094B (en) | Test data generation method and system | |
CN113554063A (en) | Industrial digital twin virtual and real data fusion method, system, equipment and terminal | |
Yang et al. | A map‐algebra‐based method for automatic change detection and spatial data updating across multiple scales | |
CN105447844A (en) | New method for characteristic selection of complex multivariable data | |
CN116701661A (en) | Building engineering BIM design calculation method based on coding | |
CN103679484A (en) | Novel method for analyzing E-commerce consistency based on behavior Petri network | |
CN108763611A (en) | A kind of wing structure random eigenvalue analysis method based on probabilistic density evolution | |
CN114626886A (en) | Questionnaire data analysis method and system | |
CN106815320B (en) | Investigation big data visual modeling method and system based on expanded three-dimensional histogram | |
CN114490836B (en) | Data mining processing method suitable for electric vehicle charging fault | |
CN115577424A (en) | Method, device, equipment and storage medium for calculating construction engineering quantity | |
CN106709598B (en) | Voltage stability prediction and judgment method based on single-class samples | |
CN113095012B (en) | Splicing and fusing method for numerical simulation calculation results of wind power plant flow field partitions | |
CN113656852B (en) | Method for rapidly generating fine river terrain | |
CN106997462A (en) | A kind of quantum wire image-recognizing method | |
CN113822564A (en) | Flight plan minimum sample size confirmation method and device for airspace simulation analysis | |
CN111241221A (en) | Automatic matching and high-precision repairing method for damaged terrain coordinate data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |