CN111753920B - Feature construction method and device, computer equipment and storage medium - Google Patents
Feature construction method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111753920B CN111753920B CN202010621785.XA CN202010621785A CN111753920B CN 111753920 B CN111753920 B CN 111753920B CN 202010621785 A CN202010621785 A CN 202010621785A CN 111753920 B CN111753920 B CN 111753920B
- Authority
- CN
- China
- Prior art keywords
- index distribution
- sets
- feature
- category
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Traffic Control Systems (AREA)
Abstract
The application relates to the technical field of machine learning, and provides a feature construction method, a device, computer equipment and a storage medium, wherein a plurality of first sets of a feature construction set and a feature value of each first set are constructed through a first feature construction unit and a second feature construction unit, and feature construction information in a feature construction process is recorded; secondly, respectively constructing a plurality of second sets of the training set and a plurality of third sets of the testing set through the first feature construction unit and the feature construction information; training and testing the binary model through the characteristic values of the second sets and the characteristic values of the third sets so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit; therefore, the performance capability of the features is adjusted according to different application scenes, and supervised and efficient feature construction is realized.
Description
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a feature construction method and apparatus, a computer device, and a storage medium.
Background
Feature construction is an important component in a structured data modeling process and is also an important factor for determining whether a data mining or machine learning project is successful or not.
Generally, the feature construction process of the two-classification model is started from business experience, that is, firstly, according to the business experience of a business expert, data items which are generated in business and play an important role in an algorithm learning mode are selected; and then, carrying out univariate or multivariate operation on the features by various means to construct new features, such as adopting univariate operation of feature aggregation, mapping, extraction, binning, calculation and the like, or multivariate operation of feature intersection (combination), polynomial calculation, grouping aggregation and the like to realize feature construction.
However, the above feature construction methods are all unsupervised, the expression capability of features to the patterns learned by the algorithm is unknown before model construction, and from a large amount of modeling experience, the features constructed in the unsupervised way are mostly invalid or repeated, i.e. the feature information concentration is low. Meanwhile, the above feature construction process cannot be effectively adjusted.
Disclosure of Invention
The present application aims to provide a feature construction method, an apparatus, a computer device, and a storage medium, which are used to solve the problems that the feature information constructed by the existing feature construction method has low concentration and the feature construction process cannot be effectively adjusted.
In order to achieve the above object, the embodiments of the present application adopt the following technical solutions:
in a first aspect, the present application provides a feature construction method, including:
obtaining a plurality of samples, and dividing the plurality of samples into a feature construction set, a training set and a test set;
performing feature construction on the samples in the feature construction set and recording feature construction information by using a first feature construction unit to obtain a plurality of first sets;
calculating a feature value of each first set by using a second feature construction unit;
generating a feature mapping table, wherein the feature mapping table includes a plurality of preset categories, a plurality of first sets, and feature values of each first set, and one preset category and one first set determine one feature value;
respectively performing feature construction on the samples in the training set and the test set by using the first feature construction unit and the feature construction information to obtain a plurality of second sets and a plurality of third sets, wherein the plurality of second sets correspond to the plurality of first sets one by one, and the plurality of third sets correspond to the plurality of first sets one by one;
searching the feature mapping table according to the preset categories to obtain a feature value of each second set and a feature value of each third set;
and training and testing a pre-selected classification model by using the characteristic value of each second set and the characteristic value of each third set so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit until the first characteristic construction unit and the second characteristic construction unit reach the optimal condition.
In a second aspect, the present application also provides a feature construction apparatus, comprising:
the system comprises a sample acquisition module, a feature construction module, a training set and a test set, wherein the sample acquisition module is used for acquiring a plurality of samples and dividing the plurality of samples into a feature construction set, a training set and a test set;
the first execution module is used for performing feature construction on the samples in the feature construction set by using a first feature construction unit and recording feature construction information to obtain a plurality of first sets;
a second execution module for calculating a feature value of each of the first sets using a second feature construction unit;
a generating module, configured to generate a feature mapping table, where the feature mapping table includes a plurality of preset categories, a plurality of first sets, and a feature value of each first set, and one preset category and one first set determine one feature value;
a first processing module, configured to perform feature construction on the samples in the training set and the test set respectively by using the first feature construction unit and the feature construction information to obtain a plurality of second sets and a plurality of third sets, where the plurality of second sets correspond to the plurality of first sets one to one, and the plurality of third sets correspond to the plurality of first sets one to one;
the second processing module is used for searching the characteristic mapping table according to the plurality of preset categories to obtain a characteristic value of each second set and a characteristic value of each third set;
and the third processing module is used for training and testing the pre-selected classification model by using the characteristic value of each second set and the characteristic value of each third set so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit until the first characteristic construction unit and the second characteristic construction unit reach the optimal condition.
In a third aspect, the present application further provides a computer device, including: one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the feature construction method described above.
In a fourth aspect, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the above-described feature construction method.
Compared with the prior art, the feature construction method, the feature construction device, the computer equipment and the storage medium provided by the application are characterized in that firstly, a plurality of first sets of a feature construction set and feature values of each first set are constructed through a first feature construction unit and a second feature construction unit, and feature construction information in a feature construction process is recorded; secondly, respectively constructing a plurality of second sets of the training set and a plurality of third sets of the testing set through the first feature construction unit and the feature construction information; training and testing the binary model through the characteristic values of the second sets and the characteristic values of the third sets so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit; therefore, the expressive ability of the features is adjusted according to different application scenes, and supervised and efficient feature construction is realized.
Drawings
Fig. 1 shows a schematic flow chart of a feature construction method provided in an embodiment of the present application.
Fig. 2 is a schematic flowchart of step S12 in the feature construction method shown in fig. 1.
Fig. 3 is a flowchart illustrating step S13 in the feature construction method illustrated in fig. 1.
Fig. 4 is a flowchart illustrating step S15 in the feature construction method illustrated in fig. 1.
Fig. 5 is a flowchart illustrating step S16 in the feature construction method illustrated in fig. 1.
Fig. 6 shows a block diagram of a feature construction apparatus provided in an embodiment of the present application.
Fig. 7 shows a block schematic diagram of a computer device provided by an embodiment of the present application.
Icon: 10-a computer device; 11-a processor; 12-a memory; 13-a bus; 100-feature building means; 101-a sample acquisition module; 102-a first execution module; 103-a second execution module; 104-a generation module; 105-a first processing module; 106-a second processing module; 107-third processing module.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
Referring to fig. 1, fig. 1 shows a schematic flow chart of a feature construction method provided in an embodiment of the present application, where the feature construction method is applied to a computer device, and may include the following steps:
and S11, obtaining a plurality of samples, and dividing the plurality of samples into a feature construction set, a training set and a test set.
The plurality of samples are historical data used for feature construction of the classification model, that is, historical data of a certain service in the past period of time. For example, the binary model is a traffic state prediction model, and if features of the traffic state prediction model are to be constructed, historical vehicle passing data of a road in a past period of time needs to be acquired as a sample.
Since the traffic condition of the entire road network is constantly changing with time in the case of predicting the road traffic condition, it is necessary to count the indexes of the road network in time slices (for example, 5min), that is, to count the indexes of each road segment every 5 min. Each index may include a link number, an upstream average speed, an upstream traffic volume, a downstream average speed, a downstream traffic volume, a link lane number, a link length, a road type, a label, and the like, the label being a traffic state of a next time slice (5min), for example, congestion or flow.
Taking the traffic state prediction as an example, the traffic state in a future period of time (e.g., 5min) is predicted. Historical vehicle passing data of each road section is obtained as a sample before prediction, and the composition of a single sample is shown in the following table 1:
TABLE 1
After the historical data is obtained, all samples can be divided into a feature construction set, a training set and a test set according to a ratio of 4:4:2, and the feature construction set, the training set and the test set are respectively used for feature construction, model training and model testing.
And S12, performing feature construction on the samples in the feature construction set by using the first feature construction unit and recording feature construction information to obtain a plurality of first sets.
The first feature construction unit may be configured to: firstly, constructing a new index aiming at each sample in a feature construction set; then, carrying out abnormal value processing on the constructed new index of each sample, and recording abnormal information; and finally, performing diversity processing on each new index after the abnormal value processing and recording diversity information to obtain a plurality of first sets. The feature construction information includes anomaly information and diversity information.
Taking traffic state prediction as an example, according to business experience, the future congestion condition of a certain road section may be related to the upstream average speed, the upstream flow, the downstream average speed and the downstream flow of the road section, and therefore, the four indexes can be selected to construct a new feature.
Generally, it is considered that the average speed difference between the upstream and downstream of the road section is related to the recent traffic jam condition from the service, but the upstream and downstream traffic conditions are also considered, for example, the average speed of a large flow is reflected by the average speed of a small flow, and obviously, the two traffic conditions are considered. Therefore, a new index can be constructed based on the four indexes of the upstream average speed, the upstream flow rate, the downstream average speed and the downstream flow rate, and the influence of the speed difference caused by the difference of the upstream flow rate and the downstream flow rate can be eliminated as much as possible, and the new index can be understood as the influence of the common change of the speed and the flow rate on the traffic condition of the future 5 min.
After the new index of each road section is constructed, abnormal value processing and diversity processing need to be carried out on the new index of each road section, so that a plurality of first sets can be obtained, and each first set comprises at least one new index of the road section.
And S13, calculating the characteristic value of each first set by using the second characteristic construction unit.
The second feature construction unit may be configured to: firstly, selecting a certain category characteristic G in a characteristic construction set, and calculating the posterior probability of each first set according to the ratio of positive and negative samples in each first set under different values of the category characteristic G; then, the category characteristic G is not considered, and the prior probability of each first set is calculated through the proportion of positive and negative samples in each first set; calculating the posterior probability acceptance rate of each first set under different values of the category characteristics G through the posterior probability of each first set under different values of the category characteristics G obtained in the last two steps and the prior probability of each first set; and finally, calculating the characteristic value of each first set according to the posterior probability of each first set under different values of the category characteristic G, the prior probability of each first set and the posterior probability acceptance rate of each first set under different values of the category characteristic G.
According to the business experience, a certain category feature G with a small number of values can be selected, so that the new index constructed in step S12 has a significant difference under different values of the category feature G.
Taking traffic state prediction as an example, according to business experience, new indexes corresponding to different road types have obvious differences. For example, traffic and speed indexes of an urban expressway and a community road are obviously different, and correspondingly, a new index constructed based on the traffic and speed indexes should also be obviously different, so that the road type can be used as the category feature G. At this time, there are three values of the category feature G, that is, the city expressway, the branch road, and the community road.
S14, generating a feature mapping table, wherein the feature mapping table comprises a plurality of preset categories, a plurality of first sets and feature values of each first set, and one preset category and one first set determine one feature value.
A plurality of preset categories are values of the category characteristics G in step S13, and taking traffic state prediction as an example, the plurality of preset categories are urban expressways, branches, and community roads.
The feature mapping table reflects the feature values of each first set under each preset category, which can be shown in table 2 below:
TABLE 2
And S15, respectively performing feature construction on the samples in the training set and the test set by using the first feature construction unit and the feature construction information to obtain a plurality of second sets and a plurality of third sets, wherein the plurality of second sets correspond to the plurality of first sets one by one, and the plurality of third sets correspond to the plurality of first sets one by one.
The manner of performing feature construction on the samples in the training set and the test set is similar to the manner of performing feature construction on the samples in the feature construction set, that is, the feature construction is performed in the manner in step S12.
Taking the training set as an example, the process of performing feature construction on the samples in the training set may include:
firstly, constructing a new index for each sample in a training set, wherein the construction mode of the new index is consistent with the mode in the step S12; then, according to the abnormal information recorded in the step S12, performing abnormal value processing on the new index of each constructed sample; finally, diversity processing is performed on each new index after the abnormal value processing according to the diversity information recorded in step S12, and a plurality of second sets are obtained.
S16, searching the feature mapping table according to a plurality of preset categories, and obtaining the feature value of each second set and the feature value of each third set.
In the same service, different values of the category characteristics are determined, that is, a plurality of preset categories are determined, and taking traffic state prediction as an example, the preset categories include urban expressways, branches and community roads.
Because the preset categories are determined, the second sets correspond to the first sets one by one, and the third sets correspond to the first sets one by one, the characteristic value of each second set and the characteristic value of each third set under each preset category can be obtained by searching the characteristic mapping table. For example, if the second set 1 corresponds to the first set 1, the feature values of the second set 1 under the preset category 1, the preset category 2 and the preset category 3 can be obtained by looking up the table 1, and the feature values are feature value 11, feature value 12 and feature value 13, respectively.
And S17, training and testing the pre-selected classification model by using the characteristic value of each second set and the characteristic value of each third set so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit until the first characteristic construction unit and the second characteristic construction unit reach the optimal state.
The feature values of each second set can be added into a training data set, the feature values of each third set can be added into a testing data set, the training data set is adopted to train the two classification models, the testing data set is adopted to test the trained two classification models, and the hyper-parameters of the first feature construction unit and the second feature construction unit are modified in an iterative mode until the first feature construction unit and the second feature construction unit reach the optimal state, namely the first feature construction unit and the second feature construction unit can be adopted to construct the optimal feature values.
The training data set and the test data set may be the training set and the test set in step S11, or may be training sets and test sets constructed separately, and are not limited herein.
On the basis of fig. 1, please refer to fig. 2, step S12 may include the following sub-steps:
and S121, respectively carrying out index construction on each sample in the feature construction set according to preset indexes to obtain first index distribution, wherein the first index distribution comprises a first intermediate index corresponding to each sample.
The preset index may be an index with a large business impact, for example, in a customs ocean garbage detection scenario, the preset index may be price and unit mass; as another example, in a traffic state prediction scenario, the preset indicators may be an upstream average speed, an upstream flow rate, a downstream average speed, and a downstream flow rate.
The first intermediate index may be a new index constructed by using a preset index, and may be FiAnd i is a sample identifier. The first indicator distribution may include a new indicator corresponding to each sample, and may be represented by an F distribution, i.e., F ═ F (F)1,...Fi...,Fn) And n represents the total number of samples in the feature construction set.
As an implementation manner, in a traffic state prediction scenario, the process of performing index construction on each sample in the feature construction set according to a preset index to obtain a first index distribution may include:
1. obtaining any one target sample in the feature construction set;
2. according to the upstream average speed, the upstream flow rate, the downstream average speed and the downstream flow rate, utilizing a preset formula:
generating a first intermediate index corresponding to the target sample, wherein i represents a road section identifier, and Vi1Representing the average speed, V, upstream of the section ii0Representing the average speed downstream, Q, of the section ii1Representing the upstream flow of the section i, Qi0Representing the downstream flow of the section i, alpha representing the hyper-parameter of the first feature building element, and alpha ∈ (0, 1)](ii) a In step S17, α needs to be iteratively modified to learn the optimal value of α.
3. And repeatedly executing the steps until a first intermediate index corresponding to each sample in the feature construction set is generated, so as to obtain first index distribution.
And S122, carrying out abnormal value processing on the first index distribution and recording abnormal information to obtain a first target index distribution, wherein the first target index distribution comprises first target index data corresponding to each sample.
In practical applications, the first index distribution may be a continuous variable or a category variable. If the first index distribution is a continuous variable, the binning processing is needed at the later stage, and the abnormal value influences the binning stability, so the abnormal value processing must be performed firstly. If the first index distribution is a category component, category merging processing is needed in the later stage, and the accuracy of category merging is affected by the category with obvious abnormality, so that abnormal value processing is also needed in the first stage.
As an embodiment, when the first index distribution is a continuous variable, the abnormality information includes a maximum value and a minimum value of the first target index distribution;
the process of performing abnormal value processing on the first index distribution and recording abnormal information to obtain the first target index distribution may include:
and performing truncation processing on the first index distribution to obtain a first target index distribution, and recording the maximum value and the minimum value of the first target index distribution.
There are many ways to process the abnormal value, and in practical application, different processing ways may be selected according to the specific situation of the first index distribution, which is not limited herein. The first target index distribution may be represented by F ═ F (F)1,...Fi...,Fn) And (4) showing.
In this embodiment, a simple method of detecting the IQR value is selected, and F ═ F (F) is used1,...Fi...,Fn) And carrying out truncation processing on the IQR value which is 1.5 times of the medium and over 75 quantiles and the IQR value which is 1.5 times of the medium and under 25 quantiles, and keeping the information of the maximum value and the minimum value. Firstly, sorting the first index distribution from small to large; then, a 25-quantile ((first quartile)) and a 75-quantile (third quartile) are found, and the IQR value 1.5 times higher than the 75-quantile and the IQR value 1.5 times lower than the 25-quantile are truncated, thereby completing the truncation of the first index distributionThe maximum value information is the maximum cutoff value, and the minimum value information is the minimum cutoff value. For example, assuming that there are 100 values in total of 1-100, 25 quantiles (first quartile) 25, 75 quantiles (third quartile) 75, and the IQR value is the difference between the third quartile and the first quartile, i.e., 75-25-50, then the maximum cutoff value is 75+ 1.5-50-150, the minimum cutoff value is 25-1.5-50, and the range of values is [ -50,150,150]Numbers below-50 are truncated by-50 and numbers above 150 are truncated by 150, i.e. numbers below-50 are replaced by-50 and numbers above 150 are replaced by 150.
As another embodiment, when the first index distribution is a category variable, the abnormality information includes a pre-replacement category and a post-replacement category of the first target index distribution;
the process of performing abnormal value processing on the first index distribution and recording abnormal information to obtain the first target index distribution may include:
and performing category replacement processing on the first index distribution to obtain a first target index distribution, and recording the category before replacement and the category after replacement of the first target index distribution.
The category replacement processing of the first index distribution means that categories with obvious abnormality in the first index distribution are reasonably replaced, and the categories before replacement and the categories after replacement are recorded. For example, F ═ is (male, female, male, unknown, male), and obviously, the "unknown" is a category with obvious abnormalities, and needs to be replaced by "male" or "female", and assuming that F ═ is (male, female, male), the category before replacement is "unknown", and the category after replacement is "female".
And S123, performing diversity processing on the first target index distribution and recording diversity information to obtain a plurality of first sets, wherein each first set comprises at least one first target index datum.
As an embodiment, when the first index distribution is a continuous variable, the diversity information includes binning point information of the first target index distribution;
the process of performing diversity processing on the first target indicator distribution and recording diversity information to obtain a plurality of first sets may include:
and performing box separation processing on the first target index distribution according to a preset box separation number to obtain a first set, and recording box separation point information of the first target index distribution, wherein the preset box separation number is a hyper-parameter of the first characteristic construction unit.
The method for the box separation treatment is various, such as equal frequency box separation, equidistant box separation, chi-square box separation, optimal box separation and the like. In this embodiment, equidistant binning is selected, the hyperparameter bins is preset as the binning number, the bins can be limited in range according to experience, generally from 7 to 15, and then iterative modification is performed in step S17 to learn the optimal value of the bins. For example, assuming that the minimum value in the first target distribution is-50, the maximum value is 150, and bins is 4, that is, binning is performed at equal intervals of 4 bins, the first division point is-50 + (150- (-50))/4 × 1 ═ 0, the second division point is-50 + (150- (-50))/4 × 2 ═ 50, the third division point is-50 + (150- (-50))/4 ═ 3 ═ 100, the bin range after division is [ -50,0], (0,50], (50,100], (100,150], and the bin division point information includes each division point and the bin range after division.
As another embodiment, when the first index distribution is a category variable, the diversity information includes category merging information of the first target index distribution;
the process of performing diversity processing on the first target index distribution and recording diversity information to obtain a plurality of first sets may include:
and carrying out category merging processing on the first target index distribution to obtain a plurality of first sets, and recording category merging information of the first target index distribution.
If the first index distribution is the category variable, the frequency condition of each value in the first index distribution can be observed, the values with lower frequency are combined according to experience, and category combination information is recorded.
Referring to fig. 3, step S13 may include the following sub-steps:
s131, respectively calculating the posterior probability of each first set under each preset category.
In a traffic state prediction scenario, the plurality of preset categories include urban expressways, branches and community roads, that is, the posterior probability of each first set is calculated under the urban expressways, branches and community roads.
As an embodiment, the sub-step S131 may include the following sub-steps:
S131A, respectively obtaining the positive sample size and the negative sample size of each first set in each preset category.
In a traffic state prediction scene, taking a single first set as an example, namely, respectively obtaining the positive sample size and the negative sample size of the road types of urban expressways, branches and community roads in the first set. The positive sample indicates that the label in table 1 is "congestion", and the negative sample indicates that the label in table 1 is "smooth".
S131B, under each preset category, respectively using the first formulaCalculating a positive sample fraction for each first set, wherein i represents a label of a preset category, j represents a label of the first set, k represents a total number of the first set,represents the positive sample size of the jth first set in the ith preset category,represents the total number of positive samples of all k first sets under the ith preset category,representing the positive sample proportion of the jth first set in the ith preset category.
S131C, under each preset category, respectively using the second formulaCalculating a negative sample fraction for each first set, wherein,represents the negative sample size of the jth first set in the ith preset category,represents the total amount of negative examples of all k first sets in the ith preset category,and the negative sample ratio of the jth first set in the ith preset category is represented.
S131D, under each preset category, respectively using posterior probability formulaCalculating a posterior probability for each first set, wherein post _ oddsijAnd the posterior probability of the jth first set under the ith preset category is represented.
From a posteriori probability post _ oddsijIt can be seen that, under a certain preset category, if the post _ odds of a certain first setijIf the value is 1, the ratio of the positive and negative samples in the first set is equivalent, and the positive and negative samples in the first set cannot be effectively divided by adopting the posterior probability, namely, the ratio of the positive and negative samples in the first set is equivalent; if the first set of post _ oddsijIf the value is far greater than 1, the first set is very likely to have risks, that is, the positive sample ratio is high; if the first set of post _ oddsijA value close to 0 indicates that the first set has a low risk, i.e. the positive sample fraction is small. Obviously, whether the risk is large or small, the positive and negative samples in the first set can be effectively divided by using the posterior probability.
As another embodiment, in the calculation process, there may be a case where the negative sample proportion in a certain first set is 0, that is,0, at this time, the calculated post _ oddsijIs infiniteOr 0/0, it is not reasonable to see such a value, therefore, when such a situation occurs that smoothing is needed, the present embodiment employs Laplace smoothing, i.e., in calculatingThe negative sample size of each first set is incremented by 1. Therefore, the sub-step S131 may further include the sub-step S131E:
S131E, when the negative sample proportion of any one first set is 0, under each preset category, respectively using the third formulaThe negative sample fraction of each first set is recalculated.
S132, calculating the prior probability of each first set.
In many cases, due to unbalanced data distribution, the sample size of some first sets in a preset category is small, so the confidence of the posterior probability calculated by the sub-step S132 is very small, and if the confidence is directly used, overfitting is easy to occur, and in order to avoid this, the concept of prior probability is introduced.
The prior probability, i.e., the probability that each first set is not calculated by a preset category. When the sample size of a certain first set is very small, the posterior probability of the box is low in confidence, and the posterior probability of the first set should be accepted with a small probability, in other words, the prior probability of the first set should be accepted with a large probability.
In the present embodiment, the substep S132 may include the following substeps:
S132A, acquiring a positive sample amount and a negative sample amount of each first set;
S132B, according to the positive sample amount and the negative sample amount of each first set, using the prior probability formulaA priori probabilities are calculated for each first set, wherein,representing the positive sample size of the jth first set,representing the total number of positive samples of all k first sets,representing the negative sample size of the jth first set,represents the total number of negative samples, prior odds, of all k first setsjRepresenting the prior probability of the jth first set.
And S133, respectively calculating the posterior probability acceptance rate of each first set under each preset category.
The posterior probability acceptance rate is used for balancing the acceptance degree of the prior probability and the posterior probability, and aims to obtain a larger acceptance rate for the posterior probability when the first set has a larger sample size and obtain a smaller acceptance rate when the first set has a smaller sample size. Therefore, the posterior probability acceptance rate is a monotone increasing function related to the sample size of the first set, and the value range thereof should be [0,1], and the posterior probability acceptance rate is established by using the Sigmoid function in the embodiment.
In the present embodiment, the substep S133 may include the following substeps:
S133A, respectively acquiring the sample size of each first set under each preset category;
S133B, under each preset category, respectively using the posterior probability acceptance rate formulaCalculating the posterior probability acceptance rate of each first set, wherein NijRepresenting the sample size of the jth first set in the ith preset category, f and K being hyper-parameters of the second feature construction unit, accept _ rateijAnd the posterior probability acceptance rate of the jth first set under the ith preset category is represented.
In this embodiment, all samples in the first set may be sorted according to the identifier, a quartile is selected as an initial value of K, and twice K is selected as an initial value of f, and then iterative modification is performed in step S17 to learn the optimal values of f and K.
And S134, calculating the characteristic value of each first set according to the posterior probability, the prior probability and the acceptance rate of the posterior probability under each preset category.
In this embodiment, the substep S134 may include the following substeps:
S134A, according to the preset upper limit value, the posterior probability and the prior probability are cut off, so that the posterior probability and the prior probability do not exceed the preset upper limit value, and the preset upper limit value is the over-parameter of the second feature building unit.
The values of the posterior probability and the prior probability calculated in the substeps S131 to S132 are both [0, + ∞ ], and in order to avoid the value being too large, the upper limit values of the posterior probability and the prior probability can be stepped to a specific theta according to business experience, wherein the theta is a number greater than 1. Namely, the value ranges of the posterior probability and the prior probability are set to be [0, theta ]. For example, the initial value of θ may be 5, and then in step S17, θ needs to be iteratively modified to learn the optimal value of θ.
S134B, under each preset category, respectively using the eigenvalue formula bins _ oddsij=accept_rateij*post_oddsij+(1-accept_rateij)*prior_oddsjCalculating eigenvalues of each first set, wherein bins _ oddsijAnd representing the characteristic value of the jth first set under the ith preset category.
That is, the a posteriori probability acceptance rate accept _ rate is usedijFor posterior probability post _ oddsijAnd prior probability prior _ oddsjWeighting to obtain eigenvalue bins oddsij。
Based on fig. 1, taking the training set as an example, referring to fig. 4, step S15 may include the following sub-steps:
and S151, respectively constructing indexes of each sample in the training set according to preset indexes to obtain second index distribution, wherein the second index distribution comprises second middle indexes corresponding to each sample.
And S152, carrying out abnormal value processing on the second index distribution according to the abnormal information to obtain a second target index distribution, wherein the second target index distribution comprises second target index data corresponding to each sample.
As an embodiment, when the second index distribution is a continuous variable, the abnormality information includes a maximum value and a minimum value of the first target index distribution;
the process of obtaining the second target index distribution by performing abnormal value processing on the second index distribution according to the abnormal information may include:
and performing truncation processing on the second target index distribution according to the maximum value and the minimum value of the first target index distribution to obtain the second target index distribution.
As another embodiment, when the second index distribution is a category variable, the abnormality information includes a pre-replacement category and a post-replacement category of the first target index distribution;
the process of obtaining the second target index distribution by performing abnormal value processing on the second index distribution according to the abnormal information may include:
and carrying out category replacement processing on the second target index distribution according to the category before replacement and the category after replacement of the first target index distribution to obtain the second target index distribution.
S153, performing diversity processing on the second target index distribution according to the diversity information to obtain a plurality of second sets, wherein each second set comprises at least one second target index data, and the plurality of second sets correspond to the plurality of first sets one to one.
As an embodiment, taking the training set as an example, when the second index distribution is a continuous variable, the diversity information includes the binning point information of the first target index distribution;
the process of performing diversity processing on the second target index distribution according to the diversity information to obtain a plurality of second sets may include:
and performing box separation processing on the second index distribution according to the box separation point information of the first target index distribution to obtain a plurality of second sets.
As another embodiment, when the second index distribution is a category variable, the diversity information includes category merging information of the first target index distribution;
the process of performing diversity processing on the second target index distribution according to the diversity information to obtain a plurality of second sets may include:
and carrying out category merging processing on the second target index distribution according to the category merging information of the first target index distribution to obtain a plurality of second sets.
It should be noted that, for the training set and the test set, the feature construction is performed in the same manner as the feature construction set, that is, the specific processes of the substeps S151 to S153 are identical to those of the substeps S151 to S153, and are not described herein again.
Referring to fig. 5, based on fig. 1, step S16 may include the following sub-steps:
s161, acquiring any one preset target category and any one second target set.
And S162, when the target preset category and the target characteristic value determined by the target second set exist in the characteristic mapping table, taking the target characteristic value as the characteristic value of the target second set.
And S163, when the target preset category and the target characteristic value determined by the target second set do not exist in the characteristic mapping table, setting the characteristic value of the target second set to 1.
When the eigenvalue of a certain second set is 1, it indicates that the proportion difference of the positive and negative samples in the second set is not large, and the positive and negative samples cannot be divided into the second set, so if the eigenvalue of the target second set is exactly the target eigenvalue in the characteristic mapping table, the eigenvalue of the target second set is set to 1, in this case, no useful information is introduced, and the division of the positive and negative samples in other second sets is not affected.
Since the above substeps 161 to S163 are processes of obtaining the feature value of any one target second set among the plurality of second sets, step S16 further includes, after executing substep S163:
and repeatedly executing the substeps 161 to 163 until the characteristic value of each second set is obtained.
It should be noted that, regarding the test set, the process of obtaining the feature value of each third set is the same as the process of the above substeps S151 to S153 and substeps 161 to S163, and is not repeated herein.
Compared with the prior art, the embodiment of the application has the following beneficial effects:
firstly, the application range is wide, the method is almost suitable for all two-classification modeling processes, and is particularly suitable for the problem of imbalance of a large number of class variables, such as anomaly detection, risk prevention and control and the like;
secondly, the service experience is well utilized, the characteristics are constructed by utilizing the service experience, for example, in a traffic state prediction scene, a new index is constructed by utilizing the upstream average speed, the upstream flow, the downstream average speed and the downstream flow, and the characteristic value is determined by utilizing the road type; meanwhile, in the characteristic construction process, business experience is converted into adjustable data function mapping, so that the meaning is clear and the interpretability is strong;
thirdly, in the feature construction process, the label of the historical data is combined in a supervision mode, so that the constructed features have higher pattern recognition information density;
fourthly, in the feature construction, the concept of data distribution interval probability (odds) is adopted to calculate the feature value, the higher the posterior probability is, the higher the risk is, and the lower the posterior probability is, the lower the risk is;
fifthly, the processing procedure borrows the Laplace smooth processing abnormal situation, and the index realization process is ensured to be carried out stably;
sixthly, the sigmoid function is borrowed, the sample size is converted into the posterior probability acceptance rate between [0 and 1], and the prior probability and the posterior probability are accepted by using the posterior probability acceptance rate, so that the index construction is more reasonable.
In order to perform the corresponding steps in the above-described embodiment of the feature construction method and various possible embodiments, an implementation of the feature construction apparatus is given below. Referring to fig. 6, fig. 6 is a block diagram illustrating a feature building apparatus 100 according to an embodiment of the present application. The feature construction apparatus 100 is applied to a computer device, and the feature construction apparatus 100 includes: a sample obtaining module 101, a first executing module 102, a second executing module 103, a generating module 104, a first processing module 105, a second processing module 106 and a third processing module 107.
The sample obtaining module 101 is configured to obtain a plurality of samples, and divide the plurality of samples into a feature construction set, a training set, and a test set.
The first executing module 102 is configured to perform feature construction on the samples in the feature construction set by using the first feature constructing unit and record feature construction information to obtain a plurality of first sets.
A second execution module 103, configured to calculate feature values of each first set by using the second feature construction unit.
A generating module 104, configured to generate a feature mapping table, where the feature mapping table includes a plurality of preset categories, a plurality of first sets, and a feature value of each first set, and one preset category and one first set determine one feature value.
The first processing module 105 is configured to perform feature construction on the samples in the training set and the test set respectively by using the first feature construction unit and the feature construction information to obtain a plurality of second sets and a plurality of third sets, where the plurality of second sets correspond to the plurality of first sets one to one, and the plurality of third sets correspond to the plurality of first sets one to one.
The second processing module 106 is configured to search the feature mapping table according to a plurality of preset categories, and obtain a feature value of each second set and a feature value of each third set.
And the third processing module 107 is configured to train and test the pre-selected two-class model by using the feature value of each second set and the feature value of each third set, so as to iteratively modify the hyper-parameters of the first feature building unit and the second feature building unit until the first feature building unit and the second feature building unit reach the optimal values.
Optionally, the first feature construction unit includes a preset index, and the feature construction information includes abnormal information and diversity information; the first execution module 102 is specifically configured to:
and respectively carrying out index construction on each sample in the feature construction set according to preset indexes to obtain first index distribution, wherein the first index distribution comprises a first intermediate index corresponding to each sample.
And processing abnormal values of the first index distribution and recording abnormal information to obtain a first target index distribution, wherein the first target index distribution comprises first target index data corresponding to each sample.
And performing diversity processing on the first target index distribution and recording diversity information to obtain a plurality of first sets, wherein each first set comprises at least one first target index datum.
Optionally, when the first index distribution is a continuous variable, the abnormal information includes a maximum value and a minimum value of the first target index distribution, and the diversity information includes split-bin point information of the first target index distribution;
the first execution module 102 executes a mode of performing abnormal value processing on the first index distribution and recording abnormal information to obtain a first target index distribution, including: cutting the first index distribution to obtain a first target index distribution, and recording the maximum value and the minimum value of the first target index distribution;
the first executing module 102 executes a manner of performing diversity processing on the first target indicator distribution and recording diversity information to obtain a plurality of first sets, including: and performing box separation processing on the first target index distribution according to a preset box separation number to obtain a first set, and recording box separation point information of the first target index distribution, wherein the preset box separation number is a hyper-parameter of the first characteristic construction unit.
Optionally, when the first index distribution is a category variable, the abnormal information includes a pre-replacement category and a post-replacement category of the first target index distribution, and the diversity information includes category merging information of the first target index distribution;
the first execution module 102 executes a mode of performing abnormal value processing on the first index distribution and recording abnormal information to obtain a first target index distribution, including: performing category replacement processing on the first index distribution to obtain a first target index distribution, and recording a pre-replacement category and a post-replacement category of the first target index distribution;
the first executing module 102 executes a manner of performing diversity processing on the first target indicator distribution and recording diversity information to obtain a plurality of first sets, including: and carrying out category merging processing on the first target index distribution to obtain a plurality of first sets, and recording category merging information of the first target index distribution.
Optionally, the feature construction method is applied to traffic jam condition prediction, and the preset indexes include an upstream average speed, an upstream flow, a downstream average speed and a downstream flow; the first executing module 102 executes index construction on each sample in the feature construction set according to preset indexes to obtain a first index distribution mode, including:
obtaining any one target sample in the feature construction set;
according to the upstream average speed, the upstream flow, the downstream average speed and the downstream flow, utilizing a preset formula:
generating a first intermediate index corresponding to the target sample, wherein i represents a road section identifier, and Vi1Representing the average speed, V, upstream of the section ii0Representing the average speed downstream, Q, of the section ii1Representing the upstream flow, Q, of the section ii0Representing the downstream flow of the section i, alpha representing the hyper-parameter of the first feature building element, and alpha ∈ (0, 1)];
And repeatedly executing the steps until a first intermediate index corresponding to each sample in the feature construction set is generated, and obtaining first index distribution.
Optionally, the second executing module 103 is specifically configured to: respectively calculating the posterior probability of each first set under each preset category; calculating the prior probability of each first set; respectively calculating the posterior probability acceptance rate of each first set under each preset category; and under each preset category, calculating the characteristic value of each first set according to the posterior probability, the prior probability and the acceptance rate of the posterior probability.
Optionally, the second executing module 103 executes a manner of calculating a posterior probability of each first set under each preset category, respectively, including:
respectively acquiring the positive sample quantity and the negative sample quantity of each first set under each preset category;
under each preset category, respectively using a first formulaCalculating a positive sample fraction for each first set, wherein i represents a label of a preset category, j represents a label of the first set, k represents a total number of the first set,represents the positive sample size of the jth first set in the ith preset category,represents the total number of positive samples of all k first sets in the ith preset category,representing the positive sample proportion of the jth first set in the ith preset category;
under each preset category, respectively using a second formulaCalculating a negative sample fraction for each first set, wherein,represents the negative sample size of the jth first set in the ith preset category,represents the total amount of negative examples of all the k first sets under the ith preset category,representing the negative sample ratio of the jth first set in the ith preset category;
under each preset category, respectively utilizing posterior probability formulaCalculating a posterior probability for each first set, wherein post _ oddsijAnd the posterior probability of the jth first set under the ith preset category is represented.
Optionally, the second executing module 103 executes a manner of calculating a posterior probability of each first set under each preset category, and further includes:
when the negative sample proportion of any one first set is 0, respectively utilizing a third formula under each preset categoryThe negative sample fraction of each first set is recalculated.
Optionally, the second performing module 103 performs a manner of calculating the a priori probability of each first set, including:
acquiring a positive sample size and a negative sample size of each first set;
according to the positive sample amount and the negative sample amount of each first set, a priori probability formula is utilizedA priori probabilities are calculated for each first set, wherein,representing the positive sample size of the jth first set,representing the total number of positive samples of all k first sets,is shown asThe negative sample size of the j first sets,represents the total number of negative samples, prior odds, of all k first setsjRepresenting the prior probability of the jth first set.
Optionally, the second executing module 103 executes a manner of calculating the posterior probability acceptance rate of each first set under each preset category, respectively, including:
respectively acquiring the sample size of each first set under each preset category;
under each preset category, respectively utilizing a posterior probability acceptance rate formulaCalculating the posterior probability acceptance rate of each first set, wherein NijRepresenting the sample size of the jth first set in the ith preset category, f and K being hyper-parameters of the second feature construction unit, accept _ rateijAnd the posterior probability acceptance rate of the jth first set in the ith preset category is represented.
Optionally, the second executing module 103 executes a manner of calculating the feature value of each first set according to the posterior probability, the prior probability, and the posterior probability acceptance rate in each preset category, including:
according to a preset upper limit value, carrying out truncation processing on the posterior probability and the prior probability so that the posterior probability and the prior probability do not exceed the preset upper limit value, wherein the preset upper limit value is a hyperparameter of the second feature construction unit;
under each preset category, respectively utilizing a characteristic value formula bins _ oddsij=accept_rateij*post_oddsij+(1-accept_rateij)*prior_oddsjCalculating eigenvalues of each first set, wherein bins _ oddsijAnd representing the characteristic value of the jth first set in the ith preset category.
Optionally, the first feature construction unit includes a preset index, and the feature construction information includes abnormal information and diversity information; the first processing module 105 is specifically configured to:
and respectively constructing indexes of each sample in the training set according to preset indexes to obtain second index distribution, wherein the second index distribution comprises second middle indexes corresponding to each sample.
And carrying out abnormal value processing on the second index distribution according to the abnormal information to obtain a second target index distribution, wherein the second target index distribution comprises second target index data corresponding to each sample.
And performing diversity processing on the second target index distribution according to the diversity information to obtain a plurality of second sets, wherein each second set comprises at least one second target index data, and the plurality of second sets correspond to the plurality of first sets one to one.
Optionally, when the second index distribution is a continuous variable, the abnormal information includes a maximum value and a minimum value of the first target index distribution, and the diversity information includes binning point information of the first target index distribution;
the first processing module 105 executes a mode of processing the abnormal value of the second index distribution according to the abnormal information to obtain the second target index distribution, including: cutting off the second target index distribution according to the maximum value and the minimum value of the first target index distribution to obtain a second target index distribution;
the first processing module 105 performs a diversity processing on the second target index distribution according to the diversity information to obtain a plurality of second sets, including: and performing box separation processing on the second index distribution according to the box separation point information of the first target index distribution to obtain a plurality of second sets.
Optionally, when the second index distribution is a category variable, the abnormal information includes a pre-replacement category and a post-replacement category of the first target index distribution, and the diversity information includes category merging information of the first target index distribution;
the first processing module 105 executes a mode of processing the abnormal value of the second index distribution according to the abnormal information to obtain the second target index distribution, including: carrying out category replacement processing on the second target index distribution according to the category before replacement and the category after replacement of the first target index distribution to obtain second target index distribution;
the first processing module 105 performs a diversity processing on the second target index distribution according to the diversity information to obtain a plurality of second sets, including: and carrying out category merging processing on the second target index distribution according to the category merging information of the first target index distribution to obtain a plurality of second sets.
Optionally, the second processing module 106 is specifically configured to: acquiring any one target preset category and any one target second set; when a target preset category and a target characteristic value determined by a target second set exist in the characteristic mapping table, taking the target characteristic value as a characteristic value of the target second set; when the target preset category and the target characteristic value determined by the target second set do not exist in the characteristic mapping table, setting the characteristic value of the target second set to be 1; and repeating the steps until the characteristic value of each second set is obtained.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the feature constructing apparatus 100 described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Referring to fig. 7, fig. 7 is a block diagram illustrating a computer device 10 according to an embodiment of the present disclosure. The computer device 10 may be any one of a smart phone, a tablet computer, a laptop computer, a desktop computer, a server, and the like, the computer device 10 includes a processor 11, a memory 12, and a bus 13, and the processor 11 is connected to the memory 12 through the bus 13.
The memory 12 is used for storing a program, such as the feature construction apparatus 100 shown in fig. 6, the feature construction apparatus 100 includes at least one software functional module which can be stored in the memory 12 in the form of software or firmware (firmware), and the processor 11 executes the program after receiving an execution instruction to implement the feature construction method disclosed in the above embodiment.
The Memory 12 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (NVM).
The processor 11 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 11. The processor 11 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), and an embedded ARM.
The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by the processor 11, the computer program implements the feature construction method disclosed in the above embodiment.
In summary, according to the feature construction method, the feature construction device, the computer device, and the storage medium provided by the present application, first, a plurality of first sets of a feature construction set and a feature value of each first set are constructed by a first feature construction unit and a second feature construction unit, and feature construction information in a feature construction process is recorded; secondly, respectively constructing a plurality of second sets of the training set and a plurality of third sets of the testing set through the first feature construction unit and the feature construction information; training and testing the binary model through the characteristic values of the second sets and the characteristic values of the third sets so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit; therefore, the performance capability of the features is adjusted according to different application scenes, and supervised and efficient feature construction is realized.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (16)
1. A method of feature construction, the method comprising:
obtaining a plurality of samples, and dividing the plurality of samples into a feature construction set, a training set and a test set, wherein the plurality of samples are historical vehicle passing data used for carrying out feature construction of a traffic state prediction model;
performing feature construction on the samples in the feature construction set and recording feature construction information by using a first feature construction unit to obtain a plurality of first sets;
calculating a feature value of each first set by using a second feature construction unit;
generating a feature mapping table, wherein the feature mapping table comprises a plurality of preset categories, a plurality of first sets and feature values of each first set, one preset category and one first set determine one feature value, and the preset categories comprise urban expressways, branches and community roads;
respectively performing feature construction on the samples in the training set and the test set by using the first feature construction unit and the feature construction information to obtain a plurality of second sets and a plurality of third sets, wherein the plurality of second sets correspond to the plurality of first sets one by one, and the plurality of third sets correspond to the plurality of first sets one by one;
searching the feature mapping table according to the preset categories to obtain a feature value of each second set and a feature value of each third set;
training and testing a preselected traffic state prediction model by using the characteristic value of each second set and the characteristic value of each third set so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit until the first characteristic construction unit and the second characteristic construction unit reach the optimal state;
the first feature construction unit comprises preset indexes, and the feature construction information comprises abnormal information and diversity information; when the method is applied to the prediction of the traffic jam condition, the preset indexes comprise an upstream average speed, an upstream flow, a downstream average speed and a downstream flow;
the step of performing feature construction on the samples in the feature construction set and recording feature construction information by using a first feature construction unit to obtain a plurality of first sets includes:
obtaining any one target sample in the feature construction set;
according to the upstream average speed, the upstream flow, the downstream average speed and the downstream flow, utilizing a preset formula:
generating a first intermediate index corresponding to the target sample, wherein m represents a road section identifier, Vm1Representing the average speed, V, upstream of the section mm0Representing the average speed, Q, downstream of the section mm1Representing the upstream flow, Q, of the section mm0Representing the downstream flow of a section m, a representing the hyper-parameter of said first feature building element, and a ∈ (0, 1)];
Repeatedly executing the steps until a first intermediate index corresponding to each sample in the feature construction set is generated to obtain a first index distribution, wherein the first index distribution comprises the first intermediate index corresponding to each sample;
processing abnormal values of the first index distribution and recording abnormal information to obtain a first target index distribution, wherein the first target index distribution comprises first target index data corresponding to each sample;
and performing diversity processing on the first target index distribution and recording the diversity information to obtain a plurality of first sets, wherein each first set comprises at least one first target index datum.
2. The method of claim 1, wherein when the first index distribution is a continuous variable, the abnormality information includes a maximum value and a minimum value of the first target index distribution, and the diversity information includes binning point information of the first target index distribution;
the step of processing the abnormal value of the first index distribution and recording the abnormal information to obtain a first target index distribution includes:
cutting the first index distribution to obtain a first target index distribution, and recording the maximum value and the minimum value of the first target index distribution;
the step of performing diversity processing on the first target index distribution and recording the diversity information to obtain the plurality of first sets includes:
and performing binning processing on the first target index distribution according to a preset binning number to obtain the first set, and recording binning point information of the first target index distribution, wherein the preset binning number is a hyper-parameter of the first feature construction unit.
3. The method of claim 1, wherein when the first target metric distribution is a category variable, the anomaly information includes a pre-replacement category and a post-replacement category of the first target metric distribution, and the diversity information includes category merging information of the first target metric distribution;
the step of performing abnormal value processing on the first index distribution and recording the abnormal information to obtain a first target index distribution includes:
carrying out category replacement processing on the first index distribution to obtain a first target index distribution, and recording a pre-replacement category and a post-replacement category of the first target index distribution;
the step of performing diversity processing on the first target index distribution and recording the diversity information to obtain the plurality of first sets includes:
and performing category merging processing on the first target index distribution to obtain a plurality of first sets, and recording category merging information of the first target index distribution.
4. The method of claim 1, wherein said step of computing feature values for each of said first sets using a second feature construction unit comprises:
respectively calculating the posterior probability of each first set under each preset category;
calculating a priori probability for each of the first sets;
respectively calculating the posterior probability acceptance rate of each first set under each preset category;
and under each preset category, calculating the characteristic value of each first set according to the posterior probability, the prior probability and the acceptance rate of the posterior probability.
5. The method of claim 4, wherein said step of calculating a posterior probability for each of said first sets, respectively, under each of said predetermined categories, comprises:
respectively acquiring the positive sample quantity and the negative sample quantity of each first set under each preset category;
under each preset category, respectively utilizing a first formulaCalculating a positive sample fraction for each of the first sets, wherein i represents a label of the preset category, j represents a label of the first set, k represents a total number of first sets,representing the positive sample size of the jth first set in the ith preset class,represents the total number of positive samples of all k first sets in the ith preset category,representing the positive sample proportion of the jth first set in the ith preset category;
under each preset category, respectively using a second formulaCalculating a negative sample fraction for each of the first sets, wherein,represents the negative sample size of the jth first set in the ith preset category,represents the total amount of negative examples of all k first sets in the ith preset category,representing the negative sample ratio of the jth first set in the ith preset category;
6. The method of claim 5, wherein said step of calculating a posterior probability for each of said first sets, respectively, under each of said predetermined categories, further comprises:
7. The method of claim 5 or 6, wherein said step of calculating a priori probabilities for each of said first sets comprises:
acquiring a positive sample size and a negative sample size of each first set;
according to the positive sample amount and the negative sample amount of each first set, utilizing a priori probability formulaCalculating a prior probability for each of said first sets, wherein,representing the positive sample size of the jth first set,representing the total number of positive samples of all k first sets,representing the negative sample size of the jth first set,represents the total number of negative samples, prior odds, of all k first setsjRepresenting the prior probability of the jth first set.
8. The method of claim 7, wherein said step of calculating a posterior probability acceptance rate for each of said first sets, respectively, under each of said predetermined categories, comprises:
respectively acquiring the sample size of each first set under each preset category;
respectively utilizing a posterior probability acceptance rate formula under each preset categoryCalculating the posterior probability acceptance rate of each first set, wherein NijRepresents the ithPresetting the sample size of the jth first set under the category, wherein f and K are hyper-parameters of the second feature construction unit, namely, accept _ rateijAnd the posterior probability acceptance rate of the jth first set in the ith preset category is represented.
9. The method of claim 8, wherein said step of calculating the eigenvalues of each said first set according to said posterior probability, said prior probability and said posterior probability acceptance rate under each said predetermined category comprises:
according to a preset upper limit value, carrying out truncation processing on the posterior probability and the prior probability so that the posterior probability and the prior probability do not exceed the preset upper limit value, wherein the preset upper limit value is a hyper-parameter of the second feature construction unit;
under each preset category, respectively utilizing a characteristic value formula bins _ oddsij=accept_rateij*post_oddsij+(1-accept_rateij)*prior_oddsjCalculating a characteristic value of each first set, wherein bins _ oddsijAnd representing the characteristic value of the jth first set under the ith preset category.
10. The method of claim 1, wherein said step of performing feature construction on said samples of said training set using said first feature construction unit and said feature construction information to obtain a plurality of second sets comprises:
respectively carrying out index construction on each sample in the training set according to the preset indexes to obtain second index distribution, wherein the second index distribution comprises a second intermediate index corresponding to each sample;
processing abnormal values of the second index distribution according to the abnormal information to obtain a second target index distribution, wherein the second target index distribution comprises second target index data corresponding to each sample;
and performing diversity processing on the second target index distribution according to the diversity information to obtain a plurality of second sets, wherein each second set comprises at least one second target index data, and the plurality of second sets correspond to the plurality of first sets one to one.
11. The method of claim 10, wherein when the second index distribution is a continuous variable, the abnormality information includes a maximum value and a minimum value of the first target index distribution, and the diversity information includes binning point information of the first target index distribution;
the step of processing the abnormal value of the second index distribution according to the abnormal information to obtain a second target index distribution includes:
intercepting the second target index distribution according to the maximum value and the minimum value of the first target index distribution to obtain the second target index distribution;
the step of performing diversity processing on the second target index distribution according to the diversity information to obtain the plurality of second sets includes:
and performing binning processing on the second index distribution according to the binning point information of the first target index distribution to obtain a plurality of second sets.
12. The method of claim 10, wherein when the second index distribution is a category variable, the anomaly information includes a pre-replacement category and a post-replacement category of the first target index distribution, and the diversity information includes category merging information of the first target index distribution;
the step of processing the abnormal value of the second index distribution according to the abnormal information to obtain a second target index distribution includes:
performing category replacement processing on the second target index distribution according to the pre-replacement category and the post-replacement category of the first target index distribution to obtain the second target index distribution;
the step of performing diversity processing on the second target index distribution according to the diversity information to obtain the plurality of second sets includes:
and performing category merging processing on the second target index distribution according to the category merging information of the first target index distribution to obtain a plurality of second sets.
13. The method as claimed in claim 1, wherein said step of finding said feature mapping table according to said plurality of preset categories to obtain a feature value of each of said second sets comprises:
acquiring any one target preset category and any one target second set;
when a target characteristic value determined by the target preset category and the target second set exists in the characteristic mapping table, taking the target characteristic value as a characteristic value of the target second set;
when the target preset category and the target characteristic value determined by the target second set do not exist in the characteristic mapping table, setting the characteristic value of the target second set to be 1;
and repeating the steps until the characteristic value of each second set is obtained.
14. A feature construction apparatus, characterized in that the apparatus comprises:
the system comprises a sample acquisition module, a characteristic construction module, a training set and a test set, wherein the sample acquisition module is used for acquiring a plurality of samples and dividing the samples into the characteristic construction set, the training set and the test set, and the samples are historical vehicle passing data used for constructing the characteristics of a traffic state prediction model;
the first execution module is used for performing feature construction on the samples in the feature construction set by using a first feature construction unit and recording feature construction information to obtain a plurality of first sets;
a second execution module for calculating a feature value of each of the first sets using a second feature construction unit;
the characteristic mapping table comprises a plurality of preset categories, a plurality of first sets and characteristic values of each first set, one preset category and one first set determine one characteristic value, and the preset categories comprise urban expressways, branches and community roads;
a first processing module, configured to perform feature construction on the samples in the training set and the test set respectively by using the first feature construction unit and the feature construction information to obtain a plurality of second sets and a plurality of third sets, where the plurality of second sets correspond to the plurality of first sets one to one, and the plurality of third sets correspond to the plurality of first sets one to one;
the second processing module is used for searching the characteristic mapping table according to the plurality of preset categories to obtain a characteristic value of each second set and a characteristic value of each third set;
the third processing module is used for training and testing a preselected traffic state prediction model by using the characteristic value of each second set and the characteristic value of each third set so as to iteratively modify the hyper-parameters of the first characteristic construction unit and the second characteristic construction unit until the first characteristic construction unit and the second characteristic construction unit are optimal;
the first characteristic construction unit comprises preset indexes, and the characteristic construction information comprises abnormal information and diversity information; when the method is applied to the prediction of the traffic jam condition, the preset indexes comprise an upstream average speed, an upstream flow, a downstream average speed and a downstream flow;
the first processing module executes the manner of performing feature construction on the samples in the feature construction set and recording feature construction information by using the first feature construction unit to obtain a plurality of first sets, including:
obtaining any one target sample in the feature construction set;
according to the upstream average speed, the upstream flow, the downstream average speed and the downstream flow, utilizing a preset formula:
generating a first intermediate index corresponding to the target sample, wherein m represents a road section identifier, Vm1Representing the average speed, V, upstream of the section mm0Representing the average speed, Q, downstream of the section mm1Representing the upstream flow, Q, of the section mm0Representing the downstream flow of a section m, a representing the hyper-parameter of said first feature building element, and a ∈ (0, 1)];
Repeatedly executing the steps until a first intermediate index corresponding to each sample in the feature construction set is generated to obtain a first index distribution, wherein the first index distribution comprises the first intermediate index corresponding to each sample;
processing abnormal values of the first index distribution and recording abnormal information to obtain a first target index distribution, wherein the first target index distribution comprises first target index data corresponding to each sample;
and performing diversity processing on the first target index distribution and recording the diversity information to obtain a plurality of first sets, wherein each first set comprises at least one first target index datum.
15. A computer device, characterized in that the computer device comprises:
one or more processors;
a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the feature construction method of any of claims 1-13.
16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the feature construction method according to any one of claims 1 to 13.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010621785.XA CN111753920B (en) | 2020-06-30 | 2020-06-30 | Feature construction method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010621785.XA CN111753920B (en) | 2020-06-30 | 2020-06-30 | Feature construction method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111753920A CN111753920A (en) | 2020-10-09 |
CN111753920B true CN111753920B (en) | 2022-06-21 |
Family
ID=72680260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010621785.XA Active CN111753920B (en) | 2020-06-30 | 2020-06-30 | Feature construction method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111753920B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388869A (en) * | 2018-02-28 | 2018-08-10 | 苏州大学 | A kind of hand-written data sorting technique and system based on multiple manifold |
CN109213833A (en) * | 2018-09-10 | 2019-01-15 | 成都四方伟业软件股份有限公司 | Two disaggregated model training methods, data classification method and corresponding intrument |
CN110569860A (en) * | 2019-08-30 | 2019-12-13 | 西安理工大学 | Image interesting binary classification prediction method combining discriminant analysis and multi-kernel learning |
CN111144473A (en) * | 2019-12-23 | 2020-05-12 | 中国医学科学院肿瘤医院 | Training set construction method and device, electronic equipment and computer readable storage medium |
CN111191654A (en) * | 2019-12-30 | 2020-05-22 | 重庆紫光华山智安科技有限公司 | Road data generation method and device, electronic equipment and storage medium |
CN111291816A (en) * | 2020-02-17 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | Method and device for carrying out feature processing aiming at user classification model |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8429102B2 (en) * | 2011-03-31 | 2013-04-23 | Mitsubishi Electric Research Laboratories, Inc. | Data driven frequency mapping for kernels used in support vector machines |
US10733527B2 (en) * | 2015-12-28 | 2020-08-04 | Facebook, Inc. | Systems and methods to de-duplicate features for machine learning model |
EP3330901A1 (en) * | 2016-12-05 | 2018-06-06 | Tata Consultancy Services Limited | Training inductive logic programming enhanced deep belief network models for discrete optimization |
-
2020
- 2020-06-30 CN CN202010621785.XA patent/CN111753920B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388869A (en) * | 2018-02-28 | 2018-08-10 | 苏州大学 | A kind of hand-written data sorting technique and system based on multiple manifold |
CN109213833A (en) * | 2018-09-10 | 2019-01-15 | 成都四方伟业软件股份有限公司 | Two disaggregated model training methods, data classification method and corresponding intrument |
CN110569860A (en) * | 2019-08-30 | 2019-12-13 | 西安理工大学 | Image interesting binary classification prediction method combining discriminant analysis and multi-kernel learning |
CN111144473A (en) * | 2019-12-23 | 2020-05-12 | 中国医学科学院肿瘤医院 | Training set construction method and device, electronic equipment and computer readable storage medium |
CN111191654A (en) * | 2019-12-30 | 2020-05-22 | 重庆紫光华山智安科技有限公司 | Road data generation method and device, electronic equipment and storage medium |
CN111291816A (en) * | 2020-02-17 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | Method and device for carrying out feature processing aiming at user classification model |
Non-Patent Citations (5)
Title |
---|
Bjorn Waske等.Sensitivity of support vector machines to random feature selection in classification of hyperspectral data.《IEEE Transactions on Geoscience and Remote Sensing》.2010,第48卷(第7期), * |
Hu Wenbin等.Real-time traffic jams prediction inpired by Biham,Middleton and Levine (BML) model.《Information Sciences》.2017,第381卷 * |
冯亚.数据挖掘中决策树分类算法研究与应用.《中国优秀硕士学位论文全文数据库 (信息科技辑)》.2007, * |
盛子豪.基于数据挖掘技术的交通拥堵判别与预测算法研究及应用.《中国优秀硕士学位论文全文数据库 (工程科技II辑)》.2018, * |
郝志强.基于联合对称不确定性的特征选择算法研究.《中国优秀硕士学位论文全文数据库 (信息科技辑)》.2018, * |
Also Published As
Publication number | Publication date |
---|---|
CN111753920A (en) | 2020-10-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200320428A1 (en) | Fairness improvement through reinforcement learning | |
Tan et al. | Evolutionary fuzzy ARTMAP neural networks for classification of semiconductor defects | |
US11973672B2 (en) | Method and system for anomaly detection based on time series | |
CN108021945A (en) | A kind of transformer state evaluation model method for building up and device | |
US20190042956A1 (en) | Automatic configurable sequence similarity inference system | |
CN116306888B (en) | Neural network pruning method, device, equipment and storage medium | |
CN115801463B (en) | Industrial Internet platform intrusion detection method and device and electronic equipment | |
SangitaB et al. | Use of Support Vector Machine, decision tree and Naive Bayesian techniques for wind speed classification | |
Satyanarayana et al. | Survey of classification techniques in data mining | |
CN112488182A (en) | Yield prediction method and device of semiconductor device | |
KR20190008515A (en) | Process Monitoring Device and Method using RTC method with improved SAX method | |
Loni et al. | ADONN: adaptive design of optimized deep neural networks for embedded systems | |
CN113988458A (en) | Anti-money laundering risk monitoring method and model training method, device, equipment and medium | |
Hlávka et al. | Change-point methods for multivariate time-series: paired vectorial observations | |
CN111753920B (en) | Feature construction method and device, computer equipment and storage medium | |
CN112433952B (en) | Method, system, device and medium for testing fairness of deep neural network model | |
CN113268929A (en) | Short-term load interval prediction method and device | |
Dobos et al. | A comparative study of anomaly detection methods for gross error detection problems | |
CN116756662A (en) | Yield prediction method and system for optimizing random forest based on Harris eagle algorithm | |
Li et al. | IRFAM: Integrated rule-based fuzzy adaptive resonance theory mapping system for watershed modeling | |
WO2022079907A1 (en) | Secret decision tree learning device, secret decision tree learning system, secret decision tree learning method, and program | |
CN114648060A (en) | Fault signal standardization processing and classification method based on machine learning | |
CN114115150A (en) | Data-based heat pump system online modeling method and device | |
CN111026608A (en) | Oracle database performance diagnosis method and system, equipment and medium thereof | |
Luo et al. | Adaptive regularization-incorporated latent factor analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |