CN110796492A - Method, device and equipment for determining important features and storage medium - Google Patents

Method, device and equipment for determining important features and storage medium Download PDF

Info

Publication number
CN110796492A
CN110796492A CN201911040435.8A CN201911040435A CN110796492A CN 110796492 A CN110796492 A CN 110796492A CN 201911040435 A CN201911040435 A CN 201911040435A CN 110796492 A CN110796492 A CN 110796492A
Authority
CN
China
Prior art keywords
features
feature
comparison
index data
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911040435.8A
Other languages
Chinese (zh)
Inventor
汤益嘉
彭涛
唐黄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN201911040435.8A priority Critical patent/CN110796492A/en
Publication of CN110796492A publication Critical patent/CN110796492A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Game Theory and Decision Science (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Operations Research (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for determining important characteristics. Wherein, the method comprises the following steps: obtaining at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment; according to the characteristic parameters of the selected characteristics, at least one comparison characteristic is obtained from the candidate comparison characteristics of the comparison cluster, and index data of each parameter segment in the comparison characteristics is determined; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features; and determining the important characteristics of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristics and the index data of each parameter segment in the comparison characteristics. And obtaining the selected characteristic and the comparison characteristic, and comparing the difference of the index data of each parameter segment in the selected characteristic and the comparison characteristic, thereby improving the determination efficiency of the important characteristic, reducing the workload and saving the time.

Description

Method, device and equipment for determining important features and storage medium
Technical Field
The present invention relates to computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining an important feature.
Background
With the progress of the information age and the explosive increase of information quantity, people put high requirements on the efficient processing of mass data. Determining important features for feature extraction is an effective way to solve information explosion, and is widely applied and researched.
The current method for determining the important features is to grade the features according to divergence or relevance, and determine the importance degree of the features according to the grade; or, determining an objective function for judging the feature importance degree, dividing the feature set into feature subsets, randomly selecting the feature subsets in the feature set, testing the scores or errors of the feature subsets on the objective function, and determining the importance degree of the features according to the scores or the errors.
However, the feature extraction method in the prior art can only find out the relationship between every two features according to the importance degree of the determined features of the correlation or the divergence, and has poor effect on processing the condition of a plurality of features; if a plurality of features are randomly selected for processing and a target function is adopted for scoring, the accuracy of determining the important features is low, a large amount of time is consumed, the resource cost is high, the efficiency of determining the important features is low, and the like.
Disclosure of Invention
Embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for feature extraction, which achieve an effect of improving the efficiency of determining important features by comparing difference values between selected features and comparison features.
In a first aspect, an embodiment of the present invention provides a method for determining an important feature, where the method includes:
obtaining at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment;
according to the characteristic parameters of the selected characteristics, at least one comparison characteristic is obtained from candidate comparison characteristics of a comparison cluster, and index data of each parameter segment in the comparison characteristics is determined; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features;
and determining the important characteristics of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristics and the index data of each parameter segment in the comparison characteristics.
Optionally, the index data of each parameter segment in the selected feature includes:
the index data for each parameter segment in the selected feature is represented as (N)1,N2,...,Nn) (ii) a Wherein the number of segments for segmenting the characteristic parameters of the selected characteristics is n;
the index data of each parameter segment in the comparison characteristic comprises:
the index data of each parameter segment in the contrast characteristic is expressed as (N'1,N′2,...,N′n) (ii) a The number of segments for segmenting the characteristic parameters of the contrast characteristics is n;
correspondingly, the determining the important feature of the target cluster according to the distribution difference between the index data of each parameter segment in the selected feature and the index data of each parameter segment in the comparison feature includes:
and determining the important features of the target clusters according to the difference of the index data of each corresponding parameter segment of the selected features and the comparison features.
Optionally, the determining the important features of the target cluster according to the difference of the index data of each corresponding parameter segment of the selected feature and the comparison feature includes:
determining a density distribution P of the index data for each parameter segment of the selected featurei(ii) a And determining the density distribution of index data of each parameter segment of the contrast characteristic as Pi′;
And determining the important features of the target clusters according to the difference of the density distribution of the corresponding parameter segments of the selected features and the comparison features.
Optionally, the following formula is adopted to calculate the difference of the density distribution of each corresponding parameter segment of the selected feature and the contrast feature:
Figure BDA0002252679400000031
wherein Diff ∈ [0, 1 ]];
Wherein Diff represents a difference value between the selected feature and the contrast feature obtained by accumulating the difference of the density distribution of each corresponding parameter segment;
Figure BDA0002252679400000032
SumN is (N)1,N2,...,Nn) Adding;
Figure BDA0002252679400000033
SumN 'is (N'1,N′2,...,N′n) And (4) adding.
Optionally, the determining the important feature of the target cluster according to a distribution difference between the index data of each parameter segment in the selected feature and the index data of each parameter segment in the comparison feature further includes:
if the distribution difference meets a preset standard, determining the currently selected feature as an important feature; and if the distribution difference does not meet the preset standard, determining the currently selected feature as the non-important feature.
In a second aspect, an embodiment of the present invention further provides an apparatus for determining an important feature, where the apparatus includes:
a selected feature obtaining module, configured to obtain at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment;
a comparison feature obtaining module, configured to obtain at least one comparison feature from candidate comparison features of a comparison cluster according to a feature parameter of the selected feature, and determine index data of each parameter segment in the comparison features; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features;
and the important characteristic determining module is used for determining the important characteristic of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristic and the index data of each parameter segment in the comparison characteristic.
Optionally, the index data of each parameter segment in the selected feature includes:
the index data for each parameter segment in the selected feature is represented as (N)1,N2,...,Nn) (ii) a Wherein the number of segments for segmenting the characteristic parameters of the selected characteristics is n;
the index data of each parameter segment in the comparison characteristic comprises:
the index data of each parameter segment in the contrast characteristic is expressed as (N'1,N′2,...,N′n) (ii) a The number of segments for segmenting the characteristic parameters of the contrast characteristics is n;
accordingly, the significant feature determination module includes:
and the parameter segment corresponding unit is used for determining the important features of the target cluster according to the difference of the index data of each corresponding parameter segment of the selected features and the comparison features.
Optionally, the parameter segment corresponding unit is specifically configured to:
determining a density distribution P of the index data for each parameter segment of the selected featurei(ii) a And determining the density distribution of index data of each parameter segment of the contrast characteristic as Pi′;
And determining the important features of the target clusters according to the difference of the density distribution of the corresponding parameter segments of the selected features and the comparison features.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement a method for determining an important feature according to any embodiment of the present invention.
In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used for performing a method for determining an important feature according to any one of the embodiments of the present invention.
According to the embodiment of the invention, the selected features are obtained from the target cluster, the comparison features are obtained from the comparison cluster, the density distribution of index data in the selected features and the density distribution of index data in the comparison features are subjected to difference, and the important features are selected according to the difference, so that the problem that a plurality of features cannot be compared simultaneously in the prior art is solved, the workload is reduced, the time is saved, and the efficiency of determining the important features is improved.
Drawings
FIG. 1 is a flow chart illustrating a method for determining an important feature according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for determining an important feature according to a second embodiment of the present invention;
FIG. 3 is a histogram of AUM asset composition at a first guest group client time point in accordance with a second embodiment of the present invention;
FIG. 4 is a histogram of AUM asset composition at a second guest farm client time point in accordance with a second embodiment of the present invention;
FIG. 5 is a star-level histogram of the current month customers of the first customer base in accordance with a second embodiment of the present invention;
FIG. 6 is a star-level histogram of the current month customers of the second customer base in accordance with a second embodiment of the present invention;
FIG. 7 is a block diagram of an apparatus for determining an important feature according to a third embodiment of the present invention;
fig. 8 is a schematic structural diagram of a computer device in the fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart illustrating a method for determining an important feature according to an embodiment of the present invention, where the method is applicable to determining an important feature, and the method can be executed by an apparatus for determining an important feature. As shown in fig. 1, the method specifically includes the following steps:
s110, obtaining at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment.
The target cluster is a cluster where the important feature to be determined is located, the candidate feature is at least one feature in the target cluster, and the important feature is determined from the candidate features. The worker may obtain at least one selected feature from the candidate features by big data analysis or manual selection. For example, if a worker wants to obtain important features of the target cluster about the female population, the worker needs to obtain features of the female population from all candidate features of the target cluster as the selected features. The acquisition mode can manually search the relevant features, and can also automatically acquire the relevant features through keywords through big data analysis. The selected features are relatively important features in the candidate features, and the important features are selected from the relatively important candidate features.
The selected features include feature parameters of the selected features and index data for each parameter segment. Specifically, the index data of each parameter segment in the selected feature is represented as (N)1,N2,...,Nn) (ii) a Wherein the number of segments for segmenting the characteristic parameter of the selected feature is n. For example, with the related features of the female population as the selected features, the selected features may include the young age of the female population and the high school calendar of the female population, and the young age and the high school calendar are the feature parameters of the two selected features respectively. The young age parameter is segmented into four segments of 17-21 years old, 22-26 years old, 27-31 years old and 32-36 years old, namely the number of the segments is 4; the high school calendar parameters are segmented into two sections, namely a master and a doctor, wherein the number n of the segments is 2. The number of women aged 17-21, the number of women aged 22-26, the number of women aged 27-31 and the number of women aged 32-36 are index data of each segment of the parameter of the young age,master schooling number of women N1Number of women who study doctor and doctor2Index data of each section of the high school calendar parameters.
It should be noted that, in this embodiment, the dividing of each parameter segment and the determining manner of the index data are not specifically limited, and the parameter segments may be manually divided and recorded according to experience, or may be automatically divided and recorded according to big data analysis.
S120, according to the characteristic parameters of the selected characteristics, obtaining at least one comparison characteristic from the candidate comparison characteristics of the comparison cluster, and determining index data of each parameter segment in the comparison characteristics; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features.
And acquiring at least one comparison characteristic from the candidate comparison characteristics of the comparison cluster according to the characteristic parameters of the selected characteristic. The comparison cluster comprises at least one candidate comparison feature, and the staff can acquire the comparison feature from the candidate comparison feature and determine the importance degree of the selected feature according to the data of the comparison feature. The comparison cluster may contain data in the target cluster, or may be completely different from the data in the target cluster. For example, the target cluster is data on the characteristics of the female population, and the comparison characteristics may be data on the characteristics of the female population and the male population, or data on the characteristics of the male population.
After the staff determines the comparison features, feature parameters of the comparison features and index data of each parameter segment are determined, wherein the feature parameters of the comparison features are the same as those of the selected features. Specifically, the index data of each parameter segment in the contrast feature is represented as (N'1,N′2,...,N′n) (ii) a Wherein the number of segments for segmenting the characteristic parameters of the contrast characteristics is n. For example, the candidate feature is a female population high school calendar feature, the comparison feature is a male population high school calendar feature, the feature parameter of the candidate feature is a high school calendar, and the number of segments n is 2, then the feature parameter of the comparison feature is also a high school calendar, and the number of segments n is also 2; the feature parameters of the candidate features are segmented by master and doctor, and the feature parameters of the comparison features are segmented by master and doctor. CandidatesThe index data of each segment in the characteristics is the number N of women in the Master's study1Number of women who study doctor and doctor2Then the index data of each parameter segment in the comparison features is the number of men in the Master's study1'and doctor' study of male population N2'. And after segmenting the characteristic parameters of the comparison characteristics, recording index data of the characteristic parameter segmentation. It should be noted that, in this embodiment, the determination of the comparison features and the recording manner of the index data are not limited, and may be manually determined and recorded by a worker, or may be performed by performing the determination of the comparison features and the recording manner of the index data through a big data analysis or the like.
S130, determining the important characteristics of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristics and the index data of each parameter segment in the comparison characteristics.
Wherein if the index data (N) of each parameter segment in the selected feature is selected1,N2,...,Nn) Index data (N ') segmented with each parameter in the contrast characteristic'1,N′2,...,N′n) If the distribution difference is large, the index data of each parameter segment in the selected feature is proved to be different from the index data of each parameter distribution in the comparison feature to a great extent. If the index data of each parameter segment in the selected feature is the same or approximately the same as the index data of each parameter distribution in the comparison feature, it is proved that the distribution of the index data of each parameter segment in the selected feature is not distributed in a special or different distribution form from the comparison feature, and the selected feature is relatively common and cannot be used as an important feature which can represent a target cluster.
Optionally, the important features of the target cluster are determined according to differences of the index data of each corresponding parameter segment of the selected feature and the comparison feature.
Specifically, the index data of each parameter segment in the selected feature and the index data of each parameter segment in the compared feature are in one-to-one correspondence, that is, N1Corresponds to N1′,NnCorresponds to Nn'. According to NnAnd Nn' Difference in distribution, target cluster is determinedThe important features of (a). E.g. N in selected featuresnAnd N in the contrast characteristicn' the difference in distribution is large, the selected feature can be considered as an important feature. The selected features correspond to the index data of each parameter segment of the comparison features, so that the accuracy of judging the importance degree is improved.
Optionally, if the distribution difference meets a preset standard, determining the currently selected feature as an important feature; and if the distribution difference does not meet the preset standard, determining the currently selected feature as the non-important feature.
Specifically, the worker may preset a distribution difference standard, and if the distribution difference between the index data of each parameter segment in the selected feature and the index data of each parameter distribution in the comparison feature conforms to the preset distribution difference standard, determine that the currently selected feature is an important feature; and if the distribution difference between the index data of each parameter segment in the selected characteristic and the index data of each parameter distribution in the comparison characteristic does not meet the preset standard, determining that the currently selected characteristic is the non-important characteristic.
For example, for the selected feature of the high school calendar of the female population, the comparison feature is the high school calendar of the male population, and the preset distribution difference standard is that the index data of each parameter segment in the selected feature of the high school calendar of the female population is more than or equal to three times of the index data of each parameter segment in the high school calendar of the male population. If the female group has a high school calendar, N is the selected feature1≥3N1' and N2≥3N2' if the distribution difference between the index data of each parameter segment in the selected feature and the index data of each parameter distribution in the comparison feature meets the preset distribution difference standard, the selected feature can be used as an important feature, and the important feature of the target cluster of the female population can be determined to be the high school calendar. If N is present1<3N1' or N2<3N2' if the distribution difference between the index data of each parameter segment in the selected feature and the index data of each parameter distribution in the comparison feature does not meet the preset distribution difference standard, the academic history is the non-important feature of the target cluster of the female population. The preset distribution difference standard improves the flexibility of judging the importance degree and is beneficial to meeting different requirementsThe requirement of business saves time and reduces workload.
According to the technical scheme of the embodiment, the selected feature of the target cluster and the comparison feature of the comparison cluster are obtained, the difference between the distribution of each parameter segmentation index data in the selected feature and the distribution of each parameter segmentation index data in the comparison feature is compared, the importance degree of the selected feature is determined, and the important feature of the target cluster is obtained. The problem of low accuracy of important degree judgment in the prior art is solved, waste of time and resources is avoided, and the effect of improving the efficiency of important characteristic determination is achieved.
Example two
Fig. 2 is a schematic flow chart of a method for determining an important feature according to a second embodiment of the present invention, which is further optimized based on the above-mentioned embodiment. As shown in fig. 2, the method specifically includes the following steps:
s210, obtaining at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment.
S220, acquiring at least one contrast characteristic from the candidate contrast characteristics of the contrast cluster according to the characteristic parameters of the selected characteristic, and determining index data of each parameter segment in the contrast characteristics; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features.
And S230, determining the important features of the target clusters according to the difference of the index data of each corresponding parameter segment of the selected features and the comparison features.
The method comprises the steps of determining index data of each parameter segment of selected features and index data of each parameter segment of comparison features, calculating distribution difference of the index data of each corresponding parameter segment of the selected features and the index data of each corresponding parameter segment of the comparison features, and determining the importance degree of the selected features according to the difference value.
Optionally, the density distribution of the index data of each parameter segment of the selected feature is determined as Pi(ii) a And determining parameters of the contrast characteristicsThe density distribution of index data of several segments is Pi'; and determining the important features of the target cluster according to the difference of the density distribution of each corresponding parameter segment of the selected features and the comparison features.
In particular, the density distribution P of the index data of each parameter segment of the selected featureiComparing the density distribution P of index data of each parameter segment of the feature for the ratio of the index data of each parameter segment to the total data of the feature parameters of the selected featurei' segmenting the proportion of the index data to the total data of the characteristic parameters of the comparison characteristic for each parameter.
PiThe calculation formula is as follows:
Figure BDA0002252679400000101
SumN is (N)1,N2,...,Nn) By addition of, i.e.
Figure BDA0002252679400000102
Figure BDA0002252679400000103
Pi' the calculation formula is:
Figure BDA0002252679400000104
SumN 'is (N'1,N′2,...,N′n) By addition of, i.e.
Figure BDA0002252679400000111
Figure BDA0002252679400000112
Density distribution P of index data segmented according to parameters of selected featuresiAnd density distribution P of index data of each parameter segment of the comparison featurei' determining the difference in density distribution of each corresponding parameter segment of the selected and comparison features.
Can be used forCalculated by the formula:
Figure BDA0002252679400000113
wherein Diff ∈ [0, 1 ]]Diff represents a difference value between the selected feature and the contrast feature obtained by accumulating the differences of the density distribution of each corresponding parameter segment, the importance degree of the selected feature is determined according to the difference value of the density distribution, and the larger the difference value is, the higher the importance degree of the selected feature is. The difference values of each selected feature and the comparison feature can be arranged in an ascending order or a descending order, and one or more features with large difference values are selected as important features of the target cluster.
For example, the target cluster is a first guest cluster, the comparison cluster is a second guest cluster, and two candidate features are obtained from the first guest cluster, wherein the candidate features are client Under Management (AUM) Asset composition and current month client star level. And acquiring two comparison characteristics from the second client group, wherein the comparison characteristics are client time point AUM asset composition and current month client star level, and determining important characteristics from the two candidate characteristics.
Fig. 3 is a histogram of the AUM asset composition at the first customer base client time point, and fig. 4 is a histogram of the AUM asset composition at the second customer base client time point. As shown in fig. 3 and 4, the candidate characteristic client time point AUM asset components of the first client group are 8 segments, namely n is 8, including live deposit, fixed deposit, fund, financing, national bond, precious metal, insurance and bank deposit management, and the index data of each parameter segment is 5502349.00 ten thousand yuan, 6599789.43 ten thousand yuan, 807236.14 ten thousand yuan, 3263360.27 ten thousand yuan, 596134.54 ten thousand yuan, 55782.56 ten thousand yuan, 1213064.74 ten thousand yuan and 217064.13 ten thousand yuan. The composition of the contrast characteristic client time point AUM assets of the second client group is divided into the same 8 segments with the candidate characteristics, and the index data are 6159916.96 ten thousand yuan, 4663196.89 ten thousand yuan, 3078798.3 ten thousand yuan, 9086699.81 ten thousand yuan, 543322.81 ten thousand yuan, 245518.29 ten thousand yuan, 3091689.61 ten thousand yuan and 576867.38 ten thousand yuan respectively. SumN of the candidate feature in the first passenger groupAsset composition5502349+6599789.43+807236.14+3263360.27+596134.54+55782.56+1213064.74+217064.13 ═ 18254780.81 ten thousand yuan (P)1 composition of assets,P2 composition of assets,…,P8 asset composition)=(0.3014,0.3615,0.0442,0.1788,0.0327,0.0031,0.0665,0.0119). SumN 'of the comparison feature in the second guest group'Asset composition6159916.96+4663196.89+3078798.3+9086699.81+543322.81+245518.29+3091689.61+576867.38 ═ 27446010.05 ten thousand yuan (P)1 composition of assets′,P2 composition of assets′,…,P8 asset composition′)=(0.2244,0.1699,0.1122,0.3311,0.0198,0.0089,0.1126,0.0210)。
Figure BDA0002252679400000121
FIG. 5 is a histogram of the star level of the current month customers of the first customer base, and FIG. 6 is a histogram of the star level of the current month customers of the second customer base. As shown in fig. 5 and 6, the candidate feature of the first passenger group is divided into 8 segments of 0 star, 1 star, 2 star, 3 star, 4 star, 5 star, 6 star and 7 star in the current month, the index data of each parameter segment is 670543 people, 2221803 people, 756165 people, 340931 people, 226388 people, 121923 people, 1974 people and 472 people, respectively, and sunnStar grade4380199 (P) 670543+2221803+756165+340931+226388+121923+1974+4721 star level,P2 star class,…,P8 star class) = (0.1531,0.5072,0.1726,0.0778,0.0517,0.0278,0.0005, 0.0001). The current-month client star level of the comparative feature of the second guest group and the candidate feature are divided into 8 segments, and the index data of each parameter segment are 146887 people, 428009 people, 278558 people, 162541 people, 152535 people, 108123 people, 7147 people and 3113 people, SumN'Star grade1286913 persons (P) 146887+428009+278558+162541+152535+108123+7147+31131 star level′,P2 star class′,…,P8 star class′)=(0.1141,0.3326,0.2165,0.1263,0.1185,0.0840,0.0056,0.0024)。
Figure BDA0002252679400000131
Diff will beAsset compositionAnd DiffStar gradeSorting was performed with the result of 0.2814>0.2182. Due to DiffAsset composition>DiffStar gradeTherefore, in the two client groups, the candidate characteristics of client time AUM asset composition is more representative than client star level in the same monthThe product group is "the more important characteristics of the customer group. The importance degree of each candidate feature is visually embodied, the judgment accuracy is improved through calculation, statistics of data of each candidate feature is facilitated, the workload is reduced, and the determination efficiency of the important features is improved.
According to the embodiment of the invention, the candidate features and the contrast features are obtained, the ratio of the index data of each parameter segment in the candidate features and the contrast features to the total data is calculated, the difference of the density distribution of each corresponding parameter segment of the selected features and the contrast features is determined according to the ratio, the difference values are sorted, and the importance degree of the candidate features with large difference values is high. The method and the device realize accurate calculation of the importance degree of each candidate feature, effectively improve the judgment accuracy of the importance degree, avoid repeated calculation of the candidate features and reduce the workload.
EXAMPLE III
Fig. 7 is a block diagram of a device for determining an important feature according to a third embodiment of the present invention, which is capable of executing a method for determining an important feature according to any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 7, the apparatus specifically includes:
a selected feature obtaining module 701, configured to obtain at least one selected feature from candidate features of a target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment;
a comparison feature obtaining module 702, configured to obtain at least one comparison feature from the candidate comparison features of the comparison cluster according to the feature parameter of the selected feature, and determine index data of each parameter segment in the comparison features; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features;
the important feature determining module 703 is configured to determine an important feature of the target cluster according to a distribution difference between the index data of each parameter segment in the selected feature and the index data of each parameter segment in the comparison feature.
Optionally, the index data of each parameter segment in the selected feature includes:
selected featuresThe index data of each parameter segment is expressed as (N)1,N2,...,Nn) (ii) a Wherein the number of segments for segmenting the characteristic parameters of the selected characteristics is n;
the index data of each parameter segment in the comparison characteristics comprises:
the index data of each parameter segment in the contrast characteristic is expressed as (N'1,N′2,...,N′n) (ii) a The number of segments for segmenting the characteristic parameters of the contrast characteristics is n;
accordingly, the significant feature determination module 703 includes:
and the parameter segmentation corresponding unit is used for determining the important characteristics of the target cluster according to the difference of the index data of each corresponding parameter segmentation of the selected characteristics and the comparison characteristics.
Optionally, the parameter segment corresponding unit is specifically configured to:
determining a density distribution P of the index data for each parameter segment of the selected featurei(ii) a And determining the density distribution of index data of each parameter segment of the contrast characteristic as Pi′;
And determining the important features of the target cluster according to the difference of the density distribution of each corresponding parameter segment of the selected features and the comparison features.
Optionally, the following formula is adopted to calculate the difference of the density distribution of each corresponding parameter segment of the selected feature and the contrast feature:
Figure BDA0002252679400000141
wherein Diff ∈ [0, 1 ]];
Wherein Diff represents a difference value between the selected feature and the contrast feature obtained by accumulating the difference of the density distribution of each corresponding parameter segment;
Figure BDA0002252679400000151
SumN is (N)1,N2,...,Nn) Adding;
Figure BDA0002252679400000152
SumN 'is (N'1,N′2,...,N′n) And (4) adding.
Optionally, the important characteristic determining module 703 is further specifically configured to:
if the distribution difference meets a preset standard, determining the currently selected feature as an important feature; and if the distribution difference does not meet the preset standard, determining the currently selected feature as the non-important feature.
According to the embodiment of the invention, the selected characteristic of the target cluster and the comparison characteristic of the comparison cluster are obtained, and the difference comparison is carried out on the distribution of each parameter segmentation index data in the selected characteristic and the distribution of each parameter segmentation index data in the comparison characteristic, so that the important characteristic of the target cluster is determined, the determination efficiency of the important characteristic is improved, the time is saved, and the resource overhead is reduced.
Example four
Fig. 8 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 8 illustrates a block diagram of an exemplary computer device 800 suitable for use in implementing embodiments of the present invention. The computer device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present invention.
As shown in fig. 8, computer device 800 is in the form of a general purpose computing device. The components of computer device 800 may include, but are not limited to: one or more processors or processing units 801, a system memory 802, and a bus 803 that couples various system components including the system memory 802 and the processing unit 801.
Bus 803 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 800 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 800 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 802 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)804 and/or cache memory 805. The computer device 800 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 806 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, and commonly referred to as a "hard drive"). Although not shown in FIG. 8, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 803 by one or more data media interfaces. Memory 802 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 808 having a set (at least one) of program modules 807 may be stored, for instance, in memory 802, such program modules 807 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may include an implementation of a network environment. Program modules 807 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.
The computer device 800 may also communicate with one or more external devices 809 (e.g., keyboard, pointing device, display 810, etc.), with one or more devices that enable a user to interact with the computer device 800, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 800 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 811. Moreover, computer device 800 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network such as the Internet) via network adapter 812. As shown, the network adapter 812 communicates with the other modules of the computer device 800 via the bus 803. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 801 executes various functional applications and data processing by running a program stored in the system memory 802, for example, to implement a method for determining an important feature provided by an embodiment of the present invention, including:
obtaining at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment;
according to the characteristic parameters of the selected characteristics, at least one comparison characteristic is obtained from the candidate comparison characteristics of the comparison cluster, and index data of each parameter segment in the comparison characteristics is determined; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features;
and determining the important characteristics of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristics and the index data of each parameter segment in the comparison characteristics.
EXAMPLE five
The fifth embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the storage medium stores a computer program, and when the computer program is executed by a processor, the method for determining an important feature according to the fifth embodiment of the present invention is implemented, where the method includes:
obtaining at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment;
according to the characteristic parameters of the selected characteristics, at least one comparison characteristic is obtained from the candidate comparison characteristics of the comparison cluster, and index data of each parameter segment in the comparison characteristics is determined; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features;
and determining the important characteristics of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristics and the index data of each parameter segment in the comparison characteristics.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method for determining a significant feature, comprising:
obtaining at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment;
according to the characteristic parameters of the selected characteristics, at least one comparison characteristic is obtained from candidate comparison characteristics of a comparison cluster, and index data of each parameter segment in the comparison characteristics is determined; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features;
and determining the important characteristics of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristics and the index data of each parameter segment in the comparison characteristics.
2. The method of claim 1, wherein the metric data for each parameter segment in the selected features comprises:
the index data for each parameter segment in the selected feature is represented as (N)1,N2,...,Nn) (ii) a Wherein the number of segments for segmenting the characteristic parameters of the selected characteristics is n;
the index data of each parameter segment in the comparison characteristic comprises:
the index data of each parameter segment in the contrast characteristic is expressed as (N'1,N′2,...,N′n) (ii) a The number of segments for segmenting the characteristic parameters of the contrast characteristics is n;
correspondingly, the determining the important feature of the target cluster according to the distribution difference between the index data of each parameter segment in the selected feature and the index data of each parameter segment in the comparison feature includes:
and determining the important features of the target clusters according to the difference of the index data of each corresponding parameter segment of the selected features and the comparison features.
3. The method of claim 2, wherein determining the significance signature of the target cluster based on the difference of the index data of each corresponding parameter segment of the selected signature and the comparison signature comprises:
determining a density distribution P of the index data for each parameter segment of the selected featurei(ii) a And determining the density distribution of index data of each parameter segment of the contrast characteristic as Pi′;
And determining the important features of the target clusters according to the difference of the density distribution of the corresponding parameter segments of the selected features and the comparison features.
4. A method according to claim 3, wherein the difference in density distribution of each corresponding parameter segment of the selected and contrast features is calculated using the formula:
Figure FDA0002252679390000021
wherein Diff ∈ [0, 1 ]];
Wherein Diff represents a difference value between the selected feature and the contrast feature obtained by accumulating the difference of the density distribution of each corresponding parameter segment;
Figure FDA0002252679390000022
SumN is (N)1,N2,...,Nn) Adding;
Figure FDA0002252679390000023
SumN 'is (N'1,N′2,...,N′n) And (4) adding.
5. The method of claim 1, wherein determining the important features of the target cluster according to a difference in distribution of index data of each parameter segment in the selected features and index data of each parameter segment in the comparison features further comprises:
if the distribution difference meets a preset standard, determining the currently selected feature as an important feature; and if the distribution difference does not meet the preset standard, determining the currently selected feature as the non-important feature.
6. An apparatus for determining an important feature, comprising:
a selected feature obtaining module, configured to obtain at least one selected feature from the candidate features of the target cluster; the selected features comprise feature parameters of the selected features and index data of each parameter segment;
a comparison feature obtaining module, configured to obtain at least one comparison feature from candidate comparison features of a comparison cluster according to a feature parameter of the selected feature, and determine index data of each parameter segment in the comparison features; wherein the characteristic parameters of the comparison features are the same as the characteristic parameters of the selected features;
and the important characteristic determining module is used for determining the important characteristic of the target cluster according to the distribution difference of the index data of each parameter segment in the selected characteristic and the index data of each parameter segment in the comparison characteristic.
7. The apparatus of claim 6, wherein the metric data for each parameter segment in the selected features comprises:
the index data for each parameter segment in the selected feature is represented as (N)1,N2,...,Nn) (ii) a Wherein the number of segments for segmenting the characteristic parameters of the selected characteristics is n;
the index data of each parameter segment in the comparison characteristic comprises:
the index data of each parameter segment in the contrast characteristic is expressed as (N'1,N′2,...,N′n) (ii) a The number of segments for segmenting the characteristic parameters of the contrast characteristics is n;
accordingly, the significant feature determination module includes:
and the parameter segment corresponding unit is used for determining the important features of the target cluster according to the difference of the index data of each corresponding parameter segment of the selected features and the comparison features.
8. The apparatus according to claim 7, wherein the parameter segment correspondence unit is specifically configured to:
determining a density distribution P of the index data for each parameter segment of the selected featurei(ii) a And determining parameters of the contrast characteristicThe density distribution of the segmented index data is Pi′;
And determining the important features of the target clusters according to the difference of the density distribution of the corresponding parameter segments of the selected features and the comparison features.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, carries out the method of determining an important feature according to any one of claims 1 to 5.
10. A storage medium containing computer-executable instructions for performing a method of determining a significance signature as claimed in any one of claims 1-5 when executed by a computer processor.
CN201911040435.8A 2019-10-29 2019-10-29 Method, device and equipment for determining important features and storage medium Pending CN110796492A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911040435.8A CN110796492A (en) 2019-10-29 2019-10-29 Method, device and equipment for determining important features and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911040435.8A CN110796492A (en) 2019-10-29 2019-10-29 Method, device and equipment for determining important features and storage medium

Publications (1)

Publication Number Publication Date
CN110796492A true CN110796492A (en) 2020-02-14

Family

ID=69441895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911040435.8A Pending CN110796492A (en) 2019-10-29 2019-10-29 Method, device and equipment for determining important features and storage medium

Country Status (1)

Country Link
CN (1) CN110796492A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898027A (en) * 2020-08-06 2020-11-06 北京字节跳动网络技术有限公司 Method, device, electronic equipment and computer readable medium for determining feature dimension

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014014678A1 (en) * 2012-07-19 2014-01-23 Qualcomm Incorporated Feature extraction and use with a probability density function and divergence|metric
US20170344620A1 (en) * 2016-05-27 2017-11-30 Adobe Systems Incorporated Feature Summarization Filter With Applications Using Data Analytics
CN109947991A (en) * 2017-10-31 2019-06-28 腾讯科技(深圳)有限公司 A kind of extraction method of key frame, device and storage medium
CN110059749A (en) * 2019-04-19 2019-07-26 成都四方伟业软件股份有限公司 Screening technique, device and the electronic equipment of important feature
CN110322274A (en) * 2019-05-30 2019-10-11 深圳壹账通智能科技有限公司 Crowd portrayal generation method, device and computer equipment based on data analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014014678A1 (en) * 2012-07-19 2014-01-23 Qualcomm Incorporated Feature extraction and use with a probability density function and divergence|metric
US20170344620A1 (en) * 2016-05-27 2017-11-30 Adobe Systems Incorporated Feature Summarization Filter With Applications Using Data Analytics
CN109947991A (en) * 2017-10-31 2019-06-28 腾讯科技(深圳)有限公司 A kind of extraction method of key frame, device and storage medium
CN110059749A (en) * 2019-04-19 2019-07-26 成都四方伟业软件股份有限公司 Screening technique, device and the electronic equipment of important feature
CN110322274A (en) * 2019-05-30 2019-10-11 深圳壹账通智能科技有限公司 Crowd portrayal generation method, device and computer equipment based on data analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何先军: "《Excel数据处理与分析应用大全》", 30 September 2019 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898027A (en) * 2020-08-06 2020-11-06 北京字节跳动网络技术有限公司 Method, device, electronic equipment and computer readable medium for determining feature dimension

Similar Documents

Publication Publication Date Title
US8996452B2 (en) Generating a predictive model from multiple data sources
CN110502519B (en) Data aggregation method, device, equipment and storage medium
CN109885597B (en) User grouping processing method and device based on machine learning and electronic terminal
CN107729944B (en) Identification method and device of popular pictures, server and storage medium
Wainer et al. How productivity and impact differ across computer science subareas
CN116402166A (en) Training method and device of prediction model, electronic equipment and storage medium
CN110928893A (en) Label query method, device, equipment and storage medium
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN110796492A (en) Method, device and equipment for determining important features and storage medium
US20230153980A1 (en) Clustering Images for Anomaly Detection
CN115544257B (en) Method and device for quickly classifying network disk documents, network disk and storage medium
CN116862641A (en) Credit product recommendation method and device, electronic equipment and storage medium
CN113780675B (en) Consumption prediction method and device, storage medium and electronic equipment
CN113672703B (en) User information updating method, device, equipment and storage medium
CN114385878A (en) Visual display method and device for government affair data and terminal equipment
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium
US10585933B2 (en) System and method for classification of low relevance records in a database using instance-based classifiers and machine learning
US8875136B2 (en) Methods of personalizing services via identification of common components
CN112149031A (en) Cultural industry and creative integrated public service platform and method based on cloud service
CN110852392A (en) User grouping method, device, equipment and medium
CN112184275B (en) Crowd subdivision method, device, equipment and storage medium
CN113761110B (en) Information issuing method, device, equipment and storage medium
CN108776679B (en) Search word classification method and device, server and storage medium
CN116610725B (en) Entity enhancement rule mining method and device applied to big data
CN110502632B (en) Contract term review method and device based on clustering algorithm, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220915

Address after: 12 / F, 15 / F, 99 Yincheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 200120

Applicant after: Jianxin Financial Science and Technology Co.,Ltd.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20200214

RJ01 Rejection of invention patent application after publication