CN109582741B

CN109582741B - Feature data processing method and device

Info

Publication number: CN109582741B
Application number: CN201811359743.2A
Authority: CN
Inventors: 刘松吟; 董扬
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2023-09-05
Anticipated expiration: 2038-11-15
Also published as: CN109582741A

Abstract

The embodiment of the specification discloses a feature data processing method and device. The method comprises the following steps: determining outlier data in the specified features of the sample set; scaling the outlier data in the sample set to obtain a scaled sample set, wherein the scaled data of the outlier data is larger than non-outlier data in the appointed characteristic of the sample set before scaling; clustering the scaled sample set; and respectively carrying out normalization processing on the appointed characteristic data of the scaled sample set in the appointed characteristic interval corresponding to each cluster based on the clusters after the clustering processing.

Description

Feature data processing method and device

Technical Field

Embodiments of the present disclosure relate to the field of data processing, and in particular, to a method and apparatus for processing feature data.

Background

With the continuous development of the internet, the more feature data generated by users during the internet use process, the more feature data can be widely used and converted into useful information, for example, the user access loyalty score, the user value score or the user browsing plate viscosity score can be obtained based on the feature data such as the user's fund purchase amount, the purchase times, the plate browsing records and the like. The scoring values can provide reference bases for product operation and can also be used as discretized data for model training.

In the process of scoring the users, many characteristic data, such as the purchase amount of the funds of the users, etc., are found to be obviously long-tail distribution, i.e. the purchase amounts of a large number of users are concentrated in a small section of the head, the purchase amount of a small number of users is far greater than the average, and the purchase amount of the small number of users can be called as outlier data.

When the characteristic data with obvious long tail distribution is normalized in the prior art, the distribution of the normalized characteristic data is still long tail distribution, so that the normalized characteristic data is concentrated in a very small value range, the degree of distinction between the normalized characteristic data is still very small, and visual and reasonable evaluation cannot be performed on users.

Disclosure of Invention

The embodiment of the specification provides a characteristic data processing method and device, which are used for solving the problem of small distinguishing degree of normalized characteristic data caused by long tail distribution of the characteristic data.

The embodiment of the specification adopts the following technical scheme:

in a first aspect, a feature data processing method is provided, including:

determining outlier data in the specified features of the sample set;

Scaling the outlier data in the sample set to obtain a scaled sample set, wherein the scaled data of the outlier data is larger than non-outlier data in the appointed characteristic of the sample set before scaling;

clustering the scaled sample set;

and respectively carrying out normalization processing on the appointed characteristic data of the scaled sample set in the appointed characteristic interval corresponding to each cluster based on the clusters after the clustering processing.

In a second aspect, there is provided a feature data processing apparatus comprising:

an outlier data determination module that determines outlier data in the specified features of the sample set;

the outlier data scaling module is used for scaling the outlier data in the sample set to obtain a scaled sample set, wherein the data scaled by the outlier data is larger than non-outlier data in appointed characteristics of the sample set before scaling;

the aggregation processing module is used for carrying out clustering processing on the scaled sample set;

and the normalization processing module is used for respectively carrying out normalization processing on the appointed characteristic data of the scaled sample set in the appointed characteristic interval corresponding to each cluster based on the clustered clusters.

In a third aspect, an electronic device is provided, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor performing the operations of:

determining outlier data in the specified features of the sample set;

clustering the scaled sample set;

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the following operations:

determining outlier data in the specified features of the sample set;

Clustering the scaled sample set;

The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect: the degree of distinction of the feature data after normalization processing can be improved by determining the outlier data in the appointed features of the sample set and scaling the outlier data; meanwhile, the data after the scaling of the outlier data is still larger than the non-outlier data, so that the difference between the scaled data and the non-outlier data is ensured; in addition, the embodiment of the specification adopts a clustering method to obtain a plurality of clusters, and performs normalization processing on the specified feature data in the specified feature interval corresponding to each cluster, so that the aggregation rule of the user can be fully reflected, and the distinguishing degree of the feature data after normalization processing is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a method for processing feature data according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a feature data processing method according to another embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a feature data processing apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a hardware structure of an electronic device for implementing various embodiments of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As shown in fig. 1, an embodiment of the present disclosure provides a feature data processing method 100 for solving a problem of small discrimination of feature data after normalization processing due to long tail distribution of feature data, where the embodiment 100 includes the steps of:

S102: outlier data in the specified features of the sample set is determined.

The sample set mentioned in the embodiment of the present specification may be a sample set of a user, or may be a sample set of another kind (such as an animal or a plant), and this embodiment will be described below by taking a sample set of a user as an example.

The user sample set typically includes a large number of user samples. For each user sample, features of multiple dimensions are typically included, such as the user's age, occupation, amount of funds purchased, number of funds purchased, amount purchased online, number of website logins, age of account registration, etc.

The specified feature mentioned in the embodiment of the present specification may be one of the above-described multidimensional features, for example, the specified feature is a fund purchase amount of the user, and for example, the specified feature is an online purchase amount of the user, or the like.

The above specified characteristics generally include outlier data, for example, the specified characteristics are the fund purchase amount (may be the total amount) of the user, the fund purchase amounts of a large number of user samples are all distributed between the interval 0-10,000 yuan, and the fund purchase amount of a very small number of user samples far exceeds the distribution interval, and the embodiment of the present specification can determine the characteristic data with large differences compared with most of the characteristic data as outlier data, for example, determine the fund purchase amount greater than 10,000 as outlier data.

Alternatively, embodiments of the present description may employ chebyshev's theorem, or by maximum-minimum methods of box plots, to determine outlier data in a given feature of a sample set.

S104: and scaling the outlier data in the sample set to obtain a scaled sample set, wherein the scaled data of the outlier data is larger than the non-outlier data in the appointed characteristic of the sample set before scaling.

As mentioned above, the outlier data is generally much larger than the non-outlier data, so that the outlier data can be scaled, i.e. the value of the outlier data is reduced; and (5) processing the non-outlier data, so as to obtain a scaled sample set.

It should be noted that, in the embodiments of the present disclosure, the outlier data and the non-outlier data are both specific to the above specified features, and the other feature data of the sample set is not generally processed.

In the embodiment of the present disclosure, the data after the scaling of the outlier is larger than the non-outlier in the designated features of the sample set before the scaling, for example, the feature data of the user's fund purchase amount in the interval 0-10,000 is referred to as non-outlier data, the feature data of more than 10,000 is referred to as outlier data, and the value before the scaling is still more than 10,000 after the scaling of the outlier, but is usually much smaller.

In addition, for the outlier data with different sizes, the size relationship of the scaled data still maintains the size relationship before scaling, for example, for the outlier data 20,000 scaled data still is smaller than the data 40,000 scaled, so that the difference between the data can be ensured.

By means of scaling processing of the outlier data, long tail effect can be greatly reduced, meanwhile, the data after scaling of the outlier data is larger than non-outlier data in the appointed characteristics of the sample set before scaling, and the difference between the data can be guaranteed.

Alternatively, embodiments of the present disclosure may predetermine scaling coefficients, which may employ a logarithmic process, in turn scaling the outlier data based on the scaling coefficients.

S106: and clustering the scaled sample set.

In the embodiment of the specification, an effective clustering algorithm including a Kmeans algorithm, a maximum expectation algorithm EM, a density-based clustering algorithm DBSCAN and the like may be used to perform clustering processing on the scaled sample set to obtain a plurality of clusters, where each cluster generally includes a plurality of user samples, and the plurality of user samples have a higher similarity, for example, the user values of the user samples in the same cluster are similar, or the viscosity of the browsed plate of the website is similar, or the access loyalty is similar, and the like.

When the scaled sample set is clustered, the step may use not only the above specified feature but also a plurality of features other than the specified feature. For example, the designated feature is the fund purchase amount of the user, when the scaled sample set is clustered, not only the fund purchase amount of the user is utilized, but also the age, occupation, fund purchase times, login times, registration years and other features of the user are possibly utilized, so that the accuracy of the clustering result is improved, that is, the higher similarity among users in one cluster and the weaker similarity among the user samples in different clusters are ensured.

S108: and respectively carrying out normalization processing on the appointed characteristic data of the scaled sample set in the appointed characteristic interval corresponding to each cluster based on the clusters after the clustering processing.

Step S106 is carried out on the scaled sample set to obtain a plurality of clusters, each cluster generally comprises a plurality of user samples, and the step S can be used for preferentially obtaining the center point of each cluster; and then sorting the clusters based on the sequence from the small center point to the large center point, and fine-tuning the boundaries of the sorted clusters to obtain a plurality of specified characteristic intervals, wherein the step needs to ensure that the division of the boundaries has clear meaning and is convenient to read and understand.

All the specified feature data fall into the corresponding specified feature interval based on the new boundary, and in general, one cluster corresponds to one specified feature interval. The specified characteristic data mentioned here includes not only the aforementioned non-outlier data but also data obtained by scaling the outlier data.

According to the embodiment of the specification, the plurality of designated characteristic intervals are obtained by adopting the clustering method, so that the user samples in the sample set are divided into the plurality of intervals, the aggregation rule of the user can be fully reflected, and the distinction degree among the designated characteristic data of the user is improved.

After the specified feature interval is obtained through the above operation, the specified feature data falling into the specified feature interval may be normalized, for example, 5 specified feature intervals are obtained, and for each specified feature interval, all the specified feature data falling into the specified feature interval may be normalized.

The step can specifically adopt a max-min standardized method, a standard fraction z-score standardized method, an equal frequency division box and the like to normalize the appointed characteristic data.

Alternatively, as an embodiment, this step may perform normalization processing on the specified feature data in the specified feature section corresponding to each cluster, respectively, based on the following formula:

Wherein j represents the number of the specified characteristic interval corresponding to each cluster, j can take a value between 1 and k, and k is the total number of the specified characteristic intervals;

x _i indicating the ith of the specified feature data before normalization processing in the jth specified feature section;

indicating the ith specified characteristic data after normalization processing in the jth specified characteristic interval;

representing the maximum value of the j-th designated characteristic interval;

representing the minimum value of the j-th specified characteristic interval.

Through the formula, the scores of the user samples of each specified characteristic interval can be obtained while normalization is carried out, for example, the score of the user sample of the 1 st specified characteristic interval is between 0 and 1; the score of the user sample of the 2 nd designated feature interval is between 1 and 2; … …; the score of the user sample for the kth designated feature interval is between k-1 and k.

The score may also present user value, user loyalty, etc. The score can be directly used as reference data of operation and maintenance and can also be used as input of model training or use.

According to the feature data processing method provided by the embodiment of the specification, the outlier data in the appointed features of the sample set is determined, and the outlier data is scaled, so that the distinguishing degree of the feature data after normalization processing can be improved; meanwhile, the data after the scaling of the outlier data is still larger than the non-outlier data, so that the difference between the scaled data and the non-outlier data is ensured; in addition, the embodiment of the specification adopts a clustering method to obtain a plurality of clusters, and performs normalization processing on the specified feature data in the specified feature interval corresponding to each cluster, so that the aggregation rule of the user can be fully reflected, and the distinguishing degree of the feature data after normalization processing is further improved.

The embodiment of the specification is equivalent to that the feature data corresponding to the specified feature is scaled twice, and the first time is that only the outlier data is scaled, so that long tail features can be removed while the difference between the feature data is still maintained; and the second time is to perform normalized scaling processing on all the characteristic data (including non-outlier data and the characteristic data after the scaling processing), so that the distinction degree of the characteristic data after the normalization processing is improved.

For the above-mentioned feature data discrimination that improves the normalization processing, for example, two users purchasing the total amount of 1,000 yuan and 100,000 yuan for the fund obviously belong to two categories, but if there are users purchasing the total amount of 10,000,000 in the whole sample set, the two users are also likely to be or are erroneously classified into one category due to the influence of outlier data, that is, the discrimination of 1,000 yuan and 100,000 yuan is smaller, which is obviously not reasonable.

Traditional ways of data normalization include data binning, e.g., equal width binning. The equal-width bins are fixed numbers of bins, based on the maximum and minimum values of the data set, the equal-width bins ensure that the data width divided into each bin is equal, the equal-width bins are sensitive to outlier data, the width of the bins in the long tail distribution is larger, a plurality of data are contained in the bins with the head, but the bins with the tail are basically free of data, the distinguishing degree of the data in the bins with the head cannot be ensured, and 1,000 yuan and 100,000 yuan are likely to be divided into the same bin. In the embodiment of the specification, the outlier data in the appointed characteristics of the sample set is predetermined, and scaled, so that the influence of long tail distribution is reduced as much as possible, and the characteristic data distinction degree after normalization processing can be improved.

The embodiment of the specification adopts a method for scaling and clustering outlier data, thereby eliminating the influence of long tail distribution, and simultaneously fully reflecting the aggregation rule of users, for example, clustering two users with the total value of 1,000 yuan and 100,000 yuan into two categories, and keeping the distinction between the data.

In step S102 of the above embodiment 100, the determination of outlier data in a specified feature of a sample set may specifically be performed by the following method:

determining a mean and standard deviation of specified features of the sample set;

characteristic data outside m standard deviation ranges of the average among the specified characteristics of the sample set are determined as outlier data, and m is a positive number.

The above-mentioned feature data lying outside the m standard deviation ranges of the average number may be, specifically, feature data larger than the average number+m standard deviations, considering that the general feature data are all positive numbers.

By adopting the method for determining the outlier data, the average and standard deviation of the appointed characteristics are fully utilized, and the accuracy of the obtained outlier data can be improved.

Further, in step S104 of the above embodiment 100, it is mentioned that the scaling process is performed on the outlier data in the sample set, and specifically the scaling process may be performed on the outlier data in the sample set based on the following formula:

wherein ,representing an ith of the scaled data of the outlier data;

x _i representing the ith ion in the sample setGroup data;

μ represents an average number of specified features of the sample set;

sigma represents the standard deviation of a given feature of the sample set.

The above formula is adopted to be equivalent to logarithmically processing the outlier data, and the scaling dimension in the above formulaGreater than 1, it can be ensured that the data scaled by the outlier is greater than the non-outlier because the data in the μ+mσ range is all non-outlier data.

In the above embodiments, before the normalization processing is performed on the specified feature data in the specified feature section corresponding to each cluster, the method may further include the following steps:

sorting the clustered clusters based on the center points of the clustered clusters;

and trimming the boundaries of the sorted clusters to obtain designated characteristic intervals corresponding to the clusters.

According to the embodiment, the plurality of specified characteristic sections are obtained by the clustering method, and the specified characteristic data of different categories can be divided into different sections, so that the degree of distinction between the data is improved.

judging whether the appointed characteristic data in the clusters after the clustering treatment is long tail distribution or not;

optionally, determining whether to redefine the outlier data in the specified feature of the sample set based on the determination result includes:

if the designated feature data in the clusters after the clustering process is long tail distribution, the value of m is reduced, and the outlier data in the designated features of the sample set is redetermined based on the reduced value of m.

By the embodiment, the influence caused by outlier data can be further avoided, and the degree of distinction between the data is improved.

Optionally, clustering the scaled sample set mentioned in the above embodiments includes:

and clustering the scaled sample set according to the characteristics of the scaled sample set and a preset clustering algorithm.

As shown in fig. 2, an embodiment of the present disclosure provides a feature data processing method 200 for solving a problem of small discrimination of feature data after normalization processing due to long tail distribution of feature data, where the embodiment 200 includes the steps of:

S202: and determining a correlation coefficient matrix among the plurality of features of the sample set, and screening the plurality of features based on the correlation coefficient matrix.

The sample set in the embodiment of the specification can be specifically a user sample set, and for each user sample, the sample set generally comprises characteristics with multiple dimensions, and the step can be used for screening the characteristics and eliminating redundant characteristics, so that the processing efficiency of clustering in the subsequent steps is improved.

The step can be specifically that a correlation coefficient matrix among a plurality of features of the sample set is determined, and based on the correlation coefficient matrix, if the correlation coefficient between two features is closer to 1 or-1, the correlation coefficient between the two features is higher, the correlation coefficient is closer to 0, and the correlation is weaker.

The embodiments of the present disclosure may be based on the above-described correlation coefficient matrix, and if the correlation coefficient between a certain feature and the other features is close to 1 or-1, the feature may be removed. The correlation coefficient may be a pearson coefficient, and optionally, a chi-square test, an R-square test, or the like may be used to obtain the correlation coefficient matrix.

S204: and determining outlier data in the appointed characteristics of the sample set, and scaling the outlier data to obtain a scaled sample set.

The specification of the present applicationIn an embodiment, the designated feature is described by taking the user's fund purchase amount as an example. This step may first obtain the amount of funds purchase (x ₁ ，x ₂ ，…，x _n ) And then, according to chebyshev's theorem, feature data lying outside the m standard deviation sigma ranges of the average μ are judged as outlier data.

In the present embodiment, since the fund purchase amounts are all positive values, the characteristic data outside the m standard deviation σ ranges of the average μ are all large values, and negative values generally do not occur. Of course, in other embodiments, negative numbers may also occur for feature data that lies outside the m standard deviations σ of the average μ.

After determining the outlier, the outlier may be scaled based on the following formula:

the scale dimension scale in the formula adopts logarithmic processing, so that the long tail characteristic of the specified characteristic data can be removed, and meanwhile, the difference between the data is still ensured:

as can be seen from the two formulas above, the specified characteristic data (i.e., the non-outlier data) lying within the m standard deviations σ of the average μ is not processed, and only the outlier data is scaled.

The scaling dimension scale is larger than 1, so that the scaled outlier data is still larger than the non-outlier data, and the difference between the data can be further ensured. In addition, through the scaling dimension scale, for outlier data with different sizes, the size relation of the scaled data still keeps the size relation before scaling, so that the difference between the data can be further ensured.

In the two formulas above:

x _i the ith (either outlier or non-outlier) of the specified feature data representing the sample set;

μ represents an average number of specified feature data;

σ represents the standard deviation of the specified feature data.

x _i Representing the ith (including data scaled by outlier data and non-outlier data not scaled) of the processed specified feature data.

S206: and clustering the scaled sample set to obtain a plurality of clusters.

The step may specifically perform clustering processing on the scaled sample set based on the processed specified feature data and other multiple feature data to obtain k clusters, so that the processed specified feature data (x ₁ ，x ₂ ，...，x _n ) That is, fall into the k clusters described above.

Then the center points (y ₁ ，y ₂ ，…，y _k ) And a corresponding amount of data per cluster, optionally, the embodiment may also determine whether to redefine outlier data in the specified features of the sample set based on the number of each cluster, specifically:

if the distribution of the number of clusters still belongs to the obvious long tail distribution, it needs to be determined whether the m value is too large in step S204, so that part of the outlier is missed, and the m value needs to be reduced to rescreen the outlier.

And sequencing the final clustering results from small to large according to the center point.

S208: and determining the designated characteristic interval corresponding to each cluster.

This step may be based on clustering the center points (y ₁ ，y ₂ ，...，y _k ) Fine tuning the cluster boundaries using visual partitioning to obtain boundaries (y) ₀ ,y ₁ ，y ₂ ，...，y _k ) Wherein [ y ] ₀ ,y ₁ ]Form the first interval, [ y ] ₁ ,y ₂ ]Constitute a second interval … …, [ y ] _k-1 ,y _k ]Constituting the kth section.

This step may follow the principle that the division of the boundaries is clearly meaningful for ease of reading and understanding when performing the boundary trimming, such as trimming the interval [985,10976] to [1000,10000 ].

After the k-cell interval is obtained, all the processed specified feature data (x ₁ ，x ₂ ，...，x _n ) Based on the new boundary falling within the corresponding interval.

S210: and carrying out normalization processing on the specified characteristic data falling into the specified characteristic interval.

The step assumes that the number of designated feature intervals (simply referred to as intervals) is k, for k intervals, the designated feature data falling in each interval is subjected to outlier data determination and outlier data scaling according to step S204, ensuring that no outlier data exists for each interval,

then, the specified characteristic data falling into the specified characteristic interval is subjected to max-min linear normalization, and falls into the jth interval [ y ] _j-1 ,y _j ]Data x of (2) _i The transformation is as follows:

thus, the processed specified feature data (x ₁ ，x ₂ ，...，x _n ) Fall to [0, k ]]And the effect of long tail data can be removed.

In the above formula, j represents the number of the interval, j can take a value between 1 and k, where k is the total number of intervals;

x _i an ith section indicating specified feature data before normalization processing;

indicating the ith of the specified characteristic data after normalization processing in the jth section;

represents the maximum value of the j-th interval;

representing the minimum value of the j-th interval.

According to the embodiment of the specification, the data are divided into a plurality of groups through the clustering algorithm, so that normalization is carried out for each group, the influence of long tail data can be eliminated as much as possible, and meanwhile, the degree of distinction between the data is ensured.

Based on the method provided in the embodiment of the present specification, the value of the user is scored based on the amount of the user's fund purchase, resulting in a user proportion of [58.64%,27.79%,8.74%,4.32%,0.52% ] within each score of 0-5 points. If the method of the embodiment of the invention is not adopted, the proportion of users in each score of 0-5 scores is [99.98%,0.02%, 0] based on the equal width score barrel.

For the evaluation of the fund purchase amount of the user, the difference between the low net value user and the high net value user is considered, and the difference degree inside the same group of people is considered. It is clearly unreasonable that equally wide buckets are limited by the long mantissa effect to categorize a large number of users into categories of low net worth users. The dividing method of the embodiment of the specification is more accurate, so that the influence of data long tail distribution is eliminated, and the difference of purchasing power of users of different grades is considered.

The above description details an embodiment of a method for processing feature data, as shown in fig. 3, and the present disclosure also provides a feature data processing apparatus, as shown in fig. 3, where the apparatus 300 includes:

an outlier data determination module 302 that may be used to determine outlier data in the specified features of the sample set;

the outlier scaling module 304 may be configured to scale outlier data in the sample set to obtain a scaled sample set, where the scaled outlier data is greater than non-outlier data in a specified feature of the sample set before scaling;

an aggregation processing module 306, configured to perform clustering processing on the scaled sample set;

The normalization processing module 308 may be configured to perform normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster based on the clusters after the clustering processing.

According to the feature data processing device provided by the embodiment of the specification, the outlier data in the appointed features of the sample set is determined, and the outlier data is scaled, so that the distinguishing degree of the feature data after normalization processing can be improved; meanwhile, the data after the scaling of the outlier data is still larger than the non-outlier data, so that the difference between the scaled data and the non-outlier data is ensured; in addition, the embodiment of the specification adopts a clustering method to obtain a plurality of clusters, and performs normalization processing on the specified feature data in the specified feature interval corresponding to each cluster, so that the aggregation rule of the user can be fully reflected, and the distinguishing degree of the feature data after normalization processing is further improved.

Optionally, as an embodiment, the outlier data determination module 302 determines outlier data in the specified feature of the sample set comprises:

Optionally, as an embodiment, the scaling of the outlier data in the sample set by the outlier data scaling module 304 includes: the outlier scaling module 304 scales the outlier data in the sample set based on the following formula:

wherein ,representing an ith of the scaled data of the outlier data;

x _i representing an ith outlier in the sample set;

μ represents an average number of specified features of the sample set;

sigma represents the standard deviation of a given feature of the sample set.

Optionally, as an embodiment, the apparatus 300 further includes a specified feature interval acquisition module (not shown) that may be configured to:

Optionally, as an embodiment, the apparatus 300 further includes a first determining module (not shown) that may be configured to:

based on the determination result, it is determined whether to redetermine outlier data in the specified features of the sample set.

Optionally, as an embodiment, the determining whether to redefine the outlier data in the specified feature of the sample set based on the determination result includes:

Optionally, as an embodiment, the clustering processing module 306 performs clustering processing on the scaled sample set, including:

Optionally, as an embodiment, the apparatus 300 further includes a feature screening module (not shown) that may be configured to:

determining a correlation coefficient matrix between a plurality of features of the sample set;

and screening a plurality of characteristics of the sample set based on the correlation coefficient matrix.

Optionally, as an embodiment, the normalizing module 308 performs normalization processing on the specified feature data of the scaled sample set in the specified feature interval corresponding to each cluster, where the normalization processing includes:

And respectively carrying out normalization processing on the specified characteristic data in the specified characteristic interval corresponding to each cluster based on the following formula:

wherein j represents the number of the designated characteristic interval corresponding to each cluster;

representing the maximum value of the j-th designated characteristic interval;

representing the minimum value of the j-th specified characteristic interval.

Optionally, as an embodiment, the apparatus 300 further includes a second determining module (not shown) that may be configured to:

judging whether outlier data exists in the designated characteristic interval corresponding to each cluster;

if so, scaling the outlier data in the specified feature interval.

The above-mentioned feature data processing apparatus 300 according to the embodiments of the present disclosure may refer to the flows of the feature data processing methods 100 and 200 corresponding to the embodiments of the previous text description, and each unit/module in the feature data processing apparatus 300 and the above-mentioned other operations and/or functions are respectively for implementing the corresponding flows in the feature data processing methods 100 and 200, and are not repeated herein for brevity.

An electronic device according to an embodiment of the present specification will be described in detail below with reference to fig. 4. Referring to fig. 4, at the hardware level, the electronic device includes a processor, optionally including an internal bus, a network interface, a memory. As shown in fig. 4, the Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory, and so on. Of course, the electronic device may also include the hardware needed to implement other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an industry standard architecture (Industry Standard Architecture, ISA) bus, a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 4, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form a device for forwarding chat information on a logic level. A processor, executing the program stored in the memory, and specifically for performing the operations of the method embodiments 100 and 200 described in the foregoing description.

The methods and apparatuses disclosed in the embodiments shown in fig. 1 to fig. 2 may be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device shown in fig. 4 may also execute the methods of fig. 1 to 2 and implement the functions of the embodiments of the feature data processing method shown in fig. 1 to 2, which are not described herein.

Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or a logic device.

The embodiments of the present disclosure further provide a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the foregoing method embodiments 100 and 200, and the same technical effects are achieved, and for avoiding repetition, a detailed description is omitted herein. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method of feature data processing, comprising:

determining outlier data in designated features of a user sample set, wherein the user sample set comprises a large number of user samples, one user sample comprises features with multiple dimensions, and the features contained in the user samples are generated in the process of using the Internet by a user;

Scaling the outlier data in the user sample set to obtain a scaled user sample set, wherein the scaled data of the outlier data is larger than non-outlier data in appointed characteristics of the user sample set before scaling;

clustering the scaled user sample set;

and based on the clusters after the clustering, respectively carrying out normalization processing on the appointed characteristic data of the scaled user sample set in the appointed characteristic interval corresponding to each cluster.

2. The method of claim 1, the determining outlier data in a given feature of a set of user samples comprising:

determining a mean and standard deviation of specified features of the set of user samples;

and determining feature data which are positioned outside m standard deviation ranges of the average number in the specified features of the user sample set as outlier data, wherein m is a positive number.

3. The method of claim 2, scaling outlier data in the set of user samples comprising: scaling the outlier data in the set of user samples based on the formula:

wherein ,representing an ith of the scaled data of the outlier data;

x _i Representing an ith outlier in the set of user samples;

μ represents an average number of specified features of the set of user samples;

sigma represents the standard deviation of the specified features of the set of user samples.

4. A method according to claim 3, wherein before the normalization processing is performed on the specified feature data of the scaled user sample set in the specified feature interval corresponding to each cluster, the method further comprises:

5. The method according to claim 4, wherein before the normalization processing is performed on the specified feature data of the scaled user sample set in the specified feature interval corresponding to each cluster, the method further comprises:

based on the determination, it is determined whether to redetermine outlier data in the specified features of the set of user samples.

6. The method of claim 5, the determining whether to redefine outlier data in a given feature of the set of user samples based on a determination result comprising:

If the designated feature data in the clusters after the clustering is long tail distribution, the value of m is reduced, and the outlier data in the designated features of the user sample set is redetermined based on the reduced value of m.

7. The method of any of claims 1 to 6, the clustering of the scaled set of user samples comprising:

and clustering the scaled user sample set according to the characteristics of the scaled user sample set and a preset clustering algorithm.

8. The method of claim 7, prior to the determining outlier data in the specified features of the set of user samples, the method further comprising:

determining a correlation coefficient matrix between a plurality of features of the set of user samples;

and screening a plurality of characteristics of the user sample set based on the correlation coefficient matrix.

9. The method of claim 1, wherein the normalizing the specified feature data of the scaled user sample set in the specified feature interval corresponding to each cluster includes:

indicating the ith specified characteristic data after normalization processing in the jth specified characteristic interval; />Representing the maximum value of the j-th designated characteristic interval;

representing the minimum value of the j-th specified characteristic interval.

10. The method of claim 9, wherein before the normalizing the specified feature data of the scaled user sample set in the specified feature interval corresponding to each cluster, the method further comprises:

if so, scaling the outlier data in the specified feature interval.

11. A feature data processing apparatus comprising:

the system comprises an outlier data determining module, a outlier data determining module and a user sample processing module, wherein outlier data in appointed characteristics of a user sample set is determined, the user sample set comprises a large number of user samples, one user sample comprises characteristics with multiple dimensions, and the characteristics contained in the user sample are generated in the process that a user uses the Internet;

The outlier data scaling module is used for scaling the outlier data in the user sample set to obtain a scaled user sample set, wherein the scaled data of the outlier data is larger than non-outlier data in appointed characteristics of the user sample set before scaling;

the aggregation processing module is used for carrying out clustering processing on the scaled user sample set;

and the normalization processing module is used for respectively carrying out normalization processing on the appointed characteristic data of the scaled user sample set in the appointed characteristic interval corresponding to each cluster based on the clustered clusters.

12. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor performing the operations of:

Clustering the scaled user sample set;

13. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the operations of:

clustering the scaled user sample set;