CN112148765B

CN112148765B - Service data processing method, device and storage medium

Info

Publication number: CN112148765B
Application number: CN201910576727.7A
Authority: CN
Inventors: 杨海华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2024-04-09
Anticipated expiration: 2039-06-28
Also published as: CN112148765A

Abstract

The application provides a method, a device and a storage medium for processing service data, wherein the method comprises the following steps: acquiring service data to be processed, wherein the service data has at least two data characteristics, total data samples contained in the service data have time marks, dividing the total data samples contained in the service data according to time marks according to time periods to obtain barrel-divided data samples respectively corresponding to each time period, calculating first statistical values respectively corresponding to each data characteristic aiming at each barrel-divided data sample, further calculating first fluctuation amplitude of each data characteristic of the service data, and performing characteristic filtering on all data characteristics according to the first fluctuation amplitude of each data characteristic of the service data to obtain key characteristics of the service data. According to the technical scheme, the time dimension is taken as a consideration factor in the process of feature screening, so that the problem that the features are changed along with time migration is solved, and the accuracy of feature screening is improved.

Description

Service data processing method, device and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for processing service data, and a storage medium.

Background

With the vigorous development of artificial intelligence technology, machine modeling is an important aspect of the artificial intelligence field, and how to screen stable and important features from multidimensional features of mass data for machine modeling is a key to improve model performance.

In the prior art, feature filtering is generally implemented based on a performance index of a single feature and a preset division threshold. When the feature screening is carried out based on the method, only the importance of the feature to be screened at a certain moment in a processed sample is generally considered, the change of the feature of data along with the time is not considered, and the problem of inaccuracy of the screened feature exists.

Disclosure of Invention

The application provides a processing method, a processing device and a storage medium of service data, which are used for solving the problem of inaccurate screened characteristics in the existing characteristic filtering method.

The method for processing service data provided in the first aspect of the present application includes:

acquiring service data to be processed, wherein the service data has at least two data characteristics, and a total data sample contained in the service data has a time mark;

dividing the total data samples contained in the service data according to the time marks to obtain barrel-divided data samples respectively corresponding to the time periods;

Calculating a first statistical value corresponding to each data feature respectively for each barrel-divided data sample, wherein the first statistical value is the ratio of the number of the data samples corresponding to the data feature to the total number of the barrel-divided data samples;

calculating a first fluctuation range of each data characteristic of the service data according to a first statistical value of each data characteristic in each barrel-divided data sample;

and performing feature filtering on all the data features according to the first fluctuation range of each data feature of the service data to obtain key features of the service data.

In one possible implementation manner of the first aspect, the calculating, according to the first statistic value of each data feature in each barrel-divided data sample, a first fluctuation range of each data feature of the service data includes:

calculating an average value of first statistical values corresponding to each data feature;

and aiming at each data feature, obtaining a first fluctuation amplitude of each data feature of the service data according to the average value of the first statistic corresponding to the data feature and the first statistic.

In another possible implementation manner of the first aspect, the performing feature filtering on all the data features according to the first fluctuation range of each data feature of the service data to obtain key features of the service data includes:

And filtering out the data features with the first fluctuation amplitude larger than a preset threshold value from all the data features to obtain the key features of the service data.

In still another possible implementation manner of the first aspect, before performing feature filtering on all the data features according to the first fluctuation range of each data feature of the service data to obtain the key feature of the service data, the method further includes:

calculating importance scores of all data features of the service data by adopting an importance analysis model;

correspondingly, the feature filtering is performed on all the data features according to the first fluctuation range of each data feature of the service data to obtain key features of the service data, including:

and carrying out feature filtering on all the data features according to the importance scores of the data features of the service data and the first fluctuation range of the data features of the service data to obtain key features of the service data.

In the foregoing possible implementation manner of the first aspect, the performing feature filtering on all the data features according to the importance score of each data feature of the service data and the first fluctuation range of each data feature of the service data to obtain the key feature of the service data includes:

Aiming at each data feature, obtaining a comprehensive index value corresponding to each data feature according to the importance score and the first fluctuation amplitude of the data feature;

and obtaining key features of the service data according to the comprehensive index value of each data feature and a preset threshold value.

In a further possible implementation manner of the first aspect, the calculating, using an importance analysis model, an importance score of each data feature of the service data includes:

aiming at each barrel-separated data sample, calculating importance scores corresponding to the data features respectively by adopting an importance analysis model;

calculating a second fluctuation range of each data characteristic of the service data according to the importance score of each data characteristic in each barrel-separated data sample;

correspondingly, the feature filtering is performed on all the data features according to the importance scores of the data features of the service data and the first fluctuation range of the data features of the service data to obtain key features of the service data, including:

and carrying out feature filtering on all the data features according to the second fluctuation range of each data feature of the service data and the first fluctuation range of each data feature of the service data to obtain key features of the service data.

calculating total feature probabilities corresponding to all data features of the service data respectively, wherein the total feature probabilities represent the occurrence probability of all data features in the total data samples;

calculating the sub-barrel feature probability of each sub-barrel data sample corresponding to the service data of each data feature;

for each data feature, determining a second statistical value corresponding to each sub-barrel data sample according to the total number of the sub-barrel data samples where the data feature is located and the total number of the total data samples contained in the service data, wherein the second statistical value is the ratio of the total number of the corresponding sub-barrel data samples to the total number of the total data samples;

determining the conditional feature probability of the data feature in the service data according to the sub-barrel feature probability of the data feature in each sub-barrel data sample corresponding to the service data and the second statistical value corresponding to each sub-barrel data sample;

and determining importance scores of all the data features of the service data according to the conditional feature probability of all the data features in the service data and the total feature probability corresponding to all the data features of the service data.

A second aspect of the present application provides a processing apparatus for service data, including: the device comprises an acquisition module, a division module, a processing module and a determination module;

the acquisition module is used for acquiring service data to be processed, wherein the service data has at least two data characteristics, and a total data sample contained in the service data has a time mark;

the dividing module is used for dividing the total data samples contained in the service data according to the time marks to obtain barrel-divided data samples respectively corresponding to the time periods;

the processing module is configured to calculate, for each barrel-divided data sample, a first statistic value corresponding to each data feature, where the first statistic value is a ratio of a number of data samples corresponding to the data feature to a total number of barrel-divided data samples, and calculate, according to the first statistic value of each data feature in each barrel-divided data sample, a first fluctuation range of each data feature of the service data;

the determining module is used for carrying out feature filtering on all the data features according to the first fluctuation amplitude of each data feature of the service data to obtain key features of the service data.

In one possible implementation manner of the second aspect, the processing module is specifically configured to calculate, for each data feature, an average value of first statistics values corresponding to the data feature, and obtain, for each data feature, a first fluctuation range of each data feature of the service data according to the first statistics value corresponding to the data feature and the average value corresponding to the first statistics value.

In another possible implementation manner of the second aspect, the determining module is specifically configured to filter, from all data features, data features with a first fluctuation range greater than a preset threshold value, and obtain key features of the service data.

In a further possible implementation manner of the second aspect, the processing module is further configured to calculate, before the determining module performs feature filtering on all the data features according to the first fluctuation range of each data feature of the service data to obtain key features of the service data, an importance score of each data feature of the service data by using an importance analysis model;

correspondingly, the determining module is specifically configured to perform feature filtering on all the data features according to the importance scores of the data features of the service data and the first fluctuation range of the data features of the service data, so as to obtain key features of the service data.

In the foregoing possible implementation manner of the second aspect, the determining module is specifically configured to obtain, for each data feature, a comprehensive index value corresponding to each data feature according to an importance score and a first fluctuation range of the data feature, and obtain, according to the comprehensive index value and a preset threshold of each data feature, a key feature of the service data.

In yet another possible implementation manner of the second aspect, the processing module is specifically configured to calculate, for each barrel-divided data sample, an importance score corresponding to each data feature by using an importance analysis model, and calculate, according to the importance score of each data feature in each barrel-divided data sample, a second fluctuation range of each data feature of the service data;

correspondingly, the determining module is specifically configured to perform feature filtering on all the data features according to the second fluctuation range of each data feature of the service data and the first fluctuation range of each data feature of the service data, so as to obtain key features of the service data.

In a further possible implementation manner of the second aspect, the processing module is specifically configured to perform the following steps:

A third aspect of the present application provides a service data processing apparatus, comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect and any one of the various possible implementations of the first aspect when executing the program.

A fourth aspect of the present application provides a storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of the first aspect described above and any of the various possible implementations of the first aspect.

A fifth aspect of the present application provides a program product comprising: a computer program stored in a readable storage medium from which at least one processor can read, the at least one processor executing the computer program for use in the method of the first aspect and any of the various possible implementations of the first aspect.

According to the processing method, the processing device and the storage medium for the service data, the service data to be processed are obtained, the service data are provided with at least two data features, the total data samples contained in the service data are provided with time marks, the total data samples contained in the service data are divided according to time periods according to the time marks to obtain barrel-divided data samples corresponding to each time period respectively, first statistical values corresponding to the data features are calculated for each barrel-divided data sample, the first statistical values are ratios of the number of data samples corresponding to the data features to the total number of barrel-divided data samples, finally first fluctuation ranges of the data features of the service data are calculated according to the first statistical values of the data features in each barrel-divided data sample, and feature filtering is carried out on all the data features according to the first fluctuation ranges of the data features of the service data to obtain key features of the service data. According to the technical scheme, the time dimension is considered in the process of feature screening, the problem that the features are changed along with time migration is solved, and the accuracy of feature screening is improved.

Drawings

Fig. 1 is a flowchart of a first embodiment of a method for processing service data according to an embodiment of the present application;

fig. 2 is a flowchart of a second embodiment of a method for processing service data according to the embodiment of the present application;

fig. 3 is a flowchart of a third embodiment of a method for processing service data according to the embodiment of the present application;

fig. 4 is a flowchart of a fourth embodiment of a method for processing service data according to the embodiment of the present application;

fig. 5 is a schematic structural diagram of a first embodiment of a processing device for service data provided in the embodiment of the present application;

fig. 6 is a schematic structural diagram of a second embodiment of a processing device for service data provided in the embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Aiming at the problem that the characteristics of data change along with time so that the screened characteristics are inaccurate in the prior art, the embodiment of the application provides a processing method, a device and a storage medium of service data. According to the technical scheme, the time dimension is considered in the process of feature screening, the problem that the features are changed along with time migration is solved, and the accuracy of feature screening is improved.

It may be understood that the execution body of the embodiment of the present application may be an electronic device, for example, a terminal device, or may be a server, for example, a background processing platform, etc., which may be determined according to an actual situation, which is not described herein again.

The technical scheme of the present application is described in detail below through specific embodiments. It should be noted that the following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of an embodiment of a method for processing service data according to an embodiment of the present application. As shown in fig. 1, the method for processing service data provided in the embodiment of the present application may include the following steps:

step 11: service data to be processed is acquired, the service data having at least two data characteristics and total data samples contained in the service data having time stamps.

In general, data usually has more feature dimensions, namely data features, in order to construct a data model with excellent performance in practical application, the data features of the data need to be screened, features with poor stability or poor importance are filtered out, key features of the data are obtained, and modeling is performed by using the data with the key features.

Optionally, in practical applications, since the data has features that are not generally constant, but change in stability and importance with time, so that original non-critical features become critical features, and critical features become non-critical features, the embodiments of the present application can process the data samples with time stamps.

The technical solution in this embodiment may process multiple types of data, for example, service data and non-service data, where the service data may include multiple types of communication service data, financial service data, educational service data, and the non-service data may be some data that does not participate in service processing, such as user information data, and the embodiment of the present application does not limit specific classification of service data and non-service data. The present embodiment is explained with respect to service data.

The service data to be processed in this embodiment needs to have a specific characteristic that the service data has at least two data features and the total data sample included in the service data has a time stamp, so that it is possible to process the acquired service data based on the time stamp later and screen out the key features from the multiple data features of the service data.

Step 12: and dividing the total data samples contained in the service data according to the time marks to obtain barrel-divided data samples corresponding to each time period.

In this embodiment, since the total data samples included in the service data have time stamps, all the total data samples may be divided into a plurality of buckets according to time periods, one for each time period, based on the time stamps of each data sample, and barrel-divided data samples corresponding to each time period may be obtained.

It should be noted that the bucket in this embodiment is actually a set described in the art, that is, in this embodiment, the total data samples included in the service data may be divided into a plurality of data sets, and the time stamps of the plurality of data samples included in each data set are all within a corresponding time period.

Step 13: for each sub-barrel data sample, calculating a first statistical value corresponding to each data feature, wherein the first statistical value is the ratio of the number of data samples corresponding to the data feature to the total number of sub-barrel data samples.

In the present embodiment, the same operation is performed for each sub-barrel data sample, that is, for each sub-barrel data sample, the first statistical value of each data feature that the traffic data has is counted.

Specifically, for each data feature, counting the number of data samples with the data feature and the total number of data samples of the barrel data sample, wherein the ratio of the number of data samples with the data feature to the total number of data samples of the barrel data sample is the first statistical value of the data feature in the barrel data sample.

Step 14: and calculating the first fluctuation amplitude of each data characteristic of the service data according to the first statistic value of each data characteristic in each barrel-divided data sample.

In this embodiment, the first statistical value of each data feature in each barrel-divided data sample may be determined by the method in step 13, and at this time, for each data feature of the service data, an average value of the first statistical value corresponding to each data feature may be calculated first, and then the first fluctuation range of each data feature may be calculated.

Specifically, this step 14 may be implemented as follows:

a1: for each data feature, an average of the first statistics corresponding to the data feature is calculated.

In this embodiment, the first statistics values of the data feature in all the sub-bucket data samples are added to obtain a sum of the first statistics values, and then the sum of the first statistics values is divided by the number of sub-buckets to obtain an average value of the first statistics values corresponding to the data feature.

A2: and aiming at each data feature, obtaining a first fluctuation amplitude of each data feature of the service data according to a first statistical value corresponding to the data feature and an average value of the first statistical value.

Illustratively, the present embodiment may measure the degree of deviation between the first statistical value of each data feature and the average value thereof using the variance or the mean square error in the probability theory or the like. That is, the first fluctuation width in the present application may be the first mean square error or the first variance.

In this step, for each data feature, the average of the squares of the differences between the first statistical value corresponding to the data feature and the average is calculated, resulting in a first variance of each data feature of the service data. Correspondingly, taking the absolute value of the square of the first square difference, and obtaining the first mean square difference.

It should be noted that, in the embodiment of the present application, the specific expression form of the first fluctuation range is not limited, and in other embodiments, the first fluctuation range may also be directly represented by the average value of the first statistical value, which is not described herein.

Step 15: and according to the first fluctuation range of each data characteristic of the service data, performing characteristic filtering on all the data characteristics to obtain key characteristics of the service data.

In this embodiment, when the difference between the first statistics corresponding to each data feature in each barrel-divided data sample is large (i.e., the first statistics of the data feature fluctuates around its average value greatly), the square sum of the differences between each first statistics and the average value is large, and the mean square error or variance is large; when the difference between the first statistical values corresponding to the data features in each barrel-separated data sample is smaller, the square sum of the differences between the first statistical values and the average value is smaller, and the mean square error or variance is smaller. Thus, the larger the first fluctuation width, the larger the fluctuation of the data feature, and the smaller the first fluctuation width, the smaller the fluctuation of the data feature.

For example, the data features with larger first fluctuation amplitude can be filtered from the data features of the service data based on the calculated first fluctuation amplitude corresponding to each data feature, so as to obtain the key features of the service data.

As one possible implementation, this step 15 may be implemented as follows:

and filtering out the data features with the first fluctuation amplitude larger than a preset threshold value from all the data features to obtain key features of the service data.

Specifically, in the processing process of the service data, a preset threshold value can be set first, so that after the first fluctuation amplitude corresponding to each data feature is obtained, the data features with the first fluctuation amplitude larger than the preset threshold value can be filtered out from all the data features, and further the key features of the service data are obtained.

It can be appreciated that, in this embodiment, the specific value of the preset threshold may be determined according to the actual situation, which is not described herein.

According to the processing method of the service data, the service data to be processed is obtained, the service data has at least two data features, the total data samples contained in the service data have time marks, the total data samples contained in the service data are divided according to time periods according to the time marks to obtain barrel-divided data samples corresponding to each time period respectively, then first statistical values corresponding to the data features are calculated for each barrel-divided data sample, the first statistical values are ratios of the number of data samples corresponding to the data features to the total number of barrel-divided data samples, finally first fluctuation amplitude of each data feature of the service data is calculated according to the first statistical values of each data feature in each barrel-divided data sample, and feature filtering is carried out on all the data features according to the first fluctuation amplitude of each data feature of the service data to obtain key features of the service data. According to the technical scheme, the time dimension is considered in the process of feature screening, the problem that the features are changed along with time migration is solved, and the accuracy of feature screening is improved.

On the basis of the foregoing embodiments, fig. 2 is a flowchart of a second embodiment of a method for processing service data according to the embodiment of the present application. As shown in fig. 2, before the step 15, the method for processing service data provided in the embodiment of the present application may further include the following steps:

step 21: and calculating importance scores of all data features of the service data by adopting an importance analysis model.

Illustratively, in the embodiment of the present application, in order to avoid inaccuracy of feature filtering of a single processing method, after calculating the first fluctuation amplitude of each data feature of the service data through steps 11 to 14 in the embodiment shown in fig. 1, an importance analysis model may also be used to calculate an importance score of each data feature of the service data before feature filtering is performed on all the data features.

It will be appreciated that the higher the importance score of a data feature, the more important the data feature, with the importance decreasing as the importance score decreases.

For example, the importance analysis model may be trained using the same type of data as the service data of the present embodiment, and thus, in the present embodiment, the acquired service data may be input into the importance analysis model, and the importance analysis model may be used to score the importance of each data feature of the service data.

For a specific implementation of this step, reference may be made to the following description of the embodiment shown in fig. 4, which is not repeated here.

Accordingly, the above step 15 may be replaced by the following steps:

step 22: and carrying out feature filtering on all the data features according to the importance scores of the data features of the service data and the first fluctuation range of the data features of the service data to obtain key features of the service data.

In an embodiment, after obtaining a first fluctuation range corresponding to each data feature in service data and an importance score of each data feature, under the condition that the importance score and the first fluctuation range of each data feature are considered at the same time, feature ordering and feature filtering are performed on all data features of the service data, and data features with poor stability and low scores are filtered out of all data features as much as possible, so that key features with good stability and high scores are obtained.

It should be noted that, in the present embodiment, since the first statistical value of each data feature is a value between 0 and 1, the first fluctuation range and the importance score of each data feature are both values between 0 and 1.

Illustratively, this step 22 may be accomplished by:

b1: and aiming at each data feature, obtaining the comprehensive index value corresponding to each data feature according to the importance score and the first fluctuation amplitude of the data feature.

In the present embodiment, as an example, for each data feature, the importance score and the first fluctuation width are summed to obtain a composite index value corresponding to each data feature.

B2: and obtaining key features of the service data according to the comprehensive index value of each data feature and a preset threshold value.

From the above analysis, it can be seen that the larger the first fluctuation width of a certain data feature, the more unstable the data feature, and the higher the importance score of the data feature, the more important the data feature. In general, the importance score of each data feature and the first fluctuation range are summed to obtain a comprehensive index value corresponding to each data feature.

However, for some special data features, for example, the data features with higher importance scores and smaller first fluctuation amplitude, the integrated index value corresponding to each data feature obtained by the summation processing is directly compared with a preset threshold value, and the error of the data feature may be filtered out, so that the problem of filtering out the key feature occurs. For this problem, the embodiment of the present application may also process the importance score or the first fluctuation width of each data feature.

As an example, the importance score of the data feature is obtained by subtracting the importance score of the data feature from 1, so that the importance score of the data feature is inversely proportional to the importance, and thus the importance score of the data feature is summed with the first fluctuation range to obtain the updated comprehensive index value of each data feature, and the data feature with the comprehensive index value exceeding the preset threshold value is filtered from all the data features, so that the obtained key feature of the service data is more accurate.

As an example, the first fluctuation amplitude of the data features is subtracted from 1 to obtain a processed first fluctuation amplitude, so that the first fluctuation amplitude of the data features is inversely proportional to the fluctuation degree, and thus the processed first fluctuation amplitude and the importance score are summed to obtain an updated comprehensive index value of each data feature, and then the data features with the comprehensive index value smaller than a preset threshold value are filtered out of all the data features, so that the obtained key features of the service data are relatively accurate.

According to the processing method of the business data, feature filtering is conducted on all data features according to the first fluctuation range of each data feature of the business data, an importance analysis model is further adopted before key features of the business data are obtained, importance scores of each data feature of the business data are calculated, and feature filtering is conducted on all data features according to the importance scores of each data feature of the business data and the first fluctuation range of each data feature of the business data, so that key features of the business data are obtained. According to the technical scheme, the key characteristics of the service data can be determined more accurately by combining the first fluctuation range of each data characteristic with the importance score of each data characteristic.

On the basis of the foregoing embodiments, fig. 3 is a flowchart of a third embodiment of a method for processing service data according to the embodiment of the present application. As shown in fig. 3, the above step 21 may be implemented by:

step 31: and calculating importance scores corresponding to the data features respectively by adopting an importance analysis model according to each barrel-separated data sample.

In this embodiment, when the importance analysis model is used to calculate the importance score of each data feature of the service data, for example, a method shown in fig. 1 may also be used to determine the importance score corresponding to each data feature in the service data based on each barrel-divided data sample.

For each barrel data sample, an importance analysis model may be first used to calculate an importance score for each data feature in the barrel data sample.

Step 32: and calculating a second fluctuation range of each data characteristic of the service data according to the importance score of each data characteristic in each barrel-separated data sample.

In this embodiment, similarly to the embodiment shown in fig. 1, for each data feature in each barrel of the data sample, an average value of importance scores corresponding to the data feature is calculated first, and then, according to the importance score corresponding to the data feature and the average value of the importance scores, a second fluctuation range of each data feature of the service data is obtained.

The specific calculation method of the second fluctuation range is similar to that of the first fluctuation range in the embodiment shown in fig. 1, and will not be described here.

Accordingly, the above step 22 may be implemented by the following steps:

step 33: and carrying out feature filtering on all the data features according to the second fluctuation range of each data feature of the service data and the first fluctuation range of each data feature of the service data to obtain key features of the service data.

Optionally, in an embodiment of the present application, for each data feature, a comprehensive index value corresponding to each data feature may be obtained according to the second fluctuation range and the first fluctuation range of the data feature, and then, according to the comprehensive index value and a preset threshold value of each data feature, a key feature of service data may be obtained.

For example, for each data feature, first summing the second fluctuation range and the first fluctuation range of the data feature to obtain a comprehensive index value corresponding to each data feature, and then filtering out the data features with the comprehensive index value exceeding a preset threshold value from all the data features to obtain the key features of the service data.

According to the processing method of the business data, the importance scores corresponding to the data features are calculated by adopting the importance analysis model for each barrel-divided data sample, the second fluctuation range of the data features of the business data is calculated according to the importance scores of the data features in each barrel-divided data sample, and then feature filtering is carried out on all the data features according to the second fluctuation range of the data features of the business data and the first fluctuation range of the data features of the business data, so that the key features of the business data are obtained. According to the technical scheme, the fluctuation degree of each data feature in the importance score and stability is considered, and the accuracy of the determined key features of the service data is further improved.

On the basis of the foregoing embodiments, fig. 4 is a flowchart of a fourth embodiment of a method for processing service data according to the embodiment of the present application. As shown in fig. 4, the above step 21 may be implemented by:

step 41: and calculating the total feature probability corresponding to each data feature of the service data, wherein the total feature probability represents the probability of each data feature in the total data sample.

In this embodiment, for each data feature of the service data, first, the number of times that each data feature appears in the total data samples is determined, and then, according to the number of times and the total number of total data samples, the total feature probability corresponding to each data feature is determined.

For example, the total feature probability in the present embodiment may also be referred to as the information entropy of each data feature, and the information entropy may represent the complexity (uncertainty) of each data feature. For example, assuming that a certain data feature occurs M times in the total data samples, the total number of total data samples is N, where n≡m, and N and M are both positive integers, the total feature probability P0 corresponding to the data feature can be expressed by the formula p0= - (M/N) ×log (M/N).

Step 42: and calculating the sub-bucket feature probability of each sub-bucket data sample corresponding to the service data of each data feature.

Optionally, in an embodiment of the present application, for each data feature, in each sub-barrel data sample, first determining a number of times the data feature occurs in each sub-barrel data sample, and then determining a sub-barrel feature probability of each data feature occurring in each sub-barrel data sample according to the number of times and a total number of sub-barrel data samples.

For example, assuming that a data feature occurs M times in a partial bucket of data samples, the total number of data samples is N, where n+.n+.m, and N and M are both positive integers, then the probability of the occurrence of the data feature in the partial bucket of data sample, p0, can be represented by the formula p0= - (M/N) ×log (M/N) - ((N-M)/N) ×log ((N-M)/N).

Step 43: and determining a second statistical value corresponding to each sub-barrel data sample according to the total number of the sub-barrel data samples of each data feature and the total number of the total data samples contained in the service data, wherein the second statistical value is the ratio of the total number of the corresponding sub-barrel data samples to the total number of the total data samples.

In this embodiment, for each data feature, firstly, in each sub-bucket where the data feature is located, counting the total number of data samples of the sub-bucket, and then, in combination with the total number of data samples contained in service data, comparing the total number of data samples of each sub-bucket with the total number of data samples of each sub-bucket, where the obtained ratio is the second statistical value corresponding to each sub-bucket data sample.

Step 44: and determining the conditional feature probability of the data feature in the service data according to the sub-barrel feature probability of the data feature in each sub-barrel data sample corresponding to the service data and the second statistical value corresponding to each sub-barrel data sample.

In this embodiment, the step 42 is used to obtain the sub-bucket feature probability of the data feature in each sub-bucket data sample corresponding to the service data, and the step 43 is used to obtain the conditional feature probability of the data feature in the service data after the sub-bucket feature probability and the second statistical value of each sub-bucket data sample are subjected to the product processing.

It should be noted that the conditional feature probability in the embodiments of the present application is also referred to as conditional entropy, which represents the complexity (uncertainty) of the data feature in each barrel of the data sample.

Step 45: and determining importance scores of all the data features of the service data according to the conditional feature probability of all the data features in the service data and the total feature probability corresponding to all the data features of the service data.

In practical applications, the information gain is equal to the information entropy minus the conditional entropy, as known from the definition of the information gain. In this embodiment, the importance score of each data feature may be expressed in the form of an information gain, that is, the higher the information gain, the higher the importance score of the data feature. Therefore, in this embodiment, the conditional feature probability that each data feature appears in the service data may be subtracted from the total feature probability that each data feature of the service data corresponds to, and the obtained difference is the importance score of each data feature of the service data.

According to the processing method of the business data, through calculating total feature probabilities corresponding to all data features of the business data respectively, calculating the sub-barrel feature probabilities of the data features in all sub-barrel data samples corresponding to the business data according to all sub-barrel data samples in which the data features are located and the total number of total data samples contained in the business data, determining second statistical values corresponding to all sub-barrel data samples, determining conditional feature probabilities of the data features in the business data according to the sub-barrel feature probabilities of the data features in all sub-barrel data samples corresponding to the business data and the second statistical values corresponding to all sub-barrel data samples, and finally determining importance scores of all data features of the business data according to the conditional feature probabilities of the data features in the business data and the total feature probabilities corresponding to all data features of the business data respectively. According to the technical scheme, the importance score of the data feature is calculated by calculating the total feature probability corresponding to each data feature and the conditional feature probability of each data feature in service data, so that the accuracy is high and the implementation is easy.

It should be noted that, in the embodiment of the present application, the method may be combined with other feature screening methods to screen the key features, and other filtering methods, such as a co-linear filtering method, are similar to the implementation principle of the key feature screening combined with other feature screening methods, which is not described herein.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Fig. 5 is a schematic structural diagram of a first embodiment of a processing device for service data provided in the embodiment of the present application. In this embodiment, the processing device for service data may be integrated in an electronic device, or may be integrated in a server, or may be a server. Alternatively, as shown in fig. 5, the apparatus may include: an acquisition module 51, a division module 52, a processing module 53 and a determination module 54.

The acquiring module 51 is configured to acquire service data to be processed, where the service data has at least two data features and a total data sample included in the service data has a time stamp;

The dividing module 52 is configured to divide the total data samples included in the service data according to the time stamp to obtain barrel-divided data samples corresponding to each time period;

the processing module 53 is configured to calculate, for each of the barrel-divided data samples, a first statistic value corresponding to each of the data features, where the first statistic value is a ratio of a number of data samples corresponding to the data features to a total number of barrel-divided data samples, and calculate, according to the first statistic value of each of the data features in each of the barrel-divided data samples, a first fluctuation range of each of the data features of the service data;

the determining module 54 is configured to perform feature filtering on all the data features according to the first fluctuation range of each data feature of the service data, so as to obtain key features of the service data.

For example, in one possible design of the present application, the processing module 53 is specifically configured to calculate, for each data feature, an average value of first statistics corresponding to the data feature, and obtain, for each data feature, a first fluctuation range of each data feature of the service data according to the first statistics corresponding to the data feature and the average value corresponding to the first statistics.

Illustratively, in another possible design of the present application, the determining module 54 is specifically configured to filter, from all the data features, the data features with the first fluctuation range greater than the preset threshold value, so as to obtain the key features of the service data.

Illustratively, in another possible design of the present application, the processing module 53 is further configured to calculate, before the determining module 54 performs feature filtering on all the data features according to the first fluctuation range of each data feature of the service data to obtain the key feature of the service data, an importance score of each data feature of the service data by using an importance analysis model;

correspondingly, the determining module 54 is specifically configured to perform feature filtering on all the data features according to the importance score of each data feature of the service data and the first fluctuation range of each data feature of the service data, so as to obtain the key feature of the service data.

As an example, the determining module is specifically configured to obtain, for each data feature, a comprehensive index value corresponding to each data feature according to the importance score and the first fluctuation range of the data feature, and obtain, according to the comprehensive index value and a preset threshold of each data feature, a key feature of the service data.

Illustratively, in yet another possible design of the present application, the processing module 53 is specifically configured to calculate, for each barrel-divided data sample, an importance score corresponding to each data feature by using an importance analysis model, and calculate, according to the importance score of each data feature in each barrel-divided data sample, a second fluctuation range of each data feature of the service data;

correspondingly, the determining module 54 is specifically configured to perform feature filtering on all the data features according to the second fluctuation range of each data feature of the service data and the first fluctuation range of each data feature of the service data, so as to obtain the key feature of the service data.

Illustratively, in yet another possible design of the present application, the processing module 53 is specifically configured to perform the following steps:

The apparatus provided in the embodiments of the present application may be used to perform the methods in the embodiments shown in fig. 1 to 4, and the implementation principle and technical effects are similar, and are not described herein again.

It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, the determining module may be a processing element that is set up separately, may be implemented in a chip of the above apparatus, or may be stored in a memory of the above apparatus in the form of program code, and may be called by a processing element of the above apparatus and execute the functions of the determining module. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as: one or more specific integrated circuits (application specific integrated circuit, ASIC), or one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general purpose processor, such as a central processing unit (central processing unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

Fig. 6 is a schematic structural diagram of a second embodiment of a processing device for service data provided in the embodiment of the present application. As shown in fig. 6, the processing apparatus of service data may include: a processor 61, a memory 62, a communication interface 63 and a system bus 64, said memory 62 and said communication interface 63 being connected to said processor 61 via said system bus 64 and performing communication with each other, said memory 62 being adapted to store a computer program, said communication interface 63 being adapted to communicate with other devices. The processor 61, when executing the computer program, implements the method of the embodiments shown in fig. 1 to 4 described above.

The system bus referred to in fig. 6 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, or the like. The system bus may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus. The communication interface is used to enable communication between the database access apparatus and other devices (e.g., clients, read-write libraries, and read-only libraries). The memory may comprise random access memory (random access memory, RAM) and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor may be a general-purpose processor, including a Central Processing Unit (CPU), a network processor (network processor, NP), etc.; but may also be a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component.

Alternatively, the acquiring module 51, the dividing module 52, the processing module 53 and the determining module 54 in fig. 5 may correspond to the processor 61 in the embodiment of the present application.

Optionally, the embodiment of the application further provides a storage medium, where the instructions are stored. The execution, when executed on a computer, causes the computer to perform the method of the embodiments shown in fig. 1-4.

Optionally, an embodiment of the present application further provides a chip for executing instructions, where the chip is configured to perform the method of the embodiment shown in fig. 1 to fig. 4.

The present application also provides a program product, which includes a computer program stored in a storage medium, from which at least one processor can read the computer program, and when the at least one processor executes the computer program, the method of the embodiments shown in fig. 1 to 4 can be implemented.

The term "plurality" herein refers to two or more. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship; in the formula, the character "/" indicates that the front and rear associated objects are a "division" relationship.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

It should be understood that, in the embodiments of the present application, the sequence number of each process described above does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for processing service data, comprising:

2. The method of claim 1, wherein calculating the first fluctuation range of each data feature of the service data based on the first statistics of each data feature in each barrel data sample comprises:

3. The method according to claim 1, wherein the feature filtering all the data features according to the first fluctuation range of each data feature of the service data to obtain the key feature of the service data includes:

4. The method according to claim 1 or 2, wherein the feature filtering is performed on all data features according to the first fluctuation range of each data feature of the service data, and before obtaining the key feature of the service data, the method further comprises:

5. The method of claim 4, wherein the performing feature filtering on all the data features according to the importance scores of the data features of the service data and the first fluctuation range of the data features of the service data to obtain the key features of the service data comprises:

6. The method of claim 4, wherein calculating an importance score for each data feature of the business data using an importance analysis model comprises:

7. The method of claim 4, wherein calculating an importance score for each data feature of the business data using an importance analysis model comprises:

8. A service data processing apparatus, comprising: the device comprises an acquisition module, a division module, a processing module and a determination module;

9. A traffic data processing apparatus comprising a processor, a memory and a computer program stored on said memory and executable on the processor, characterized in that the processor implements the method according to any of the preceding claims 1-7 when executing said program.

10. A storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-7.

11. A program product comprising: computer program, characterized in that the computer program, when being executed by a processor, is adapted to carry out the method of any one of the preceding claims 1-7.