CN109783481A

CN109783481A - Data processing method and device

Info

Publication number: CN109783481A
Application number: CN201811561766.1A
Authority: CN
Inventors: 郝向东
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2019-05-21

Abstract

The present disclosure discloses a kind of data processing method and device, it is related to technical field of data processing, the embodiment of the present disclosure accounts for the specific gravity of data set by obtaining abnormal data in data set, comparison is more than or equal to the data set of preset threshold again, then for any one abnormal data in abnormal data subset, calculate the standardization Euclidean distance of every normal data in abnormal data and normal data subset, abnormal data is replaced with the corresponding normal data of the smallest standardization Euclidean distance, so that when carrying out data cleansing to the biggish data set of data volume, improve data accuracy, reduce the distortion of data analysis result.

Description

Data processing method and device

Technical field

This disclosure relates to technical field of data processing more particularly to a kind of data processing method and device.

Background technique

With the arrival of big data era, big data analysis technology is developed by leaps and bounds.In big data analysis, by It is usually present some abnormal datas (data for deviateing desired value or normal range (NR)) in data set, data can be analyzed and cause to do It disturbs, influences that data are analyzed as a result, needing first to carry out data cleansing to data set, i.e., to it therefore before big data analysis In abnormal data be purged or handle, to guarantee the reasonability of data set.

In the prior art, when carrying out data cleansing, direct suppressing exception data is usually taken or are considered as abnormal data The methods of missing values are handled.

But when the data volume in data set is larger, using above-mentioned in the prior art to the processing method of abnormal data, Meeting is so that the shortage of data in data set is more (missing that such as time series data concentrates sequential value), to cause serious number According to analysis result serious distortion.

Summary of the invention

The disclosure provides a kind of data processing method and device, for solve in the prior art when in data set data volume compared with When big, the problem of data analysis result serious distortion.

To achieve the above object, embodiment of the present disclosure first aspect provides a kind of data processing method, which comprises

Obtain the specific gravity that abnormal data in data set accounts for data set, wherein data set includes normal data subset and exception Data subset；Judge whether specific gravity is more than or equal to preset threshold；If specific gravity is more than or equal to preset threshold, for abnormal data Any one abnormal data concentrated, calculates abnormal data and the standardization of every normal data in normal data subset is European Distance；Abnormal data is replaced with the corresponding normal data of the smallest standardization Euclidean distance.

Second aspect, the embodiment of the present disclosure provide a kind of data processing equipment, comprising: obtain module, judgment module, first Processing module and replacement module；Module is obtained, the specific gravity of data set is accounted for for obtaining abnormal data in data set, wherein data Collection includes normal data subset and abnormal data subset；Judgment module is for judging whether specific gravity is more than or equal to preset threshold, the One processing module, if being more than or equal to preset threshold for specific gravity, any one abnormal data being directed in abnormal data subset, Calculate the standardization Euclidean distance of every normal data in the abnormal data and normal data subset；Replacement module is used for With the smallest standardization Euclidean distance corresponding normal data replacement abnormal data.

The third aspect, the embodiment of the present disclosure provide a kind of electronic equipment, including memory, processor, store in memory There is the computer program that can be run on a processor, processor realizes side described in above-mentioned first aspect when executing computer program The step of method.

Fourth aspect, the embodiment of the present disclosure provide a kind of computer readable storage medium, are stored thereon with computer program, The step of method as described in relation to the first aspect is realized when computer program is executed by processor.

Based on any of the above-described aspect, the embodiment of the present disclosure is had the advantages that

In the embodiment of the present disclosure, the specific gravity of data set is accounted for by obtaining abnormal data in data set, comparison is more than or equal to again The data set of preset threshold calculates abnormal data and normal number then for any one abnormal data in abnormal data subset According to the standardization Euclidean distance of every normal data in subset, replaced with the corresponding normal data of the smallest standardization Euclidean distance Transversion regular data, so that improving data accuracy when carrying out data cleansing to the biggish data set of data volume, reducing number According to the distortion of analysis result.

Detailed description of the invention

Fig. 1 is the flow diagram for the data processing method that the embodiment of the present disclosure provides；

Fig. 2 is another flow diagram for the data processing method that the embodiment of the present disclosure provides；

Fig. 3 is the another flow diagram for the data processing method that the embodiment of the present disclosure provides；

Fig. 4 is the another flow diagram for the data processing method that the embodiment of the present disclosure provides；

Fig. 5 is the another flow diagram for the data processing method that the embodiment of the present disclosure provides；

Fig. 6 is the another flow diagram for the data processing method that the embodiment of the present disclosure provides；

Fig. 7 is the another flow diagram for the data processing method that the embodiment of the present disclosure provides；

Fig. 8 is the structural schematic diagram for the data processing equipment that the embodiment of the present disclosure provides；

Fig. 9 is another structural schematic diagram for the data processing equipment that the embodiment of the present disclosure provides；

Figure 10 is the another structural schematic diagram for the data processing equipment that the embodiment of the present disclosure provides；

Figure 11 is the structural schematic diagram for the electronic equipment that the embodiment of the present disclosure provides.

The embodiments will be further described with reference to the accompanying drawings for realization, functional characteristics and the advantage of disclosure purpose.

Specific embodiment

It should be appreciated that specific embodiment described herein is only used to explain the disclosure, it is not used to limit the disclosure.

Based on this, the embodiment of the present disclosure provides a kind of data processing method, counts to the biggish data set of data volume When according to cleaning, data accuracy can be improved, reduce the distortion of data analysis result.This method can be applied to have communication and The calculating equipment of computing capability.The calculating equipment can be server, work station, be also possible to desktop computer, notebook meter The personal computer etc. of the configurations such as calculation machine, the disclosure is not construed as limiting this.

Fig. 1 is the flow diagram for the data processing method that the embodiment of the present disclosure provides, as shown in Figure 1, the data processing Method includes:

S101, obtain abnormal data in data set and account for the specific gravity of data set, wherein data set include normal data subset and Abnormal data subset.

Specifically, data set can screen the abnormal data in the data set according to the screening rule of abnormal data, number It is divided into normal data subset and abnormal data subset according to collection, according to the abnormal data filtered out, can obtains abnormal in data set Data account for the specific gravity of data set.

For example, if in data set including following data: { 96,97,97,96,98,110 }, the screening rule of abnormal data Then are as follows: " it is abnormal data that the difference with 95, which is more than 5, " can then get in data set according to the screening rule of the abnormal data Normal data subset include following data: { 96,97,97,96,98 }, abnormal data subset include following data: { 110 }, by This is it is found that the specific gravity that abnormal data occupies data set is 1/6.

It should be noted that the big data that above-mentioned data set can be collected according to user demand, such as can be electricity Net data on flows, surfing Internet with cell phone data on flows etc., wherein data type can be the classes such as text, coding, number in data set The data of type, if it is literal type data, can be carried out the conversion of numeric type data, Huo Zhejian by the data of literal type The mapping relations for founding the text categorical data and numeric type data are provided by the data set to numeric type using the disclosure Data processing method carry out calculation processing, and can at the process of numeric type data for the data conversion of verbal description type To set according to the actual situation, the disclosure is not specifically limited herein, in addition, for different type or under different scenes Data set, need to screen abnormal data therein with the screening rule of different abnormal datas, such as: sentencing by sorting algorithm Break, judge by modeling judgement and by simple condition etc., therefore, in the embodiment of the present disclosure, it can be prestored in server There is the screening rule of abnormal data, for the type and quantity of the screening rule of abnormal data, the disclosure does not limit specifically herein It is fixed.

S102, judge whether specific gravity is more than or equal to preset threshold.

It, can be by acquired specific gravity and preset threshold after abnormal data accounts for the specific gravity of data set in getting data set It is compared, judgement obtains whether specific gravity is more than or equal to preset threshold, wherein the preset threshold can be concentrated according to different data The processing of abnormal data is set, for example, the preset threshold can be 5%, may be arranged as 3%, 10% etc. Other values, the disclosure is not particularly limited herein.

S103, if it is greater than or equal to preset threshold, then for any one abnormal data in abnormal data subset, calculate different The standardization Euclidean distance of every normal data in regular data and normal data subset.

Optionally, if judgement obtain abnormal data account for all data in data set specific gravity be more than or equal to preset threshold, For any one abnormal data in abnormal data subset, every normal number in abnormal data and normal data subset is calculated According to standardization Euclidean distance.

For example, abnormal data subset Y includes n abnormal data (y in data set A₁, y₂... y_i…y_n), normal number It include m abnormal data (x according to subset X₁, x₂…x_j…x_m), if to obtain specific gravity T shared by abnormal data in data set A big for judgement When 5%, then every abnormal data in abnormal data subset Y is required to clean.For every abnormal data Cleaning process is identical, and the disclosure (is assumed to be y with an abnormal data₁) cleaning for be illustrated.

Firstly, calculating y₁With (x in normal data subset X₁, x₂…x_j…x_m) every data standardization Euclidean distance, it may be assumed that Calculate separately y₁With x₁Standardization Euclidean distance d₁, y₁With x₂Standardization Euclidean distance d₂, y₁With x_jStandardization Euclidean distance d_j..., until y₁With x_mStandardization Euclidean distance d_m。

S104, the abnormal data is replaced with the corresponding normal data of the smallest standardization Euclidean distance.

The abnormal data y in getting abnormal data subset Y₁With the standard of every normal data in normal data subset Change Euclidean distance, it can obtain m standardization Euclidean distance are as follows: d₁, d₂…d_j…d_m, can from these standardization it is European away from Normal data corresponding from the middle the smallest standardization Euclidean distance of selection replaces the abnormal data y₁, it is assumed that d₅For these standards Change the smallest, i.e. d in data set in Euclidean distance₅Corresponding normal data x₅Replace y₁, to realize to abnormal data y₁Place Reason.

Every abnormal data in abnormal data subset Y is all used and abnormal data y₁Identical cleaning way, i.e., It can reach the purpose that data are concentrated with abnormal data cleaning.

From the above mentioned, it in the data processing method that the embodiment of the present disclosure provides, is accounted for by obtaining abnormal data in data set The specific gravity of data set, comparison is more than or equal to the data set of preset threshold again, then different for any one in abnormal data subset Regular data calculates the standardization Euclidean distance of every normal data in abnormal data and normal data subset, with the smallest institute It states the corresponding normal data of standardization Euclidean distance and replaces the abnormal data, so that being carried out to the biggish data set of data volume When data cleansing, data accuracy is improved, reduces the distortion of data analysis result.

Fig. 2 is another flow diagram of data processing method provided in an embodiment of the present invention.

Optionally, as shown in Fig. 2, in the data processing method that the embodiment of the present disclosure provides, if specific gravity is less than preset threshold, Then include the following steps S201-S203 for the processing mode of any one abnormal data.It is different for every in this embodiment The cleaning process of regular data is identical, and the disclosure (is assumed to be y with an abnormal data₂) cleaning for be illustrated:

S201, two adjacent datas for obtaining abnormal data.

Specifically, if judgement obtains in data set abnormal data proportion when being less than preset threshold, for what is cleaned Abnormal data can first obtain two adjacent datas of the abnormal data in data set.In one embodiment, in data set The sequence that data can be generated or be acquired according to data sorts.

For example, if there are abnormal data b in data set B₂, for abnormal data b₂, b can be got₂In data set A In two adjacent datas, it is assumed that respectively b₁And b₃.It is operated according to the differentiation of above-mentioned abnormal data and normal data, according to reality The application scenarios b on border₁It may be abnormal data is also likely to be normal data, b₃Possible abnormal data is also likely to be normal data.

S202, the average value for calculating two adjacent datas.

It specifically, can be according to two acquired consecutive numbers after getting two adjacent datas of each abnormal data According to being calculated, the average value of two adjacent datas is obtained.

For example, for above-mentioned abnormal data b₂Two adjacent data b₁And b₃, b can be calculated₁And b₃Be averaged Value are as follows:

Wherein, z b₁And b₃Average value.

S203, abnormal data is replaced with average value.

Specifically, it after the average value of two adjacent datas of each abnormal data is calculated, can be put down with acquired Mean value replaces corresponding each abnormal data.

For example, for above-mentioned abnormal data b₂, the average value of two adjacent datas acquired in server is z, then The abnormal data b that can be concentrated with z replacement data₂, realize to abnormal data b₂Processing.

Optionally, can also using other modification methods, such as: account in data set and own if judgement obtains abnormal data The specific gravity of data is less than preset threshold, can also be modified place to each abnormal data that data are concentrated according to smooth revised law Reason, which, which refers to, replaces the abnormal data with the weighted average of abnormal data, is less than preset threshold for specific gravity When, the modification method of abnormal data can be set according to the actual situation, the disclosure is not specifically limited herein.

From the above mentioned, in data processing method provided by the present embodiment, own by being accounted in data set in abnormal data When the specific gravity of data is less than preset threshold, place is modified to abnormal data according to equal value correction method or smooth modification method Reason can be improved the treatment effeciency that data are concentrated with abnormal data, reduce the computing resource occupied during data cleansing.

Fig. 3 is the another flow diagram of data processing method provided in an embodiment of the present invention.

Optionally, as shown in figure 3, obtaining the specific gravity that abnormal data in data set accounts for data set, comprising:

S301, the entry number and data for obtaining abnormal data in abnormal data subset concentrate the entry number of all data.

Specifically, data set can screen the abnormal data in the data set according to the screening rule of abnormal data, number It is divided into normal data subset and abnormal data subset according to collection, the entry number sum number of abnormal data in abnormal data subset can be obtained According to the entry number for concentrating all data.

S302, the entry number for calculating abnormal data and data concentrate the ratio of the entry number of all data, obtain data set Middle abnormal data accounts for the specific gravity of data set.

Specifically, the ratio of the entry number of all data in data set can be occupied according to the entry number of abnormal data, obtain Into data set, abnormal data accounts for the specific gravity of data set.

For example, for above-mentioned data set A, A is divided into abnormal data subset Y (comprising n abnormal data) and normally Data subset X (including m normal data), then get the specific gravity that abnormal data in data set A accounts for all data in data set Are as follows:

Wherein, T is the specific gravity that abnormal data accounts for all data in data set.

It should be noted that the embodiment of the present disclosure provide the data processing method can be applied not only to any dimension to The data (such as: the data including time dimension) of amount are handled, and can eliminate abnormal data in data processing While increase the reserving degree that influences on whole data set of each dimensional characteristics of abnormal data, thus further raising data Accuracy is presented below one embodiment and the data processing method of N-dimensional degree vector is described in detail.

Fig. 4 is another flow diagram for the data processing method that the embodiment of the present disclosure provides.

Optionally, as shown in figure 4, normal data and abnormal data in data set are preset N-dimensional vector, N is big In 0 integer；

Specifically, the data set can preset N-dimensional vector according to data representation content, and such as: the data in the data set The attributes such as length, volume, color, the material for describing each chest to state, if be concerned only with the length of each chest, Wide, high these three types of attributes, can preset N-dimensional vector is three-dimensional vector, specifically can be according to practical feelings for the setting of N-dimensional vector Condition is set, and is not limited in disclosure implementation.

So, for any one abnormal data in abnormal data subset, abnormal data and normal data subset are calculated In every normal data standardization Euclidean distance the step of, comprising:

For any one normal data:

S401, acquisition kth dimension component standard are poor, wherein 1≤k≤N.

Assuming that abnormal data subset Y (y₁, y₂... y_i…y_n) in any one abnormal data y_iN-dimensional vector be (y_i1, y_i2…y_ik…y_iN), normal data subset X (x₁, x₂…x_j…x_m) in any one normal data x_jN-dimensional vector be (x_j1, x_j2…x_jk…x_jN)。

It first calculates kth and ties up component standard difference s_k, the s_kCalculating process will be illustrated by subsequent embodiment.

S402, according to the kth of abnormal data tie up component, normal data kth dimension component and kth dimension component standard it is poor, The Euclidean distance that abnormal data and normal data are tieed up in kth is calculated.

Then, abnormal data y is obtained_iKth tie up component, i.e. y_ik, obtain normal data x_jKth tie up component, i.e. x_jk, meter Calculate abnormal data y_iWith normal data x_jIn the Euclidean distance of kth dimension are as follows:

S403, according to abnormal data and normal data in N-dimensional per one-dimensional Euclidean distance, obtain abnormal data and normal The standardization Euclidean distance of data.

According toAbnormal data y can be calculated_iWith normal data x_jEurope in N-dimensional per one-dimensional component Formula distance.Abnormal data y can then be calculated_iWith normal data x_jBetween standardization Euclidean distance are as follows:

Wherein, d is abnormal data y_iWith normal data x_jBetween standardization Euclidean distance, N be abnormal data y_iWith it is normal Data x_jCorresponding N number of dimension.

Fig. 5 is the another flow diagram of data processing method provided in an embodiment of the present invention.

Optionally, as shown in figure 5, above-mentioned acquisition kth dimension component standard is poor, comprising:

S501, the kth for obtaining all normal datas in the kth dimension component and normal data subset of abnormal data tie up component, Obtain kth dimension component set.

Specifically, abnormal data and normal data are N-dimensional vector in data set, it is first when obtaining kth dimension component standard difference First, the kth dimension component of each abnormal data and the kth dimension component set { y of all normal datas can be constructed_ik, x_1k, x_2k…x_jk… x_jN, y_ikFor abnormal data y_iKth ties up component；x_jkFor normal data x_jKth ties up component, can use a_kiIndicate kth Wei Fenliangji I-th of component in conjunction, M indicate the quantity of component in kth dimension component set, then kth dimension component set is { a_k1, a_k2…a_ki… a_kM}。

S502, average value important in kth dimension component set is determined.

Specifically, tieing up component set according to kth is { a_k1, a_k2…a_ki…a_kM, it is important in the available set Average value

In S503, each component according in kth dimension component set, important average value and kth dimension component set It is poor that kth dimension component standard is calculated in the quantity of component.

Specifically, continue above description, according to kth tie up component set in each component, important average value and Kth ties up the quantity of component in component set, and kth dimension component standard difference s is calculated_k, it is expressed as follows by formula:

Wherein, 1≤k≤N.

That is, the data processing method that the disclosure provides is to obtain any one abnormal data first in N-dimensional per one-dimensional The standard deviation of component, it is then any one according to any one abnormal data any one component, any one normal data The standard deviation of a component and any one component, is calculated any one abnormal data and any one normal The Euclidean distance of any component of data is further continued for being calculated per one-dimensional Euclidean distance in N-dimensional, obtains any one The corresponding standardization Euclidean distance of abnormal data and any one normal data, finally further according to any one abnormal data with Corresponding multiple standardization Euclidean distances of all normal datas, choose the smallest standardization from multiple standardization Euclidean distances The corresponding normal data of Euclidean distance replaces the abnormal data, so that more quasi- to the cleaning of the abnormal data of data concentration Really, to reduce the distortion of data analysis result.

Fig. 6 is the another flow diagram for the data processing method that the embodiment of the present disclosure provides.

Optionally, as shown in fig. 6, above-mentioned data processing method further include:

S601, judge whether there is also abnormal datas in data set.

It specifically, can after handling according to above-mentioned data processing method some abnormal data that data are concentrated To continue to judge whether there is also abnormal datas in data set.

S602, if it does not exist, then export data set.

As described above, if judgement obtains in data set, there is no abnormal datas, data set can be exported, to carry out down One step data analysis, modeling etc. other work.

It should be noted that if above-mentioned judgement is also continued there are if abnormal data, can be according to abnormal data before Processing mode carries out identical processing step, until confirming that the data set does not have abnormal data, wherein for identical Processing step the above method description in be described in detail, details are not described herein.

Fig. 7 is the another flow diagram for the data processing method that the embodiment of the present disclosure provides.

As shown in fig. 7, clearer for the step of making above-mentioned data processing method, the embodiment of the present disclosure is with the data processing A kind of optional embodiment when method carries out data cleansing to data set completely illustrates it.In the present embodiment, Data set includes normal data subset and abnormal data subset, which includes:

S710, the entry number and data for obtaining abnormal data in abnormal data subset concentrate the entry number of all data.

S720, the entry number for calculating abnormal data and data concentrate the ratio of the entry number of all data, obtain data set Middle abnormal data accounts for the specific gravity of the data set.

S730, judge whether specific gravity is more than or equal to preset threshold；If it is greater than or equal to preset threshold, then step is successively executed Otherwise S741, S742, S743, S744 and S745 successively execute step S751, S752, S753 and S754.

S741, for any one abnormal data in abnormal data subset, in normal data subset any one it is normal Data calculate the standardization Euclidean distance of the abnormal data Yu the normal data.

S742, repeat S741 the step of, obtain any one abnormal data with every normal data it is corresponding multiple Standardize Euclidean distance.

S743, the smallest standardization Euclidean distance is chosen from multiple standardization Euclidean distances.

S744, the abnormal data is replaced with the corresponding normal data of the smallest standardization Euclidean distance.

S745, judge whether there is also abnormal datas in data set；If it exists, S741 is thened follow the steps；If it does not exist, then Execute step S760.

S751, two adjacent datas for obtaining abnormal data.

S752, the average value for calculating two adjacent datas.

S753, abnormal data is replaced with average value.

S754, judge whether there is also abnormal datas in data set；If it exists, S751 is thened follow the steps；If it does not exist, then Execute step S760.

S760, data set is exported.

In above-mentioned optional embodiment, the beneficial effect which has is implemented in preceding method It is described in example, details are not described herein for the disclosure.

Fig. 8 is the structural schematic diagram for the data processing equipment that the embodiment of the present disclosure provides.

As shown in figure 8, the data processing equipment includes: to obtain module 801, judgment module 802, first processing module 803 With replacement module 804；Module 801 is obtained, the specific gravity of data set is accounted for for obtaining abnormal data in data set, wherein data set Including normal data subset and abnormal data subset；Judgment module 802, for judging whether specific gravity is more than or equal to preset threshold； First processing module 803 is used for if it is greater than or equal to preset threshold, then any one abnormal data being directed in abnormal data subset, Calculate the standardization Euclidean distance of every normal data in the abnormal data and normal data subset；Replacement module 804 is used In replacing the abnormal data with the corresponding normal data of the smallest standardization Euclidean distance.

Optionally, first processing module 803, if being also used to specific gravity less than preset threshold, for any one abnormal number According to:

Obtain two adjacent datas of the abnormal data；Calculate the average value of two adjacent datas；

The replacement module is also used to, and replaces the abnormal data with the average value.

Optionally, if specific gravity is less than preset threshold, smooth revised law can also be used to carry out the replacement of abnormal data.

Fig. 9 is another structural schematic diagram for the data processing equipment that the embodiment of the present disclosure provides.

Optionally, as shown in figure 9, described device further include: obtain module 801 and include: acquisition submodule 901 and calculate sub Module 902；

Acquisition submodule 901, entry number and data for obtaining abnormal data in abnormal data subset concentrate all numbers According to entry number；

Computational submodule 902, entry number and data for calculating abnormal data concentrate the ratio of the entry number of all data Value, obtains the specific gravity that abnormal data in data set accounts for data set.

Figure 10 is the another structural schematic diagram for the data processing equipment that the embodiment of the present disclosure provides.

Optionally, as shown in Figure 10, the first processing module includes: the first acquisition submodule 1001, first processing Module 1002 and second processing submodule 1003；

The normal data described for any one:

First acquisition submodule 1001, it is poor for obtaining kth dimension component standard, wherein 1≤k≤N；

The first processing submodule 1002, for tieing up component, the normal data according to the kth of the abnormal data Kth ties up component and kth dimension component standard is poor, is calculated what the abnormal data was tieed up with the normal data in kth Euclidean distance；

Second processing submodule 1003, for according to abnormal data and normal data in N-dimensional per one-dimensional Euclidean distance, Obtain the standardization Euclidean distance of abnormal data and normal data.

First acquisition submodule 1001 includes that standard deviation obtains module 1004,

Standard deviation obtains to be owned in kth dimension component and normal data subset of the module 1004 for obtaining the abnormal data The kth of normal data ties up component, obtains kth dimension component set；

Determine average value important in kth dimension component set；

Each component, important average value and kth in component set, which are tieed up, according to kth ties up component in component set It is poor that kth dimension component standard is calculated in quantity.

Above-mentioned apparatus can integrate for executing preceding method embodiment and calculate equipment in server, computer etc., in fact Existing principle and technical effect are referred to preceding method embodiment, and details are not described herein.

The above module can be arranged to implement one or more integrated circuits of above method, such as: one Or multiple specific integrated circuits (Application Specific Integrated Circuit, abbreviation ASIC), or, one Or multi-microprocessor (digital singnal processor, abbreviation DSP), or, one or more field programmable gate Array (Field Programmable Gate Array, abbreviation FPGA) etc..For another example, when some above module passes through processing elements When the form of part scheduler program code is realized, which can be general processor, such as central processing unit (Central Processing Unit, abbreviation CPU) or it is other can be with the processor of caller code.For another example, these modules can integrate Together, it is realized in the form of system on chip (system-on-a-chip, abbreviation SOC).

As shown in figure 11, which includes memory 1102, processor 1101, and being stored in memory 1102 can be The computer program run on processor 1101, processor 1101 realize above-mentioned data processing method when executing computer program Step.Specific implementation is similar with technical effect, and which is not described herein again.

Optionally, the embodiment of the present disclosure also provides a kind of computer readable storage medium, is stored thereon with computer program, The step of above-mentioned data processing method is realized when computer program is executed by processor.

In several embodiments provided by the disclosure, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the module, only Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple module or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.Another point, it is shown or discussed Mutual coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or logical of device or module Letter connection can be electrical property, mechanical or other forms.

The module as illustrated by the separation member may or may not be physically separated, aobvious as module The component shown may or may not be physical module, it can and it is in one place, or may be distributed over multiple On network module.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

It, can also be in addition, can integrate in a processing module in each functional module in each embodiment of the disclosure It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also realize in the form of hardware adds software function module.

The above-mentioned integrated module realized in the form of software function module, can store and computer-readable deposit at one In storage media.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) or processor (English: processor) execute this public affairs Open the part steps of each embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (English: Read-Only Memory, abbreviation: ROM), random access memory (English: Random Access Memory, letter Claim: RAM), the various media that can store program code such as magnetic or disk.

The above is only preferred embodiment of the present disclosure, are not intended to limit the scope of the patents of the disclosure, all to utilize this public affairs Equivalent structure or equivalent flow shift made by specification and accompanying drawing content is opened, other relevant skills are applied directly or indirectly in Art field similarly includes in the scope of patent protection of the disclosure.

Claims

1. a kind of data processing method, which is characterized in that the described method includes:

Obtain abnormal data in data set and account for the specific gravity of the data set, wherein the data set include normal data subset and Abnormal data subset；

Judge whether the specific gravity is more than or equal to preset threshold；

If the specific gravity is more than or equal to preset threshold, for any one abnormal data in abnormal data subset, institute is calculated State the standardization Euclidean distance of every normal data in abnormal data and normal data subset；

With the smallest standardization Euclidean distance corresponding normal data replacement abnormal data.

2. the method according to claim 1, wherein if the specific gravity is less than preset threshold, for any one Abnormal data:

Obtain two adjacent datas of the abnormal data；

Calculate the average value of two adjacent datas；

The abnormal data is replaced with the average value.

3. the method according to claim 1, wherein abnormal data accounts for the data set in the acquisition data set Specific gravity, comprising:

Obtain the entry number of all data in the entry number of abnormal data and the data set in the abnormal data subset；

The ratio for calculating the entry number of all data in the entry number and the data set of the abnormal data obtains the data Abnormal data is concentrated to account for the specific gravity of the data set.

4. the method according to claim 1, wherein the normal data and abnormal data in the data set are pre- If N-dimensional vector, N is integer greater than 0；

The standardization Euclidean distance for calculating every normal data in the abnormal data and normal data subset, comprising:

The normal data described for any one:

It is poor to obtain kth dimension component standard, wherein 1≤k≤N；

Component, the kth dimension component of the normal data and the kth, which are tieed up, according to the kth of the abnormal data ties up component standard The Euclidean distance that the abnormal data is tieed up with the normal data in kth is calculated in difference；

According to the abnormal data and the normal data per one-dimensional Euclidean distance in the N-dimensional, the abnormal number is obtained According to the standardization Euclidean distance with the normal data.

5. according to the method described in claim 4, it is characterized in that, acquisition kth dimension component standard is poor, comprising:

The kth dimension component for obtaining all normal datas in the kth dimension component and normal data subset of the abnormal data, obtains the K ties up component set；

Determine average value important in the kth dimension component set；

Each component in component set, described the important average value and the kth are tieed up according to the kth ties up component set It is poor that kth dimension component standard is calculated in the quantity of middle component.

6. a kind of data processing equipment characterized by comprising obtain module, judgment module, first processing module and replacement mould Block；

The acquisition module, the specific gravity of the data set is accounted for for obtaining abnormal data in data set, wherein the data set packet Include normal data subset and abnormal data subset；

The judgment module, for judging whether the specific gravity is more than or equal to preset threshold；

The first processing module, if being more than or equal to preset threshold for the specific gravity, for appointing in abnormal data subset Anticipate an abnormal data, calculate the standardization Euclidean of every normal data in the abnormal data and normal data subset away from From；

The replacement module, for replacing the exception number with the corresponding normal data of the smallest standardization Euclidean distance According to.

7. device according to claim 6, which is characterized in that further include Second processing module；

The Second processing module is directed to any one abnormal data if being less than preset threshold for the specific gravity: obtaining institute State two adjacent datas of abnormal data；Calculate the average value of two adjacent datas；

8. device according to claim 6, which is characterized in that the acquisition module includes: acquisition submodule and calculates sub Module；

The acquisition submodule, for obtaining in the abnormal data subset institute in the entry number of abnormal data and the data set There is the entry number of data；

The computational submodule, the entry number of all data in the entry number and the data set for calculating the abnormal data Ratio, obtain the specific gravity that abnormal data in the data set accounts for the data set.

9. device according to claim 6, which is characterized in that the first processing module include: the first acquisition submodule, First processing submodule and second processing submodule；

The normal data described for any one:

First acquisition submodule, it is poor for obtaining kth dimension component standard, wherein 1≤k≤N；

The first processing submodule, for tieing up the kth dimension point of component, the normal data according to the kth of the abnormal data Amount and kth dimension component standard are poor, be calculated the abnormal data and the normal data kth tie up it is European away from From；

The second processing submodule, for every one-dimensional in the N-dimensional according to the abnormal data and the normal data Euclidean distance obtains the standardization Euclidean distance of the abnormal data and the normal data.

10. device according to claim 9, which is characterized in that first acquisition submodule includes that standard deviation obtains mould Block,

The kth that kth for obtaining the abnormal data ties up all normal datas in component and normal data subset ties up component, obtains Component set is tieed up to kth；

Determine average value important in the kth dimension component set；Each of component set point is tieed up according to the kth The quantity of component, is calculated kth dimension component standard in amount, described the important average value and kth dimension component set Difference.