CN108154163A

CN108154163A - Data processing method, data identification and learning method and its device

Info

Publication number: CN108154163A
Application number: CN201611112409.8A
Authority: CN
Inventors: 闫强; 李爱华; 王晓; 葛胜利
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-12-06
Filing date: 2016-12-06
Publication date: 2018-06-12
Anticipated expiration: 2036-12-06
Also published as: CN108154163B

Abstract

The present invention provides a kind of data processing method, data identification and learning method and its devices.Data processing method includes：Sample data is divided into multiple marshallings by way of cluster；Check that each marshalling whether there is positive class mark data, and delete the marshalling not comprising positive class mark data in the multiple marshalling；Determine the total quantity of positive class mark data in the multiple marshalling；Determine whether the total quantity of the positive class mark data proportion in the sample data is more than predetermined threshold；In the case where the ratio is more than the predetermined threshold, retains and carry out step (b) treated data.Said program can accurately obtain data needed for model learning.

Description

Data processing method, data identification and learning method and its device

Technical field

The present invention relates to data processing fields, and in particular to a kind of data processing method, data identification and learning method and Its device.

Background technology

It is most important for the Classification and Identification of supervised learning model identification in data mining or machine-learning process. However, there is positive class magnitudes and the situation of anti-class magnitude ratio imbalance verification in the identification process.Therefore, if not to sample Notebook data is pre-processed, and only passes through simple Model Identification, it is more likely that lead to accuracy decline.

Existing data prediction includes outlier processing, lack sampling and over-sampling etc..However, these technologies equally exist Various problems.For example, outlier processing carries out special place by data distribution trend or the data point that situation is concentrated to will deviate from Situations such as reason, this may cause anti-class data accidentally to be deleted, particularly for example necessarily occur in the characteristic of risk subscribers In the case of the phenomenon that peels off.Although data can be carried out in the magnitude level of each classification by way of lack sampling or over-sampling Processing, but still the influence that positive class data characteristics covers anti-category feature can not be solved, while also broken the randomness of sampling.

Therefore, it is necessary to a kind of for being pre-processed to data with the data for the problem that at least some of solves the above problems Processing method and its device.

Invention content

At least some of to solve the above-mentioned problems, an embodiment of the present invention provides a kind of data processing method, data Identification and learning method and its device, obtain required data with high precision.

A scheme according to the present invention, provides a kind of data processing method, including：

(a) sample data is divided into multiple marshallings by way of cluster；

(b) check that each marshalling whether there is positive class mark data, and deletes and do not include positive category in the multiple marshalling Know the marshalling of data；

(c) total quantity of positive class mark data in the multiple marshalling is determined；

(d) it is pre- to determine whether the total quantity of the positive class mark data proportion in the sample data is more than first Determine threshold value；And

(e) the ratio be more than first predetermined threshold in the case of, retain carry out step (b) treated number According to.

According to another aspect of the present invention, a kind of method for pattern-recognition and study is provided, including：Based on above-mentioned Data processing method obtains processed multiple marshallings of sample data；And based on the processed multiple of the sample data Marshalling comes execution pattern identification and study.

According to another aspect of the present invention, a kind of data processing equipment is provided, including：

Division module is organized into groups, for sample data to be divided into multiple marshallings by way of cluster；

Data review module, for checking, each marshalling whether there is positive class mark data in the multiple marshalling；

Data removing module, for deleting the marshalling not comprising positive class mark data；

Data bulk determining module, for determining the total quantity of positive class mark data in the multiple marshalling；

Ratio data determining module, for determining that the total quantity of the positive class mark data is shared in the sample data Whether ratio is more than the first predetermined threshold；And

Data reservation module in the case of being more than first predetermined threshold in the ratio, retains the data Removing module treated data.

According to another aspect of the present invention, a kind of device for pattern-recognition and study is provided, including：

Data organize into groups acquisition module, for obtaining processed multiple volumes of sample data from above-mentioned data processing equipment Group；And

Identification and study submodule identify for processed multiple marshallings based on the sample data come execution pattern And study.

Memory, for storing executable instruction；And

Processor, for performing the executable instruction stored in memory, to perform above-mentioned data processing method.

Memory, for storing executable instruction；And

Processor, for performing the executable instruction stored in memory, to perform above-mentioned pattern-recognition and learning method.

According to another aspect of the present invention, a kind of memory devices carried thereon by computer program are provided, when by When processor performs the computer program, the computer program makes the processor perform above-mentioned data processing method.

According to another aspect of the present invention, a kind of memory devices carried thereon by computer program are provided, when by When processor performs the computer program, the computer program makes the processor perform above-mentioned pattern-recognition and study side Method.

Said program can find the general character of the required data sample handled by the method for cluster, reject part interference number According to the ratio of positive class data and anti-class data being balanced, and retain the higher positive and negative class data of similitude, so as to accurately obtain Data sample needed for obtaining.Model learning is carried out on such data sample can greatly improve the precision of following model study.

Description of the drawings

It is by the detailed description carried out below in conjunction with the accompanying drawings to invention, the features described above and advantage that make the present invention is brighter It is aobvious, wherein：

Fig. 1 shows a kind of outline flowchart of data processing method according to embodiments of the present invention；

Fig. 2 shows a kind of outline flowcharts of method for being used for pattern-recognition and study according to embodiments of the present invention；

Fig. 3 shows the brief block diagram of the structure of data processing equipment according to embodiments of the present invention；

Fig. 4 shows the brief block diagram of the structure of marshalling division module according to embodiments of the present invention；

Fig. 5 shows the brief block diagram of the structure of the device for being used for pattern-recognition and study according to embodiments of the present invention；

Fig. 6 shows a kind of specific implementation of data processing method according to embodiments of the present invention；

Fig. 7 shows variance according to an embodiment of the invention with clustering the relationship graph of number；And

Fig. 8 shows the block diagram of the exemplary hardware arrangement of Fig. 3/Fig. 5 shown devices according to embodiments of the present invention.

Specific embodiment

In the following, the preferred embodiment of the present invention is described in detail with reference to the accompanying drawings.In the accompanying drawings, although being shown in different attached drawings In, but identical reference numeral is used to represent identical or similar component.For clarity and conciseness, comprising known to herein The detailed description of function and structure will be omitted, to avoid making subject of the present invention unclear.

In the application to sample data, it is understood that there may be data needed for are situation of the ratio compared with small data.Such as in portion Divide in the application processes such as consumer's risk identification, anomalous identification, the data of normal users in the behavioral data as sample collection Major part is accounted for, and wants the data of risk subscribers/abnormal user identified often ratio is smaller.Which results in data sample Anti- class (normal users) data and positive class (exception/risk subscribers) data it is out of proportion.Such case can cause carrying out mould When type learns, the feature of anti-class data can seriously cover positive class data, too low so as to cause model learning precision, and and then cause None- identified risk subscribers or the situation that risk subscribers are erroneously identified as to normal users.

For this purpose, the embodiment of the present invention provides a kind of data processing method referring to figs. 1 to Fig. 4, a kind of data identify and learn Learning method and its device.Here, positive class data refer to the data of the target for model learning, and other data are anti-class number According to.Such as in the application of above-mentioned risk identification/anomalous identification, the data of risk subscribers are positive class data, and normal users Data be anti-class data.

Fig. 1 shows a kind of outline flowchart of data processing method according to embodiments of the present invention.It as shown in Figure 1, should Data processing method includes：

Sample data is divided into multiple marshallings by step S110 by way of cluster；

Step S120 checks that each marshalling whether there is positive class mark data, and deletes and do not include just in multiple marshalling The marshalling of class mark data；

Step S130 determines the total quantity of positive class mark data in multiple marshalling；

It is predetermined to determine whether the total quantity of the positive class mark data proportion in sample data is more than first by step S140 Threshold value；And

Step S150 in the case where ratio is more than the first predetermined threshold, retains and carries out step S120 treated data.

Said program can tentatively find the general character of the required data sample handled by the method for cluster, reject part and interfere Data, balance the ratio of positive class data and anti-class data, and retain the higher positive and negative class data of similitude, so as to accurately Data sample needed for acquisition.

Optionally, in some embodiments, as shown in Figure 1, may also include step S160, it is less than the first predetermined threshold in ratio In the case of value, step S110-S150 is performed using step S120 treated data as sample data, wherein, it is repeating to hold Increase the quantity of marshalling in capable step S110.

By using the cycle that step 160 is formed, interference data can be repeatedly rejected, balance positive class data and anti-class data Ratio, and retain the higher positive and negative class data of similitude, realize higher processing accuracy.

In some instances, the maximum number magnitude of marshalling can be also set.In the case, in step (f), if currently The quantity of marshalling is more than or equal to the maximum number magnitude, is not further added by the quantity of marshalling, and after retaining the processing of the last step (b) Data, terminate data processing.

In some instances, k-means algorithms can be used to realize the mode of the cluster, wherein, it is calculated by k-means Method determines the initial number of marshalling so that at initial number, the entire change of the data variance in each cluster slows down.

It should be noted that the present invention is not limited to use data variance, any change for characterizing variable quantity can also be used Change the index of speed to realize the step.

In some instances, step (a) may include：

(a1) during multiple sample datas are randomly selected in sample data respectively as each being organized into groups in multiple marshalling Heart point, the quantity of multiple data sample are equal with the quantity of multiple marshallings to be divided；

(a2) all data samples are calculated to the distance of each central point, each data sample is divided into away from nearest neighbours Central point where marshalling in；

(a3) average value of all data samples in each marshalling is calculated, and by average value as in new in each marshalling Heart point；

(a4) for each marshalling, judge whether the difference of new central point and central point before is more than the second predetermined threshold It is worth and if the difference of central point is more than the second predetermined threshold, step (a2) and (a3) is performed using new central point, if The difference of central point is not more than the second predetermined threshold, and new center is determined as Optimal cluster centers；And

(a5) all sample datas are calculated to the distance of each Optimal cluster centers point, each sample data is repartitioned Into the marshalling where the Optimal cluster centers point away from nearest neighbours.

Fig. 2 shows a kind of outline flowcharts of method for being used for pattern-recognition and study according to embodiments of the present invention. As shown in Fig. 2, this method includes：

Step S210, data processing method according to figure 1 obtain processed multiple marshallings of sample data；And

Step S220, processed multiple marshallings based on sample data are identified and are learnt come execution pattern.

It should be noted that method shown in Fig. 1 and Fig. 2 is only illustrative.Method shown in Fig. 1 and Fig. 2 can be fallen Any modification among the scope of the present invention.For example, although step 160 institute by increasing marshalling quantity is shown in Fig. 1 The cycle of formation, in some embodiments (for example, the once-through operation in step 110-130 has been obtained for satisfactory number In the case of), such cycle is not necessary.

Fig. 3 shows the brief block diagram of data processing equipment according to embodiments of the present invention.As shown in figure 3, the device packet It includes：

Division module 310 is organized into groups, for sample data to be divided into multiple marshallings by way of cluster；

Data review module 320, for checking, each marshalling whether there is positive class mark data in multiple marshalling；

Data removing module 330, for deleting the marshalling not comprising positive class mark data；

Data bulk determining module 340, for determining the total quantity of positive class mark data in multiple marshalling；

Ratio data determining module 350, for determining the total quantity of positive class mark data proportion in sample data Whether the first predetermined threshold is more than；And

Data reservation module 360, in the case of being more than the first predetermined threshold in ratio, retention data removing module 330 treated data.

In some embodiments, marshalling division module 310 can be additionally used in the case where ratio is less than the first predetermined threshold, It is divided using data removing module treated data as sample data to perform marshalling, module, data is checked with repeated data The operation of removing module, data bulk determining module, ratio data determining module and data reservation module, wherein, to data Increase the quantity of marshalling in the grouping activity of treated the data of removing module 330.By the behaviour for organizing into groups division module 310 Make, rejecting interference data can be formed, balance the ratio of positive class data and anti-class data, and retain the higher positive and negative class number of similitude According to a cycle, therefore more complicated data cases can be directed to, and realize higher processing accuracy.

Device shown in Fig. 3 may also include：Maximum marshalling number setup module 370, for setting the maximum number magnitude of marshalling. If the quantity currently organized into groups is more than or equal to maximum number magnitude, organizes into groups division module 310 and repeating data grouping activity When be not further added by the quantity of marshalling, and data reservation module 360 retains the last data removing module treated data.

In some instances, k-means algorithms can be used to realize the mode of the cluster in marshalling division module 310.Its In, marshalling division module 310 determines the initial number of marshalling by k-means algorithms so that at initial number, each cluster The entire change of interior data variance slows down.It is of course also possible, as described before, any pace of change for characterizing variable quantity can also be used Index realize the step.

In some instances, marshalling division module 310 may also include multiple submodule.As shown in figure 4, marshalling division module 310 include：

Central point determination sub-module 311 is more respectively as this for randomly selecting multiple sample datas in sample data The central point each organized into groups in a marshalling, the quantity of multiple data sample are equal with the quantity of multiple marshallings to be divided；

Central point, will be each for calculating all data samples to the distance of each central point apart from determination sub-module 312 Data sample is divided into away from the marshalling where the central point of nearest neighbours；

Central point weight determination sub-module 313, for calculating the average value of all data samples in each marshalling, and will be average It is worth as central point new in each marshalling；

Optimal cluster centers determination sub-module 314 for being directed to each marshalling, judges new central point and center before The difference of point whether is more than the second predetermined threshold and if the difference of central point is more than the second predetermined threshold, by new central point Central point is sent to apart from determination sub-module 312, is determined with re-executing data marshalling and new central point, if the difference of central point No more than the second predetermined threshold, new center is determined as Optimal cluster centers；And

Determination sub-module 315 is organized into groups, it, will be every for calculating all sample datas to the distance of each Optimal cluster centers point A sample data is repartitioned in the marshalling where the Optimal cluster centers point away from nearest neighbours.

Fig. 5 shows a kind of brief block diagram of device for being used for pattern-recognition and study according to embodiments of the present invention.Such as Shown in Fig. 5, which includes：

Data organize into groups acquisition module 410, for from data processing equipment shown according to fig. 3 obtain sample data from Multiple marshallings of reason；And

Identification and study module 420, for based on sample data it is processed it is multiple marshalling come execution pattern identification and Study

It should be noted that the structure diagram of above-mentioned Fig. 3 to Fig. 5 is only illustrative, specific manifestation can also pass through it He provides form.Such as at some in the specific implementation, data caused by each module/submodule in Fig. 3 to device shown in fig. 5 Also can be stored in some storage device (not shown), other module/submodules can by from the storage device read data come Obtain the various data that each module/submodule is generated.In this case, each module/son in Fig. 3 to device shown in fig. 5 Connection signal between module may change.However this variation is not out the range of illustrated embodiment of the present invention, And it should be considered within the scope of illustrated embodiment of the present invention.For example, although maximum marshalling setup module is shown in FIG. 3 370, but in some embodiments, which not necessarily needs.

It will be described with reference to Fig. 6 specific implementations of the Fig. 1 to method, device shown in fig. 5.It should be noted that Fig. 6 is only It shows a kind of specific implementation to method provided in an embodiment of the present invention, and is not considered as providing the embodiment of the present invention Method limitation.For example, at other in the specific implementation, can not also use following k-means algorithms, but using it Other modifications or this field in can be used in other any algorithm/methods of the data clusters with general character.

The step of step shown in fig. 6 can be corresponded roughly to shown in Fig. 1, for example, step A02, A03 in Fig. 6 can divide Step A04 in step S110 and S120, Fig. 6 that Dui Yingyu be in Fig. 1 may correspond to step S130 and S140 in Fig. 1, with And the step A05 and A06 in Fig. 6 can correspond respectively to the step S160 and S150 in Fig. 1.However this correspondence is not necessity And be not stringent corresponding, in some modifications of the embodiment of the present invention, there may also be differences for the content of specific steps.

Clustering algorithm k-means algorithms used in flow shown in fig. 6 and usually used k-means algorithms there is Difference.For example, the judgement of k values is only carried out before k-means clusters are carried out, without carrying out conventional data cleansing.

In the step A01 of Fig. 6, it may be determined that the initial k value of k-means algorithms.In the inventive solutions, the k Value can represent the quantity of the marshalling to be divided (or cluster).

Determining can be used of k values is manually specified, automatically specifies or the method that is combined of the two is realized.If for example, number It is very familiar to data application scenarios according to treatment people, the number of cluster can be manually specified according to priori.However at other Under some cases, k values can be also automatically determined by system.A kind of automatic determination method of k values is provided below.

First, the range from 2 to N is specified.In the illustrative example of the present invention, in order to be shown, N values can It is confirmed as 15.It is noted, however, that the value of N is not limited to 15, but can be selected according to concrete implementation scene Any suitable value.

Next, perform searching loop in the range of above-mentioned selection, the increase with number of clusters is calculated, side in cluster The situation of change of difference.Formula can be used in the example of the present inventionTo calculate the variance Variation, wherein, u is the mean value of k classes, and Xi is each data point in class.

The number of clusters at place then, it is determined that the pace of change of variance variation calculated slows down.In the embodiment of the present invention Scheme in, judging the standard of suitable k values can be：Suitable cluster number k corresponds to such a turning point, at this Variance within clusters strongly reduce before point, and variance reduction after this point slows down.Variance shown in Fig. 7 be can refer to clustering number Relationship graph come perform the step or by the pace of change that variance changes compared with some threshold value relatively come determine variance variation change It is slow.Specified in the system automatically specified scheme being combined, experienced data analyst can be for example with reference to 7 institute of figure manually It is showing as a result, selecting 4 or 5 value as k.Or in the scheme specified automatically in system, variance change can also be for example calculated It is simultaneously relatively determined slowing down the moment for variance variation by the pace of change of change compared with some threshold value, no longer superfluous to this present invention It states.In a preferred embodiment, smaller value may be selected in initial k value, and in the example of such as Fig. 6, initial k value may be selected to be 4.

It should be noted that it can all redefined when realizing technical solution of the present invention using k-means algorithms every time Initial k value.However in for the processing of the repeated data of same scene, data processing personnel may also be according to priori more Using same initial k value during secondary realization technical solution of the present invention, the initial k value without performing step A01 determines.

In the step A02 of Fig. 6, sample data is divided into multiple marshallings (cluster) using k-means algorithms.It has Body step is as follows：

1) quantity organized into groups according to the cluster that step A01 is obtained is assigned to variable K, while randomly select K data sample Point centered on this.

2) all data samples are calculated to the distance of the K central point, and each data sample is referred to apart from it most Marshalling where near central point.

3) it averages to all data samples in each marshalling, using the new central point as the marshalling.

4) for each marshalling, the otherness of the new central point and previous central point is judged, if the excessive (example of difference Such as, distance/difference is more than some predetermined threshold), return to step 2), 3) carry out continuation iteration, and if difference it is smaller (for example, Less than the predetermined threshold), this stops iteration, which is determined as Optimal cluster centers.

5) according to the Optimal cluster centers point calculated in step 4), (optionally) carries out each Optimal cluster centers point Number, and all sample datas are calculated to the distance of each Optimal cluster centers point, each sample data is repartitioned and (is returned Class) in the marshalling where the Optimal cluster centers point away from nearest neighbours.

After sample data is divided into each marshalling in step A02, checked in step A03 in each data marshalling Whether positive class data are included.If do not included, by this organize into groups in all data deleted from data area.If there is just Class data continue to retain the data in the marshalling.Here, state variable f can be created, organize into groups and delete if there is data, then f= True is organized into groups if there is no data and is deleted, then f=false.

Then, in step A04, check whether the magnitude of positive class data and anti-class data in sample reaches requirement, example Such as, whether ratio reaches balance.A threshold parameter a can be set herein, by the total of the quantity G of positive class sample and data sample The ratio between amount M is compared with the threshold parameter, to judge whether the ratio balances.Such as：

If G/M ＞=a,

Then ratio reaches balance, performs step A06,

If G/M ＜ a,

Then ratio not up to balances, and performs step A05->A02->A03->A04, herein can be optionally in each cycle All accordingly change the step the value of state variable f in A03.

In step A05, the parameter k of k-means algorithms is adjusted, data is made more to disperse, so as to preferably adjust Whole positive class data and anti-class data.Specifically, all increase the value of parameter k when performing the step every time.

Preferably, the step can arrange parameter k maximum occurrences x.X is bigger, and representative marshalling quantity is bigger, each to organize into groups In sample data it is fewer, so as to data degree of scatter it is bigger, shown as in data characteristics data marshalling tightness degree get over Greatly.However the tightness degree that excessive x values can cause each data to be organized into groups is excessive, independence is excessively weak.Therefore, the value of x should be simultaneous Care for the tightness degree of marshalling and the independence of marshalling.In the case where setting x values, one of step A05 is implemented as follows：

IF (f=true), then k=k continue to execute step A02->A03->A04,

IF (f=false and k ＜ x), then k=k+1 continue to execute step A02->A03->A04,

IF (f=false and k ＞=x), then A06.

It should be noted that above-mentioned pseudocode is only for showing for a specific implementation for illustrating technical solution of the present invention Example.Other different pseudocodes can also be used in other specific examples.For example, at some in the specific implementation, even if in step Data marshalling has been deleted in rapid A03, the operation for increasing marshalling quantity is performed still in step A05.In the case, on It is unwanted to state in pseudocode about the condition of state variable f, as long as k values are not up to maximum value x and can increase its value, is repaiied It is as follows to change example：

IF (k ＜ x), then k=k+1 continue to execute step A02->A03->A04,

IF (k ＞=x), then A06.

In step A06, pass through the data point reuse of above-mentioned steps A01-A04 (and in some cases, also passing through A05) After deletion, the sample data of relatively initial positive class and anti-class relative equilibrium is obtained.It can be performed based on the sample data Pattern-recognition and study.

In the technical solution of the embodiment of the present invention described in above-mentioned combination Fig. 1 to Fig. 7, the anti-class data in part are rejected, So as to improve the balance degree of positive and negative class data in remaining sample data, additionally eliminate differ greatly with positive generic attribute it is anti- Class data.Because being not the data for model learning target, the sample size reduction pair caused by these data rejected Subsequent model learning is nearly free from influence, so as to remain data characteristic with arriving very much.In such positive and negative generic sex differernce Model learning is carried out on relatively small sample and so that the recognition capability of model learning is stronger, so as to improve accuracy of identification.

Fig. 8 is to show the block diagram arranged according to the exemplary hardware of Fig. 3 or Fig. 5 shown devices of the embodiment of the present disclosure.It should Hardware layout includes processor 506 (for example, microprocessor (μ P), digital signal processor (DSP) etc.).Processor 506 can be with It is performed for single treatment unit either multiple processing units of the different actions of flow described herein.Arrangement can be with Output unit including being used for from the input unit 502 of other entities reception signal and for providing from signal to other entities 504.Input unit 502 and output unit 504 can be arranged to the entity that single entities either detach.

In addition, arrangement can include having non-volatile or form of volatile memory at least one readable storage medium storing program for executing 508, e.g. electrically erasable programmable read-only memory (EEPROM), flash memory, and/or hard disk drive.Readable storage medium storing program for executing 508 include computer program 510, which includes code/computer-readable instruction, in by arrangement Reason device 506 allows hardware layout when performing and/or the equipment including hardware layout perform for example above in conjunction with Fig. 1/ The described flows of Fig. 2 and its any deformation.

In the case where realizing Fig. 3 shown devices, computer program 510 can be configured with such as computer program mould The computer program code of block 510A~510E frameworks.Therefore, the example during hardware layout is used to implement in such as equipment In example, the code in the computer program of arrangement includes：Module 510A, for being divided into sample data by way of cluster Multiple marshallings：Module 510B, for checking, each marshalling whether there is coin class mark data, and delete not in the multiple marshalling Include the marshalling of positive class mark data；Module 510C, for determining the total quantity of positive class mark data in the multiple marshalling；Mould Block 510D, for determining it is pre- whether the total quantity of positive class mark data proportion in the sample data is more than first Determine threshold value；Module 510E, for the ratio be more than the first predetermined threshold in the case of, reservation module 510B treated number According to.

In the case where realizing Fig. 5 shown devices, computer program 510 can be configured as only having such as computer program The computer program code of module 510A~510B frameworks.Therefore, use the example during hardware layout real in such as equipment It applies in example, the code in the computer program of arrangement includes：Module 510A, for obtaining sample based on the processing of Fig. 3 shown devices Processed multiple marshallings of notebook data.Code in computer program further includes：Module 510B, for based on sample data Processed multiple marshallings come execution pattern identification and study.

Computer program module can substantially perform each action in the flow shown in Fig. 1 or Fig. 2, with simulation The device shown in Fig. 3 or Fig. 5.In other words, when performing different computer program modules in processor 506, they can be with Corresponding to the above-mentioned different units in the device shown in Fig. 3 or Fig. 5.

Although being implemented as computer program module above in conjunction with the code means in Fig. 8 the disclosed embodiments, Hardware layout is caused to perform above in conjunction with Fig. 1 or Fig. 2 described actions when being performed in processor 506, however is alternatively being implemented In example, at least one in the code means can at least be implemented partly as hardware circuit.

Processor can be single cpu (central processing unit), but can also include two or more processing units.Example Such as, processor can include general purpose microprocessor, instruction set processor and/or related chip group and/or special microprocessor (example Such as, application-specific integrated circuit (ASIC)).Processor can also include the onboard storage device for caching purposes.

Computer program can be carried by the computer program product for being connected to processor.Computer program product can be with Computer-readable medium including being stored thereon with computer program.For example, computer program product can be flash memory, deposit at random Access to memory (RAM), read-only memory (ROM), EEPROM, and above computer program module can be used in an alternative embodiment The form of memory in UE is distributed in different computer program products.

It should be noted that the technical solution recorded in the embodiment of the present invention in the absence of conflict can be arbitrary group It closes.

In several embodiments provided by the present invention, it should be understood that disclosed method and apparatus can pass through it Its mode is realized.Apparatus embodiments described above are only schematical, for example, the division of the unit, only A kind of division of logic function can have other dividing mode, such as in actual implementation：Multiple units or component can combine or It is desirably integrated into another system or some features can be ignored or does not perform.In addition, shown or discussed each composition portion Point mutual coupling or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, equipment or unit Or communication connection, can be electrical, mechanical or other forms.

The above-mentioned unit illustrated as separating component can be or may not be physically separate, be shown as unit The component shown can be or may not be physical unit, you can be located at a place, can also be distributed to multiple network lists In member；Part or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.

In addition, each functional unit in various embodiments of the present invention can be fully integrated into a second processing unit, Can also be each unit individually as a unit, can also two or more units integrate in a unit； The form that hardware had both may be used in above-mentioned integrated unit is realized, the form that hardware adds SFU software functional unit can also be used real It is existing.

Above description is only used for realizing embodiments of the present invention, and it should be appreciated by those skilled in the art do not taking off Any modification or partial replacement from the scope of the present invention, the range that should belong to the claim of the present invention to limit, because This, protection scope of the present invention should be subject to the protection domain of claims.

Claims

1. a kind of data processing method, including：

(a) sample data is divided into multiple marshallings by way of cluster；

(b) check that each marshalling whether there is positive class mark data, and deletes and identify number not comprising positive class in the multiple marshalling According to marshalling；

(d) determine whether the total quantity of the positive class mark data proportion in the sample data is more than the first predetermined threshold Value；And

(e) in the case where the ratio is more than first predetermined threshold, retain and carry out step (b) treated data.

2. data processing method according to claim 1, further includes：

(f) in the case where the ratio is less than first predetermined threshold, using step (b) treated data as the sample Notebook data performs step (a)-(e), wherein, increase quantity of marshalling repeat the step of in (a).

3. data processing method according to claim 2, further includes：

The maximum number magnitude of marshalling is set；

In step (f), if the quantity currently organized into groups is more than or equal to the maximum number magnitude, the quantity of marshalling is not further added by, And retain the last step (b) treated data, terminate the data processing.

4. data processing method according to any one of claim 1 to 3, wherein, using k-means algorithms to realize The mode of cluster is stated,

Wherein, the initial number of the marshalling is determined by k-means algorithms so that at the initial number, each cluster The entire change of interior data variance slows down.

5. data processing method according to claim 4, wherein, step (a) includes：

(a1) multiple sample datas are randomly selected in the sample data respectively as each organizing into groups in the multiple marshalling Central point, the quantity of the multiple data sample are equal with the quantity of the multiple marshalling to be divided；

(a2) all data samples are calculated to the distance of each central point, each data sample is divided into away from nearest neighbours In marshalling where heart point；

(a3) average value of all data samples in each marshalling is calculated, and using the average value as new in each marshalling Central point；

(a4) for each marshalling, judging the difference of the new central point and central point before, whether to be more than second predetermined Threshold value and if the difference of central point is more than second predetermined threshold, step (a2) is performed using the new central point The new center if the difference of central point is not more than second predetermined threshold, is determined as Optimal cluster centers by (a3)； And

(a5) all sample datas are calculated to the distance of each Optimal cluster centers point, by each sample data repartition away from In marshalling where the Optimal cluster centers point of nearest neighbours.

6. a kind of method for pattern-recognition and study, including：

Processed multiple marshallings of sample data are obtained based on the data processing method described in any one of claim 1 to 5； And

Processed multiple marshallings based on the sample data are identified and are learnt come execution pattern.

7. a kind of data processing equipment, including：

Ratio data determining module, for determining the total quantity of positive class mark data proportion in the sample data Whether the first predetermined threshold is more than；And

Data reservation module in the case of being more than first predetermined threshold in the ratio, retains the data and deletes Data after resume module.

8. data processing equipment according to claim 7, wherein, the marshalling division module is additionally operable to small in the ratio In the case of first predetermined threshold, held using the data removing module treated data as the sample data Row marshalling divides, to repeat the data review module, the data removing module, the data bulk determining module, described The operation of ratio data determining module and the data reservation module, wherein, to the data removing module treated number According to grouping activity in increase the quantity of marshalling.

9. data processing equipment according to claim 8, further includes：

Maximum marshalling number setup module, for setting the maximum number magnitude of marshalling；

Wherein, if the quantity currently organized into groups is more than or equal to the maximum number magnitude, the marshalling division module is repeating to hold The quantity of marshalling is not further added by during row data grouping activity, and the data reservation module retains the last data and deletes Data after resume module.

10. according to the data processing equipment described in any one of claim 7-9, wherein, the marshalling division module uses k- Means algorithms realize the mode of the cluster,

Wherein, the marshalling division module determines the initial number of the marshalling by k-means algorithms so that described initial At quantity, the entire change of the data variance in each cluster slows down.

11. data processing equipment according to claim 8, wherein, the marshalling division module further includes：

Central point determination sub-module, for randomly selecting multiple sample datas in the sample data respectively as the multiple The central point each organized into groups in marshalling, the quantity of the multiple data sample and the quantity phase for the multiple marshalling to be divided Deng；

Central point is apart from determination sub-module, for calculating all data samples to the distance of each central point, by each data sample Originally it is divided into away from the marshalling where the central point of nearest neighbours；

Central point weight determination sub-module, for calculating the average value of all data samples in each marshalling, and by the average value As central point new in each marshalling；

Optimal cluster centers determination sub-module, for being directed to each marshalling, judge the new central point with before in The difference of heart point whether is more than the second predetermined threshold and if the difference of central point is more than second predetermined threshold, by described in New central point is sent to the central point apart from determination sub-module, is determined with re-executing data marshalling and new central point, such as The difference of fruit central point is not more than second predetermined threshold, and the new center is determined as Optimal cluster centers；And

Determination sub-module is organized into groups, for calculating all sample datas to the distance of each Optimal cluster centers point, by each sample Data are repartitioned in the marshalling where the Optimal cluster centers point away from nearest neighbours.

12. a kind of device for pattern-recognition and study, including：

Data organize into groups acquisition module, for obtaining sample from the data processing equipment according to any one of claim 7 to 11 Processed multiple marshallings of notebook data；And

Identification and study submodule are identified and are learned come execution pattern for processed multiple marshallings based on the sample data It practises.

13. a kind of data processing equipment, including：

Memory, for storing executable instruction；And

Processor, for performing the executable instruction stored in memory, to perform according to any one of claim 1 to 5 institute The method stated.

14. a kind of device for pattern-recognition and study, including：

Memory, for storing executable instruction；And

Processor, for performing the executable instruction stored in memory, to perform the method according to claim 11.

15. a kind of memory devices carried thereon by computer program, when performing the computer program by processor, institute Stating computer program makes the processor perform the method according to any one of claims 1 to 5.

16. a kind of memory devices carried thereon by computer program, when performing the computer program by processor, institute Stating computer program makes the processor perform the method according to claim 11.