CN108154163A - Data processing method, data identification and learning method and its device - Google Patents
Data processing method, data identification and learning method and its device Download PDFInfo
- Publication number
- CN108154163A CN108154163A CN201611112409.8A CN201611112409A CN108154163A CN 108154163 A CN108154163 A CN 108154163A CN 201611112409 A CN201611112409 A CN 201611112409A CN 108154163 A CN108154163 A CN 108154163A
- Authority
- CN
- China
- Prior art keywords
- data
- marshalling
- module
- sample
- central point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The present invention provides a kind of data processing method, data identification and learning method and its devices.Data processing method includes:Sample data is divided into multiple marshallings by way of cluster;Check that each marshalling whether there is positive class mark data, and delete the marshalling not comprising positive class mark data in the multiple marshalling;Determine the total quantity of positive class mark data in the multiple marshalling;Determine whether the total quantity of the positive class mark data proportion in the sample data is more than predetermined threshold;In the case where the ratio is more than the predetermined threshold, retains and carry out step (b) treated data.Said program can accurately obtain data needed for model learning.
Description
Technical field
The present invention relates to data processing fields, and in particular to a kind of data processing method, data identification and learning method and
Its device.
Background technology
It is most important for the Classification and Identification of supervised learning model identification in data mining or machine-learning process.
However, there is positive class magnitudes and the situation of anti-class magnitude ratio imbalance verification in the identification process.Therefore, if not to sample
Notebook data is pre-processed, and only passes through simple Model Identification, it is more likely that lead to accuracy decline.
Existing data prediction includes outlier processing, lack sampling and over-sampling etc..However, these technologies equally exist
Various problems.For example, outlier processing carries out special place by data distribution trend or the data point that situation is concentrated to will deviate from
Situations such as reason, this may cause anti-class data accidentally to be deleted, particularly for example necessarily occur in the characteristic of risk subscribers
In the case of the phenomenon that peels off.Although data can be carried out in the magnitude level of each classification by way of lack sampling or over-sampling
Processing, but still the influence that positive class data characteristics covers anti-category feature can not be solved, while also broken the randomness of sampling.
Therefore, it is necessary to a kind of for being pre-processed to data with the data for the problem that at least some of solves the above problems
Processing method and its device.
Invention content
At least some of to solve the above-mentioned problems, an embodiment of the present invention provides a kind of data processing method, data
Identification and learning method and its device, obtain required data with high precision.
A scheme according to the present invention, provides a kind of data processing method, including:
(a) sample data is divided into multiple marshallings by way of cluster;
(b) check that each marshalling whether there is positive class mark data, and deletes and do not include positive category in the multiple marshalling
Know the marshalling of data;
(c) total quantity of positive class mark data in the multiple marshalling is determined;
(d) it is pre- to determine whether the total quantity of the positive class mark data proportion in the sample data is more than first
Determine threshold value;And
(e) the ratio be more than first predetermined threshold in the case of, retain carry out step (b) treated number
According to.
According to another aspect of the present invention, a kind of method for pattern-recognition and study is provided, including:Based on above-mentioned
Data processing method obtains processed multiple marshallings of sample data;And based on the processed multiple of the sample data
Marshalling comes execution pattern identification and study.
According to another aspect of the present invention, a kind of data processing equipment is provided, including:
Division module is organized into groups, for sample data to be divided into multiple marshallings by way of cluster;
Data review module, for checking, each marshalling whether there is positive class mark data in the multiple marshalling;
Data removing module, for deleting the marshalling not comprising positive class mark data;
Data bulk determining module, for determining the total quantity of positive class mark data in the multiple marshalling;
Ratio data determining module, for determining that the total quantity of the positive class mark data is shared in the sample data
Whether ratio is more than the first predetermined threshold;And
Data reservation module in the case of being more than first predetermined threshold in the ratio, retains the data
Removing module treated data.
According to another aspect of the present invention, a kind of device for pattern-recognition and study is provided, including:
Data organize into groups acquisition module, for obtaining processed multiple volumes of sample data from above-mentioned data processing equipment
Group;And
Identification and study submodule identify for processed multiple marshallings based on the sample data come execution pattern
And study.
According to another aspect of the present invention, a kind of data processing equipment is provided, including:
Memory, for storing executable instruction;And
Processor, for performing the executable instruction stored in memory, to perform above-mentioned data processing method.
According to another aspect of the present invention, a kind of device for pattern-recognition and study is provided, including:
Memory, for storing executable instruction;And
Processor, for performing the executable instruction stored in memory, to perform above-mentioned pattern-recognition and learning method.
According to another aspect of the present invention, a kind of memory devices carried thereon by computer program are provided, when by
When processor performs the computer program, the computer program makes the processor perform above-mentioned data processing method.
According to another aspect of the present invention, a kind of memory devices carried thereon by computer program are provided, when by
When processor performs the computer program, the computer program makes the processor perform above-mentioned pattern-recognition and study side
Method.
Said program can find the general character of the required data sample handled by the method for cluster, reject part interference number
According to the ratio of positive class data and anti-class data being balanced, and retain the higher positive and negative class data of similitude, so as to accurately obtain
Data sample needed for obtaining.Model learning is carried out on such data sample can greatly improve the precision of following model study.
Description of the drawings
It is by the detailed description carried out below in conjunction with the accompanying drawings to invention, the features described above and advantage that make the present invention is brighter
It is aobvious, wherein:
Fig. 1 shows a kind of outline flowchart of data processing method according to embodiments of the present invention;
Fig. 2 shows a kind of outline flowcharts of method for being used for pattern-recognition and study according to embodiments of the present invention;
Fig. 3 shows the brief block diagram of the structure of data processing equipment according to embodiments of the present invention;
Fig. 4 shows the brief block diagram of the structure of marshalling division module according to embodiments of the present invention;
Fig. 5 shows the brief block diagram of the structure of the device for being used for pattern-recognition and study according to embodiments of the present invention;
Fig. 6 shows a kind of specific implementation of data processing method according to embodiments of the present invention;
Fig. 7 shows variance according to an embodiment of the invention with clustering the relationship graph of number;And
Fig. 8 shows the block diagram of the exemplary hardware arrangement of Fig. 3/Fig. 5 shown devices according to embodiments of the present invention.
Specific embodiment
In the following, the preferred embodiment of the present invention is described in detail with reference to the accompanying drawings.In the accompanying drawings, although being shown in different attached drawings
In, but identical reference numeral is used to represent identical or similar component.For clarity and conciseness, comprising known to herein
The detailed description of function and structure will be omitted, to avoid making subject of the present invention unclear.
In the application to sample data, it is understood that there may be data needed for are situation of the ratio compared with small data.Such as in portion
Divide in the application processes such as consumer's risk identification, anomalous identification, the data of normal users in the behavioral data as sample collection
Major part is accounted for, and wants the data of risk subscribers/abnormal user identified often ratio is smaller.Which results in data sample
Anti- class (normal users) data and positive class (exception/risk subscribers) data it is out of proportion.Such case can cause carrying out mould
When type learns, the feature of anti-class data can seriously cover positive class data, too low so as to cause model learning precision, and and then cause
None- identified risk subscribers or the situation that risk subscribers are erroneously identified as to normal users.
For this purpose, the embodiment of the present invention provides a kind of data processing method referring to figs. 1 to Fig. 4, a kind of data identify and learn
Learning method and its device.Here, positive class data refer to the data of the target for model learning, and other data are anti-class number
According to.Such as in the application of above-mentioned risk identification/anomalous identification, the data of risk subscribers are positive class data, and normal users
Data be anti-class data.
Fig. 1 shows a kind of outline flowchart of data processing method according to embodiments of the present invention.It as shown in Figure 1, should
Data processing method includes:
Sample data is divided into multiple marshallings by step S110 by way of cluster;
Step S120 checks that each marshalling whether there is positive class mark data, and deletes and do not include just in multiple marshalling
The marshalling of class mark data;
Step S130 determines the total quantity of positive class mark data in multiple marshalling;
It is predetermined to determine whether the total quantity of the positive class mark data proportion in sample data is more than first by step S140
Threshold value;And
Step S150 in the case where ratio is more than the first predetermined threshold, retains and carries out step S120 treated data.
Said program can tentatively find the general character of the required data sample handled by the method for cluster, reject part and interfere
Data, balance the ratio of positive class data and anti-class data, and retain the higher positive and negative class data of similitude, so as to accurately
Data sample needed for acquisition.
Optionally, in some embodiments, as shown in Figure 1, may also include step S160, it is less than the first predetermined threshold in ratio
In the case of value, step S110-S150 is performed using step S120 treated data as sample data, wherein, it is repeating to hold
Increase the quantity of marshalling in capable step S110.
By using the cycle that step 160 is formed, interference data can be repeatedly rejected, balance positive class data and anti-class data
Ratio, and retain the higher positive and negative class data of similitude, realize higher processing accuracy.
In some instances, the maximum number magnitude of marshalling can be also set.In the case, in step (f), if currently
The quantity of marshalling is more than or equal to the maximum number magnitude, is not further added by the quantity of marshalling, and after retaining the processing of the last step (b)
Data, terminate data processing.
In some instances, k-means algorithms can be used to realize the mode of the cluster, wherein, it is calculated by k-means
Method determines the initial number of marshalling so that at initial number, the entire change of the data variance in each cluster slows down.
It should be noted that the present invention is not limited to use data variance, any change for characterizing variable quantity can also be used
Change the index of speed to realize the step.
In some instances, step (a) may include:
(a1) during multiple sample datas are randomly selected in sample data respectively as each being organized into groups in multiple marshalling
Heart point, the quantity of multiple data sample are equal with the quantity of multiple marshallings to be divided;
(a2) all data samples are calculated to the distance of each central point, each data sample is divided into away from nearest neighbours
Central point where marshalling in;
(a3) average value of all data samples in each marshalling is calculated, and by average value as in new in each marshalling
Heart point;
(a4) for each marshalling, judge whether the difference of new central point and central point before is more than the second predetermined threshold
It is worth and if the difference of central point is more than the second predetermined threshold, step (a2) and (a3) is performed using new central point, if
The difference of central point is not more than the second predetermined threshold, and new center is determined as Optimal cluster centers;And
(a5) all sample datas are calculated to the distance of each Optimal cluster centers point, each sample data is repartitioned
Into the marshalling where the Optimal cluster centers point away from nearest neighbours.
Fig. 2 shows a kind of outline flowcharts of method for being used for pattern-recognition and study according to embodiments of the present invention.
As shown in Fig. 2, this method includes:
Step S210, data processing method according to figure 1 obtain processed multiple marshallings of sample data;And
Step S220, processed multiple marshallings based on sample data are identified and are learnt come execution pattern.
It should be noted that method shown in Fig. 1 and Fig. 2 is only illustrative.Method shown in Fig. 1 and Fig. 2 can be fallen
Any modification among the scope of the present invention.For example, although step 160 institute by increasing marshalling quantity is shown in Fig. 1
The cycle of formation, in some embodiments (for example, the once-through operation in step 110-130 has been obtained for satisfactory number
In the case of), such cycle is not necessary.
Fig. 3 shows the brief block diagram of data processing equipment according to embodiments of the present invention.As shown in figure 3, the device packet
It includes:
Division module 310 is organized into groups, for sample data to be divided into multiple marshallings by way of cluster;
Data review module 320, for checking, each marshalling whether there is positive class mark data in multiple marshalling;
Data removing module 330, for deleting the marshalling not comprising positive class mark data;
Data bulk determining module 340, for determining the total quantity of positive class mark data in multiple marshalling;
Ratio data determining module 350, for determining the total quantity of positive class mark data proportion in sample data
Whether the first predetermined threshold is more than;And
Data reservation module 360, in the case of being more than the first predetermined threshold in ratio, retention data removing module
330 treated data.
In some embodiments, marshalling division module 310 can be additionally used in the case where ratio is less than the first predetermined threshold,
It is divided using data removing module treated data as sample data to perform marshalling, module, data is checked with repeated data
The operation of removing module, data bulk determining module, ratio data determining module and data reservation module, wherein, to data
Increase the quantity of marshalling in the grouping activity of treated the data of removing module 330.By the behaviour for organizing into groups division module 310
Make, rejecting interference data can be formed, balance the ratio of positive class data and anti-class data, and retain the higher positive and negative class number of similitude
According to a cycle, therefore more complicated data cases can be directed to, and realize higher processing accuracy.
Device shown in Fig. 3 may also include:Maximum marshalling number setup module 370, for setting the maximum number magnitude of marshalling.
If the quantity currently organized into groups is more than or equal to maximum number magnitude, organizes into groups division module 310 and repeating data grouping activity
When be not further added by the quantity of marshalling, and data reservation module 360 retains the last data removing module treated data.
In some instances, k-means algorithms can be used to realize the mode of the cluster in marshalling division module 310.Its
In, marshalling division module 310 determines the initial number of marshalling by k-means algorithms so that at initial number, each cluster
The entire change of interior data variance slows down.It is of course also possible, as described before, any pace of change for characterizing variable quantity can also be used
Index realize the step.
In some instances, marshalling division module 310 may also include multiple submodule.As shown in figure 4, marshalling division module
310 include:
Central point determination sub-module 311 is more respectively as this for randomly selecting multiple sample datas in sample data
The central point each organized into groups in a marshalling, the quantity of multiple data sample are equal with the quantity of multiple marshallings to be divided;
Central point, will be each for calculating all data samples to the distance of each central point apart from determination sub-module 312
Data sample is divided into away from the marshalling where the central point of nearest neighbours;
Central point weight determination sub-module 313, for calculating the average value of all data samples in each marshalling, and will be average
It is worth as central point new in each marshalling;
Optimal cluster centers determination sub-module 314 for being directed to each marshalling, judges new central point and center before
The difference of point whether is more than the second predetermined threshold and if the difference of central point is more than the second predetermined threshold, by new central point
Central point is sent to apart from determination sub-module 312, is determined with re-executing data marshalling and new central point, if the difference of central point
No more than the second predetermined threshold, new center is determined as Optimal cluster centers;And
Determination sub-module 315 is organized into groups, it, will be every for calculating all sample datas to the distance of each Optimal cluster centers point
A sample data is repartitioned in the marshalling where the Optimal cluster centers point away from nearest neighbours.
Fig. 5 shows a kind of brief block diagram of device for being used for pattern-recognition and study according to embodiments of the present invention.Such as
Shown in Fig. 5, which includes:
Data organize into groups acquisition module 410, for from data processing equipment shown according to fig. 3 obtain sample data from
Multiple marshallings of reason;And
Identification and study module 420, for based on sample data it is processed it is multiple marshalling come execution pattern identification and
Study
It should be noted that the structure diagram of above-mentioned Fig. 3 to Fig. 5 is only illustrative, specific manifestation can also pass through it
He provides form.Such as at some in the specific implementation, data caused by each module/submodule in Fig. 3 to device shown in fig. 5
Also can be stored in some storage device (not shown), other module/submodules can by from the storage device read data come
Obtain the various data that each module/submodule is generated.In this case, each module/son in Fig. 3 to device shown in fig. 5
Connection signal between module may change.However this variation is not out the range of illustrated embodiment of the present invention,
And it should be considered within the scope of illustrated embodiment of the present invention.For example, although maximum marshalling setup module is shown in FIG. 3
370, but in some embodiments, which not necessarily needs.
It will be described with reference to Fig. 6 specific implementations of the Fig. 1 to method, device shown in fig. 5.It should be noted that Fig. 6 is only
It shows a kind of specific implementation to method provided in an embodiment of the present invention, and is not considered as providing the embodiment of the present invention
Method limitation.For example, at other in the specific implementation, can not also use following k-means algorithms, but using it
Other modifications or this field in can be used in other any algorithm/methods of the data clusters with general character.
The step of step shown in fig. 6 can be corresponded roughly to shown in Fig. 1, for example, step A02, A03 in Fig. 6 can divide
Step A04 in step S110 and S120, Fig. 6 that Dui Yingyu be in Fig. 1 may correspond to step S130 and S140 in Fig. 1, with
And the step A05 and A06 in Fig. 6 can correspond respectively to the step S160 and S150 in Fig. 1.However this correspondence is not necessity
And be not stringent corresponding, in some modifications of the embodiment of the present invention, there may also be differences for the content of specific steps.
Clustering algorithm k-means algorithms used in flow shown in fig. 6 and usually used k-means algorithms there is
Difference.For example, the judgement of k values is only carried out before k-means clusters are carried out, without carrying out conventional data cleansing.
In the step A01 of Fig. 6, it may be determined that the initial k value of k-means algorithms.In the inventive solutions, the k
Value can represent the quantity of the marshalling to be divided (or cluster).
Determining can be used of k values is manually specified, automatically specifies or the method that is combined of the two is realized.If for example, number
It is very familiar to data application scenarios according to treatment people, the number of cluster can be manually specified according to priori.However at other
Under some cases, k values can be also automatically determined by system.A kind of automatic determination method of k values is provided below.
First, the range from 2 to N is specified.In the illustrative example of the present invention, in order to be shown, N values can
It is confirmed as 15.It is noted, however, that the value of N is not limited to 15, but can be selected according to concrete implementation scene
Any suitable value.
Next, perform searching loop in the range of above-mentioned selection, the increase with number of clusters is calculated, side in cluster
The situation of change of difference.Formula can be used in the example of the present inventionTo calculate the variance
Variation, wherein, u is the mean value of k classes, and Xi is each data point in class.
The number of clusters at place then, it is determined that the pace of change of variance variation calculated slows down.In the embodiment of the present invention
Scheme in, judging the standard of suitable k values can be:Suitable cluster number k corresponds to such a turning point, at this
Variance within clusters strongly reduce before point, and variance reduction after this point slows down.Variance shown in Fig. 7 be can refer to clustering number
Relationship graph come perform the step or by the pace of change that variance changes compared with some threshold value relatively come determine variance variation change
It is slow.Specified in the system automatically specified scheme being combined, experienced data analyst can be for example with reference to 7 institute of figure manually
It is showing as a result, selecting 4 or 5 value as k.Or in the scheme specified automatically in system, variance change can also be for example calculated
It is simultaneously relatively determined slowing down the moment for variance variation by the pace of change of change compared with some threshold value, no longer superfluous to this present invention
It states.In a preferred embodiment, smaller value may be selected in initial k value, and in the example of such as Fig. 6, initial k value may be selected to be 4.
It should be noted that it can all redefined when realizing technical solution of the present invention using k-means algorithms every time
Initial k value.However in for the processing of the repeated data of same scene, data processing personnel may also be according to priori more
Using same initial k value during secondary realization technical solution of the present invention, the initial k value without performing step A01 determines.
In the step A02 of Fig. 6, sample data is divided into multiple marshallings (cluster) using k-means algorithms.It has
Body step is as follows:
1) quantity organized into groups according to the cluster that step A01 is obtained is assigned to variable K, while randomly select K data sample
Point centered on this.
2) all data samples are calculated to the distance of the K central point, and each data sample is referred to apart from it most
Marshalling where near central point.
3) it averages to all data samples in each marshalling, using the new central point as the marshalling.
4) for each marshalling, the otherness of the new central point and previous central point is judged, if the excessive (example of difference
Such as, distance/difference is more than some predetermined threshold), return to step 2), 3) carry out continuation iteration, and if difference it is smaller (for example,
Less than the predetermined threshold), this stops iteration, which is determined as Optimal cluster centers.
5) according to the Optimal cluster centers point calculated in step 4), (optionally) carries out each Optimal cluster centers point
Number, and all sample datas are calculated to the distance of each Optimal cluster centers point, each sample data is repartitioned and (is returned
Class) in the marshalling where the Optimal cluster centers point away from nearest neighbours.
After sample data is divided into each marshalling in step A02, checked in step A03 in each data marshalling
Whether positive class data are included.If do not included, by this organize into groups in all data deleted from data area.If there is just
Class data continue to retain the data in the marshalling.Here, state variable f can be created, organize into groups and delete if there is data, then f=
True is organized into groups if there is no data and is deleted, then f=false.
Then, in step A04, check whether the magnitude of positive class data and anti-class data in sample reaches requirement, example
Such as, whether ratio reaches balance.A threshold parameter a can be set herein, by the total of the quantity G of positive class sample and data sample
The ratio between amount M is compared with the threshold parameter, to judge whether the ratio balances.Such as:
If G/M >=a,
Then ratio reaches balance, performs step A06,
If G/M < a,
Then ratio not up to balances, and performs step A05->A02->A03->A04, herein can be optionally in each cycle
All accordingly change the step the value of state variable f in A03.
In step A05, the parameter k of k-means algorithms is adjusted, data is made more to disperse, so as to preferably adjust
Whole positive class data and anti-class data.Specifically, all increase the value of parameter k when performing the step every time.
Preferably, the step can arrange parameter k maximum occurrences x.X is bigger, and representative marshalling quantity is bigger, each to organize into groups
In sample data it is fewer, so as to data degree of scatter it is bigger, shown as in data characteristics data marshalling tightness degree get over
Greatly.However the tightness degree that excessive x values can cause each data to be organized into groups is excessive, independence is excessively weak.Therefore, the value of x should be simultaneous
Care for the tightness degree of marshalling and the independence of marshalling.In the case where setting x values, one of step A05 is implemented as follows:
IF (f=true), then k=k continue to execute step A02->A03->A04,
IF (f=false and k < x), then k=k+1 continue to execute step A02->A03->A04,
IF (f=false and k >=x), then A06.
It should be noted that above-mentioned pseudocode is only for showing for a specific implementation for illustrating technical solution of the present invention
Example.Other different pseudocodes can also be used in other specific examples.For example, at some in the specific implementation, even if in step
Data marshalling has been deleted in rapid A03, the operation for increasing marshalling quantity is performed still in step A05.In the case, on
It is unwanted to state in pseudocode about the condition of state variable f, as long as k values are not up to maximum value x and can increase its value, is repaiied
It is as follows to change example:
IF (k < x), then k=k+1 continue to execute step A02->A03->A04,
IF (k >=x), then A06.
In step A06, pass through the data point reuse of above-mentioned steps A01-A04 (and in some cases, also passing through A05)
After deletion, the sample data of relatively initial positive class and anti-class relative equilibrium is obtained.It can be performed based on the sample data
Pattern-recognition and study.
In the technical solution of the embodiment of the present invention described in above-mentioned combination Fig. 1 to Fig. 7, the anti-class data in part are rejected,
So as to improve the balance degree of positive and negative class data in remaining sample data, additionally eliminate differ greatly with positive generic attribute it is anti-
Class data.Because being not the data for model learning target, the sample size reduction pair caused by these data rejected
Subsequent model learning is nearly free from influence, so as to remain data characteristic with arriving very much.In such positive and negative generic sex differernce
Model learning is carried out on relatively small sample and so that the recognition capability of model learning is stronger, so as to improve accuracy of identification.
Fig. 8 is to show the block diagram arranged according to the exemplary hardware of Fig. 3 or Fig. 5 shown devices of the embodiment of the present disclosure.It should
Hardware layout includes processor 506 (for example, microprocessor (μ P), digital signal processor (DSP) etc.).Processor 506 can be with
It is performed for single treatment unit either multiple processing units of the different actions of flow described herein.Arrangement can be with
Output unit including being used for from the input unit 502 of other entities reception signal and for providing from signal to other entities
504.Input unit 502 and output unit 504 can be arranged to the entity that single entities either detach.
In addition, arrangement can include having non-volatile or form of volatile memory at least one readable storage medium storing program for executing
508, e.g. electrically erasable programmable read-only memory (EEPROM), flash memory, and/or hard disk drive.Readable storage medium storing program for executing
508 include computer program 510, which includes code/computer-readable instruction, in by arrangement
Reason device 506 allows hardware layout when performing and/or the equipment including hardware layout perform for example above in conjunction with Fig. 1/
The described flows of Fig. 2 and its any deformation.
In the case where realizing Fig. 3 shown devices, computer program 510 can be configured with such as computer program mould
The computer program code of block 510A~510E frameworks.Therefore, the example during hardware layout is used to implement in such as equipment
In example, the code in the computer program of arrangement includes:Module 510A, for being divided into sample data by way of cluster
Multiple marshallings:Module 510B, for checking, each marshalling whether there is coin class mark data, and delete not in the multiple marshalling
Include the marshalling of positive class mark data;Module 510C, for determining the total quantity of positive class mark data in the multiple marshalling;Mould
Block 510D, for determining it is pre- whether the total quantity of positive class mark data proportion in the sample data is more than first
Determine threshold value;Module 510E, for the ratio be more than the first predetermined threshold in the case of, reservation module 510B treated number
According to.
In the case where realizing Fig. 5 shown devices, computer program 510 can be configured as only having such as computer program
The computer program code of module 510A~510B frameworks.Therefore, use the example during hardware layout real in such as equipment
It applies in example, the code in the computer program of arrangement includes:Module 510A, for obtaining sample based on the processing of Fig. 3 shown devices
Processed multiple marshallings of notebook data.Code in computer program further includes:Module 510B, for based on sample data
Processed multiple marshallings come execution pattern identification and study.
Computer program module can substantially perform each action in the flow shown in Fig. 1 or Fig. 2, with simulation
The device shown in Fig. 3 or Fig. 5.In other words, when performing different computer program modules in processor 506, they can be with
Corresponding to the above-mentioned different units in the device shown in Fig. 3 or Fig. 5.
Although being implemented as computer program module above in conjunction with the code means in Fig. 8 the disclosed embodiments,
Hardware layout is caused to perform above in conjunction with Fig. 1 or Fig. 2 described actions when being performed in processor 506, however is alternatively being implemented
In example, at least one in the code means can at least be implemented partly as hardware circuit.
Processor can be single cpu (central processing unit), but can also include two or more processing units.Example
Such as, processor can include general purpose microprocessor, instruction set processor and/or related chip group and/or special microprocessor (example
Such as, application-specific integrated circuit (ASIC)).Processor can also include the onboard storage device for caching purposes.
Computer program can be carried by the computer program product for being connected to processor.Computer program product can be with
Computer-readable medium including being stored thereon with computer program.For example, computer program product can be flash memory, deposit at random
Access to memory (RAM), read-only memory (ROM), EEPROM, and above computer program module can be used in an alternative embodiment
The form of memory in UE is distributed in different computer program products.
It should be noted that the technical solution recorded in the embodiment of the present invention in the absence of conflict can be arbitrary group
It closes.
In several embodiments provided by the present invention, it should be understood that disclosed method and apparatus can pass through it
Its mode is realized.Apparatus embodiments described above are only schematical, for example, the division of the unit, only
A kind of division of logic function can have other dividing mode, such as in actual implementation:Multiple units or component can combine or
It is desirably integrated into another system or some features can be ignored or does not perform.In addition, shown or discussed each composition portion
Point mutual coupling or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, equipment or unit
Or communication connection, can be electrical, mechanical or other forms.
The above-mentioned unit illustrated as separating component can be or may not be physically separate, be shown as unit
The component shown can be or may not be physical unit, you can be located at a place, can also be distributed to multiple network lists
In member;Part or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.
In addition, each functional unit in various embodiments of the present invention can be fully integrated into a second processing unit,
Can also be each unit individually as a unit, can also two or more units integrate in a unit;
The form that hardware had both may be used in above-mentioned integrated unit is realized, the form that hardware adds SFU software functional unit can also be used real
It is existing.
Above description is only used for realizing embodiments of the present invention, and it should be appreciated by those skilled in the art do not taking off
Any modification or partial replacement from the scope of the present invention, the range that should belong to the claim of the present invention to limit, because
This, protection scope of the present invention should be subject to the protection domain of claims.
Claims (16)
1. a kind of data processing method, including:
(a) sample data is divided into multiple marshallings by way of cluster;
(b) check that each marshalling whether there is positive class mark data, and deletes and identify number not comprising positive class in the multiple marshalling
According to marshalling;
(c) total quantity of positive class mark data in the multiple marshalling is determined;
(d) determine whether the total quantity of the positive class mark data proportion in the sample data is more than the first predetermined threshold
Value;And
(e) in the case where the ratio is more than first predetermined threshold, retain and carry out step (b) treated data.
2. data processing method according to claim 1, further includes:
(f) in the case where the ratio is less than first predetermined threshold, using step (b) treated data as the sample
Notebook data performs step (a)-(e), wherein, increase quantity of marshalling repeat the step of in (a).
3. data processing method according to claim 2, further includes:
The maximum number magnitude of marshalling is set;
In step (f), if the quantity currently organized into groups is more than or equal to the maximum number magnitude, the quantity of marshalling is not further added by,
And retain the last step (b) treated data, terminate the data processing.
4. data processing method according to any one of claim 1 to 3, wherein, using k-means algorithms to realize
The mode of cluster is stated,
Wherein, the initial number of the marshalling is determined by k-means algorithms so that at the initial number, each cluster
The entire change of interior data variance slows down.
5. data processing method according to claim 4, wherein, step (a) includes:
(a1) multiple sample datas are randomly selected in the sample data respectively as each organizing into groups in the multiple marshalling
Central point, the quantity of the multiple data sample are equal with the quantity of the multiple marshalling to be divided;
(a2) all data samples are calculated to the distance of each central point, each data sample is divided into away from nearest neighbours
In marshalling where heart point;
(a3) average value of all data samples in each marshalling is calculated, and using the average value as new in each marshalling
Central point;
(a4) for each marshalling, judging the difference of the new central point and central point before, whether to be more than second predetermined
Threshold value and if the difference of central point is more than second predetermined threshold, step (a2) is performed using the new central point
The new center if the difference of central point is not more than second predetermined threshold, is determined as Optimal cluster centers by (a3);
And
(a5) all sample datas are calculated to the distance of each Optimal cluster centers point, by each sample data repartition away from
In marshalling where the Optimal cluster centers point of nearest neighbours.
6. a kind of method for pattern-recognition and study, including:
Processed multiple marshallings of sample data are obtained based on the data processing method described in any one of claim 1 to 5;
And
Processed multiple marshallings based on the sample data are identified and are learnt come execution pattern.
7. a kind of data processing equipment, including:
Division module is organized into groups, for sample data to be divided into multiple marshallings by way of cluster;
Data review module, for checking, each marshalling whether there is positive class mark data in the multiple marshalling;
Data removing module, for deleting the marshalling not comprising positive class mark data;
Data bulk determining module, for determining the total quantity of positive class mark data in the multiple marshalling;
Ratio data determining module, for determining the total quantity of positive class mark data proportion in the sample data
Whether the first predetermined threshold is more than;And
Data reservation module in the case of being more than first predetermined threshold in the ratio, retains the data and deletes
Data after resume module.
8. data processing equipment according to claim 7, wherein, the marshalling division module is additionally operable to small in the ratio
In the case of first predetermined threshold, held using the data removing module treated data as the sample data
Row marshalling divides, to repeat the data review module, the data removing module, the data bulk determining module, described
The operation of ratio data determining module and the data reservation module, wherein, to the data removing module treated number
According to grouping activity in increase the quantity of marshalling.
9. data processing equipment according to claim 8, further includes:
Maximum marshalling number setup module, for setting the maximum number magnitude of marshalling;
Wherein, if the quantity currently organized into groups is more than or equal to the maximum number magnitude, the marshalling division module is repeating to hold
The quantity of marshalling is not further added by during row data grouping activity, and the data reservation module retains the last data and deletes
Data after resume module.
10. according to the data processing equipment described in any one of claim 7-9, wherein, the marshalling division module uses k-
Means algorithms realize the mode of the cluster,
Wherein, the marshalling division module determines the initial number of the marshalling by k-means algorithms so that described initial
At quantity, the entire change of the data variance in each cluster slows down.
11. data processing equipment according to claim 8, wherein, the marshalling division module further includes:
Central point determination sub-module, for randomly selecting multiple sample datas in the sample data respectively as the multiple
The central point each organized into groups in marshalling, the quantity of the multiple data sample and the quantity phase for the multiple marshalling to be divided
Deng;
Central point is apart from determination sub-module, for calculating all data samples to the distance of each central point, by each data sample
Originally it is divided into away from the marshalling where the central point of nearest neighbours;
Central point weight determination sub-module, for calculating the average value of all data samples in each marshalling, and by the average value
As central point new in each marshalling;
Optimal cluster centers determination sub-module, for being directed to each marshalling, judge the new central point with before in
The difference of heart point whether is more than the second predetermined threshold and if the difference of central point is more than second predetermined threshold, by described in
New central point is sent to the central point apart from determination sub-module, is determined with re-executing data marshalling and new central point, such as
The difference of fruit central point is not more than second predetermined threshold, and the new center is determined as Optimal cluster centers;And
Determination sub-module is organized into groups, for calculating all sample datas to the distance of each Optimal cluster centers point, by each sample
Data are repartitioned in the marshalling where the Optimal cluster centers point away from nearest neighbours.
12. a kind of device for pattern-recognition and study, including:
Data organize into groups acquisition module, for obtaining sample from the data processing equipment according to any one of claim 7 to 11
Processed multiple marshallings of notebook data;And
Identification and study submodule are identified and are learned come execution pattern for processed multiple marshallings based on the sample data
It practises.
13. a kind of data processing equipment, including:
Memory, for storing executable instruction;And
Processor, for performing the executable instruction stored in memory, to perform according to any one of claim 1 to 5 institute
The method stated.
14. a kind of device for pattern-recognition and study, including:
Memory, for storing executable instruction;And
Processor, for performing the executable instruction stored in memory, to perform the method according to claim 11.
15. a kind of memory devices carried thereon by computer program, when performing the computer program by processor, institute
Stating computer program makes the processor perform the method according to any one of claims 1 to 5.
16. a kind of memory devices carried thereon by computer program, when performing the computer program by processor, institute
Stating computer program makes the processor perform the method according to claim 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611112409.8A CN108154163B (en) | 2016-12-06 | 2016-12-06 | Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611112409.8A CN108154163B (en) | 2016-12-06 | 2016-12-06 | Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108154163A true CN108154163A (en) | 2018-06-12 |
CN108154163B CN108154163B (en) | 2020-11-24 |
Family
ID=62468532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611112409.8A Active CN108154163B (en) | 2016-12-06 | 2016-12-06 | Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108154163B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109147081A (en) * | 2018-09-03 | 2019-01-04 | 深圳市智物联网络有限公司 | A kind of equipment operation stability analysis method and system |
CN109447103A (en) * | 2018-09-07 | 2019-03-08 | 平安科技(深圳)有限公司 | A kind of big data classification method, device and equipment based on hard clustering algorithm |
CN109495291A (en) * | 2018-09-30 | 2019-03-19 | 阿里巴巴集团控股有限公司 | Call abnormal localization method, device and server |
CN110427358A (en) * | 2019-02-22 | 2019-11-08 | 北京沃东天骏信息技术有限公司 | Data cleaning method and device and information recommendation method and device |
CN110579708A (en) * | 2019-08-29 | 2019-12-17 | 爱驰汽车有限公司 | Battery capacity identification method and device, computing equipment and computer storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999615A (en) * | 2012-11-29 | 2013-03-27 | 合肥工业大学 | Diversified image marking and retrieving method based on radial basis function neural network |
CN103488623A (en) * | 2013-09-04 | 2014-01-01 | 中国科学院计算技术研究所 | Multilingual text data sorting treatment method |
CN105868775A (en) * | 2016-03-23 | 2016-08-17 | 深圳市颐通科技有限公司 | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm |
CN105912726A (en) * | 2016-05-13 | 2016-08-31 | 北京邮电大学 | Density centrality based sampling and detecting methods of unusual transaction data of virtual assets |
-
2016
- 2016-12-06 CN CN201611112409.8A patent/CN108154163B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999615A (en) * | 2012-11-29 | 2013-03-27 | 合肥工业大学 | Diversified image marking and retrieving method based on radial basis function neural network |
CN103488623A (en) * | 2013-09-04 | 2014-01-01 | 中国科学院计算技术研究所 | Multilingual text data sorting treatment method |
CN105868775A (en) * | 2016-03-23 | 2016-08-17 | 深圳市颐通科技有限公司 | Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm |
CN105912726A (en) * | 2016-05-13 | 2016-08-31 | 北京邮电大学 | Density centrality based sampling and detecting methods of unusual transaction data of virtual assets |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109147081A (en) * | 2018-09-03 | 2019-01-04 | 深圳市智物联网络有限公司 | A kind of equipment operation stability analysis method and system |
CN109147081B (en) * | 2018-09-03 | 2021-02-26 | 深圳市智物联网络有限公司 | Equipment operation stability analysis method and system |
CN109447103A (en) * | 2018-09-07 | 2019-03-08 | 平安科技(深圳)有限公司 | A kind of big data classification method, device and equipment based on hard clustering algorithm |
CN109447103B (en) * | 2018-09-07 | 2023-09-29 | 平安科技(深圳)有限公司 | Big data classification method, device and equipment based on hard clustering algorithm |
CN109495291A (en) * | 2018-09-30 | 2019-03-19 | 阿里巴巴集团控股有限公司 | Call abnormal localization method, device and server |
CN109495291B (en) * | 2018-09-30 | 2021-11-16 | 创新先进技术有限公司 | Calling abnormity positioning method and device and server |
CN110427358A (en) * | 2019-02-22 | 2019-11-08 | 北京沃东天骏信息技术有限公司 | Data cleaning method and device and information recommendation method and device |
CN110427358B (en) * | 2019-02-22 | 2021-04-30 | 北京沃东天骏信息技术有限公司 | Data cleaning method and device and information recommendation method and device |
CN110579708A (en) * | 2019-08-29 | 2019-12-17 | 爱驰汽车有限公司 | Battery capacity identification method and device, computing equipment and computer storage medium |
CN110579708B (en) * | 2019-08-29 | 2021-10-22 | 爱驰汽车有限公司 | Battery capacity identification method and device, computing equipment and computer storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108154163B (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108154163A (en) | Data processing method, data identification and learning method and its device | |
TWI703503B (en) | Risk transaction identification method, device, server and storage medium | |
JP5792747B2 (en) | Text classification method and system | |
CN108090508A (en) | A kind of classification based training method, apparatus and storage medium | |
EP3203417B1 (en) | Method for detecting texts included in an image and apparatus using the same | |
CN108595585A (en) | Sample data sorting technique, model training method, electronic equipment and storage medium | |
TWI696964B (en) | Object classification method, device, server and storage medium | |
CN111291865B (en) | Gait recognition method based on convolutional neural network and isolated forest | |
JP6708043B2 (en) | Data search program, data search method, and data search device | |
CN112825576A (en) | Method and device for determining cell capacity expansion and storage medium | |
CN109635669A (en) | Image classification method, the training method of device and disaggregated model, device | |
CN110083507A (en) | Key Performance Indicator classification method and device | |
CN106778731B (en) | A kind of license plate locating method and terminal | |
CN112437053A (en) | Intrusion detection method and device | |
CN110929218A (en) | Difference minimization random grouping method and system | |
EP3067804A1 (en) | Data arrangement program, data arrangement method, and data arrangement apparatus | |
CN112966643A (en) | Face and iris fusion recognition method and device based on self-adaptive weighting | |
CN111428064B (en) | Small-area fingerprint image fast indexing method, device, equipment and storage medium | |
CN106021852B (en) | Blood glucose level data exception value calculating method based on density clustering algorithm and device | |
CN109583492A (en) | A kind of method and terminal identifying antagonism image | |
CN107077617B (en) | Fingerprint extraction method and device | |
CN108364026A (en) | A kind of cluster heart update method, device and K-means clustering methods, device | |
CN104573696B (en) | Method and apparatus for handling face characteristic data | |
US11113580B2 (en) | Image classification system and method | |
CN113705625A (en) | Method and device for identifying abnormal life guarantee application families and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |