CN108154163B - Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium - Google Patents

Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium Download PDF

Info

Publication number
CN108154163B
CN108154163B CN201611112409.8A CN201611112409A CN108154163B CN 108154163 B CN108154163 B CN 108154163B CN 201611112409 A CN201611112409 A CN 201611112409A CN 108154163 B CN108154163 B CN 108154163B
Authority
CN
China
Prior art keywords
data
module
grouping
sample
groups
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611112409.8A
Other languages
Chinese (zh)
Other versions
CN108154163A (en
Inventor
闫强
李爱华
王晓
葛胜利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201611112409.8A priority Critical patent/CN108154163B/en
Publication of CN108154163A publication Critical patent/CN108154163A/en
Application granted granted Critical
Publication of CN108154163B publication Critical patent/CN108154163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a data processing method, a data identification and learning method and a device thereof. The data processing method comprises the following steps: dividing sample data into a plurality of groups in a clustering mode; checking each of the plurality of groupings for the presence of positive class identification data and deleting groupings not containing positive class identification data; determining a total amount of positive class identification data in the plurality of groupings; determining whether the proportion of the total amount of the positive identification data in the sample data is larger than a preset threshold value; and (c) if the ratio is larger than the predetermined threshold, retaining the data processed in the step (b). The scheme can obtain the data required by model learning with high precision.

Description

Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium
Technical Field
The invention relates to the field of data processing, in particular to a data processing method, a data identification and learning method, a device and a computer readable medium thereof.
Background
In the data mining or machine learning process, the method is important for the classification recognition of the supervised learning model identification. However, there are cases where the ratio of the positive class magnitude and the negative class magnitude is not verified in this identification process. Therefore, if sample data is not preprocessed, only simple model recognition is performed, which may cause problems such as accuracy reduction.
Existing data preprocessing includes outlier processing, undersampling, and oversampling. However, these techniques also suffer from various problems. For example, the outlier processing specially processes the deviated data points through the data distribution trend or the concentration situation, which may cause the false deletion of the inverse data and the like, especially in the case that the outlier phenomenon necessarily occurs to the feature data of the risky user, for example. Although data can be processed at the level of each classification by an undersampling or oversampling mode, the influence of the coverage of the forward data features on the reverse data features cannot be solved, and the randomness of sampling is broken.
Accordingly, there is a need for a data processing method and apparatus for preprocessing data to address at least some of the above-mentioned problems.
Disclosure of Invention
In order to solve at least some of the above-described problems, embodiments of the present invention provide a data processing method, a data recognition and learning method, and an apparatus thereof, which obtain desired data with high accuracy.
According to an aspect of the present invention, there is provided a data processing method including:
(a) dividing sample data into a plurality of groups in a clustering mode;
(b) checking each of the plurality of groupings for the presence of positive class identification data and deleting groupings not containing positive class identification data;
(c) determining a total amount of positive class identification data in the plurality of groupings;
(d) determining whether the proportion of the total amount of the positive identification data in the sample data is greater than a first preset threshold value; and
(e) and (c) if the ratio is greater than the first predetermined threshold, retaining the data after the processing of step (b).
According to another aspect of the present invention, there is provided a method for pattern recognition and learning, comprising: obtaining a plurality of processed groups of sample data based on the data processing method; and performing pattern recognition and learning based on the processed plurality of groupings of the sample data.
According to another aspect of the present invention, there is provided a data processing apparatus including:
the grouping dividing module is used for dividing the sample data into a plurality of groups in a clustering mode;
a data checking module for checking whether positive type identification data exists in each of the plurality of groups;
the data deleting module is used for deleting the grouping which does not contain the positive type identification data;
a data quantity determination module for determining a total quantity of positive type identification data in the plurality of groupings;
the data proportion determining module is used for determining whether the proportion of the total amount of the positive identification data in the sample data is greater than a first preset threshold value; and
and the data retaining module is used for retaining the data processed by the data deleting module under the condition that the proportion is greater than the first preset threshold value.
According to another aspect of the present invention, there is provided an apparatus for pattern recognition and learning, including:
a data marshalling acquisition module for acquiring a plurality of processed marshalling of sample data from the data processing apparatus; and
an identification and learning sub-module to perform pattern identification and learning based on the processed plurality of groupings of the sample data.
According to another aspect of the present invention, there is provided a data processing apparatus including:
a memory for storing executable instructions; and
and the processor is used for executing the executable instructions stored in the memory so as to execute the data processing method.
According to another aspect of the present invention, there is provided an apparatus for pattern recognition and learning, including:
a memory for storing executable instructions; and
a processor for executing executable instructions stored in the memory to perform the pattern recognition and learning methods described above.
According to another aspect of the present invention, there is provided a computer readable medium having embodied thereon a computer program, which when executed by a processor, causes the processor to execute the above-mentioned data processing method.
According to another aspect of the present invention, there is provided a computer readable medium having embodied thereon a computer program, which when executed by a processor, causes the processor to perform the pattern recognition and learning method described above.
The scheme can find the commonality of the data samples needing to be processed by a clustering method, eliminate partial interference data, balance the proportion of the forward data and the reverse data, and reserve the forward data and the reverse data with higher similarity, thereby obtaining the required data samples with high precision. Performing model learning on such data samples can greatly improve the accuracy of subsequent model learning.
Drawings
The above features and advantages of the present invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings, in which:
FIG. 1 shows a simplified flow diagram of a data processing method according to an embodiment of the invention;
FIG. 2 illustrates a simplified flow diagram of a method for pattern recognition and learning according to an embodiment of the present invention;
FIG. 3 shows a schematic block diagram of the structure of a data processing apparatus according to an embodiment of the present invention;
FIG. 4 illustrates a schematic block diagram of the structure of a group partitioning module according to an embodiment of the present invention;
FIG. 5 is a schematic block diagram illustrating the structure of an apparatus for pattern recognition and learning according to an embodiment of the present invention;
FIG. 6 illustrates a particular implementation of a data processing method according to an embodiment of the invention;
FIG. 7 shows a graph of variance versus number of clusters according to an embodiment of the invention; and
fig. 8 shows a block diagram of an example hardware arrangement of the apparatus shown in fig. 3/5, according to an embodiment of the invention.
Detailed Description
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components, although they are shown in different drawings. For the purposes of clarity and conciseness, a detailed description of known functions and configurations incorporated herein will be omitted to avoid making the subject matter of the present invention unclear.
In the application of sample data, there may be cases where the required data is a small percentage of data. For example, in the application processes of risk identification and abnormal identification of partial users, the data of normal users in the behavior data collected as a sample is mostly occupied, and the data of risk users/abnormal users to be identified is often in a smaller proportion. This causes a disproportion between the reverse-class (normal users) data and the forward-class (abnormal/risky users) data in the data sample. This situation may cause the features of the reverse class data to heavily cover the forward class data when model learning is performed, thereby causing the accuracy of model learning to be too low, and further causing a situation in which a risky user cannot be identified or the risky user is erroneously identified as a normal user.
To this end, embodiments of the present invention provide a data processing method, a data recognition and learning method, and apparatuses thereof with reference to fig. 1 to 4. Here, the forward class data refers to data directed to a target of model learning, and the other data is reverse class data. For example, in the risk identification/anomaly identification application described above, the data of the risk users is the normal class data, and the data of the normal users is the reverse class data.
Fig. 1 shows a schematic flow diagram of a data processing method according to an embodiment of the invention. As shown in fig. 1, the data processing method includes:
step S110, dividing the sample data into a plurality of groups in a clustering mode;
step S120, checking whether each group in the plurality of groups has positive type identification data, and deleting the group which does not contain the positive type identification data;
step S130, determining the total amount of the positive type identification data in the plurality of groups;
step S140, determining whether the proportion of the total amount of the positive identification data in the sample data is greater than a first preset threshold value; and
and step S150, if the proportion is larger than the first preset threshold value, retaining the data processed in the step S120.
The scheme can preliminarily find the commonality of the data samples needing to be processed by a clustering method, eliminate partial interference data, balance the proportion of the forward data and the reverse data, and reserve the forward data and the reverse data with higher similarity, thereby obtaining the required data samples with high precision.
Optionally, in some embodiments, as shown in fig. 1, a step S160 may be further included, and in a case that the ratio is smaller than the first predetermined threshold, the steps S110 to S150 are performed with the data processed in the step S120 as sample data, where the number of groups is increased in the step S110 that is repeatedly performed.
By using the loop formed in step 160, the interference data can be eliminated for many times, the proportion of the forward data and the reverse data is balanced, the forward data and the reverse data with higher similarity are retained, and higher processing precision is realized.
In some examples, a maximum number value for the grouping may also be set. In this case, in step (f), if the number of current marshalling is greater than or equal to the maximum number value, the number of marshalling is not increased any more, and the data processed in the last step (b) is retained, and the data processing is ended.
In some examples, the manner of clustering may be implemented using a k-means algorithm, where an initial number of groupings is determined by the k-means algorithm such that at the initial number, the overall change in data variance within each cluster is slowed.
It should be noted that the present invention is not limited to the use of data variance, and any index that can characterize the rate of change of the amount of change can be used to implement this step.
In some examples, step (a) may comprise:
(a1) randomly extracting a plurality of sample data in the sample data to respectively serve as the central point of each group in the plurality of groups, wherein the number of the plurality of sample data is equal to that of the plurality of groups to be divided;
(a2) calculating the distance from all the data samples to each central point, and dividing each data sample into a group in which the central point closest to the data sample is positioned;
(a3) calculating the average value of all data samples in each grouping, and taking the average value as a new central point in each grouping;
(a4) determining, for each grouping, whether a difference between the new center point and the previous center point is greater than a second predetermined threshold, and if the difference between the center points is greater than the second predetermined threshold, performing steps (a2) and (a3) using the new center point, and if the difference between the center points is not greater than the second predetermined threshold, determining the new center as an optimal cluster center; and
(a5) and calculating the distance from all sample data to each optimal clustering center point, and reclassifying each sample data into a grouping in which the optimal clustering center point closest to the sample data is located.
FIG. 2 illustrates a simplified flow diagram of a method for pattern recognition and learning, according to an embodiment of the present invention. As shown in fig. 2, the method includes:
step S210, obtaining a plurality of processed groups of sample data according to the data processing method shown in fig. 1; and
step S220, pattern recognition and learning is performed based on the processed plurality of groupings of sample data.
It is noted that the method shown in fig. 1 and 2 is only schematic. Any modification of the method shown in fig. 1 and 2 may be made which falls within the scope of the present invention. For example, while a loop formed by the step 160 of increasing the number of groupings is shown in FIG. 1, in some embodiments (e.g., where one operation of the steps 110 and 130 has obtained satisfactory data), such a loop is not necessary.
Fig. 3 shows a schematic block diagram of a data processing device according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes:
a grouping dividing module 310, configured to divide the sample data into a plurality of groupings in a clustering manner;
a data checking module 320 for checking whether positive type identification data exists in each of the plurality of groups;
a data deleting module 330 for deleting the grouping that does not contain the positive type identification data;
a data quantity determination module 340 for determining a total quantity of positive class identification data in the plurality of groupings;
a data proportion determining module 350, configured to determine whether a proportion of the total amount of the positive type identification data in the sample data is greater than a first predetermined threshold; and
and a data retaining module 360, configured to retain the data processed by the data deleting module 330 if the ratio is greater than the first predetermined threshold.
In some embodiments, the grouping partitioning module 310 may be further configured to perform grouping partitioning on the data processed by the data deleting module as sample data to repeat operations of the data checking module, the data deleting module, the data quantity determining module, the data proportion determining module and the data retaining module if the proportion is smaller than the first predetermined threshold, wherein the number of groupings is increased in a grouping operation on the data processed by the data deleting module 330. By this operation of the grouping and dividing module 310, a loop can be formed in which the interference data is eliminated, the proportion of the forward-class data and the backward-class data is balanced, and the forward-class data and the backward-class data with higher similarity are retained, so that a more complex data situation can be addressed and higher processing accuracy can be achieved.
The apparatus shown in fig. 3 may further include: a maximum grouping number setting module 370 for setting a maximum number value of the grouping. If the number of current groupings is greater than or equal to the maximum number of values, the grouping partitioning module 310 does not increase the number of groupings when data grouping operations are repeatedly performed, and the data retention module 360 retains the data that was processed by the last data deletion module.
In some examples, the grouping partitioning module 310 may employ a k-means algorithm to implement the manner of clustering. Wherein the grouping partitioning module 310 determines an initial number of groupings by a k-means algorithm such that at the initial number, the overall change in data variance within each cluster is mitigated. Of course, as mentioned above, this step may be implemented using any index that characterizes the rate of change of the amount of change.
In some examples, the group partitioning module 310 may also include a plurality of sub-modules. As shown in fig. 4, the grouping division module 310 includes:
a center point determining submodule 311, configured to randomly extract, from the sample data, a plurality of sample data as a center point of each of the plurality of groups, where the number of the plurality of sample data is equal to the number of the plurality of groups to be divided;
a center point distance determining submodule 312, configured to calculate distances from all the data samples to each center point, and divide each data sample into a group in which a center point closest to the data sample is located;
the center point re-determination submodule 313 is used for calculating the average value of all the data samples in each grouping, and taking the average value as a new center point in each grouping;
an optimal cluster center determining sub-module 314 for determining, for each grouping, whether a difference between a new center point and a previous center point is greater than a second predetermined threshold, and if the difference between the center points is greater than the second predetermined threshold, transmitting the new center point to the center point distance determining sub-module 312 to perform data grouping and new center point determination again, and if the difference between the center points is not greater than the second predetermined threshold, determining the new center as an optimal cluster center; and
the grouping determination sub-module 315 is configured to calculate distances from all sample data to each optimal cluster center point, and re-divide each sample data into groups where the optimal cluster center point closest to the sample data is located.
FIG. 5 illustrates a schematic block diagram of an apparatus for pattern recognition and learning, according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes:
a data grouping acquisition module 410 for obtaining a plurality of processed groupings of sample data from the data processing apparatus according to fig. 3; and
a recognition and learning module 420 to perform pattern recognition and learning based on the processed plurality of groupings of sample data
It should be noted that the structural block diagrams of fig. 3 to 5 are only schematic, and the concrete representation thereof can also be given in other forms. For example, in some specific implementations, the data generated by each module/sub-module in the apparatuses shown in fig. 3 to 5 may also be stored in a certain storage device (not shown), and other modules/sub-modules may obtain various data generated by each module/sub-module by reading the data from the storage device. In this case, the connection scheme between the modules/sub-modules in the apparatus shown in fig. 3 to 5 may vary. Such variations, however, are not beyond the scope of the illustrated embodiments of the present invention and should be considered within the scope of the illustrated embodiments of the present invention. For example, while the maximum grouping setting module 370 is shown in fig. 3, in some embodiments, this module is not necessarily required.
A specific implementation of the method/apparatus shown in fig. 1-5 will be described below with reference to fig. 6. It should be noted that fig. 6 only shows a specific implementation of the method provided by the embodiment of the present invention, and should not be considered as a limitation of the method provided by the embodiment of the present invention. For example, in other implementations, instead of the k-means algorithm described below, other variations thereof or any other algorithm/method that can be used in the art for clustering data having commonalities may be used.
The steps shown in fig. 6 may generally correspond to the steps shown in fig. 1, e.g., steps a02, a03 in fig. 6 may correspond to steps S110 and S120 in fig. 1, respectively, step a04 in fig. 6 may correspond to steps S130 and S140 in fig. 1, and steps a05 and a06 in fig. 6 may correspond to steps S160 and S150 in fig. 1, respectively. However, this correspondence is not necessary and not strictly necessary, and the content of the specific steps may also differ in some variations of the embodiments of the present invention.
The clustering algorithm k-means used in the flow shown in fig. 6 is different from the commonly used k-means algorithm. For example, the determination of k values is only done before k-means clustering is done, and no conventional data cleansing is done.
In step A01 of FIG. 6, an initial k value for the k-means algorithm may be determined. In the technical solution of the present invention, the k value may represent the number of groups (or clusters) to be divided.
The determination of the k value can be performed by manual assignment, automatic assignment, or a combination thereof. For example, if the data processing personnel are familiar with the data application scenario, the number of clusters can be manually specified based on a priori experience. In other cases, however, the value of k may be determined automatically by the system. A method for automatic determination of the k value is provided below.
First, a range from 2 to N is specified. In one exemplary example of the present invention, the value of N may be determined to be 15 for illustration. It should be noted, however, that the value of N is not limited to 15, but may be any suitable value according to a specific implementation scenario.
Next, a loop traversal is performed in the selected range, and the variance variation in the clusters as the number of clusters increases is calculated. In one example of the invention, a formula may be employed
Figure GDA0001616357230000091
The variance change is calculated where u is the mean of the k classes and Xi is the individual data points in the classes.
Then, the number of clusters where the change speed of the calculated variance change becomes slow is determined. In the solution of the embodiment of the present invention, the criterion for determining the suitable k value may be: the number of clusters k that is suitable corresponds to a turning point before which the intra-class variance decreases sharply and after which the variance decreases slowly. This step may be performed with reference to a graph of variance versus number of clusters as shown in fig. 7, or the rate of change of variance change may be compared with a threshold to determine the slowness of variance change. In a scenario where manual designation is combined with system automatic designation, an experienced data analyst may select 4 or 5 as the value of k, for example, with reference to the results shown in fig. 7. Or in the scheme automatically specified by the system, for example, the change speed of the variance change may be calculated and compared with a certain threshold value to determine the slowing-down time of the variance change, which is not described in detail herein. In a preferred embodiment, the initial k value may be chosen to be a smaller value, and in the example of fig. 6 for example, the initial k value may be chosen to be 4.
It should be noted that the initial k value can be determined again each time the technical solution of the present invention is implemented using the k-means algorithm. However, in repeated data processing for the same scenario, it is also possible for a data processor to use the same initial k value multiple times in implementing the solution of the present invention based on a priori experience without performing the initial k value determination of step a 01.
In step A02 of FIG. 6, the k-means algorithm is utilized to divide the sample data into a plurality of groupings (clusters). The method comprises the following specific steps:
1) and B, assigning the quantity of the cluster grouping obtained according to the step A01 to a variable K, and randomly extracting K data samples as central points.
2) The distances of all data samples to the K center points are calculated, and each data sample is classified into a group in which the center point closest to the data sample is located.
3) All data samples in each grouping are averaged to serve as the new center point for that grouping.
4) And judging the difference between the new central point and the previous central point for each group, if the difference is too large (for example, the distance/difference is larger than a certain preset threshold), returning to the steps 2) and 3) to continue the iteration, and if the difference is smaller (for example, smaller than the preset threshold), stopping the iteration and determining the new central point as the optimal cluster center.
5) And (4) numbering the optimal clustering center points (optionally) according to the optimal clustering center points calculated in the step 4), calculating the distance from all sample data to each optimal clustering center point, and re-dividing (classifying) each sample data into a group in which the optimal clustering center point closest to the sample data is positioned.
After the sample data is divided into groups in step A02, it is checked in step A03 whether the groups of data contain normal data. If not, all data within the grouping is deleted from the data range. If positive type data exists, the data within the consist continues to be retained. Here, a state variable f may be newly created, and if there is data grouping deletion, f is true, and if there is no data grouping deletion, f is false.
Then, in step a04, it is checked whether the magnitude of the positive-class data and the negative-class data in the sample is satisfactory, for example, whether the ratio thereof is balanced. A threshold parameter a can be set, and the ratio of the number G of positive samples and the total number M of data samples is compared with the threshold parameter to determine whether the ratio is balanced. For example:
if G/M > ═ a,
the ratio is balanced, step a06 is performed,
if G/M < a,
the ratio does not reach equilibrium and step a05- > a02- > a03- > a04 is executed, where the value of the state variable f in step a03 may optionally be changed accordingly at each cycle.
In step A05, the parameter k of the k-means algorithm is adjusted to make the data more dispersed, thereby better adjusting the forward data and the backward data. Specifically, the value of the parameter k is increased each time this step is performed.
Preferably, the maximum value x of the parameter k may be set at this step. The larger x represents the greater number of groupings, the less sample data per grouping, and thus the greater the degree of data scatter, the more closely the data characteristically appears as a grouping of data. However, too large a value of x may result in too tight of individual data groupings and too weak of independence. Therefore, the value of x should take into account the closeness of the grouping and the independence of the grouping. In the case of setting the value of x, one specific implementation of step a05 is as follows:
IF (f) and then k, go on to step a02- > a03- > a04,
IF (f ═ false and k < x), then k ═ k +1, proceed to step a02- > a03- > a04,
IF(f=false and k>=x),then A06。
it should be noted that the above pseudo code is only an example for explaining one specific implementation of the technical solution of the present invention. In other specific examples, different pseudo-code may be used. For example, in some implementations, even though the data grouping has been deleted in step A03, an operation to increase the number of groupings is still performed in step A05. In this case, the condition on the state variable f in the above pseudo code is not necessary, and the value thereof can be increased as long as the value of k does not reach the maximum value x, and modified examples thereof are as follows:
IF (k < x), then k ═ k +1, proceed to step a02- > a03- > a04,
IF(k>=x),then A06。
in step A06, sample data that is relatively balanced with respect to the initial positive and negative classes is obtained after data adjustment deletion by the above-described steps A01-A04 (and in some cases, also by A05). Pattern recognition and learning may be performed based on the sample data.
In the technical solutions of the embodiments of the present invention described above with reference to fig. 1 to 7, part of the reverse-class data is removed, so as to improve the balance degree of the forward-class data and the reverse-class data in the remaining sample data, and in addition, the reverse-class data with a larger difference from the forward-class data is removed. Because the data are not data aiming at model learning targets, the sample size reduction caused by the removed data hardly influences the subsequent model learning, and the data characteristics are well preserved. Model learning is performed on the sample with the relatively small positive and negative attribute difference, so that the recognition capability of the model learning is stronger, and the recognition accuracy is improved.
Fig. 8 is a block diagram illustrating an example hardware arrangement of the apparatus shown in fig. 3 or 5 in accordance with an embodiment of the present disclosure. The hardware arrangement includes a processor 506 (e.g., a microprocessor (μ P), a Digital Signal Processor (DSP), etc.). Processor 506 may be a single processing unit or multiple processing units to perform the different actions of the processes described herein. The arrangement may also comprise an input unit 502 for receiving signals from other entities, and an output unit 504 for providing signals to other entities. The input unit 502 and the output unit 504 may be arranged as a single entity or as separate entities.
Furthermore, the arrangement may comprise at least one readable storage medium 508 in the form of a non-volatile or volatile memory, for example an electrically erasable programmable read-only memory (EEPROM), a flash memory, and/or a hard disk drive. The readable storage medium 508 comprises a computer program 510, which computer program 510 comprises code/computer readable instructions, which when executed by the processor 506 in the arrangement, cause the hardware arrangement and/or the device comprising the hardware arrangement to perform a procedure, such as described above in connection with fig. 1/2, and any variants thereof.
In the case of an implementation of the apparatus shown in fig. 3, the computer program 510 may be configured as computer program code having, for example, an architecture of computer program modules 510A-510E. Thus, in an example embodiment when the hardware arrangement is used, for example, in a device, the code in the computer program of the arrangement comprises: a module 510A, configured to divide sample data into a plurality of groups by means of clustering: a module 510B, configured to check whether positive class identification data exists in each of the plurality of groups, and delete a group that does not contain the positive class identification data; a module 510C for determining a total amount of positive class identification data in the plurality of groupings; a module 510D, configured to determine whether a proportion of the total amount of the positive identification data in the sample data is greater than a first predetermined threshold; a module 510E, configured to retain the data processed by the module 510B if the ratio is greater than the first predetermined threshold.
In the case of an implementation of the apparatus shown in fig. 5, the computer program 510 may be configured as computer program code having only an architecture of, for example, computer program modules 510A-510B. Thus, in an example embodiment when the hardware arrangement is used, for example, in a device, the code in the computer program of the arrangement comprises: module 510A for obtaining a processed plurality of groupings of sample data based on processing by the device shown in fig. 3. The code in the computer program further comprises: a module 510B for performing pattern recognition and learning based on the processed plurality of groupings of sample data.
The computer program modules may perform substantially the respective actions of the flows shown in figures 1 or 2 to simulate the apparatus shown in figures 3 or 5. In other words, when different computer program modules are executed in the processor 506, they may correspond to the different units described above in the apparatus shown in fig. 3 or 5.
Although the code means in the embodiment disclosed above in connection with fig. 8 are implemented as computer program modules which, when executed in the processor 506, cause the hardware arrangement to perform the actions described above in connection with fig. 1 or 2, at least one of the code means may, in alternative embodiments, be implemented at least partly as hardware circuits.
The processor may be a single CPU (central processing unit), but may also include two or more processing units. For example, a processor may include a general purpose microprocessor, an instruction set processor, and/or related chip sets and/or special purpose microprocessors (e.g., an Application Specific Integrated Circuit (ASIC)). The processor may also include on-board memory for caching purposes.
The computer program may be carried by a computer program product connected to the processor. The computer program product may comprise a computer readable medium having a computer program stored thereon. For example, the computer program product may be a flash memory, a Random Access Memory (RAM), a Read Only Memory (ROM), an EEPROM, and the above-mentioned computer program modules may in alternative embodiments be distributed in the form of a memory within the UE to the different computer program products.
It should be noted that the technical solutions described in the embodiments of the present invention can be arbitrarily combined without conflict.
In the embodiments provided in the present invention, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The above description is only for implementing the embodiments of the present invention, and those skilled in the art will understand that any modification or partial replacement without departing from the scope of the present invention shall fall within the scope defined by the claims of the present invention, and therefore, the scope of the present invention shall be subject to the protection scope of the claims.

Claims (14)

1. A method for processing behavior data indicative of user behavior, comprising:
(a) dividing the sample data of the behavior data into a plurality of groups in a clustering mode;
(b) checking whether positive type identification data exist in each of the groups, and deleting the groups which do not contain the positive type identification data, wherein the positive type identification data are data of users with abnormity;
(c) determining a total amount of positive class identification data in the plurality of groupings;
(d) determining whether the proportion of the total amount of the positive identification data in the sample data is greater than a first preset threshold value;
(e) retaining the data after the processing of step (b) if the ratio is greater than the first predetermined threshold; and
(f) and (c) in the case that the ratio is smaller than the first predetermined threshold, performing steps (a) - (e) with the processed data of step (b) as the sample data, wherein the number of groups is increased in the repeatedly performed step (a).
2. The method of claim 1, further comprising:
setting a maximum quantity value of the marshalling;
in step (f), if the number of current marshalling is greater than or equal to the maximum number value, the number of marshalling is not increased any more, and the data processed in the last step (b) is reserved, and the data processing is finished.
3. The method according to claim 1 or 2, wherein the way of clustering is implemented using a k-means algorithm,
wherein an initial number of the groupings is determined by a k-means algorithm such that at the initial number, an overall change in data variance within each cluster is mitigated.
4. The method of claim 3, wherein step (a) comprises:
(a1) randomly extracting a plurality of sample data in the sample data respectively as a central point of each of the plurality of groups, wherein the number of the plurality of sample data is equal to that of the plurality of groups to be divided;
(a2) calculating the distance from all the data samples to each central point, and dividing each data sample into a group in which the central point closest to the data sample is positioned;
(a3) calculating the average value of all data samples in each grouping, and taking the average value as a new central point in each grouping;
(a4) for each of the groups, determining whether a difference between the new center point and a previous center point is greater than a second predetermined threshold, and if the difference between center points is greater than the second predetermined threshold, performing steps (a2) and (a3) using the new center point, and if the difference between center points is not greater than the second predetermined threshold, determining the new center as an optimal cluster center; and
(a5) and calculating the distance from all sample data to each optimal clustering center point, and reclassifying each sample data into a grouping in which the optimal clustering center point closest to the sample data is located.
5. A method for pattern recognition and learning, comprising:
obtaining a processed plurality of groupings of sample data based on the method of any of claims 1 to 4; and
performing pattern recognition and learning based on the processed plurality of groupings of the sample data.
6. A data processing apparatus comprising:
the grouping dividing module is used for dividing sample data of the behavior data indicating the user behavior into a plurality of groups in a clustering mode;
the data checking module is used for checking whether positive type identification data exist in each of the plurality of groups, wherein the positive type identification data are data of users with exceptions;
the data deleting module is used for deleting the grouping which does not contain the positive type identification data;
a data quantity determination module for determining a total quantity of positive type identification data in the plurality of groupings;
the data proportion determining module is used for determining whether the proportion of the total amount of the positive identification data in the sample data is greater than a first preset threshold value; and
a data retaining module for retaining the data processed by the data deleting module under the condition that the ratio is greater than the first predetermined threshold value,
wherein the grouping dividing module is further configured to perform grouping division on the data processed by the data deleting module as the sample data to repeat operations of the data checking module, the data deleting module, the data quantity determining module, the data proportion determining module and the data retaining module if the ratio is smaller than the first predetermined threshold, wherein the number of groupings is increased in a grouping operation on the data processed by the data deleting module.
7. The data processing apparatus of claim 6, further comprising:
the maximum grouping number setting module is used for setting the maximum number value of grouping;
wherein, if the number of the current grouping is larger than or equal to the maximum number value, the grouping dividing module does not increase the number of the grouping when the data grouping operation is repeatedly executed, and the data retaining module retains the data processed by the data deleting module at the latest time.
8. The data processing apparatus according to claim 6 or 7, wherein the grouping partitioning module employs a k-means algorithm to implement the manner of clustering,
wherein the group partitioning module determines an initial number of the groups by a k-means algorithm such that at the initial number, an overall change in data variance within each cluster is mitigated.
9. The data processing apparatus of claim 6, wherein the group partitioning module further comprises:
a central point determining submodule, configured to randomly extract, from the sample data, a plurality of sample data as a central point of each of the plurality of groups, where the number of the plurality of sample data is equal to the number of the plurality of groups to be divided;
the central point distance determining submodule is used for calculating the distance from all the data samples to each central point and dividing each data sample into a group in which the central point closest to the data sample is positioned;
the center point re-determination submodule is used for calculating the average value of all data samples in each grouping and taking the average value as a new center point in each grouping;
an optimal cluster center determining sub-module, configured to determine, for each of the groups, whether a difference between the new center point and a previous center point is greater than a second predetermined threshold, and if the difference between the center points is greater than the second predetermined threshold, send the new center point to the center point distance determining sub-module to perform data grouping and new center point determination again, and if the difference between the center points is not greater than the second predetermined threshold, determine the new center as an optimal cluster center; and
and the grouping determination submodule is used for calculating the distance from all sample data to each optimal clustering center point and reclassifying each sample data into a grouping where the optimal clustering center point closest to the sample data is located.
10. An apparatus for pattern recognition and learning, comprising:
a data grouping acquisition module for obtaining a processed plurality of groupings of sample data from the data processing apparatus according to any of claims 6 to 9; and
an identification and learning sub-module to perform pattern identification and learning based on the processed plurality of groupings of the sample data.
11. A data processing apparatus comprising:
a memory for storing executable instructions; and
a processor for executing executable instructions stored in the memory to perform the method of any one of claims 1 to 4.
12. An apparatus for pattern recognition and learning, comprising:
a memory for storing executable instructions; and
a processor for executing executable instructions stored in the memory to perform the method of claim 5.
13. A memory device having a computer program loaded thereon, which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1 to 4.
14. A memory device having a computer program loaded thereon, which, when executed by a processor, causes the processor to carry out the method according to claim 5.
CN201611112409.8A 2016-12-06 2016-12-06 Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium Active CN108154163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611112409.8A CN108154163B (en) 2016-12-06 2016-12-06 Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611112409.8A CN108154163B (en) 2016-12-06 2016-12-06 Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium

Publications (2)

Publication Number Publication Date
CN108154163A CN108154163A (en) 2018-06-12
CN108154163B true CN108154163B (en) 2020-11-24

Family

ID=62468532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611112409.8A Active CN108154163B (en) 2016-12-06 2016-12-06 Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium

Country Status (1)

Country Link
CN (1) CN108154163B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147081B (en) * 2018-09-03 2021-02-26 深圳市智物联网络有限公司 Equipment operation stability analysis method and system
CN109447103B (en) * 2018-09-07 2023-09-29 平安科技(深圳)有限公司 Big data classification method, device and equipment based on hard clustering algorithm
CN109495291B (en) * 2018-09-30 2021-11-16 创新先进技术有限公司 Calling abnormity positioning method and device and server
CN110427358B (en) * 2019-02-22 2021-04-30 北京沃东天骏信息技术有限公司 Data cleaning method and device and information recommendation method and device
CN110579708B (en) * 2019-08-29 2021-10-22 爱驰汽车有限公司 Battery capacity identification method and device, computing equipment and computer storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999615A (en) * 2012-11-29 2013-03-27 合肥工业大学 Diversified image marking and retrieving method based on radial basis function neural network
CN103488623A (en) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 Multilingual text data sorting treatment method
CN105868775A (en) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN105912726A (en) * 2016-05-13 2016-08-31 北京邮电大学 Density centrality based sampling and detecting methods of unusual transaction data of virtual assets

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999615A (en) * 2012-11-29 2013-03-27 合肥工业大学 Diversified image marking and retrieving method based on radial basis function neural network
CN103488623A (en) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 Multilingual text data sorting treatment method
CN105868775A (en) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN105912726A (en) * 2016-05-13 2016-08-31 北京邮电大学 Density centrality based sampling and detecting methods of unusual transaction data of virtual assets

Also Published As

Publication number Publication date
CN108154163A (en) 2018-06-12

Similar Documents

Publication Publication Date Title
CN108154163B (en) Data processing method, data recognition and learning method, apparatus thereof, and computer readable medium
US10198455B2 (en) Sampling-based deduplication estimation
US9811760B2 (en) Online per-feature descriptor customization
CN104077328B (en) The operation diagnostic method and equipment of MapReduce distributed system
CN111222137A (en) Program classification model training method, program classification method and device
US20170317905A1 (en) Metric fingerprint identification
CN112380131B (en) Module testing method and device and electronic equipment
CN106301979B (en) Method and system for detecting abnormal channel
CN111091287A (en) Risk object identification method and device and computer equipment
CN106998336B (en) Method and device for detecting user in channel
CN111126028A (en) Data processing method, device and equipment
CN106778277A (en) Malware detection methods and device
CN112765324A (en) Concept drift detection method and device
US10341471B2 (en) Packet analysis apparatus, method, and non-transitory computer readable medium thereof
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
CN111353109A (en) Malicious domain name identification method and system
WO2016127858A1 (en) Method and device for identifying webpage intrusion script features
CN109800775B (en) File clustering method, device, equipment and readable medium
CN107508764B (en) Network data traffic type identification method and device
US20230133180A1 (en) Modeling method and apparatus for model of tracing the origin of durians, and method for tracing the origin of durians
CN111539576B (en) Risk identification model optimization method and device
CN109947933B (en) Method and device for classifying logs
CN113705625A (en) Method and device for identifying abnormal life guarantee application families and electronic equipment
CN113901417A (en) Mobile equipment fingerprint generation method and readable storage medium
CN109670976B (en) Feature factor determination method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant