CN109214435A

CN109214435A - A kind of data classification method and device

Info

Publication number: CN109214435A
Application number: CN201810956382.3A
Authority: CN
Inventors: 李明; 孙翯; 池天宇; 刘冬阳; 张启龙; 王玲玲; 黎佳林; 胡海波; 张仲朋; 薛旭锋
Original assignee: Beijing Harmony Information Technology Ltd By Share Ltd
Current assignee: Beijing Harmony Information Technology Ltd By Share Ltd
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2019-01-15

Abstract

The embodiment of the invention discloses a kind of data classification method and devices, which comprises after the source data for obtaining random combine in real time, carries out classification processing to the source data of the random combine first, obtains multiple class data；Wherein, each class data corresponds to a class label in the multiple class data；Meanwhile determining that each class data corresponds to the weight of class label；Later, it stores weight that each identified class data correspond to class label is corresponding with class label to tag library.In this way, can be realized the real time correlation between data and class label in tag system business module, the error in data classification is effectively avoided, and then promote the accuracy and availability of entire data management system.

Description

A kind of data classification method and device

Technical field

The present invention relates to big data processing technique more particularly to a kind of data classification methods and device.

Background technique

Currently, big data management system mainly includes that data management platform, tag system and model platform (MMP) three are big Business module.Under big data high concurrent state, usually require that data in the data management system between each business module as far as possible Keep real-time update.

However, there are apparent defects: 1) update between data and label for big data management system in the related technology A part is not carried out the real-time update degree of association；2) tag library is not followed up in real time；3) there are larger for the setting of label Limitation.

In real data management, just because of the drawbacks described above of big data management system, data classification can be directly resulted in On error, to influence the accuracy and availability of entire big data management system.

Summary of the invention

The embodiment of the present invention creatively provides one to effectively overcome the defect of big data management system in the prior art Kind data classification method and device.

According to the first aspect of the invention, a kind of data classification method is provided, which comprises obtain random combine Source data；Classification processing is carried out to the source data of the random combine, obtains multiple class data；Wherein, the multiple class data In each class data correspond to a class label；Determine that each class data corresponds to the weight of class label；It will be identified every The weight that one class data corresponds to class label corresponding with class label is stored to tag library.

According to an embodiment of the present invention, wherein classification processing is carried out to the source data of the random combine, is obtained multiple Class data, comprising: classification processing is carried out to the source data of the random combine based on preset class label, obtain respectively with There are the class data of mapping relations for the preset class label；Each class data of the determination correspond to the power of class label Weight, comprising: during carrying out classification processing based on source data of the preset class label to the random combine, according to At least one characteristic dimension of the source data of the random combine to carry out the preset class label determination of weight.

According to an embodiment of the present invention, wherein classification processing is carried out to the source data of the random combine, is obtained multiple Class data, comprising: carried out the source data of the random combine according to different characteristic dimensions using the first specific classification algorithm Classification processing obtains a classification processing result；The side of the refinement of characteristic dimension is carried out using the second specific classification algorithm Formula to carry out secondary classification processing to a classification processing result, obtains multiple class data；Each class data pair of the determination Answer the weight of class label, comprising: by the way of the refinement for carrying out each characteristic dimension using the second specific classification algorithm come pair During classification processing result carries out secondary classification processing, according to the accounting of each class data in multiple class data come The weight for corresponding to class label to each class data is determined.

According to an embodiment of the present invention, wherein the first specific classification algorithm includes at least one following algorithm: poly- Class, classification tree, Rd forest.

According to an embodiment of the present invention, wherein the second feature sorting algorithm includes at least one following algorithm: shellfish Ye Si, logistic regression training.

According to an embodiment of the present invention, wherein the method further includes: in response to the application to class label, hold Row operates the update for the weight that each described class data correspond to class label.

According to the second aspect of the invention, a kind of device for classifying data is provided, described device includes: acquisition module, is used for Obtain the source data of random combine；Classification processing module carries out classification processing for the source data to the random combine, obtains Multiple class data；Wherein, each class data corresponds to a class label in the multiple class data；Determining module, for true The weight of each fixed class data corresponding label；Memory module, for each identified class data to be corresponded to class label Weight is corresponding with class label to be stored to tag library.

According to an embodiment of the present invention, wherein the classification processing module is also used to, and is based on preset class label Classification processing is carried out to the source data of the random combine, obtains respectively that there are mapping relations with the preset class label Class data；The determining module is also used to, and is based on preset class label to described random in the classification processing module During combined source data carries out classification processing, according at least one characteristic dimension of the source data of the random combine come The determination of weight is carried out to the preset class label.

According to an embodiment of the present invention, wherein the classification processing module is also used to, and utilizes the first specific classification algorithm The source data of the random combine is subjected to a classification processing according to different characteristic dimensions, obtains a classification processing knot Fruit；Secondary point is carried out to a classification processing result by the way of the refinement that the second specific classification algorithm carries out characteristic dimension Class processing, obtains multiple class data；The determining module is also used to, and is calculated in the classification processing module using the second specific classification The mode that method carries out the refinement of each characteristic dimension is come during carrying out secondary classification processing to a classification processing result, It is determined according to the accounting of each class data in multiple class data come the weight for corresponding to class label to each class data.

According to an embodiment of the present invention, wherein described device further comprises: update module, in response to class The application of label executes the update to the weight of each class data corresponding label and operates.

Data classification method and device described in the embodiment of the present invention, it is first after the source data for obtaining random combine in real time Classification processing first is carried out to the source data of the random combine, obtains multiple class data；Wherein, each in the multiple class data A class data correspond to a class label；Meanwhile determining that each class data corresponds to the weight of class label；Later, it will determine Each class data correspond to the weight of class label and corresponding with class label store to tag library.In this way, in tag system business module It can be realized the real time correlation between data and class label, effectively avoid the error in data classification, and then promote entire data The accuracy and availability of management system.

It is to be appreciated that the teachings of the present invention does not need to realize whole beneficial effects recited above, but it is specific Technical solution may be implemented specific technical effect, and other embodiments of the invention can also be realized and not mentioned above Beneficial effect.

Detailed description of the invention

The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:

In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.

Fig. 1 shows the structure composed figure of data management system of the present invention；

Fig. 2 shows an implementation process schematic diagrames of data classification method of the embodiment of the present invention；

Fig. 3 shows the another implementation process schematic diagram of data classification method of the embodiment of the present invention；

Fig. 4 shows the composed structure schematic diagram of device for classifying data of the embodiment of the present invention.

Specific embodiment

The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any Mode limits the scope of the invention.On the contrary, thesing embodiments are provided so that the present invention is more thorough and complete, and energy It enough will fully convey the scope of the invention to those skilled in the art.

The technical solution of the present invention is further elaborated in the following with reference to the drawings and specific embodiments.

Fig. 1 shows the structure composed figure of data management system of the present invention.

As shown in Figure 1, data management system of the present invention includes data management platform (DMP), tag system (LOS) and model Platform (MMP).

Wherein, the DMP is mainly used for the cleaning, filtering and reparation of data；The LOS is mainly used for data class label Update and storage, the MMP be mainly used for model creation maintenance and storage.

In the operation of entire data management system, main includes following a few step key operations:

The first step carries out data pull by DMP, is further formatted to data；Then by the data of formatting to Amount is input to the cleaning filtering that service sink carries out data；Data reparation further is carried out to the abnormal data for not passing through service sink, To guarantee the accuracy of data.

Second step, the data after being cleaned, filtered and being repaired via DMP are input to LOS system and carry out classification processing, Storing the data for having weight and class label to tag library.

Third step, MMP is by way of automatically creating model or modifying model come to storing to tag library with weight Analysis screening is carried out with the data of class label.

Here, it should be added that, in second step, the class label can be common label, be also possible to Anonymous label.

It is described in detail below mainly for the creative realization process of LOS system.

Fig. 2 shows the implementation process schematic diagrames of data classification method of the embodiment of the present invention.

As shown in Fig. 2, data classification method described in the embodiment of the present invention includes: operation 201, the source number of random combine is obtained According to；Operation 202 carries out classification processing to the source data of the random combine, obtains multiple class data；Wherein, the multiple class Each class data corresponds to a class label in data；Operation 203, determines that each class data corresponds to the weight of class label； Operation 204 is stored weight that each identified class data correspond to class label is corresponding with class label to tag library.

In operation 202, based on the different type of class label, and then exists and classify to the source data of the random combine Two different implementations of processing.

For common label, in operation 202, classification processing is carried out to the source data of the random combine, is obtained more A class data, comprising: classification processing is carried out based on source data of the preset class label to the random combine, is distinguished There are the class data of mapping relations with the preset class label.Wherein, here, the preset class label is usual It is artificially determined by user.The preset class label includes driving score, drive speed or driving duration etc..

Correspondingly, each class data of the determination correspond to the weight of class label in operation 203, comprising: based on preparatory During the class label of setting carries out classification processing to the source data of the random combine, according to the source number of the random combine According at least one characteristic dimension to carry out the preset class label determination of weight.Wherein, it is described at least one Characteristic dimension may include time, region or position etc..

In an application example, when user wants that the driving situation to all chauffeurs in areas of Beijing carries out analysis screening, Therefore a series of class label is preset, such as drive score, drive speed, drive duration；Later, source data is propped up Vector machine (Support Vector Machine, SVM) division is held, so that the classification processing of data is realized, by the data of DMP The mapping of multi-to-multi is carried out with a series of preset class labels；At the same time, according to the time of source data, region, position The initialization of weight is carried out Deng at least one characteristic dimension.Certainly, which can dynamically update.

In this way, passing through the setting of class label, the innovative communication system increased between MMP-LOS-DMP is improved The activity of data, while also allowing and contacting even closer between data and label and initial data, improve data directory Accuracy.

In the classification process of corresponding common label, it is not difficult to find that entire classification processing mainly artificially labels, Therefore there are subjectivities and non-intellectual, therefore common label is not able to satisfy refinement of the user to data, cannot guarantee that user's logarithm According to control, it is therefore desirable to establish anonymous tag system.

The classification processing of source data is described in detail below for anonymous label.

For anonymous label, in operation 202, classification processing is carried out to the source data of the random combine, is obtained more A class data, comprising: using the first specific classification algorithm by the source data of the random combine according to different characteristic dimensions into Classification processing of row, obtains a classification processing result；The refinement of characteristic dimension is carried out using the second specific classification algorithm Mode to carry out secondary classification processing to a classification processing result, obtains multiple class data.Wherein, first specific classification Algorithm includes at least one following algorithm: cluster, classification tree, Rd forest.

Correspondingly, each class data of the determination correspond to the weight of class label in operation 203, comprising: using second The mode that specific classification algorithm carries out the refinement of each characteristic dimension to carry out at secondary classification a classification processing result During reason, according to the accounting of each class data in multiple class data each class data are corresponded to the weight of class label It is determined.Wherein, the second feature sorting algorithm includes at least one following algorithm: Bayes, logistic regression training.

It can be respectively cluster and shellfish with the first specific classification algorithm and the second specific classification algorithm in an application example Ye Si realizes the classification processing to source data.General classification thinking are as follows: data are clustered according to different characteristic dimensions, Such as: region dimension, time dimension, work dimension, house dimension；Bayes is reused later carries out each characteristic dimension Refine and initialize the anonymous label weight after each refinement.

For example, step 1, all data are carried out by seriation and normalization according to coordinate vector first, mainly The later period is facilitated to calculate the complexity of similarity and calculating；Step 2, calculate data vector between similarity, according to similarity into Row cluster, every one kind stand alone as an anonymous label, then carry out inside to each class and classifying, and so on, until each Element gap inside class is sufficiently small (default can degree of being similarly configured judged)；Step 3, by the period it is rough be divided into 24 A section, 1 hour is 1 time interval, executes step 2 for each section, the weight of each label is calculated, calculating Method has very much, here to a kind of relatively good understanding: the similar class of each time interval is carried out element number summation, note Are as follows: S；The element number of the corresponding class of each time interval is denoted as: e；Weight are as follows: e/S*100%；Step 4, to all marks Label execute identical process and obtain the weight of each label.

In this way, passing through the setting of class label, the innovative communication system increased between MMP-LOS-DMP is improved The activity of data, while also allowing and contacting even closer between data and label and initial data, improve data directory Accuracy.Moreover, the embodiment of the present invention creatively increases anonymous label, it can preferably increase the granularity of data, be able to The potential data characteristics of mining data.

A possible embodiment according to the present invention, as shown in figure 3, after operation 204, the method also includes: behaviour Make 205, in response to the application to class label, the update for executing the weight for corresponding to class label to each described class data is operated.

Wherein, it can be MMP system in response to the application to class label and carry out model index in triggering.Carried out in triggering In the case that model indexes, start the update operation for executing the weight that each described class data are corresponded to class label.

It should be added that in practical applications, can be executed based on user satisfaction to each described class Data correspond to the update operation of the weight of class label.

Fig. 4 shows the composed structure schematic diagram of device for classifying data of the embodiment of the present invention.As shown in figure 3, the data Sorter 40 includes:

Module 401 is obtained, for obtaining the source data of random combine；

Classification processing module 402 carries out classification processing for the source data to the random combine, obtains multiple class numbers According to；Wherein, each class data corresponds to a class label in the multiple class data；

Determining module 403, for determining the weight of each class data corresponding label；

Memory module 404, weight for each identified class data to be corresponded to class label is corresponding with class label to deposit It stores up to tag library.

According to an embodiment of the present invention, the classification processing module 402 is also used to, and is based on preset class label pair The source data of the random combine carries out classification processing, obtains respectively that there are mapping relations with the preset class label Class data；The determining module 403 is also used to, and is based on preset class label to described random in the classification processing module During combined source data carries out classification processing, according at least one characteristic dimension of the source data of the random combine come The determination of weight is carried out to the preset class label.

According to an embodiment of the present invention, the classification processing module 402 is also used to, will using the first specific classification algorithm The source data of the random combine carries out a classification processing according to different characteristic dimensions, obtains a classification processing result； Secondary classification is carried out to a classification processing result by the way of the refinement that the second specific classification algorithm carries out characteristic dimension Processing, obtains multiple class data；The determining module 403 is also used to, and uses the second specific classification in the classification processing module The mode that algorithm carries out the refinement of each characteristic dimension to carry out a classification processing result process of secondary classification processing In, the weight that according to the accounting of each class data in multiple class data each class data are corresponded to class label carries out really It is fixed.

According to an embodiment of the present invention, as shown in figure 4, described device 40 further comprises: update module 405 is used for In response to the application to class label, executes the update to the weight of each class data corresponding label and operate.

It need to be noted that: the description of above data sorter embodiment, the description with preceding method embodiment Be it is similar, there is with embodiment of the method similar beneficial effect, therefore do not repeat them here.It is real for device for classifying data of the present invention Undisclosed technical detail in example is applied, the description of embodiment of the present invention method is please referred to and understands, to save length, therefore no longer It repeats.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the device that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or device institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or device.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.Apparatus embodiments described above are merely indicative, for example, the division of the unit, only A kind of logical function partition, there may be another division manner in actual implementation, such as: multiple units or components can combine, or It is desirably integrated into another system, or some features can be ignored or not executed.In addition, shown or discussed each composition portion Mutual coupling or direct-coupling or communication connection is divided to can be through some interfaces, the INDIRECT COUPLING of equipment or unit Or communication connection, it can be electrical, mechanical or other forms.

Above-mentioned unit as illustrated by the separation member, which can be or may not be, to be physically separated, aobvious as unit The component shown can be or may not be physical unit；Both it can be located in one place, and may be distributed over multiple network lists In member；Some or all of units can be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

In addition, each functional unit in various embodiments of the present invention can be fully integrated in one processing unit, it can also To be each unit individually as a unit, can also be integrated in one unit with two or more units；It is above-mentioned Integrated unit both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can store in computer-readable storage medium, which exists When execution, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: movable storage device, read-only deposits The various media that can store program code such as reservoir (Read Only Memory, ROM), magnetic or disk.

If alternatively, the above-mentioned integrated unit of the present invention is realized in the form of software function module and as independent product When selling or using, it also can store in a computer readable storage medium.Based on this understanding, the present invention is implemented Substantially the part that contributes to existing technology can be embodied in the form of software products the technical solution of example in other words, The computer software product is stored in a storage medium, including some instructions are used so that computer equipment (can be with It is personal computer, server or network equipment etc.) execute all or part of each embodiment the method for the present invention. And storage medium above-mentioned includes: various Jie that can store program code such as movable storage device, ROM, magnetic or disk Matter.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of data classification method, which is characterized in that the described method includes:

Obtain the source data of random combine；

Classification processing is carried out to the source data of the random combine, obtains multiple class data；Wherein, every in the multiple class data One class data corresponds to a class label；

Determine that each class data corresponds to the weight of class label；

It stores weight that each identified class data correspond to class label is corresponding with class label to tag library.

2. the method according to claim 1, wherein

Classification processing is carried out to the source data of the random combine, obtains multiple class data, comprising:

Classification processing is carried out to the source data of the random combine based on preset class label, obtain respectively with it is described in advance There are the class data of mapping relations for the class label of setting；

Each class data of the determination correspond to the weight of class label, comprising:

During carrying out classification processing based on source data of the preset class label to the random combine, according to described At least one characteristic dimension of the source data of random combine to carry out the preset class label determination of weight.

3. the method according to claim 1, wherein

The source data of the random combine is subjected to a subseries according to different characteristic dimensions using the first specific classification algorithm Processing, obtains a classification processing result；

A classification processing result is carried out by the way of the refinement that the second specific classification algorithm carries out characteristic dimension secondary Classification processing obtains multiple class data；

Come to a classification processing result by the way of the refinement for carrying out each characteristic dimension using the second specific classification algorithm During carrying out secondary classification processing, according to the accounting of each class data in multiple class data come to each class data pair The weight of class label is answered to be determined.

4. according to the method described in claim 3, it is characterized in that, the first specific classification algorithm include following algorithm at least One of: cluster, classification tree, Rd forest.

5. according to the method described in claim 3, it is characterized in that, the second feature sorting algorithm include following algorithm at least One of: Bayes, logistic regression training.

6. method according to any one of claims 1 to 5, which is characterized in that the method further includes:

In response to the application to class label, the update for executing the weight for corresponding to class label to each described class data is operated.

7. a kind of device for classifying data, which is characterized in that described device includes:

Module is obtained, for obtaining the source data of random combine；

Classification processing module carries out classification processing for the source data to the random combine, obtains multiple class data；Wherein, Each class data corresponds to a class label in the multiple class data；

Determining module, for determining the weight of each class data corresponding label；

Memory module, weight for each identified class data to be corresponded to class label is corresponding with class label to be stored to label Library.

8. device according to claim 7, which is characterized in that

The classification processing module is also used to, and is classified based on preset class label to the source data of the random combine Processing, obtains respectively that there are the class data of mapping relations with the preset class label；

The determining module is also used to, in the classification processing module based on preset class label to the random combine During source data carries out classification processing, according at least one characteristic dimension of the source data of the random combine come to described Preset class label carries out the determination of weight.

9. device according to claim 7, which is characterized in that

The classification processing module is also used to, using the first specific classification algorithm by the source data of the random combine according to difference Characteristic dimension carry out a classification processing, obtain a classification processing result；Feature is carried out using the second specific classification algorithm The mode of the refinement of dimension to carry out secondary classification processing to a classification processing result, obtains multiple class data；

The determining module is also used to, and carries out each feature dimensions using the second specific classification algorithm in the classification processing module The mode of the refinement of degree is come during carrying out secondary classification processing to a classification processing result, according to every in multiple class data The accounting of one class data is determined come the weight for corresponding to class label to each class data.

10. device according to any one of claims 7 to 9, which is characterized in that described device further comprises:

Update module, for executing to the weight of each class data corresponding label in response to the application to class label Update operation.