CN109299279A

CN109299279A - A kind of data processing method, equipment, system and medium

Info

Publication number: CN109299279A
Application number: CN201811450760.7A
Authority: CN
Inventors: 朱细智
Original assignee: Beijing Qianxin Technology Co Ltd
Current assignee: Beijing Qianxin Technology Co Ltd
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-02-01
Anticipated expiration: 2038-11-29
Also published as: CN109299279B

Abstract

Present disclose provides a kind of data processing methods, comprising: obtains data, and clusters to data, obtains N number of data category；M specific data classification is extracted from N number of data category；The first sample for meeting the specific data classification is obtained from data；Determine one or more keyword of each specific data classification；First sample is screened according to keyword, obtains the second sample；Disaggregated model is generated according to the second sample, calculates the matching degree of disaggregated model, if matching degree is less than preset threshold, repeats aforesaid operations until the matching degree of the disaggregated model of foundation is not less than preset threshold.The disclosure additionally provides a kind of data processing equipment, system and medium.By carrying out automatic cluster and classification to pending data, determines the classification standard of pending data, realize the Accurate classification of pending data.

Description

A kind of data processing method, equipment, system and medium

Technical field

This disclosure relates to data processing field, and in particular to a kind of data processing method, equipment, system and medium.

Background technique

It carries out data clusters currently, enterprise mainly passes through the processing mode of rambling data manually to identify and divides Class mainly understands the theme expressed by it, and then several determining data categories by way of manual read's file content, Manual read's file content to be processed is simultaneously divided into several data categories.

Due to each department of enterprise business independence and it is multidisciplinary between business plyability so that most employee is to existing What class is data be divided into, and is difficult to make correct judgement；And due to the limitation of employee's knowledge and experience, it is generally difficult to manually reflect Data are divided into suitable classification otherwise；Meanwhile the artificial time for identifying consuming and economic cost are also obvious excessively high It holds high and cannot achieve.

Summary of the invention

The disclosure in view of the above problems, provides a kind of data processing method, equipment, system and medium.By to specified The pending data of position carries out automatic cluster and classification, determines the classification standard of pending data, realizes pending data Accurate classification, and reduce costs.

An aspect of this disclosure provides a kind of data processing method, comprising: obtains data, and carries out to the data Cluster, obtains N number of data category；M specific data classification is extracted from N number of data category；It is obtained from the data Meet the first sample of the specific data classification；Determine one or more keyword of each specific data classification； The first sample is screened according to the keyword, obtains the second sample；Disaggregated model is generated according to second sample, is calculated The matching degree of the disaggregated model repeats aforesaid operations described in the foundation if the matching degree is less than preset threshold The matching degree of disaggregated model is not less than the preset threshold.

Optionally, described that the data are clustered further include: to extract the semantic feature of the data；Selection cluster is calculated Method clusters the data according to the semantic feature.

Optionally, described that the first sample is screened according to the keyword further include: according to the keyword to described First sample is matched, and the one or more most comprising the keyword type and the keyword frequency of occurrence is filtered out The first sample.

Optionally, described that disaggregated model is generated according to second sample further include: to extract the semanteme of second sample Feature；The semantic feature of second sample and the fitness of the specific data classification are judged according to preset rules, are filtered out The semantic feature of the highest one or more of fitness second sample；According to one or more of second samples This semantic feature generates disaggregated model.

Optionally, the matching degree for calculating the disaggregated model further include: using the disaggregated model to described second Sample is classified, and classification results are obtained；The matching degree of the disaggregated model is calculated according to the classification results.

Optionally, the matching degree is selected from accuracy, precision ratio, recall ratio, F1 value, classification report, confusion matrix, ROC One or more in area under curve and ROC curve.

Optionally, described to repeat aforesaid operations further include: to reject described deleted in the specific data classification Two samples；Supplement second sample for increasing newly or modifying in the specific data classification；Second sample is updated, according to more Second sample after new generates the new disaggregated model.

On the other hand the disclosure additionally provides a kind of data processing electronics, comprising: processor；Memory, storage There is computer executable program, the program by the processor when being executed, so that the processor executes above-mentioned data processing Method.

On the other hand the disclosure additionally provides a kind of data processing system, the data processing system includes: cluster module, It is clustered for obtaining data, and to the data, obtains N number of data category；Sample determining module is used for from described N number of M specific data classification is extracted in data category, and the first sample for meeting the specific data classification is obtained from the data, One or more keyword for determining each specific data classification screens the first sample according to the keyword, Obtain the second sample；Disaggregated model generation module, for generating disaggregated model according to second sample；Disaggregated model verifies mould Block, for calculating the matching degree of the disaggregated model, if the matching degree be less than preset threshold, repeat above-mentioned module until The matching degree for the disaggregated model established is not less than the preset threshold.

On the other hand the disclosure additionally provides a kind of computer readable storage medium, be stored thereon with computer program, should Above-mentioned data processing method is realized when program is executed by processor.

Detailed description of the invention

In order to which the disclosure and its advantage is more fully understood, referring now to being described below in conjunction with attached drawing, in which:

Fig. 1 diagrammatically illustrates the flow chart of the data processing method provided according to the embodiment of the present disclosure.

Fig. 2 diagrammatically illustrates the block diagram of the electronic equipment according to the disclosure.

Fig. 3 diagrammatically illustrates the block diagram of the data processing system of the embodiment of the present disclosure.

Specific embodiment

According in conjunction with attached drawing to the described in detail below of disclosure exemplary embodiment, other aspects, the advantage of the disclosure Those skilled in the art will become obvious with prominent features.

In the disclosure, term " includes " and " containing " and its derivative mean including rather than limit；Term "or" is packet Containing property, mean and/or.

In the present specification, following various embodiments for describing disclosure principle only illustrate, should not be with any Mode is construed to limitation scope of disclosure.Referring to attached drawing the comprehensive understanding described below that is used to help by claim and its equivalent The exemplary embodiment for the disclosure that object limits.Described below includes a variety of details to help to understand, but these details are answered Think to be only exemplary.Therefore, it will be appreciated by those of ordinary skill in the art that without departing substantially from the scope of the present disclosure and spirit In the case where, embodiment described herein can be made various changes and modifications.In addition, for clarity and brevity, The description of known function and structure is omitted.In addition, running through attached drawing, same reference numbers are used for identity function and operation.

File server is a device for being stored with heap file, for providing file to server.The embodiment of the present disclosure The data processing method of offer, is illustrated by taking the file server of corporate client as an example, wherein file is a kind of shape of data Formula, the file in the embodiment of the present disclosure can be understood as data.The mixed and disorderly nothing of a pile is stored on the file server of corporate client The file of chapter, this heap file may cover the classifications such as economy, sport, medical treatment, law, military affairs, the energy, but each classification includes Which file is still uncertain.It is the rambling file generated disaggregated model of this heap by the data processing method of the disclosure, it is real Now to the automatic and Accurate classification of the rambling file.

Fig. 1 diagrammatically illustrates the flow chart of the method for the data processing provided according to the embodiment of the present disclosure.Such as Fig. 1 institute Show, this method includes following operation:

S1 obtains data to be processed, carries out automatic cluster to pending data, obtains N number of data category.

Firstly, specifying the path of file to be processed, the content of text of the file under the path is extracted, utilizes data cleansing skill Art carries out data cleansing to text content, and the file to be processed after data cleansing is automatically extracted using Feature Engineering technology Semantic feature.

Data cleansing is the process that data are examined and verified again, it is therefore intended that deletes duplicate message, corrects and deposit Mistake, and provide the consistency of data.It such as filters and deletes in file the tone that the frequency of occurrences is high and practical significance is little and help Word, adverbial word, preposition etc., and the sentence in file is divided into single word etc..

Semantic feature be with several words similar in document theme, its semantic feature of the file of such as medical class may be disease Disease, heart disease, tumour, health, medical instrument etc., the file of law class its semantic feature may for criminal law, civil law, copyright, People's court, labor arbitration etc..

Then, automatic cluster algorithm is selected, automatic cluster is carried out to file to be processed according to semantic feature, obtains N number of number According to classification.

Automatic cluster is that different files are mapped to point different in characteristic vector space respectively using special algorithm, according to The aggregation extent of these points, is gathered into certain specific data categories for respective file.By taking K-means algorithm as an example, number is inputted According to the number N of classification, file automatic cluster to be processed can be obtained to N number of number indicated with digital label (such as 1,2,3 ... N) According to classification, wherein the file similarity in same data category is higher, and the file similarity in different data categories is lower.

S2 extracts M specific data classification from N number of data category, obtains from data to be processed and meets certain number According to the first sample of classification.

Firstly, carrying out file movement, file mergences etc. to N number of data category that automatic cluster obtains, Y data class is obtained Not, and with word tag (such as economy, sport, medical treatment, law, military affairs, the energy ...) this Y data category is indicated.

Automatic cluster algorithm, which carries out automatic cluster to file to be processed, may have error, it is therefore desirable to by manually seeing Filename or the file content of file to be processed are examined to judge the accuracy of cluster result.For example, the file master in data category 1 Related to All-round Development of Students, the file in data category 2 is mainly related to books, and artificial observation is to 1 He of data category Data category 2 is related to education, needs to merge data category 1 and data category 2 at this time.It is adjusted by manual operation Cluster result obtains Y data category, wherein Y≤N until optimal, and the theme expressed according to each data category is by this Y number Word tag is revised as according to the digital label of classification.

Then, one or more specific data classifications are confirmed from this Y data category by corporate client according to its demand. For example, medical treatment, the military, energy are its key business, corporate client confirmation medical treatment, military, energy for Mr. Yu's corporate client These three data categories of source are as its specific data classification.

Finally, for each specific data classification, is obtained from file to be processed and suitable meet the specific data class Other file is as first sample.Such as 1000 files closely related with medical treatment are obtained from medical data category as doctor The first sample for treating data category obtains 1000 with military closely related file as military number from military data category According to the first sample of classification, 1000 files closely related with the energy are obtained from multi-energy data classification as multi-energy data class Other first sample.

S3 determines the keyword of each specific data classification, screens first sample according to obtained keyword, obtains second Sample.

Firstly, determining the keyword of each specific data classification, this operation is determined by corporate client, and corporate client is true Fixed suitable keyword that can most represent the first sample content.For example, determine medical data classification keyword be " hospital, Operation, drug, medical instrument, health, physical examination, disease, heart disease, self-closing disease, mental disease, AIDS, tumour, cancer, rehabilitation Training " determines that the keyword of military-specific data classification is " war, peace, military exercises, gun, weapon, nuclear weapon, conflict, office Gesture, the Middle East, Afghanistan, Iraq, Ukraine, five generation machines, unmanned plane, guided missile, aircraft carrier, the Pentagon, the Korea peninsula " determines The keyword of multi-energy data classification be " new energy, petroleum, coal, natural gas, solar energy, resource, nuclear power station, photovoltaic, cleaning, Production capacity ".

Then, according to obtained keyword, using keyword match technology, respectively in each specific data classification One sample is matched, and filters out first samples more comprising keyword type and that keyword frequency of occurrence is more as second Sample.

S4 generates disaggregated model according to the second sample, the matching degree of disaggregated model is calculated, if disaggregated model matching degree is less than Preset threshold repeats aforesaid operations, until the disaggregated model matching degree of foundation is not less than preset threshold.

Firstly, extracting the content of text of second sample, it is clear to carry out data to text content using data cleansing technology It washes, and automatically extracts the semantic feature of the second sample after data cleansing using Feature Engineering technology.The semantic feature is to pass through Feature Engineering technology automatically extract to obtain and with several words similar in the second sample theme, hand picking can be overcome semantic special Levy incomplete disadvantage.

Secondly, judging the semantic feature of the second sample and the fitness of specific data classification according to preset rules, filter out The semantic feature of highest one or more second samples of fitness is used as most representative semantic feature, and usual most generation The semantic feature of table has multiple.

Fitness is the degree of correlation of theme expressed by the semantic feature and specific data classification of the second sample, related journey Degree is higher, and fitness is higher.Preset rules can be according to the prepared rule of artificial experience.Such as medical data classification and Speech, in " disinfection, tumour, health, operation, rehabilitation training, physical examination " this six semantic features, it is assumed that according to artificial experience it is found that It is followed successively by " operation, tumour, rehabilitation training, physical examination, health, disinfection " from high to low with medical problem degree of correlation, if desired sieves It selects highest four semantic features of fitness and is used as most representative semantic feature, at this point, most representative semantic special Sign is " operation, tumour, rehabilitation training, physical examination ".

Then, selection sort algorithm (such as naive Bayesian, decision tree, random forest, SVM support vector machines), according to Obtained most representative semantic feature generates disaggregated model.Import the second sample, according to obtained disaggregated model to this Two samples are classified, and classification results are obtained, by the classification results and expected results (the of i.e. above-mentioned each specific data classification Two samples) it is compared, calculate the matching degree of the disaggregated model.

It is bent that matching degree is selected from accuracy, precision ratio, recall ratio, F1 value, classification report, confusion matrix, ROC curve and ROC One or more in area under line.Accuracy is the ratio that correct sample number accounts for all sample numbers in classification results.It looks into The accuracy rate for the sample being retrieved in quasi- rate presentation class result.Recall ratio indicates to be retrieved in all accurate samples Ratio.F1 value is precision ratio and recall ratio weighted harmonic mean, is the evaluation index for combining precision ratio and recall ratio.Classification Report is that synthesis provides the evaluation index of precision ratio, recall ratio and F1 value.Confusion matrix respectively statistical classification model return wrong class, Return the number to the observation of class, then result is placed in confusion matrix and is shown.ROC curve is reflection recall ratio and spy The overall target of anisotropic continuous variable, the area under ROC curve is bigger, and disaggregated model is more effective.

Finally, the relationship between the matching degree of disaggregated model and preset threshold is judged, if the matching degree is less than default threshold Value repeats the matching degree of disaggregated model of the above operation until foundation not less than preset threshold.

By taking matching degree includes recall rate, accuracy rate and F1 value as an example, it is assumed that the preset threshold of recall rate is 95%, accuracy rate Preset threshold be the preset threshold of 98%, F1 value be 96.5%.Then when the recall rate of disaggregated model is not less than 95%, accuracy rate The disaggregated model is issued not less than 98% and when F1 value is not less than 96.5%, the disaggregated model is for executing data classification business； Otherwise, the above operation is repeated, the second sample deleted in specific data classification is rejected, supplements in specific data classification and increases newly Or the second sample of modification, the second sample is updated, new disaggregated model is generated according to updated second sample, until foundation The recall rate of new disaggregated model issues the classification when being not less than 96.5% not less than 98% and F1 value not less than 95%, accuracy rate Model.

As shown in Fig. 2, electronic equipment 200 includes processor 210, computer readable storage medium 220.The electronic equipment 200 can execute the method described above with reference to Fig. 1, to carry out Message Processing.

Specifically, processor 210 for example may include general purpose microprocessor, instruction set processor and/or related chip group And/or special microprocessor (for example, specific integrated circuit (ASIC)), etc..Processor 210 can also include using for caching The onboard storage device on way.Processor 210 can be for executing the method flow according to the embodiment of the present disclosure for referring to Fig. 1 description Different movements single treatment units either multiple processing units.

Computer readable storage medium 220, such as can be times can include, store, transmitting, propagating or transmitting instruction Meaning medium.For example, readable storage medium storing program for executing can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device, Device or propagation medium.The specific example of readable storage medium storing program for executing includes: magnetic memory apparatus, such as tape or hard disk (HDD)；Optical storage Device, such as CD (CD-ROM)；Memory, such as random access memory (RAM) or flash memory；And/or wire/wireless communication chain Road.

Computer readable storage medium 220 may include computer program 221, which may include generation Code/computer executable instructions retouch the execution of processor 210 for example above in conjunction with Fig. 1 The method flow stated and its any deformation.

Computer program 221 can be configured to have the computer program code for example including computer program module.Example Such as, in the exemplary embodiment, the code in computer program 221 may include one or more program modules, for example including 221A, module 221B ....It should be noted that the division mode and number of module are not fixation, those skilled in the art can To be combined according to the actual situation using suitable program module or program module, when these program modules are combined by processor 210 When execution, processor 210 is executed for example above in conjunction with method flow described in Fig. 1 and its any deformation.

In accordance with an embodiment of the present disclosure, computer-readable medium can be computer-readable signal media or computer can Read storage medium either the two any combination.Computer readable storage medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example of machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, portable of one or more conducting wires Formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.In the disclosure, computer readable storage medium can be it is any include or storage program Tangible medium, which can be commanded execution system, device or device use or in connection.And in this public affairs In opening, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Any computer-readable medium other than storage medium, the computer-readable medium can send, propagate or transmit for by Instruction execution system, device or device use or program in connection.The journey for including on computer-readable medium Sequence code can transmit with any suitable medium, including but not limited to: wireless, wired, optical cable, radiofrequency signal etc., or Above-mentioned any appropriate combination.

As shown in figure 3, matched system includes cluster module 310, sample determining module 320, disaggregated model between user Generation module 330 and disaggregated model authentication module 340.

Specifically, cluster module 310 carry out data cleansing to pending data, automatically for obtaining data to be processed The semantic feature of pending data after extracting data cleansing selects automatic cluster algorithm, special according to the semanteme of pending data Sign carries out automatic cluster to pending data, obtains N number of data category.

Sample determining module 320 obtains Y for move to N number of data category after automatic cluster, merge Data category, confirms one or more specific data classifications from this Y data category, and appropriate symbol is obtained from pending data The data of the specific data classification are closed as first sample, the keyword of each specific data classification is determined, utilizes keyword First sample is matched with technology, filters out the first more comprising keyword type and more keyword frequency of occurrence samples This is as the second sample.

It is clear to carry out data to text content for extracting the content of text of the second sample for disaggregated model generation module 330 It washes, the semantic feature of the second sample after automatically extracting data cleansing judges the semantic feature of the second sample according to preset rules With the fitness of specific data classification, the semantic feature of highest one or more second samples of fitness is filtered out as most Representative semantic feature, selection sort algorithm generate disaggregated model according to most representative semantic feature.

Disaggregated model authentication module 340 is classified for being classified according to obtained disaggregated model to the second sample As a result, calculating the matching degree of the disaggregated model according to classification results, if matching degree is less than preset threshold, repeat with upper module Until the matching degree of the disaggregated model of foundation is not less than preset threshold.

It is understood that cluster module 310, sample determining module 320, disaggregated model generation module 330 and classification mould Type authentication module 340 may be incorporated in a module realize or any one module therein can be split into it is multiple Module.Alternatively, at least partly function of one or more modules in these modules can be at least partly function of other modules It can combine, and be realized in a module.In accordance with an embodiment of the present disclosure, cluster module 310, sample determining module 320, point At least one of class model generation module 330 and disaggregated model authentication module 340 can at least be implemented partly as hardware Circuit, such as field programmable gate array (FPGA), programmable logic array (PLA), system on chip, the system on substrate, envelope The system loaded onto, specific integrated circuit (ASIC), or can be to carry out any other reasonable side that is integrated or encapsulating to circuit The hardware such as formula or firmware realize, or is realized with software, the appropriately combined of three kinds of implementations of hardware and firmware.Alternatively, In cluster module 310, sample determining module 320, disaggregated model generation module 330 and disaggregated model authentication module 340 at least One can at least be implemented partly as computer program module, when the program is run by computer, can execute corresponding The function of module.

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.

It will be understood by those skilled in the art that the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations or/or combination, even if such combination or combination are not expressly recited in the disclosure.Particularly, exist In the case where not departing from disclosure spirit or teaching, the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.

Although the disclosure, those skilled in the art are shown and described with reference to the certain exemplary embodiments of the disclosure It, can be with it should be understood that in the case where the spirit and scope of the present disclosure limited without departing substantially from the following claims and their equivalents A variety of changes in form and details are carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment, but It should be not only determined by appended claims, be also defined by the equivalent of appended claims.

Claims

1. a kind of data processing method characterized by comprising

Data are obtained, and the data are clustered, obtain N number of data category；

M specific data classification is extracted from N number of data category；

The first sample for meeting the specific data classification is obtained from the data；

Determine one or more keyword of each specific data classification；

The first sample is screened according to the keyword, obtains the second sample；

Disaggregated model is generated according to second sample, calculates the matching degree of the disaggregated model, if the matching degree is less than in advance If threshold value, aforesaid operations are repeated until the matching degree of the disaggregated model of foundation is not less than the preset threshold.

2. data processing method according to claim 1, which is characterized in that described clustered to the data is also wrapped It includes:

Extract the semantic feature of the data；

Clustering algorithm is selected, the data are clustered according to the semantic feature.

3. data processing method according to claim 1, which is characterized in that described according to keyword screening described the One sample further include:

The first sample is matched according to the keyword, is filtered out comprising the keyword type and the keyword The most one or more first sample of frequency of occurrence.

4. data processing method according to claim 1, which is characterized in that described generated according to second sample is classified Model further include:

Extract the semantic feature of second sample；

The semantic feature of second sample and the fitness of the specific data classification are judged according to preset rules, filter out institute State the semantic feature of the highest one or more of fitness second sample；

Disaggregated model is generated according to the semantic feature of one or more of second samples.

5. data processing method according to claim 1, which is characterized in that the matching degree for calculating the disaggregated model Further include:

Classified using the disaggregated model to second sample, obtains classification results；

The matching degree of the disaggregated model is calculated according to the classification results.

6. data processing method according to claim 5, which is characterized in that the matching degree be selected from accuracy, precision ratio, One or more in area under recall ratio, F1 value, classification report, confusion matrix, ROC curve and ROC curve.

7. data processing method according to claim 1, which is characterized in that described to repeat aforesaid operations further include:

Reject second sample deleted in the specific data classification；

Supplement second sample for increasing newly or modifying in the specific data classification；

Second sample is updated, the new disaggregated model is generated according to updated second sample.

8. a kind of data processing electronics characterized by comprising

Processor；

Memory is stored with computer executable program, and the program by the processor when being executed, so that the processor It executes such as data processing method in claim 1-7.

9. a kind of data processing system, which is characterized in that the data processing system includes:

Cluster module clusters for obtaining data, and to the data, obtains N number of data category；

Sample determining module is obtained from the data for extracting M specific data classification from N number of data category The first sample for meeting the specific data classification determines one or more keyword of each specific data classification, The first sample is screened according to the keyword, obtains the second sample；

Disaggregated model generation module, for generating disaggregated model according to second sample；

Disaggregated model authentication module, for calculating the matching degree of the disaggregated model, if the matching degree is less than preset threshold, weight Above-mentioned module is executed again until the matching degree of the disaggregated model of foundation is not less than the preset threshold.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor It is realized when execution such as data processing method in claim 1-7.