CN109299279A - A kind of data processing method, equipment, system and medium - Google Patents
A kind of data processing method, equipment, system and medium Download PDFInfo
- Publication number
- CN109299279A CN109299279A CN201811450760.7A CN201811450760A CN109299279A CN 109299279 A CN109299279 A CN 109299279A CN 201811450760 A CN201811450760 A CN 201811450760A CN 109299279 A CN109299279 A CN 109299279A
- Authority
- CN
- China
- Prior art keywords
- data
- sample
- classification
- disaggregated model
- data processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
Present disclose provides a kind of data processing methods, comprising: obtains data, and clusters to data, obtains N number of data category;M specific data classification is extracted from N number of data category;The first sample for meeting the specific data classification is obtained from data;Determine one or more keyword of each specific data classification;First sample is screened according to keyword, obtains the second sample;Disaggregated model is generated according to the second sample, calculates the matching degree of disaggregated model, if matching degree is less than preset threshold, repeats aforesaid operations until the matching degree of the disaggregated model of foundation is not less than preset threshold.The disclosure additionally provides a kind of data processing equipment, system and medium.By carrying out automatic cluster and classification to pending data, determines the classification standard of pending data, realize the Accurate classification of pending data.
Description
Technical field
This disclosure relates to data processing field, and in particular to a kind of data processing method, equipment, system and medium.
Background technique
It carries out data clusters currently, enterprise mainly passes through the processing mode of rambling data manually to identify and divides
Class mainly understands the theme expressed by it, and then several determining data categories by way of manual read's file content,
Manual read's file content to be processed is simultaneously divided into several data categories.
Due to each department of enterprise business independence and it is multidisciplinary between business plyability so that most employee is to existing
What class is data be divided into, and is difficult to make correct judgement;And due to the limitation of employee's knowledge and experience, it is generally difficult to manually reflect
Data are divided into suitable classification otherwise;Meanwhile the artificial time for identifying consuming and economic cost are also obvious excessively high
It holds high and cannot achieve.
Summary of the invention
The disclosure in view of the above problems, provides a kind of data processing method, equipment, system and medium.By to specified
The pending data of position carries out automatic cluster and classification, determines the classification standard of pending data, realizes pending data
Accurate classification, and reduce costs.
An aspect of this disclosure provides a kind of data processing method, comprising: obtains data, and carries out to the data
Cluster, obtains N number of data category;M specific data classification is extracted from N number of data category;It is obtained from the data
Meet the first sample of the specific data classification;Determine one or more keyword of each specific data classification;
The first sample is screened according to the keyword, obtains the second sample;Disaggregated model is generated according to second sample, is calculated
The matching degree of the disaggregated model repeats aforesaid operations described in the foundation if the matching degree is less than preset threshold
The matching degree of disaggregated model is not less than the preset threshold.
Optionally, described that the data are clustered further include: to extract the semantic feature of the data;Selection cluster is calculated
Method clusters the data according to the semantic feature.
Optionally, described that the first sample is screened according to the keyword further include: according to the keyword to described
First sample is matched, and the one or more most comprising the keyword type and the keyword frequency of occurrence is filtered out
The first sample.
Optionally, described that disaggregated model is generated according to second sample further include: to extract the semanteme of second sample
Feature;The semantic feature of second sample and the fitness of the specific data classification are judged according to preset rules, are filtered out
The semantic feature of the highest one or more of fitness second sample;According to one or more of second samples
This semantic feature generates disaggregated model.
Optionally, the matching degree for calculating the disaggregated model further include: using the disaggregated model to described second
Sample is classified, and classification results are obtained;The matching degree of the disaggregated model is calculated according to the classification results.
Optionally, the matching degree is selected from accuracy, precision ratio, recall ratio, F1 value, classification report, confusion matrix, ROC
One or more in area under curve and ROC curve.
Optionally, described to repeat aforesaid operations further include: to reject described deleted in the specific data classification
Two samples;Supplement second sample for increasing newly or modifying in the specific data classification;Second sample is updated, according to more
Second sample after new generates the new disaggregated model.
On the other hand the disclosure additionally provides a kind of data processing electronics, comprising: processor;Memory, storage
There is computer executable program, the program by the processor when being executed, so that the processor executes above-mentioned data processing
Method.
On the other hand the disclosure additionally provides a kind of data processing system, the data processing system includes: cluster module,
It is clustered for obtaining data, and to the data, obtains N number of data category;Sample determining module is used for from described N number of
M specific data classification is extracted in data category, and the first sample for meeting the specific data classification is obtained from the data,
One or more keyword for determining each specific data classification screens the first sample according to the keyword,
Obtain the second sample;Disaggregated model generation module, for generating disaggregated model according to second sample;Disaggregated model verifies mould
Block, for calculating the matching degree of the disaggregated model, if the matching degree be less than preset threshold, repeat above-mentioned module until
The matching degree for the disaggregated model established is not less than the preset threshold.
On the other hand the disclosure additionally provides a kind of computer readable storage medium, be stored thereon with computer program, should
Above-mentioned data processing method is realized when program is executed by processor.
Detailed description of the invention
In order to which the disclosure and its advantage is more fully understood, referring now to being described below in conjunction with attached drawing, in which:
Fig. 1 diagrammatically illustrates the flow chart of the data processing method provided according to the embodiment of the present disclosure.
Fig. 2 diagrammatically illustrates the block diagram of the electronic equipment according to the disclosure.
Fig. 3 diagrammatically illustrates the block diagram of the data processing system of the embodiment of the present disclosure.
Specific embodiment
According in conjunction with attached drawing to the described in detail below of disclosure exemplary embodiment, other aspects, the advantage of the disclosure
Those skilled in the art will become obvious with prominent features.
In the disclosure, term " includes " and " containing " and its derivative mean including rather than limit;Term "or" is packet
Containing property, mean and/or.
In the present specification, following various embodiments for describing disclosure principle only illustrate, should not be with any
Mode is construed to limitation scope of disclosure.Referring to attached drawing the comprehensive understanding described below that is used to help by claim and its equivalent
The exemplary embodiment for the disclosure that object limits.Described below includes a variety of details to help to understand, but these details are answered
Think to be only exemplary.Therefore, it will be appreciated by those of ordinary skill in the art that without departing substantially from the scope of the present disclosure and spirit
In the case where, embodiment described herein can be made various changes and modifications.In addition, for clarity and brevity,
The description of known function and structure is omitted.In addition, running through attached drawing, same reference numbers are used for identity function and operation.
File server is a device for being stored with heap file, for providing file to server.The embodiment of the present disclosure
The data processing method of offer, is illustrated by taking the file server of corporate client as an example, wherein file is a kind of shape of data
Formula, the file in the embodiment of the present disclosure can be understood as data.The mixed and disorderly nothing of a pile is stored on the file server of corporate client
The file of chapter, this heap file may cover the classifications such as economy, sport, medical treatment, law, military affairs, the energy, but each classification includes
Which file is still uncertain.It is the rambling file generated disaggregated model of this heap by the data processing method of the disclosure, it is real
Now to the automatic and Accurate classification of the rambling file.
Fig. 1 diagrammatically illustrates the flow chart of the method for the data processing provided according to the embodiment of the present disclosure.Such as Fig. 1 institute
Show, this method includes following operation:
S1 obtains data to be processed, carries out automatic cluster to pending data, obtains N number of data category.
Firstly, specifying the path of file to be processed, the content of text of the file under the path is extracted, utilizes data cleansing skill
Art carries out data cleansing to text content, and the file to be processed after data cleansing is automatically extracted using Feature Engineering technology
Semantic feature.
Data cleansing is the process that data are examined and verified again, it is therefore intended that deletes duplicate message, corrects and deposit
Mistake, and provide the consistency of data.It such as filters and deletes in file the tone that the frequency of occurrences is high and practical significance is little and help
Word, adverbial word, preposition etc., and the sentence in file is divided into single word etc..
Semantic feature be with several words similar in document theme, its semantic feature of the file of such as medical class may be disease
Disease, heart disease, tumour, health, medical instrument etc., the file of law class its semantic feature may for criminal law, civil law, copyright,
People's court, labor arbitration etc..
Then, automatic cluster algorithm is selected, automatic cluster is carried out to file to be processed according to semantic feature, obtains N number of number
According to classification.
Automatic cluster is that different files are mapped to point different in characteristic vector space respectively using special algorithm, according to
The aggregation extent of these points, is gathered into certain specific data categories for respective file.By taking K-means algorithm as an example, number is inputted
According to the number N of classification, file automatic cluster to be processed can be obtained to N number of number indicated with digital label (such as 1,2,3 ... N)
According to classification, wherein the file similarity in same data category is higher, and the file similarity in different data categories is lower.
S2 extracts M specific data classification from N number of data category, obtains from data to be processed and meets certain number
According to the first sample of classification.
Firstly, carrying out file movement, file mergences etc. to N number of data category that automatic cluster obtains, Y data class is obtained
Not, and with word tag (such as economy, sport, medical treatment, law, military affairs, the energy ...) this Y data category is indicated.
Automatic cluster algorithm, which carries out automatic cluster to file to be processed, may have error, it is therefore desirable to by manually seeing
Filename or the file content of file to be processed are examined to judge the accuracy of cluster result.For example, the file master in data category 1
Related to All-round Development of Students, the file in data category 2 is mainly related to books, and artificial observation is to 1 He of data category
Data category 2 is related to education, needs to merge data category 1 and data category 2 at this time.It is adjusted by manual operation
Cluster result obtains Y data category, wherein Y≤N until optimal, and the theme expressed according to each data category is by this Y number
Word tag is revised as according to the digital label of classification.
Then, one or more specific data classifications are confirmed from this Y data category by corporate client according to its demand.
For example, medical treatment, the military, energy are its key business, corporate client confirmation medical treatment, military, energy for Mr. Yu's corporate client
These three data categories of source are as its specific data classification.
Finally, for each specific data classification, is obtained from file to be processed and suitable meet the specific data class
Other file is as first sample.Such as 1000 files closely related with medical treatment are obtained from medical data category as doctor
The first sample for treating data category obtains 1000 with military closely related file as military number from military data category
According to the first sample of classification, 1000 files closely related with the energy are obtained from multi-energy data classification as multi-energy data class
Other first sample.
S3 determines the keyword of each specific data classification, screens first sample according to obtained keyword, obtains second
Sample.
Firstly, determining the keyword of each specific data classification, this operation is determined by corporate client, and corporate client is true
Fixed suitable keyword that can most represent the first sample content.For example, determine medical data classification keyword be " hospital,
Operation, drug, medical instrument, health, physical examination, disease, heart disease, self-closing disease, mental disease, AIDS, tumour, cancer, rehabilitation
Training " determines that the keyword of military-specific data classification is " war, peace, military exercises, gun, weapon, nuclear weapon, conflict, office
Gesture, the Middle East, Afghanistan, Iraq, Ukraine, five generation machines, unmanned plane, guided missile, aircraft carrier, the Pentagon, the Korea peninsula " determines
The keyword of multi-energy data classification be " new energy, petroleum, coal, natural gas, solar energy, resource, nuclear power station, photovoltaic, cleaning,
Production capacity ".
Then, according to obtained keyword, using keyword match technology, respectively in each specific data classification
One sample is matched, and filters out first samples more comprising keyword type and that keyword frequency of occurrence is more as second
Sample.
S4 generates disaggregated model according to the second sample, the matching degree of disaggregated model is calculated, if disaggregated model matching degree is less than
Preset threshold repeats aforesaid operations, until the disaggregated model matching degree of foundation is not less than preset threshold.
Firstly, extracting the content of text of second sample, it is clear to carry out data to text content using data cleansing technology
It washes, and automatically extracts the semantic feature of the second sample after data cleansing using Feature Engineering technology.The semantic feature is to pass through
Feature Engineering technology automatically extract to obtain and with several words similar in the second sample theme, hand picking can be overcome semantic special
Levy incomplete disadvantage.
Secondly, judging the semantic feature of the second sample and the fitness of specific data classification according to preset rules, filter out
The semantic feature of highest one or more second samples of fitness is used as most representative semantic feature, and usual most generation
The semantic feature of table has multiple.
Fitness is the degree of correlation of theme expressed by the semantic feature and specific data classification of the second sample, related journey
Degree is higher, and fitness is higher.Preset rules can be according to the prepared rule of artificial experience.Such as medical data classification and
Speech, in " disinfection, tumour, health, operation, rehabilitation training, physical examination " this six semantic features, it is assumed that according to artificial experience it is found that
It is followed successively by " operation, tumour, rehabilitation training, physical examination, health, disinfection " from high to low with medical problem degree of correlation, if desired sieves
It selects highest four semantic features of fitness and is used as most representative semantic feature, at this point, most representative semantic special
Sign is " operation, tumour, rehabilitation training, physical examination ".
Then, selection sort algorithm (such as naive Bayesian, decision tree, random forest, SVM support vector machines), according to
Obtained most representative semantic feature generates disaggregated model.Import the second sample, according to obtained disaggregated model to this
Two samples are classified, and classification results are obtained, by the classification results and expected results (the of i.e. above-mentioned each specific data classification
Two samples) it is compared, calculate the matching degree of the disaggregated model.
It is bent that matching degree is selected from accuracy, precision ratio, recall ratio, F1 value, classification report, confusion matrix, ROC curve and ROC
One or more in area under line.Accuracy is the ratio that correct sample number accounts for all sample numbers in classification results.It looks into
The accuracy rate for the sample being retrieved in quasi- rate presentation class result.Recall ratio indicates to be retrieved in all accurate samples
Ratio.F1 value is precision ratio and recall ratio weighted harmonic mean, is the evaluation index for combining precision ratio and recall ratio.Classification
Report is that synthesis provides the evaluation index of precision ratio, recall ratio and F1 value.Confusion matrix respectively statistical classification model return wrong class,
Return the number to the observation of class, then result is placed in confusion matrix and is shown.ROC curve is reflection recall ratio and spy
The overall target of anisotropic continuous variable, the area under ROC curve is bigger, and disaggregated model is more effective.
Finally, the relationship between the matching degree of disaggregated model and preset threshold is judged, if the matching degree is less than default threshold
Value repeats the matching degree of disaggregated model of the above operation until foundation not less than preset threshold.
By taking matching degree includes recall rate, accuracy rate and F1 value as an example, it is assumed that the preset threshold of recall rate is 95%, accuracy rate
Preset threshold be the preset threshold of 98%, F1 value be 96.5%.Then when the recall rate of disaggregated model is not less than 95%, accuracy rate
The disaggregated model is issued not less than 98% and when F1 value is not less than 96.5%, the disaggregated model is for executing data classification business;
Otherwise, the above operation is repeated, the second sample deleted in specific data classification is rejected, supplements in specific data classification and increases newly
Or the second sample of modification, the second sample is updated, new disaggregated model is generated according to updated second sample, until foundation
The recall rate of new disaggregated model issues the classification when being not less than 96.5% not less than 98% and F1 value not less than 95%, accuracy rate
Model.
As shown in Fig. 2, electronic equipment 200 includes processor 210, computer readable storage medium 220.The electronic equipment
200 can execute the method described above with reference to Fig. 1, to carry out Message Processing.
Specifically, processor 210 for example may include general purpose microprocessor, instruction set processor and/or related chip group
And/or special microprocessor (for example, specific integrated circuit (ASIC)), etc..Processor 210 can also include using for caching
The onboard storage device on way.Processor 210 can be for executing the method flow according to the embodiment of the present disclosure for referring to Fig. 1 description
Different movements single treatment units either multiple processing units.
Computer readable storage medium 220, such as can be times can include, store, transmitting, propagating or transmitting instruction
Meaning medium.For example, readable storage medium storing program for executing can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device,
Device or propagation medium.The specific example of readable storage medium storing program for executing includes: magnetic memory apparatus, such as tape or hard disk (HDD);Optical storage
Device, such as CD (CD-ROM);Memory, such as random access memory (RAM) or flash memory;And/or wire/wireless communication chain
Road.
Computer readable storage medium 220 may include computer program 221, which may include generation
Code/computer executable instructions retouch the execution of processor 210 for example above in conjunction with Fig. 1
The method flow stated and its any deformation.
Computer program 221 can be configured to have the computer program code for example including computer program module.Example
Such as, in the exemplary embodiment, the code in computer program 221 may include one or more program modules, for example including
221A, module 221B ....It should be noted that the division mode and number of module are not fixation, those skilled in the art can
To be combined according to the actual situation using suitable program module or program module, when these program modules are combined by processor 210
When execution, processor 210 is executed for example above in conjunction with method flow described in Fig. 1 and its any deformation.
In accordance with an embodiment of the present disclosure, computer-readable medium can be computer-readable signal media or computer can
Read storage medium either the two any combination.Computer readable storage medium for example can be --- but it is unlimited
In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates
The more specific example of machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, portable of one or more conducting wires
Formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory
(EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or
The above-mentioned any appropriate combination of person.In the disclosure, computer readable storage medium can be it is any include or storage program
Tangible medium, which can be commanded execution system, device or device use or in connection.And in this public affairs
In opening, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to
Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable
Any computer-readable medium other than storage medium, the computer-readable medium can send, propagate or transmit for by
Instruction execution system, device or device use or program in connection.The journey for including on computer-readable medium
Sequence code can transmit with any suitable medium, including but not limited to: wireless, wired, optical cable, radiofrequency signal etc., or
Above-mentioned any appropriate combination.
Fig. 3 diagrammatically illustrates the block diagram of the data processing system of the embodiment of the present disclosure.
As shown in figure 3, matched system includes cluster module 310, sample determining module 320, disaggregated model between user
Generation module 330 and disaggregated model authentication module 340.
Specifically, cluster module 310 carry out data cleansing to pending data, automatically for obtaining data to be processed
The semantic feature of pending data after extracting data cleansing selects automatic cluster algorithm, special according to the semanteme of pending data
Sign carries out automatic cluster to pending data, obtains N number of data category.
Sample determining module 320 obtains Y for move to N number of data category after automatic cluster, merge
Data category, confirms one or more specific data classifications from this Y data category, and appropriate symbol is obtained from pending data
The data of the specific data classification are closed as first sample, the keyword of each specific data classification is determined, utilizes keyword
First sample is matched with technology, filters out the first more comprising keyword type and more keyword frequency of occurrence samples
This is as the second sample.
It is clear to carry out data to text content for extracting the content of text of the second sample for disaggregated model generation module 330
It washes, the semantic feature of the second sample after automatically extracting data cleansing judges the semantic feature of the second sample according to preset rules
With the fitness of specific data classification, the semantic feature of highest one or more second samples of fitness is filtered out as most
Representative semantic feature, selection sort algorithm generate disaggregated model according to most representative semantic feature.
Disaggregated model authentication module 340 is classified for being classified according to obtained disaggregated model to the second sample
As a result, calculating the matching degree of the disaggregated model according to classification results, if matching degree is less than preset threshold, repeat with upper module
Until the matching degree of the disaggregated model of foundation is not less than preset threshold.
It is understood that cluster module 310, sample determining module 320, disaggregated model generation module 330 and classification mould
Type authentication module 340 may be incorporated in a module realize or any one module therein can be split into it is multiple
Module.Alternatively, at least partly function of one or more modules in these modules can be at least partly function of other modules
It can combine, and be realized in a module.In accordance with an embodiment of the present disclosure, cluster module 310, sample determining module 320, point
At least one of class model generation module 330 and disaggregated model authentication module 340 can at least be implemented partly as hardware
Circuit, such as field programmable gate array (FPGA), programmable logic array (PLA), system on chip, the system on substrate, envelope
The system loaded onto, specific integrated circuit (ASIC), or can be to carry out any other reasonable side that is integrated or encapsulating to circuit
The hardware such as formula or firmware realize, or is realized with software, the appropriately combined of three kinds of implementations of hardware and firmware.Alternatively,
In cluster module 310, sample determining module 320, disaggregated model generation module 330 and disaggregated model authentication module 340 at least
One can at least be implemented partly as computer program module, when the program is run by computer, can execute corresponding
The function of module.
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants
It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule
The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction
It closes to realize.
It will be understood by those skilled in the art that the feature recorded in each embodiment and/or claim of the disclosure can
To carry out multiple combinations or/or combination, even if such combination or combination are not expressly recited in the disclosure.Particularly, exist
In the case where not departing from disclosure spirit or teaching, the feature recorded in each embodiment and/or claim of the disclosure can
To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.
Although the disclosure, those skilled in the art are shown and described with reference to the certain exemplary embodiments of the disclosure
It, can be with it should be understood that in the case where the spirit and scope of the present disclosure limited without departing substantially from the following claims and their equivalents
A variety of changes in form and details are carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment, but
It should be not only determined by appended claims, be also defined by the equivalent of appended claims.
Claims (10)
1. a kind of data processing method characterized by comprising
Data are obtained, and the data are clustered, obtain N number of data category;
M specific data classification is extracted from N number of data category;
The first sample for meeting the specific data classification is obtained from the data;
Determine one or more keyword of each specific data classification;
The first sample is screened according to the keyword, obtains the second sample;
Disaggregated model is generated according to second sample, calculates the matching degree of the disaggregated model, if the matching degree is less than in advance
If threshold value, aforesaid operations are repeated until the matching degree of the disaggregated model of foundation is not less than the preset threshold.
2. data processing method according to claim 1, which is characterized in that described clustered to the data is also wrapped
It includes:
Extract the semantic feature of the data;
Clustering algorithm is selected, the data are clustered according to the semantic feature.
3. data processing method according to claim 1, which is characterized in that described according to keyword screening described the
One sample further include:
The first sample is matched according to the keyword, is filtered out comprising the keyword type and the keyword
The most one or more first sample of frequency of occurrence.
4. data processing method according to claim 1, which is characterized in that described generated according to second sample is classified
Model further include:
Extract the semantic feature of second sample;
The semantic feature of second sample and the fitness of the specific data classification are judged according to preset rules, filter out institute
State the semantic feature of the highest one or more of fitness second sample;
Disaggregated model is generated according to the semantic feature of one or more of second samples.
5. data processing method according to claim 1, which is characterized in that the matching degree for calculating the disaggregated model
Further include:
Classified using the disaggregated model to second sample, obtains classification results;
The matching degree of the disaggregated model is calculated according to the classification results.
6. data processing method according to claim 5, which is characterized in that the matching degree be selected from accuracy, precision ratio,
One or more in area under recall ratio, F1 value, classification report, confusion matrix, ROC curve and ROC curve.
7. data processing method according to claim 1, which is characterized in that described to repeat aforesaid operations further include:
Reject second sample deleted in the specific data classification;
Supplement second sample for increasing newly or modifying in the specific data classification;
Second sample is updated, the new disaggregated model is generated according to updated second sample.
8. a kind of data processing electronics characterized by comprising
Processor;
Memory is stored with computer executable program, and the program by the processor when being executed, so that the processor
It executes such as data processing method in claim 1-7.
9. a kind of data processing system, which is characterized in that the data processing system includes:
Cluster module clusters for obtaining data, and to the data, obtains N number of data category;
Sample determining module is obtained from the data for extracting M specific data classification from N number of data category
The first sample for meeting the specific data classification determines one or more keyword of each specific data classification,
The first sample is screened according to the keyword, obtains the second sample;
Disaggregated model generation module, for generating disaggregated model according to second sample;
Disaggregated model authentication module, for calculating the matching degree of the disaggregated model, if the matching degree is less than preset threshold, weight
Above-mentioned module is executed again until the matching degree of the disaggregated model of foundation is not less than the preset threshold.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
It is realized when execution such as data processing method in claim 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811450760.7A CN109299279B (en) | 2018-11-29 | 2018-11-29 | Data processing method, device, system and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811450760.7A CN109299279B (en) | 2018-11-29 | 2018-11-29 | Data processing method, device, system and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109299279A true CN109299279A (en) | 2019-02-01 |
CN109299279B CN109299279B (en) | 2020-08-21 |
Family
ID=65142066
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811450760.7A Active CN109299279B (en) | 2018-11-29 | 2018-11-29 | Data processing method, device, system and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109299279B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111078984A (en) * | 2019-11-05 | 2020-04-28 | 深圳奇迹智慧网络有限公司 | Network model publishing method and device, computer equipment and storage medium |
CN111091915A (en) * | 2019-12-24 | 2020-05-01 | 医渡云(北京)技术有限公司 | Medical data processing method and device, storage medium and electronic equipment |
CN113031877A (en) * | 2021-04-12 | 2021-06-25 | 中国移动通信集团陕西有限公司 | Data storage method, device, equipment and medium |
CN113626385A (en) * | 2021-07-07 | 2021-11-09 | 厦门市美亚柏科信息股份有限公司 | Method and system based on text data reading |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101477633A (en) * | 2009-01-21 | 2009-07-08 | 北京大学 | Method for automatically estimating visual significance of image and video |
CN102693452A (en) * | 2012-05-11 | 2012-09-26 | 上海交通大学 | Multiple-model soft-measuring method based on semi-supervised regression learning |
CN104102700A (en) * | 2014-07-04 | 2014-10-15 | 华南理工大学 | Categorizing method oriented to Internet unbalanced application flow |
US20170161255A1 (en) * | 2015-12-02 | 2017-06-08 | Abbyy Infopoisk Llc | Extracting entities from natural language texts |
CN106951925A (en) * | 2017-03-27 | 2017-07-14 | 成都小多科技有限公司 | Data processing method, device, server and system |
US20180052817A1 (en) * | 2016-08-22 | 2018-02-22 | International Business Machines Corporation | Syntactic classification of natural language sentences with respect to a targeted element |
CN108595585A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Sample data sorting technique, model training method, electronic equipment and storage medium |
-
2018
- 2018-11-29 CN CN201811450760.7A patent/CN109299279B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101477633A (en) * | 2009-01-21 | 2009-07-08 | 北京大学 | Method for automatically estimating visual significance of image and video |
CN102693452A (en) * | 2012-05-11 | 2012-09-26 | 上海交通大学 | Multiple-model soft-measuring method based on semi-supervised regression learning |
CN104102700A (en) * | 2014-07-04 | 2014-10-15 | 华南理工大学 | Categorizing method oriented to Internet unbalanced application flow |
US20170161255A1 (en) * | 2015-12-02 | 2017-06-08 | Abbyy Infopoisk Llc | Extracting entities from natural language texts |
US20180052817A1 (en) * | 2016-08-22 | 2018-02-22 | International Business Machines Corporation | Syntactic classification of natural language sentences with respect to a targeted element |
CN106951925A (en) * | 2017-03-27 | 2017-07-14 | 成都小多科技有限公司 | Data processing method, device, server and system |
CN108595585A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Sample data sorting technique, model training method, electronic equipment and storage medium |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111078984A (en) * | 2019-11-05 | 2020-04-28 | 深圳奇迹智慧网络有限公司 | Network model publishing method and device, computer equipment and storage medium |
CN111078984B (en) * | 2019-11-05 | 2024-02-06 | 深圳奇迹智慧网络有限公司 | Network model issuing method, device, computer equipment and storage medium |
CN111091915A (en) * | 2019-12-24 | 2020-05-01 | 医渡云(北京)技术有限公司 | Medical data processing method and device, storage medium and electronic equipment |
CN113031877A (en) * | 2021-04-12 | 2021-06-25 | 中国移动通信集团陕西有限公司 | Data storage method, device, equipment and medium |
CN113031877B (en) * | 2021-04-12 | 2024-03-08 | 中国移动通信集团陕西有限公司 | Data storage method, device, equipment and medium |
CN113626385A (en) * | 2021-07-07 | 2021-11-09 | 厦门市美亚柏科信息股份有限公司 | Method and system based on text data reading |
CN113626385B (en) * | 2021-07-07 | 2022-07-15 | 厦门市美亚柏科信息股份有限公司 | Method and system based on text data reading |
Also Published As
Publication number | Publication date |
---|---|
CN109299279B (en) | 2020-08-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299279A (en) | A kind of data processing method, equipment, system and medium | |
Wu et al. | Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation | |
Hon et al. | Deep learning classification in asteroseismology | |
CN107908635A (en) | Establish textual classification model and the method, apparatus of text classification | |
Ramakrishna et al. | Homogeneous adaboost ensemble machine learning algorithms with reduced entropy on balanced data | |
CN108804718A (en) | Data push method, device, electronic equipment and computer readable storage medium | |
CN105640577A (en) | Method and system automatically detecting local lesion in radiographic image | |
Nanni et al. | Ensemble of deep learning, visual and acoustic features for music genre classification | |
CN109800781A (en) | A kind of image processing method, device and computer readable storage medium | |
CN107463605A (en) | The recognition methods and device of low-quality News Resources, computer equipment and computer-readable recording medium | |
CN108960264A (en) | The training method and device of disaggregated model | |
CN109887562A (en) | The similarity of electronic health record determines method, apparatus, equipment and storage medium | |
CN109684476A (en) | A kind of file classification method, document sorting apparatus and terminal device | |
CN111338897A (en) | Identification method of abnormal node in application host, monitoring equipment and electronic equipment | |
CN106529110A (en) | Classification method and equipment of user data | |
van Tulder et al. | Learning features for tissue classification with the classification restricted Boltzmann machine | |
CN109948680A (en) | The classification method and system of medical record data | |
CN110532352A (en) | Text duplicate checking method and device, computer readable storage medium, electronic equipment | |
CN115858886B (en) | Data processing method, device, equipment and readable storage medium | |
CN113569895A (en) | Image processing model training method, processing method, device, equipment and medium | |
Barbara et al. | Classifying Kepler light curves for 12 000 A and F stars using supervised feature-based machine learning | |
JP2023532292A (en) | Machine learning based medical data checker | |
Cao et al. | Supervised contrastive pre-training formammographic triage screening models | |
CN114662477A (en) | Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium | |
Liu et al. | Automated icd coding using extreme multi-label long text transformer-based models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100088 Building 3 332, 102, 28 Xinjiekouwai Street, Xicheng District, Beijing Applicant after: Qianxin Technology Group Co., Ltd. Address before: Beijing Chaoyang District Jiuxianqiao Road 10, building 15, floor 17, layer 1701-26, 3 Applicant before: BEIJING QI'ANXIN SCIENCE & TECHNOLOGY CO., LTD. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |