CN103729428B - Big data classification method and system - Google Patents
Big data classification method and system Download PDFInfo
- Publication number
- CN103729428B CN103729428B CN201310727192.1A CN201310727192A CN103729428B CN 103729428 B CN103729428 B CN 103729428B CN 201310727192 A CN201310727192 A CN 201310727192A CN 103729428 B CN103729428 B CN 103729428B
- Authority
- CN
- China
- Prior art keywords
- mapping
- classifying rules
- hbase database
- association rule
- category
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
Abstract
The invention discloses a big data classification method and system. The method includes the steps: training, namely dividing input data into input data blocks, generating classification rules (pattern strings => class marks) of pattern strings from the input data blocks, and writing the classification rules into a Hbase database rule table; testing, namely, reading the input data blocks, constructing pattern strings to be classified, searching for classification rules matching with the pattern strings to be classified, in the Hbase database rule table, and outputting classification results. Therefore, the big data classification method and system based on hypersurface is provided; classification can be achieved by a hypersurface-based covering algorithm on the basis of a Hadoop mapping/simplification programming frame and a Hbase distributed non-relational database, a rule model easy to explain can be constructed at low calculation cost, big data is quickly and efficiently processed, and the requirement for classifying explosively-increasing data in real world is met.
Description
Technical field
The present invention relates to big data analysis field, more particularly to a kind of big data sorting technique based on hypersurface and it is
System.
Background technology
Classification is a kind of important data analysis form, for extracting the model for portraying significant data class.This model claims
For grader, for the class label of prediction classification.Data classification is a two stage process, including study stage and classification rank
Section, the study stage builds the disaggregated model stage, and sorting phase predicts the class label of data-oriented using model.For example,
We can set up a disaggregated model, and bank loan application is divided into safety or dangerous.This analysis can help us
More preferable comprehensive understanding data.Many classification and Forecasting Methodology come from machine learning, pattern-recognition and statistics.Major part is calculated
Method is the algorithm of memory resident, often assumes that data volume very little.Classification has extensive application, including fraud detection, target marketing, property
Can prediction, manufacture and medical diagnosis.
The existing method for solving classification problem has a lot, and single sorting technique mainly includes:Decision tree, Bayes,
Artificial neural network, k- neighbours, SVMs and the classification based on correlation rule etc.;Also it is used in addition combine single classification
The Ensemble Learning Algorithms of method, such as pack and are lifted/propulsion.
(1)Decision tree
Decision tree is one of major technique for classifying and predicting, decision tree learning is the conclusion based on example
Algorithm is practised, it is conceived to from one group of out of order, random example and infers with the classifying rules of decision-making tree representation.Construction is determined
The purpose of plan tree is to find out the relation between attribute and classification, and the classification of the record of unknown classification in the future is predicted with it.It is adopted
Top-down recursive fashion, in the internal node of decision tree the comparison of attribute is carried out, and is judged from this according to different attribute value
The downward branch of node, decision tree leaf node it is concluded that.
Main decision Tree algorithms have ID3, C4.5(C5.0), CART, PUBLIC, SLIQ and SPRINT algorithm etc..They
At the technology for selecting testing attribute to adopt, the structure of decision tree for generating, the method for beta pruning and moment, big data can be processed
The aspects such as collection have respective difference.
(2)Bayes
Bayesian Classification Arithmetic is the algorithm that a class is classified using probability statistics knowledge, such as NB Algorithm.
These algorithms mainly predict that the sample of a unknown classification belongs to the possibility of each classification using Bayes' theorem, select it
Final classification of the maximum classification of middle possibility as the sample.Because Bayesian establishment itself needs one very
Strong conditional independence assumption premise, and this hypothesis is often invalid in a practical situation, thus its classification accuracy is just
Can decline.The Bayesian Classification Arithmetic of many reduction independence assumptions, such as TAN algorithms are occurred as soon as this, it is in Bayes
Increase the association between attribute pair on the basis of network structure to realize.
(3)Artificial neural network
Artificial neural network is the mathematics that a kind of application carries out information processing similar to the structure of cerebral nerve cynapse connection
Model.In this model, substantial amounts of node(Or claim " neuron ", or " unit ")Between be coupled to each other composition network, i.e., " god
Jing networks ", to reach the purpose of processing information.Neutral net generally needs to be trained, and the process of training is exactly that network is carried out
The process of study.The value of the connection weight that training changes network node makes it have the function of classification, and trained network is just
Can be used for the identification of object.
At present, existing hundreds of the different model of neutral net, common are counterpropagation network, Radial Basis Function Network
Network, hopfield network, stochastic neural net, Competitive ANN etc..But current neutral net still receive by generally existing
The shortcomings of holding back slow, computationally intensive speed, training time length and can not explain.
(4)K- neighbours
K- nearest neighbor algorithms are a kind of sorting techniques of Case-based Reasoning.The method be exactly find out it is closest with unknown sample x
K training sample, see which kind of majority belongs in this k sample, is just classified as that class x.K- near neighbor methods are a kind of lazy
Lazy learning method, it deposits sample, is just classified when needing classification, if sample set is more complicated, may result in
Very big computing cost, therefore the very strong occasion of real-time cannot be applied to.
(5)SVMs
SVMs is a kind of new learning method that Wan Punike is proposed according to Statistical Learning Theory, and its maximum is special
Point is, according to empirical risk minimization, to maximize class interval construction optimal separating hyper plane the general of learning machine to be improved
Change ability, the problems such as preferably solve non-linear, high dimension, local minimum point.For classification problem, algorithm of support vector machine
Sample in region calculates the decision-making curved surface in the region, thereby determines that the classification of unknown sample in the region.
(6)Classification based on correlation rule
Association rule mining is an important field of research in data mining.In recent years, for how by correlation rule
Excavate for classification problem, scholars conduct extensive research.Associative classification method excavates the rule of shape such as condset → C,
Wherein condset is the set of item (or attribute-value to), and C is class label, and the rule of this form is referred to as class association rules.
Associative classification method is typically made up of two steps:First step association rules mining algorithm is excavated all full from training data concentration
Toe determines the class association rules of support and confidence level;Second step is chosen using heuristic from the class association rules excavated
One group of high-quality rule is selected for classifying.
(7)Integrated study
The complexity of practical application and the diversity of data often cause single sorting technique not effective enough.Therefore, learn
Fusion of the persons to various sorting techniques is that integrated study is conducted extensive research.Integrated study has become international machine learning
The study hotspot on boundary, and it is referred to as one of current machine four mains direction of studying of study.
Integrated study is a kind of machine learning normal form, and it is attempted by continuously calling single learning algorithm, obtains different
Base learner, then these learners are combined solving same problem according to rule, can significantly improve learning system
Generalization ability.Combine multiple base learners mainly to adopt(Weighting)The method of ballot, common algorithm has pack, is lifted/is pushed away
Enter.
Integrated study as a result of the average multiple graders of Combination of Methods of voting, it is possible to reducing single classification
The error of device, acquisition is more accurately represented problem space model, so as to improve the classification degree of accuracy of grader.
Standard for comparing and assessing sorting technique mainly has:(1)The accuracy rate of prediction.Model is correctly predicted new sample
The ability of this class label;(2)Calculating speed.The time classified including tectonic model and using model;(3)It is strong
Property.The model ability correctly predicted to noise data or vacancy Value Data;(4)Scalability.The data very big for data volume
Collection, the ability of effective tectonic model;(5)The terseness and interpretation of model description.Model description more succinctly, is more easily managed
Solution, then it is more welcome.
Judge from these indexs, currently a popular sorting algorithm has problems with.Decision Tree algorithms application is very
Extensively, thinking is simple, realizes being easier.But due to the data structure of the intrinsic tree of decision tree, cause machine internal memory to become calculation
Method bottleneck, it is impossible to process large-scale data.And beta pruning is also faced with standard and is difficult to really after beta pruning and construction before the construction of decision tree
It is fixed, the high problem of algorithm complex.Bayes's classification is also conventional sorting algorithm, but Bayesian establishment itself is needed
A very strong conditional independence assumption premise is wanted, and this hypothesis is often invalid in a practical situation, this can serious shadow
Ring classification accuracy.Artificial neural network training process is extremely complex, generally existing convergence rate is slow, computationally intensive, training when
Between it is long and the shortcomings of can not explain.K- nearest neighbor algorithms are a kind of Lazy learning algorithms, and it deposits sample ability when needing classification
Classified, without deliberately distinguishing training and identification process, but if sample set is more complicated, very big calculating can be caused to hold
Pin, it is impossible to process mass data.SVMs can be solved non-linear with maximizing class interval construction optimal classification hypersurface
Higher dimensionality data are classified, but there is also solution Classification Hyperplane difficulty, and algorithm complex is high, it is difficult to adapt to count in real time in a large number
According to problem.The method of classification and integrated study based on correlation rule is corresponding with less at present, also in the exploratory stage, but
Also generally it is faced with algorithm complex high, it is difficult to restrain, the uncertain problem of classifying quality.
The content of the invention
In order to solve the above problems, it is an object of the present invention to provide a kind of big data sorting technique based on hypersurface
And system, the predictablity rate for solving above-mentioned prior art is unstable, calculation cost is big, speed is slow, model complexity is high, be difficult to
Explain, cannot process the problem of mass data.The method, can be in Hadoop mappings/change using the covering algorithm based on hypersurface
Realize on the basis of simple programming framework and the distributed non-relational databases of Hbase, and can be easy to relatively low calculation cost, structure
The rule model of explanation, rapidly and efficiently processes mass data, to tackle real world in explosive growth data classification
Demand.
For achieving the above object, big data sorting technique proposed by the invention, it is characterised in that the method includes following
Step:
Training step, including the first mapping/abbreviation step of multiple circulation, for input data to be divided into into input data
Block, by the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write
Enter Hbase database association rule tables;
Testing procedure, including a second mapping/abbreviation step, for reading the input block, and construct to be sorted
Model string, searches the classifying rules matched with the model string to be sorted in the Hbase database association rule tables, and defeated
Go out classification results.
The big data sorting technique of the present invention, it is characterised in that the first mapping/abbreviation step is specially:
Including one or more first mapping steps and an abbreviation step, wherein, first mapping step is used for should
Input data is divided into the input block of fixed size, and the input block is read line by line and front l positions is taken successively according to per one-dimensional
Mode structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is current
Cycle-index;The abbreviation step is used to for the key-value pair to be merged into project<Model string, list<Category>>, and judge this
Whether mesh is pure, if pure, in the Hbase database association rule tables rule is write, and otherwise turns first mapping step, its
In the pure finger list<Category>In comprising some category number of times percentage reach user setting threshold value.
The big data sorting technique of the present invention, it is characterised in that be stored in the rule in the Hbase database association rule tables
Concrete grammar is:
Judge in the Hbase database association rule tables with the presence or absence of the classifying rules, if not existing, the classifying rules is deposited
Enter the Hbase database association rule tables, if existing, the classifying rules is stored in the Hbase databases rule by the mode for taking covering
Then table.
The big data sorting technique of the present invention, it is characterised in that the Hbase database association rule tables take the mode of row to carry out
The storage of the classifying rules.
The big data sorting technique of the present invention, it is characterised in that the second mapping/abbreviation step is specially:
Including one or more the second mapping steps, for reading line by line to the input block, according to per it is one-dimensional successively
The mode for taking front m bit digitals constructs model string to be sorted, and wherein m is positive integer, and is looked in Hbase database association rule tables
Look for the presence or absence of the classifying rules that matches with the model string to be sorted, until meeting end condition, output category result, its
In the end condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number
Reach one of either condition in user input threshold value.
The present invention also proposes a kind of big data categorizing system, it is characterised in that the system is included with lower module:
Training module, including the first mapping/abbreviation module of multiple circulation, for input data to be divided into into input data
Block, by the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write
Enter Hbase database association rule tables;
Test module, including a second mapping/abbreviation module, for reading the input block, and construct to be sorted
Model string, searches the classifying rules matched with the model string to be sorted in the Hbase database association rule tables, and defeated
Go out classification results.
The big data categorizing system of the present invention, it is characterised in that characterized in that, the first mapping/abbreviation module is concrete
For:
Including one or more first mapping blocks and an abbreviation module, wherein, first mapping block is used for should
Input data is divided into the input block of fixed size, and the input block is read line by line and front l positions is taken successively according to per one-dimensional
Mode structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is current
Cycle-index;The abbreviation module is used to for the key-value pair to be merged into project<Model string, list<Category>>, and judge this
Whether mesh is pure, if pure, in the Hbase database association rule tables rule is write, and otherwise turns first mapping block, its
In the pure finger list<Category>In comprising some category number of times percentage reach user setting threshold value.
The big data categorizing system of the present invention, it is characterised in that be stored in the rule in the Hbase database association rule tables
Concrete grammar is:
Judge in the Hbase database association rule tables with the presence or absence of the classifying rules, if not existing, the classifying rules is deposited
Enter the Hbase database association rule tables, if existing, the classifying rules is stored in the Hbase databases rule by the mode for taking covering
Then table.
The big data categorizing system of the present invention, it is characterised in that the Hbase database association rule tables take the mode of row to carry out
The storage of the classifying rules.
The big data categorizing system of the present invention, it is characterised in that the second mapping/abbreviation module is specially:
Including one or more the second mapping blocks, for reading line by line to the input block, according to per it is one-dimensional successively
The mode for taking front m bit digitals constructs model string to be sorted, and wherein m is positive integer, and is looked in Hbase database association rule tables
Look for the presence or absence of the classifying rules that matches with the model string to be sorted, until meeting end condition, output category result, its
In the end condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number
Reach one of either condition in user input threshold value.
The present invention has advantages below:
(1)The sorting algorithm based on hypersurface that the present invention is realized, if its Fundamentals of Mathematics is topological to work as theorem:It is flat
One is not continuously just called Jordan curve with itself intersection curve on face.A closure in plane(It is end to end)Ruo Er
Work as curve, plane is divided into 2 regions, and if taken a bit respectively in the two regions, then with a curve by its phase
Even, then this line must be intersecting with original closure Jordan curve.Inference is:Any point is penetrated as starting point with space
Line, if ray is odd number with the intersection point of the closure Jordan curve, claims this in the closed space that curve is surrounded, if
Intersection point is even number, then claim the point outside closed space.High dimensional data is same.From the theory, the space of the present invention
Dividing to merge will not have a negative impact to classification accuracy;
(2)The present invention adopts the thought of " dividing and ruling ", by each dimension of input data by data bit value, as sign
The pattern string that space divides.Very thought is taken, by space according to being divided into 10nIndividual region(N is the dimension of input data).
Whether enough " pure " to judge the classification at each number of regions strong point(It is same class or reaches user for of a sort ratio and sets
Determine threshold value).If enough " pure ", the zone marker is the most data category of the region quantity, otherwise enters in the region
One step is subdivided into 10nSub-regions, take each dimension next one data bit, proceed this work, until meeting all thin
Subregion " pure ", traveled through all data bit or cycle-index is reached in user's given threshold these three end conditions
It is any.The sign character string in each region of final output is used to classify with classification, formation rule table is marked.It is this to divide and rule
Thought is it can be readily appreciated that facilitate implementation.Meanwhile, using very mode so that most grouped datas(Typically enter using ten
System)Algorithm process flow process can be directly entered without system conversion, greatly reduce computation complexity.Using Hbase rule lists
Mode carry out storage rule, favorable expandability is real-time, is the optimal selection of technical grade database application.
(3)Present invention achieves the hypersurface sorting algorithm based on Hadoop, using mapping/abbreviation mechanism, by big data
Small block data is divided into, the sign area of space position of the current level of correspondence is produced to every data parallel at multiple mapping ends
Model string, using the mass data external memory ordering mechanism of mapping/abbreviation, judges parallel space per sub-regions at abbreviation end
Data purity, generation rule or into next layer of computing.It is this to design memory pressure and the calculating for greatly alleviating unit
Pressure, allows processing data scale linearly increasing with the increase of Hadoop clustered machine quantity, has reached mass data
The technique effect of Distributed Calculation;
(4)Present invention uses the rule list that distributed non-relational database Hbase storage is produced.Hbase is used as row
The database of storage, with high reliability, high-performance, telescopic feature, tolerates data redundancy, to the inquiry in mass data
Demand can efficient quick response.The present invention is devised based on the extendible rule list of Hbase, can be constantly rich according to newly-increased data
Rich rule list, adapts to the parallel insertion of mass data and the inquiry operation of separate sources.Enable the invention to be applied to reality
The classification demand of industry data.
The positive effect of the present invention is that the present invention can realize being distributed based on Hadoop mapping/abbreviations framework and Hbase
The sorting algorithm based on hypersurface of formula non-relational database.Compared with prior art, new method proposed by the present invention and it is
System can process TB DBMSs, and computing capability with the increase of Hadoop clustered machines close linear rise, it is real real
Existing Distributed Calculation, greatly improves performance and efficiency.In addition, the present invention is not using complicated computing, general classification is different from
The high computation complexity of algorithm, can reduce overhead.Using the distinctive row storage table structure of Hbase databases, rule list is made
The inquiry velocity linear rise in the case of data volume geometric growth, efficient quick response classification demand.Meet system counting greatly
According to the real time handling requirement under environment.
Description of the drawings
Fig. 1 is the hypersurface grader parallel training partial process view of the embodiment of the present invention
Fig. 2 is the hypersurface grader concurrent testing partial process view of the embodiment of the present invention
Fig. 3 is the input data form of the embodiment of the present invention
Fig. 4 is systematic training and identification running on the JobTracker of the embodiment of the present invention
Fig. 5 is the Hbase database association rule matrix section regular datas of the embodiment of the present invention
Fig. 6 is the test process partial recognition result of the embodiment of the present invention
Fig. 7 is the UCLA data sets test recognition accuracy of the embodiment of the present invention
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with accompanying drawing to the present invention's
Big data sorting technique and system are further elaborated.It should be appreciated that specific embodiment described herein is only used
To explain the present invention, it is not intended to limit the present invention.
The present invention is based on distributed open source software platform Hadoop and non-relational distributed data base Hbase database,
Using Hadoop mechanism parallel processing input datas, shorten system operation time, using the storage of Hbase row and highly redundant machine is tolerated
System so that when knowledge database data increases in geometry rank, query time only has linear increase, it is adaptable to actual increment big data
Process.Concrete scheme is as follows.
The big data sorting technique of the present invention is broadly divided into two steps, training step and testing procedure, training step bag
The first mapping/abbreviation step containing multiple circulation, testing procedure includes a second mapping/abbreviation step.Explanation is needed herein
Be:First mapping/abbreviation step includes one or more first mapping steps, an abbreviation step.Second mapping/abbreviation is walked
Suddenly second mapping step is included, wherein, the first mapping step is played carries out piecemeal by a large amount of input datas, is divided into fixed big
It is little(Usually 64MB)Input block after processed again;Abbreviation step is to the data block that generates the first mapping step
Processed, and generated final output result.Data insertion and inquiry operation to distributed data base Hbase is interspersed in training
During the mapping of step and testing procedure/abbreviation program is realized, training step advises the classifying rules write Hbase databases for producing
Then in table, testing procedure then searches the classifying rules from Hbase database association rule tables, and for the class categories of test data
Identification.The present invention can process increment big data, and after processing a collection of training data that a period of time produces classification gauge is extracted
Then, Hbase database association rules are write, and the classifying rules obtained after the next group data processing that lower a period of time produces continues to write
Enter Hbase database association rules, expanded classifying rules storehouse, classification capacity strengthens.
Embodiment
Below we by taking the training sample of classical 26 English alphabet picture respective pixels composition as an example, to the present invention's
Big data sorting technique is illustrated.
In the present invention, the form of input data is as follows, and training data includes category, and test data does not contain category(According to
Specific demand can be pre-processed to various data, be allowed to meet the pattern of the input of the present invention):
……
First mapping/abbreviation step of the training step comprising repeatedly circulation(N is denoted herein as, n is positive integer), each
One mapping/abbreviation step is made up of the first mapping step and abbreviation step.Input training data enters mapping end, is divided into many
Individual fixed size(Usually 64MB)Input block, then each input block is read line by line process parallel.It is defeated
Enter every data shape of data block such as " 02080305010813000606100800080008T ",
" 02080305010813000606100800080008 " represents 256 of alphabetical " T " picture(Binary system)Pixel value, that is,
The pattern of the data of this input block.In this way, regular text message has been converted image information into, can be with
Carry out classification process.Setting is current for the l time circulation(L is positive integer), input data is taken per one-dimensional front l bit digitals composition mould
Formula character string, such as l=1, take per the first one-dimensional bit digital compositional model character string " 0000001000100000 ", will
, used as the key K of mapping output, class formative " T " is used as value V for mapping output, mapping output shape such as key-value pair for the model string
<0000001000100000,T>.If this time generating three key-value pairs<0000001000100000,T>,<
0000001000100000,B>,<0000001000100000,B>.The abbreviation stage is entered afterwards, and abbreviation is using mapping output<
0000001000100000,T>,<0000001000100000,B>,<0000001000100000,B>Deng conduct input, and
Key assignments K identical key-value pairs are merged into<K,list<V>>The project of form, such as<0000001000100000,<T,B,B>>,
Traversal analysis is carried out to such project in abbreviation, if for some model string such as
" 0000001000100000 ", the percentage of the set comprising some category number of times of class scale value reaches the degree of purity of user's setting
Threshold value, the degree of purity threshold value refers to the percentage of the scope in [0,1] of user's setting, then illustrates to train part more than the value
In abbreviation 1<Model string, list<Category>>Classifying rules can be produced, is otherwise circulated into next round.For example 90%,
Then it is referred to as the project " pure ", extracts rule, otherwise initial data is labeled as into " untreated ", into next round circulation.This
Place for model string " 0000001000100000 ", comprising category sequence<T,B,B>, occurrence number most " B " accounts for
To 66% ratio, less than 90% threshold value of setting, so " untreated " is labeled as, into the second wheel circulation.Second wheel circulation
Mapping is taken per one-dimensional front binary digit compositional model character string as key K, and category is generated as value V<
00000010001000002835183066080808,T>,<00000010001000002835183066080801,B>,<
00000010001000002835183066080801,B>Three datas, system is assembled into into abbreviation step according to same keys K value
The mode of one sequence, produces<00000010001000002835183066080808,<T>>,<
00000010001000002835183066080801,<B,B>>Two projects, it is clear that the two projects are analyzed respectively, its certain
One category accounting more than 90%, referred to as " pure ".So abbreviation produces two classifying rules,
{00000010001000002835183066080808=>T},{00000010001000002835183066080801=>B},
In the middle of the rule list of this two rules write Hbase database.If still having the key-value pair of inadequate " pure ", according to above-mentioned
Mode is circulated into next round, until it reaches maximum cycle, the maximum cycle refers to the systematic training being manually set
Part circulation maximum, is a positive integer.As being not provided with, input data precision is defaulted as.Training process is finished.
Hbase database association rule table structures are illustrated herein.Hbase takes the mode of row to carry out data storage,
The table structure of Hbase is represented that a Ge Lie races include one or more column labels, the key of every a line of the rule list by row, column race
As classifying rules model string, " 00000010001000002835183066080808 " for as above producing in example is each
Row includes a Ge Lie races " fam ", and the row race includes a column label, and the corresponding category of column label memory module character string should
Place's correspondence class is designated as " T ".New classifying rules is produced after having incremental data to process, if the pattern-recognition word of the classifying rules
Symbol string exists not yet in table, then newly insert the rule, if having existed, takes the mode of covering.Such as newly produce one
Classifying rules " 00000010001000002835183066080808=>M ", the rule exists in rule list, former category
For T, then former category T is override with M.So complete the growth and renewal of classifying rules.
Testing procedure includes a second mapping/abbreviation step.The process includes second mapping step.In the map
The data of each data block are read line by line, and m dimension words constitute model string to be identified before taking out to every a line(M is just whole
Number), the first dimension word is taken first, this is sentenced as a example by " 07100505020608060811071102080509 ", the pattern character
Go here and there as " 010000001010000 ", inquire about Hbase database association rule tables, whether search has { 010000001010000=>}(
For a certain category) rule complete matching.If there is for example { 010000001010000=>Z } rule match, then mark the pattern character
String class is designated as Z;The second dimension word compositional model character string " 0100000010100007055268681712859 " is otherwise taken, after
It is continuous to be inquired about.Some condition terminates below satisfaction:1, find the classifying rules of matching;2, the digit of word of peeking reaches
Input data maximal accuracy, such as going up example can only get second;3, peek word bit number reaches user input threshold value, such as can only
Get the 3rd.After terminating according to these three conditional procedures, if the classifying rules for matching, then output mode character string and class
Mark, such as " 07100505020608060811071102080509Z ", otherwise exports
“07100505020608060811071102080509NF”(NF represents NotFound, it is impossible to find).Test process terminates.
Fig. 1 gives the parallel organization figure in Algorithm for Training stage.1 pair of input block of mapping reads line by line, and constructs<Mould
Formula character string, category>Key-value pair.Abbreviation 1 is analyzed<Model string, list<Category>>Project purity, judges whether " pure ",
If pure output model string=>Category } classifying rules, Hbase database association rule tables are inserted, former data are otherwise exported, enter
Enter next round circulation.
Fig. 2 gives the parallel organization figure in test of heuristics stage.1 pair of input block of mapping reads line by line, according to each
Dimension takes successively front m bit digitals(M is positive integer)Mode constructs model string to be identified, accordingly in Hbase database association rule tables
In carry out category inquiry.Export final recognition result.
According to aforementioned structure, below it is divided to training and tests two part displaying examples.In order to more preferably ensure authenticity, as far as possible
Represent running on the server in sectional drawing form.Training data and test data recognize number using UCLA standard alphabets
According to collection, altogether comprising data 20000, using ten folding cross validation methods.As shown in Figure 3.
Training and the test process of the system, Jobtracker are run in Hadoop platform(Hadoop logger tasks are transported
Capable Web page scan tool)Whole process is have recorded, as shown in figure 4, each job(Workflow)Represent flat in Hadoop
The one-stop operation run on platform.Training process includes job0, job1, job2, Afterjob, and job0 is pre-treatment job, to defeated
Enter data form to pre-process, be allowed to meet system requirements, job1 and job2 is then the cycle operation twice of create-rule,
Afterjob is to process operation the later stage, unified by the classifying rules for generating insertion Hbase database association rule tables.Part of detecting includes
Job0 and testjob, job0 are pre-treatment jobs, do pattern of the input pretreatment, testjob be identification operation, be identified and
Accuracy in computation.Implication is represented in figure per a line as follows, with the first row " job_201310230921_0021NORMAL hadoop
As a example by job0100.00% ", " job_201310230921_0021 " represents that operation submission time is October 23 day 9: 21 in 2013
Point, 0021 is the numbering that same day operation is submitted to, and " NORMAL " represents that Job execution state is normal, and " hadoop " is carried out operation
User name, " 100.00% " represents Job execution progress absolutely, that is, to be finished.
The classifying rules insertion Hbase rule lists produced during training department's partite transport row, Fig. 5 is Hbase middle parts contingency table
For the classifying rules of " J ".Model string, row name, timestamp, category are followed successively by per data line.Still with the first row data
“00000000010000002735296261481616column=fam:col timestamp=1385007184062value=
As a example by J ", notice that " 296261481616 " herein are because that order line shows the reason of automatic line feed, it with above
" 00000000010000002735 " is the priority part of a character string.
" 00000000010000002735296261481616 " intermediate scheme character string, " column=fam:Col " represents that the row has one
Ge Lie races " fam ", comprising a column label " col ", when " timestamp=1385007184062 " represents that this data line is inserted
Between be " 1385007184062 "(Built-in computer gsec), " value=J " represents this line below " col " column label
Value be " J ".Generally speaking, it is exactly that to represent this defeated for model string " 00000000010000002735296261481616 "
Enter being categorized as " J " for character string.
Part of detecting operation is identified to input data, using ten folding cross validations, behind former data basis is marked
The category that rule database is identified.As shown in Figure 6.
Fig. 7 is illustrated on UCLA Letter identification standard data sets, and using ten folding cross validations, part of detecting identification is accurate
Rate is 92%, is showed excellent.What is intercepted in figure is that operation result under order line shows that previous section is Hadoop engineering operations
The internal mechanism information of printing, represents physics memory size, abbreviation the output of process record quantity, and virtual memory capacity mapped
Journey output record quantity etc., it is important that last sentence " algorithm precision is0.92 " is represented after proof of algorithm
The degree of accuracy is 0.92, that is, 92%.
Claims (8)
1. a kind of big data sorting technique, it is characterised in that the method is comprised the following steps:
Training step, including the first mapping/abbreviation step of multiple circulation, for input data to be divided into into input block,
By the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write
Hbase database association rule tables;
Testing procedure, including a second mapping/abbreviation step, for reading the input block, and construct pattern to be sorted
Character string, searches the classifying rules matched with the model string to be sorted, and output point in the Hbase database association rule tables
Class result;
Wherein the first mapping/abbreviation step is specially:
Including one or more first mapping steps and an abbreviation step, wherein, first mapping step is used to be input into this
Data are divided into the input block of fixed size, the input block are read line by line and according to per the one-dimensional side for taking front l positions successively
Formula structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is previous cycle
Number of times;The abbreviation step is used to for the key assignments to be merged into project<Model string, list<Category>>, and whether judge the project
It is pure, if pure, the rule is write in the Hbase database association rule tables, otherwise turn first mapping step, wherein this is pure
Only list is referred to<Category>In comprising some category number of times percentage reach user setting threshold value.
2. big data sorting technique as claimed in claim 1, it is characterised in that be stored in in the Hbase database association rule tables
The regular concrete grammar is:
Judge in the Hbase database association rule tables with the presence or absence of the classifying rules, if not existing, the classifying rules is stored in into this
Hbase database association rule tables, if existing, the classifying rules is stored in the Hbase database association rule tables by the mode for taking covering.
3. big data sorting technique as claimed in claim 1 or 2, it is characterised in that the Hbase database association rule tables take row
Mode carry out the storage of the classifying rules.
4. big data sorting technique as claimed in claim 1, it is characterised in that the second mapping/abbreviation step is specially:
Including one or more the second mapping steps, for reading line by line to the input block, front m is taken successively according to per one-dimensional
The mode of bit digital constructs model string to be sorted, and wherein m is positive integer, and lookup is in Hbase database association rule tables
No to there is the classifying rules matched with the model string to be sorted, until meeting end condition, output category result wherein should
End condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number reach
One of either condition in user input threshold value.
5. a kind of big data categorizing system, it is characterised in that the system is included with lower module:
Training module, including the first mapping/abbreviation module of multiple circulation, for input data to be divided into into input block,
By the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write
Hbase database association rule tables;
Test module, including a second mapping/abbreviation module, for reading the input block, and construct pattern to be sorted
Character string, searches the classifying rules matched with the model string to be sorted, and output point in the Hbase database association rule tables
Class result;
First mapping/abbreviation the module is specially:
Including one or more first mapping blocks and an abbreviation module, wherein, first mapping block is used to be input into this
Data are divided into the input block of fixed size, the input block are read line by line and according to per the one-dimensional side for taking front l positions successively
Formula structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is previous cycle
Number of times;The abbreviation module is used to for the key assignments to be merged into project<Model string, list<Category>>, and whether judge the project
It is pure, if pure, the rule is write in the Hbase database association rule tables, otherwise turn first mapping block, wherein this is pure
Only list is referred to<Category>In comprising some category number of times percentage reach user setting threshold value.
6. big data categorizing system as claimed in claim 5, it is characterised in that be stored in in the Hbase database association rule tables
The regular concrete grammar is:
Judge in the Hbase database association rule tables with the presence or absence of the classifying rules, if not existing, the classifying rules is stored in into this
Hbase database association rule tables, if existing, the classifying rules is stored in the Hbase database association rule tables by the mode for taking covering.
7. the big data categorizing system as described in claim 5 or 6, it is characterised in that the Hbase database association rule tables take row
Mode carry out the storage of the classifying rules.
8. big data categorizing system as claimed in claim 5, it is characterised in that the second mapping/abbreviation module is specially:
Including one or more the second mapping blocks, for reading line by line to the input block, front m is taken successively according to per one-dimensional
The mode of bit digital constructs model string to be sorted, and wherein m is positive integer, and lookup is in Hbase database association rule tables
No to there is the classifying rules matched with the model string to be sorted, until meeting end condition, output category result wherein should
End condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number reach
One of either condition in user input threshold value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310727192.1A CN103729428B (en) | 2013-12-25 | 2013-12-25 | Big data classification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310727192.1A CN103729428B (en) | 2013-12-25 | 2013-12-25 | Big data classification method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103729428A CN103729428A (en) | 2014-04-16 |
CN103729428B true CN103729428B (en) | 2017-04-12 |
Family
ID=50453502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310727192.1A Active CN103729428B (en) | 2013-12-25 | 2013-12-25 | Big data classification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103729428B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104111992B (en) * | 2014-07-03 | 2017-05-17 | 北京思特奇信息技术股份有限公司 | Method and system for merging agency results of distributed database |
CN104268260A (en) * | 2014-10-10 | 2015-01-07 | 中国科学院重庆绿色智能技术研究院 | Method, device and system for classifying streaming data |
CN106657222A (en) * | 2016-09-21 | 2017-05-10 | 广东工业大学 | Information management method, system and device based on Internet of Things big data |
CN106778849A (en) * | 2016-12-02 | 2017-05-31 | 杭州普玄科技有限公司 | Data processing method and device |
CN107515897B (en) * | 2017-07-19 | 2021-02-02 | 中国科学院信息工程研究所 | Data set generation method and device in string matching scene and readable storage medium |
CN108228757A (en) * | 2017-12-21 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image search method and device, electronic equipment, storage medium, program |
US11823038B2 (en) | 2018-06-22 | 2023-11-21 | International Business Machines Corporation | Managing datasets of a cognitive storage system with a spiking neural network |
CN110096519A (en) * | 2019-04-09 | 2019-08-06 | 北京中科智营科技发展有限公司 | A kind of optimization method and device of big data classifying rules |
CN110210773A (en) * | 2019-06-10 | 2019-09-06 | 四川长虹电器股份有限公司 | A kind of project iteration appraisal system and method |
CN112115335B (en) * | 2019-06-20 | 2024-05-28 | 百度(中国)有限公司 | Data fusion processing method, device, equipment and storage medium |
CN112258690B (en) * | 2020-10-23 | 2022-09-06 | 中车青岛四方机车车辆股份有限公司 | Data access method and device and data storage method and device |
CN113159600B (en) * | 2021-04-29 | 2023-04-28 | 南方电网深圳数字电网研究院有限公司 | Demand subcontracting management method and system applied to bidding |
US20220405417A1 (en) * | 2021-06-17 | 2022-12-22 | International Business Machines Corporation | Sensitive data classification in non-relational databases |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6778534B1 (en) * | 2000-06-30 | 2004-08-17 | E. Z. Chip Technologies Ltd. | High-performance network processor |
CN102841860A (en) * | 2012-08-17 | 2012-12-26 | 珠海世纪鼎利通信科技股份有限公司 | Large data volume information storage and access method |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
-
2013
- 2013-12-25 CN CN201310727192.1A patent/CN103729428B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6778534B1 (en) * | 2000-06-30 | 2004-08-17 | E. Z. Chip Technologies Ltd. | High-performance network processor |
CN102841860A (en) * | 2012-08-17 | 2012-12-26 | 珠海世纪鼎利通信科技股份有限公司 | Large data volume information storage and access method |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
Non-Patent Citations (1)
Title |
---|
基于云计算的大数据挖掘平台;何清等;《中兴通讯技术》;20130831;第19卷(第4期);32-38 * |
Also Published As
Publication number | Publication date |
---|---|
CN103729428A (en) | 2014-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103729428B (en) | Big data classification method and system | |
CN109902145B (en) | Attention mechanism-based entity relationship joint extraction method and system | |
CN104318340B (en) | Information visualization methods and intelligent visible analysis system based on text resume information | |
CN112199520A (en) | Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix | |
CN105393264A (en) | Interactive segment extraction in computer-human interactive learning | |
CN110689081A (en) | Weak supervision target classification and positioning method based on bifurcation learning | |
CN104751182A (en) | DDAG-based SVM multi-class classification active learning algorithm | |
CN106778832A (en) | The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization | |
CN113673254B (en) | Knowledge distillation position detection method based on similarity maintenance | |
CN110633365A (en) | Word vector-based hierarchical multi-label text classification method and system | |
CN111339407B (en) | Implementation method of information extraction cloud platform | |
CN112597324A (en) | Image hash index construction method, system and equipment based on correlation filtering | |
CN111582506A (en) | Multi-label learning method based on global and local label relation | |
Chu et al. | Co-training based on semi-supervised ensemble classification approach for multi-label data stream | |
CN116383399A (en) | Event public opinion risk prediction method and system | |
CN115329120A (en) | Weak label Hash image retrieval framework with knowledge graph embedded attention mechanism | |
CN115827954A (en) | Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment | |
Yao | Design and simulation of integrated education information teaching system based on fuzzy logic | |
Lei et al. | An input information enhanced model for relation extraction | |
Wang et al. | Distant supervised relation extraction with position feature attention and selective bag attention | |
Liu et al. | Community-based dandelion algorithm-enabled feature selection and broad learning system for traffic flow prediction | |
CN116226404A (en) | Knowledge graph construction method and knowledge graph system for intestinal-brain axis | |
CN117151222A (en) | Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium | |
Cao | Design and optimization of a decision support system for sports training based on data mining technology | |
CN114860852A (en) | Knowledge graph construction method for military field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |