CN103729428B

CN103729428B - Big data classification method and system

Info

Publication number: CN103729428B
Application number: CN201310727192.1A
Authority: CN
Inventors: 何清; 吴新宇; 庄福振; 敖翔
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-12-25
Filing date: 2013-12-25
Publication date: 2017-04-12
Anticipated expiration: 2033-12-25
Also published as: CN103729428A

Abstract

The invention discloses a big data classification method and system. The method includes the steps: training, namely dividing input data into input data blocks, generating classification rules (pattern strings => class marks) of pattern strings from the input data blocks, and writing the classification rules into a Hbase database rule table; testing, namely, reading the input data blocks, constructing pattern strings to be classified, searching for classification rules matching with the pattern strings to be classified, in the Hbase database rule table, and outputting classification results. Therefore, the big data classification method and system based on hypersurface is provided; classification can be achieved by a hypersurface-based covering algorithm on the basis of a Hadoop mapping/simplification programming frame and a Hbase distributed non-relational database, a rule model easy to explain can be constructed at low calculation cost, big data is quickly and efficiently processed, and the requirement for classifying explosively-increasing data in real world is met.

Description

A kind of big data sorting technique and system

Technical field

The present invention relates to big data analysis field, more particularly to a kind of big data sorting technique based on hypersurface and it is System.

Background technology

Classification is a kind of important data analysis form, for extracting the model for portraying significant data class.This model claims For grader, for the class label of prediction classification.Data classification is a two stage process, including study stage and classification rank Section, the study stage builds the disaggregated model stage, and sorting phase predicts the class label of data-oriented using model.For example, We can set up a disaggregated model, and bank loan application is divided into safety or dangerous.This analysis can help us More preferable comprehensive understanding data.Many classification and Forecasting Methodology come from machine learning, pattern-recognition and statistics.Major part is calculated Method is the algorithm of memory resident, often assumes that data volume very little.Classification has extensive application, including fraud detection, target marketing, property Can prediction, manufacture and medical diagnosis.

The existing method for solving classification problem has a lot, and single sorting technique mainly includes：Decision tree, Bayes, Artificial neural network, k- neighbours, SVMs and the classification based on correlation rule etc.；Also it is used in addition combine single classification The Ensemble Learning Algorithms of method, such as pack and are lifted/propulsion.

（1）Decision tree

Decision tree is one of major technique for classifying and predicting, decision tree learning is the conclusion based on example Algorithm is practised, it is conceived to from one group of out of order, random example and infers with the classifying rules of decision-making tree representation.Construction is determined The purpose of plan tree is to find out the relation between attribute and classification, and the classification of the record of unknown classification in the future is predicted with it.It is adopted Top-down recursive fashion, in the internal node of decision tree the comparison of attribute is carried out, and is judged from this according to different attribute value The downward branch of node, decision tree leaf node it is concluded that.

Main decision Tree algorithms have ID3, C4.5（C5.0）, CART, PUBLIC, SLIQ and SPRINT algorithm etc..They At the technology for selecting testing attribute to adopt, the structure of decision tree for generating, the method for beta pruning and moment, big data can be processed The aspects such as collection have respective difference.

（2）Bayes

Bayesian Classification Arithmetic is the algorithm that a class is classified using probability statistics knowledge, such as NB Algorithm. These algorithms mainly predict that the sample of a unknown classification belongs to the possibility of each classification using Bayes' theorem, select it Final classification of the maximum classification of middle possibility as the sample.Because Bayesian establishment itself needs one very Strong conditional independence assumption premise, and this hypothesis is often invalid in a practical situation, thus its classification accuracy is just Can decline.The Bayesian Classification Arithmetic of many reduction independence assumptions, such as TAN algorithms are occurred as soon as this, it is in Bayes Increase the association between attribute pair on the basis of network structure to realize.

（3）Artificial neural network

Artificial neural network is the mathematics that a kind of application carries out information processing similar to the structure of cerebral nerve cynapse connection Model.In this model, substantial amounts of node（Or claim " neuron ", or " unit "）Between be coupled to each other composition network, i.e., " god Jing networks ", to reach the purpose of processing information.Neutral net generally needs to be trained, and the process of training is exactly that network is carried out The process of study.The value of the connection weight that training changes network node makes it have the function of classification, and trained network is just Can be used for the identification of object.

At present, existing hundreds of the different model of neutral net, common are counterpropagation network, Radial Basis Function Network Network, hopfield network, stochastic neural net, Competitive ANN etc..But current neutral net still receive by generally existing The shortcomings of holding back slow, computationally intensive speed, training time length and can not explain.

（4）K- neighbours

K- nearest neighbor algorithms are a kind of sorting techniques of Case-based Reasoning.The method be exactly find out it is closest with unknown sample x K training sample, see which kind of majority belongs in this k sample, is just classified as that class x.K- near neighbor methods are a kind of lazy Lazy learning method, it deposits sample, is just classified when needing classification, if sample set is more complicated, may result in Very big computing cost, therefore the very strong occasion of real-time cannot be applied to.

（5）SVMs

SVMs is a kind of new learning method that Wan Punike is proposed according to Statistical Learning Theory, and its maximum is special Point is, according to empirical risk minimization, to maximize class interval construction optimal separating hyper plane the general of learning machine to be improved Change ability, the problems such as preferably solve non-linear, high dimension, local minimum point.For classification problem, algorithm of support vector machine Sample in region calculates the decision-making curved surface in the region, thereby determines that the classification of unknown sample in the region.

（6）Classification based on correlation rule

Association rule mining is an important field of research in data mining.In recent years, for how by correlation rule Excavate for classification problem, scholars conduct extensive research.Associative classification method excavates the rule of shape such as condset → C, Wherein condset is the set of item (or attribute-value to), and C is class label, and the rule of this form is referred to as class association rules. Associative classification method is typically made up of two steps：First step association rules mining algorithm is excavated all full from training data concentration Toe determines the class association rules of support and confidence level；Second step is chosen using heuristic from the class association rules excavated One group of high-quality rule is selected for classifying.

（7）Integrated study

The complexity of practical application and the diversity of data often cause single sorting technique not effective enough.Therefore, learn Fusion of the persons to various sorting techniques is that integrated study is conducted extensive research.Integrated study has become international machine learning The study hotspot on boundary, and it is referred to as one of current machine four mains direction of studying of study.

Integrated study is a kind of machine learning normal form, and it is attempted by continuously calling single learning algorithm, obtains different Base learner, then these learners are combined solving same problem according to rule, can significantly improve learning system Generalization ability.Combine multiple base learners mainly to adopt（Weighting）The method of ballot, common algorithm has pack, is lifted/is pushed away Enter.

Integrated study as a result of the average multiple graders of Combination of Methods of voting, it is possible to reducing single classification The error of device, acquisition is more accurately represented problem space model, so as to improve the classification degree of accuracy of grader.

Standard for comparing and assessing sorting technique mainly has：（1）The accuracy rate of prediction.Model is correctly predicted new sample The ability of this class label；（2）Calculating speed.The time classified including tectonic model and using model；（3）It is strong Property.The model ability correctly predicted to noise data or vacancy Value Data；（4）Scalability.The data very big for data volume Collection, the ability of effective tectonic model；（5）The terseness and interpretation of model description.Model description more succinctly, is more easily managed Solution, then it is more welcome.

Judge from these indexs, currently a popular sorting algorithm has problems with.Decision Tree algorithms application is very Extensively, thinking is simple, realizes being easier.But due to the data structure of the intrinsic tree of decision tree, cause machine internal memory to become calculation Method bottleneck, it is impossible to process large-scale data.And beta pruning is also faced with standard and is difficult to really after beta pruning and construction before the construction of decision tree It is fixed, the high problem of algorithm complex.Bayes's classification is also conventional sorting algorithm, but Bayesian establishment itself is needed A very strong conditional independence assumption premise is wanted, and this hypothesis is often invalid in a practical situation, this can serious shadow Ring classification accuracy.Artificial neural network training process is extremely complex, generally existing convergence rate is slow, computationally intensive, training when Between it is long and the shortcomings of can not explain.K- nearest neighbor algorithms are a kind of Lazy learning algorithms, and it deposits sample ability when needing classification Classified, without deliberately distinguishing training and identification process, but if sample set is more complicated, very big calculating can be caused to hold Pin, it is impossible to process mass data.SVMs can be solved non-linear with maximizing class interval construction optimal classification hypersurface Higher dimensionality data are classified, but there is also solution Classification Hyperplane difficulty, and algorithm complex is high, it is difficult to adapt to count in real time in a large number According to problem.The method of classification and integrated study based on correlation rule is corresponding with less at present, also in the exploratory stage, but Also generally it is faced with algorithm complex high, it is difficult to restrain, the uncertain problem of classifying quality.

The content of the invention

In order to solve the above problems, it is an object of the present invention to provide a kind of big data sorting technique based on hypersurface And system, the predictablity rate for solving above-mentioned prior art is unstable, calculation cost is big, speed is slow, model complexity is high, be difficult to Explain, cannot process the problem of mass data.The method, can be in Hadoop mappings/change using the covering algorithm based on hypersurface Realize on the basis of simple programming framework and the distributed non-relational databases of Hbase, and can be easy to relatively low calculation cost, structure The rule model of explanation, rapidly and efficiently processes mass data, to tackle real world in explosive growth data classification Demand.

For achieving the above object, big data sorting technique proposed by the invention, it is characterised in that the method includes following Step：

Training step, including the first mapping/abbreviation step of multiple circulation, for input data to be divided into into input data Block, by the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write Enter Hbase database association rule tables；

Testing procedure, including a second mapping/abbreviation step, for reading the input block, and construct to be sorted Model string, searches the classifying rules matched with the model string to be sorted in the Hbase database association rule tables, and defeated Go out classification results.

The big data sorting technique of the present invention, it is characterised in that the first mapping/abbreviation step is specially：

Including one or more first mapping steps and an abbreviation step, wherein, first mapping step is used for should Input data is divided into the input block of fixed size, and the input block is read line by line and front l positions is taken successively according to per one-dimensional Mode structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is current Cycle-index；The abbreviation step is used to for the key-value pair to be merged into project<Model string, list<Category>>, and judge this Whether mesh is pure, if pure, in the Hbase database association rule tables rule is write, and otherwise turns first mapping step, its In the pure finger list<Category>In comprising some category number of times percentage reach user setting threshold value.

The big data sorting technique of the present invention, it is characterised in that be stored in the rule in the Hbase database association rule tables Concrete grammar is：

Judge in the Hbase database association rule tables with the presence or absence of the classifying rules, if not existing, the classifying rules is deposited Enter the Hbase database association rule tables, if existing, the classifying rules is stored in the Hbase databases rule by the mode for taking covering Then table.

The big data sorting technique of the present invention, it is characterised in that the Hbase database association rule tables take the mode of row to carry out The storage of the classifying rules.

The big data sorting technique of the present invention, it is characterised in that the second mapping/abbreviation step is specially：

Including one or more the second mapping steps, for reading line by line to the input block, according to per it is one-dimensional successively The mode for taking front m bit digitals constructs model string to be sorted, and wherein m is positive integer, and is looked in Hbase database association rule tables Look for the presence or absence of the classifying rules that matches with the model string to be sorted, until meeting end condition, output category result, its In the end condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number Reach one of either condition in user input threshold value.

The present invention also proposes a kind of big data categorizing system, it is characterised in that the system is included with lower module：

Training module, including the first mapping/abbreviation module of multiple circulation, for input data to be divided into into input data Block, by the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write Enter Hbase database association rule tables；

Test module, including a second mapping/abbreviation module, for reading the input block, and construct to be sorted Model string, searches the classifying rules matched with the model string to be sorted in the Hbase database association rule tables, and defeated Go out classification results.

The big data categorizing system of the present invention, it is characterised in that characterized in that, the first mapping/abbreviation module is concrete For：

Including one or more first mapping blocks and an abbreviation module, wherein, first mapping block is used for should Input data is divided into the input block of fixed size, and the input block is read line by line and front l positions is taken successively according to per one-dimensional Mode structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is current Cycle-index；The abbreviation module is used to for the key-value pair to be merged into project<Model string, list<Category>>, and judge this Whether mesh is pure, if pure, in the Hbase database association rule tables rule is write, and otherwise turns first mapping block, its In the pure finger list<Category>In comprising some category number of times percentage reach user setting threshold value.

The big data categorizing system of the present invention, it is characterised in that be stored in the rule in the Hbase database association rule tables Concrete grammar is：

The big data categorizing system of the present invention, it is characterised in that the Hbase database association rule tables take the mode of row to carry out The storage of the classifying rules.

The big data categorizing system of the present invention, it is characterised in that the second mapping/abbreviation module is specially：

Including one or more the second mapping blocks, for reading line by line to the input block, according to per it is one-dimensional successively The mode for taking front m bit digitals constructs model string to be sorted, and wherein m is positive integer, and is looked in Hbase database association rule tables Look for the presence or absence of the classifying rules that matches with the model string to be sorted, until meeting end condition, output category result, its In the end condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number Reach one of either condition in user input threshold value.

The present invention has advantages below：

（1）The sorting algorithm based on hypersurface that the present invention is realized, if its Fundamentals of Mathematics is topological to work as theorem：It is flat One is not continuously just called Jordan curve with itself intersection curve on face.A closure in plane（It is end to end）Ruo Er Work as curve, plane is divided into 2 regions, and if taken a bit respectively in the two regions, then with a curve by its phase Even, then this line must be intersecting with original closure Jordan curve.Inference is：Any point is penetrated as starting point with space Line, if ray is odd number with the intersection point of the closure Jordan curve, claims this in the closed space that curve is surrounded, if Intersection point is even number, then claim the point outside closed space.High dimensional data is same.From the theory, the space of the present invention Dividing to merge will not have a negative impact to classification accuracy；

（2）The present invention adopts the thought of " dividing and ruling ", by each dimension of input data by data bit value, as sign The pattern string that space divides.Very thought is taken, by space according to being divided into 10ⁿIndividual region（N is the dimension of input data）. Whether enough " pure " to judge the classification at each number of regions strong point（It is same class or reaches user for of a sort ratio and sets Determine threshold value）.If enough " pure ", the zone marker is the most data category of the region quantity, otherwise enters in the region One step is subdivided into 10ⁿSub-regions, take each dimension next one data bit, proceed this work, until meeting all thin Subregion " pure ", traveled through all data bit or cycle-index is reached in user's given threshold these three end conditions It is any.The sign character string in each region of final output is used to classify with classification, formation rule table is marked.It is this to divide and rule Thought is it can be readily appreciated that facilitate implementation.Meanwhile, using very mode so that most grouped datas（Typically enter using ten System）Algorithm process flow process can be directly entered without system conversion, greatly reduce computation complexity.Using Hbase rule lists Mode carry out storage rule, favorable expandability is real-time, is the optimal selection of technical grade database application.

（3）Present invention achieves the hypersurface sorting algorithm based on Hadoop, using mapping/abbreviation mechanism, by big data Small block data is divided into, the sign area of space position of the current level of correspondence is produced to every data parallel at multiple mapping ends Model string, using the mass data external memory ordering mechanism of mapping/abbreviation, judges parallel space per sub-regions at abbreviation end Data purity, generation rule or into next layer of computing.It is this to design memory pressure and the calculating for greatly alleviating unit Pressure, allows processing data scale linearly increasing with the increase of Hadoop clustered machine quantity, has reached mass data The technique effect of Distributed Calculation；

（4）Present invention uses the rule list that distributed non-relational database Hbase storage is produced.Hbase is used as row The database of storage, with high reliability, high-performance, telescopic feature, tolerates data redundancy, to the inquiry in mass data Demand can efficient quick response.The present invention is devised based on the extendible rule list of Hbase, can be constantly rich according to newly-increased data Rich rule list, adapts to the parallel insertion of mass data and the inquiry operation of separate sources.Enable the invention to be applied to reality The classification demand of industry data.

The positive effect of the present invention is that the present invention can realize being distributed based on Hadoop mapping/abbreviations framework and Hbase The sorting algorithm based on hypersurface of formula non-relational database.Compared with prior art, new method proposed by the present invention and it is System can process TB DBMSs, and computing capability with the increase of Hadoop clustered machines close linear rise, it is real real Existing Distributed Calculation, greatly improves performance and efficiency.In addition, the present invention is not using complicated computing, general classification is different from The high computation complexity of algorithm, can reduce overhead.Using the distinctive row storage table structure of Hbase databases, rule list is made The inquiry velocity linear rise in the case of data volume geometric growth, efficient quick response classification demand.Meet system counting greatly According to the real time handling requirement under environment.

Description of the drawings

Fig. 1 is the hypersurface grader parallel training partial process view of the embodiment of the present invention

Fig. 2 is the hypersurface grader concurrent testing partial process view of the embodiment of the present invention

Fig. 3 is the input data form of the embodiment of the present invention

Fig. 4 is systematic training and identification running on the JobTracker of the embodiment of the present invention

Fig. 5 is the Hbase database association rule matrix section regular datas of the embodiment of the present invention

Fig. 6 is the test process partial recognition result of the embodiment of the present invention

Fig. 7 is the UCLA data sets test recognition accuracy of the embodiment of the present invention

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with accompanying drawing to the present invention's Big data sorting technique and system are further elaborated.It should be appreciated that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention.

The present invention is based on distributed open source software platform Hadoop and non-relational distributed data base Hbase database, Using Hadoop mechanism parallel processing input datas, shorten system operation time, using the storage of Hbase row and highly redundant machine is tolerated System so that when knowledge database data increases in geometry rank, query time only has linear increase, it is adaptable to actual increment big data Process.Concrete scheme is as follows.

The big data sorting technique of the present invention is broadly divided into two steps, training step and testing procedure, training step bag The first mapping/abbreviation step containing multiple circulation, testing procedure includes a second mapping/abbreviation step.Explanation is needed herein Be：First mapping/abbreviation step includes one or more first mapping steps, an abbreviation step.Second mapping/abbreviation is walked Suddenly second mapping step is included, wherein, the first mapping step is played carries out piecemeal by a large amount of input datas, is divided into fixed big It is little（Usually 64MB）Input block after processed again；Abbreviation step is to the data block that generates the first mapping step Processed, and generated final output result.Data insertion and inquiry operation to distributed data base Hbase is interspersed in training During the mapping of step and testing procedure/abbreviation program is realized, training step advises the classifying rules write Hbase databases for producing Then in table, testing procedure then searches the classifying rules from Hbase database association rule tables, and for the class categories of test data Identification.The present invention can process increment big data, and after processing a collection of training data that a period of time produces classification gauge is extracted Then, Hbase database association rules are write, and the classifying rules obtained after the next group data processing that lower a period of time produces continues to write Enter Hbase database association rules, expanded classifying rules storehouse, classification capacity strengthens.

Embodiment

Below we by taking the training sample of classical 26 English alphabet picture respective pixels composition as an example, to the present invention's Big data sorting technique is illustrated.

In the present invention, the form of input data is as follows, and training data includes category, and test data does not contain category（According to Specific demand can be pre-processed to various data, be allowed to meet the pattern of the input of the present invention）：

……

First mapping/abbreviation step of the training step comprising repeatedly circulation（N is denoted herein as, n is positive integer）, each One mapping/abbreviation step is made up of the first mapping step and abbreviation step.Input training data enters mapping end, is divided into many Individual fixed size（Usually 64MB）Input block, then each input block is read line by line process parallel.It is defeated Enter every data shape of data block such as " 02080305010813000606100800080008T ", " 02080305010813000606100800080008 " represents 256 of alphabetical " T " picture（Binary system）Pixel value, that is, The pattern of the data of this input block.In this way, regular text message has been converted image information into, can be with Carry out classification process.Setting is current for the l time circulation（L is positive integer）, input data is taken per one-dimensional front l bit digitals composition mould Formula character string, such as l=1, take per the first one-dimensional bit digital compositional model character string " 0000001000100000 ", will , used as the key K of mapping output, class formative " T " is used as value V for mapping output, mapping output shape such as key-value pair for the model string <0000001000100000,T>.If this time generating three key-value pairs<0000001000100000,T>,< 0000001000100000,B>,<0000001000100000,B>.The abbreviation stage is entered afterwards, and abbreviation is using mapping output< 0000001000100000,T>,<0000001000100000,B>,<0000001000100000,B>Deng conduct input, and Key assignments K identical key-value pairs are merged into<K,list<V>>The project of form, such as<0000001000100000,<T,B,B>>, Traversal analysis is carried out to such project in abbreviation, if for some model string such as " 0000001000100000 ", the percentage of the set comprising some category number of times of class scale value reaches the degree of purity of user's setting Threshold value, the degree of purity threshold value refers to the percentage of the scope in [0,1] of user's setting, then illustrates to train part more than the value In abbreviation 1<Model string, list<Category>>Classifying rules can be produced, is otherwise circulated into next round.For example 90%, Then it is referred to as the project " pure ", extracts rule, otherwise initial data is labeled as into " untreated ", into next round circulation.This Place for model string " 0000001000100000 ", comprising category sequence<T,B,B>, occurrence number most " B " accounts for To 66% ratio, less than 90% threshold value of setting, so " untreated " is labeled as, into the second wheel circulation.Second wheel circulation Mapping is taken per one-dimensional front binary digit compositional model character string as key K, and category is generated as value V< 00000010001000002835183066080808,T>,<00000010001000002835183066080801,B>,< 00000010001000002835183066080801,B>Three datas, system is assembled into into abbreviation step according to same keys K value The mode of one sequence, produces<00000010001000002835183066080808,<T>>,< 00000010001000002835183066080801,<B,B>>Two projects, it is clear that the two projects are analyzed respectively, its certain One category accounting more than 90%, referred to as " pure ".So abbreviation produces two classifying rules, {00000010001000002835183066080808=>T},{00000010001000002835183066080801=>B}, In the middle of the rule list of this two rules write Hbase database.If still having the key-value pair of inadequate " pure ", according to above-mentioned Mode is circulated into next round, until it reaches maximum cycle, the maximum cycle refers to the systematic training being manually set Part circulation maximum, is a positive integer.As being not provided with, input data precision is defaulted as.Training process is finished.

Hbase database association rule table structures are illustrated herein.Hbase takes the mode of row to carry out data storage, The table structure of Hbase is represented that a Ge Lie races include one or more column labels, the key of every a line of the rule list by row, column race As classifying rules model string, " 00000010001000002835183066080808 " for as above producing in example is each Row includes a Ge Lie races " fam ", and the row race includes a column label, and the corresponding category of column label memory module character string should Place's correspondence class is designated as " T ".New classifying rules is produced after having incremental data to process, if the pattern-recognition word of the classifying rules Symbol string exists not yet in table, then newly insert the rule, if having existed, takes the mode of covering.Such as newly produce one Classifying rules " 00000010001000002835183066080808=>M ", the rule exists in rule list, former category For T, then former category T is override with M.So complete the growth and renewal of classifying rules.

Testing procedure includes a second mapping/abbreviation step.The process includes second mapping step.In the map The data of each data block are read line by line, and m dimension words constitute model string to be identified before taking out to every a line（M is just whole Number）, the first dimension word is taken first, this is sentenced as a example by " 07100505020608060811071102080509 ", the pattern character Go here and there as " 010000001010000 ", inquire about Hbase database association rule tables, whether search has { 010000001010000=>}( For a certain category) rule complete matching.If there is for example { 010000001010000=>Z } rule match, then mark the pattern character String class is designated as Z；The second dimension word compositional model character string " 0100000010100007055268681712859 " is otherwise taken, after It is continuous to be inquired about.Some condition terminates below satisfaction：1, find the classifying rules of matching；2, the digit of word of peeking reaches Input data maximal accuracy, such as going up example can only get second；3, peek word bit number reaches user input threshold value, such as can only Get the 3rd.After terminating according to these three conditional procedures, if the classifying rules for matching, then output mode character string and class Mark, such as " 07100505020608060811071102080509Z ", otherwise exports “07100505020608060811071102080509NF”（NF represents NotFound, it is impossible to find）.Test process terminates.

Fig. 1 gives the parallel organization figure in Algorithm for Training stage.1 pair of input block of mapping reads line by line, and constructs<Mould Formula character string, category>Key-value pair.Abbreviation 1 is analyzed<Model string, list<Category>>Project purity, judges whether " pure ", If pure output model string=>Category } classifying rules, Hbase database association rule tables are inserted, former data are otherwise exported, enter Enter next round circulation.

Fig. 2 gives the parallel organization figure in test of heuristics stage.1 pair of input block of mapping reads line by line, according to each Dimension takes successively front m bit digitals（M is positive integer）Mode constructs model string to be identified, accordingly in Hbase database association rule tables In carry out category inquiry.Export final recognition result.

According to aforementioned structure, below it is divided to training and tests two part displaying examples.In order to more preferably ensure authenticity, as far as possible Represent running on the server in sectional drawing form.Training data and test data recognize number using UCLA standard alphabets According to collection, altogether comprising data 20000, using ten folding cross validation methods.As shown in Figure 3.

Training and the test process of the system, Jobtracker are run in Hadoop platform（Hadoop logger tasks are transported Capable Web page scan tool）Whole process is have recorded, as shown in figure 4, each job（Workflow）Represent flat in Hadoop The one-stop operation run on platform.Training process includes job0, job1, job2, Afterjob, and job0 is pre-treatment job, to defeated Enter data form to pre-process, be allowed to meet system requirements, job1 and job2 is then the cycle operation twice of create-rule, Afterjob is to process operation the later stage, unified by the classifying rules for generating insertion Hbase database association rule tables.Part of detecting includes Job0 and testjob, job0 are pre-treatment jobs, do pattern of the input pretreatment, testjob be identification operation, be identified and Accuracy in computation.Implication is represented in figure per a line as follows, with the first row " job_201310230921_0021NORMAL hadoop As a example by job0100.00% ", " job_201310230921_0021 " represents that operation submission time is October 23 day 9: 21 in 2013 Point, 0021 is the numbering that same day operation is submitted to, and " NORMAL " represents that Job execution state is normal, and " hadoop " is carried out operation User name, " 100.00% " represents Job execution progress absolutely, that is, to be finished.

The classifying rules insertion Hbase rule lists produced during training department's partite transport row, Fig. 5 is Hbase middle parts contingency table For the classifying rules of " J ".Model string, row name, timestamp, category are followed successively by per data line.Still with the first row data “00000000010000002735296261481616column=fam:col timestamp=1385007184062value= As a example by J ", notice that " 296261481616 " herein are because that order line shows the reason of automatic line feed, it with above " 00000000010000002735 " is the priority part of a character string. " 00000000010000002735296261481616 " intermediate scheme character string, " column=fam:Col " represents that the row has one Ge Lie races " fam ", comprising a column label " col ", when " timestamp=1385007184062 " represents that this data line is inserted Between be " 1385007184062 "（Built-in computer gsec）, " value=J " represents this line below " col " column label Value be " J ".Generally speaking, it is exactly that to represent this defeated for model string " 00000000010000002735296261481616 " Enter being categorized as " J " for character string.

Part of detecting operation is identified to input data, using ten folding cross validations, behind former data basis is marked The category that rule database is identified.As shown in Figure 6.

Fig. 7 is illustrated on UCLA Letter identification standard data sets, and using ten folding cross validations, part of detecting identification is accurate Rate is 92%, is showed excellent.What is intercepted in figure is that operation result under order line shows that previous section is Hadoop engineering operations The internal mechanism information of printing, represents physics memory size, abbreviation the output of process record quantity, and virtual memory capacity mapped Journey output record quantity etc., it is important that last sentence " algorithm precision is0.92 " is represented after proof of algorithm The degree of accuracy is 0.92, that is, 92%.

Claims

1. a kind of big data sorting technique, it is characterised in that the method is comprised the following steps：

Training step, including the first mapping/abbreviation step of multiple circulation, for input data to be divided into into input block, By the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write Hbase database association rule tables；

Testing procedure, including a second mapping/abbreviation step, for reading the input block, and construct pattern to be sorted Character string, searches the classifying rules matched with the model string to be sorted, and output point in the Hbase database association rule tables Class result；

Wherein the first mapping/abbreviation step is specially：

Including one or more first mapping steps and an abbreviation step, wherein, first mapping step is used to be input into this Data are divided into the input block of fixed size, the input block are read line by line and according to per the one-dimensional side for taking front l positions successively Formula structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is previous cycle Number of times；The abbreviation step is used to for the key assignments to be merged into project<Model string, list<Category>>, and whether judge the project It is pure, if pure, the rule is write in the Hbase database association rule tables, otherwise turn first mapping step, wherein this is pure Only list is referred to<Category>In comprising some category number of times percentage reach user setting threshold value.

2. big data sorting technique as claimed in claim 1, it is characterised in that be stored in in the Hbase database association rule tables The regular concrete grammar is：

Judge in the Hbase database association rule tables with the presence or absence of the classifying rules, if not existing, the classifying rules is stored in into this Hbase database association rule tables, if existing, the classifying rules is stored in the Hbase database association rule tables by the mode for taking covering.

3. big data sorting technique as claimed in claim 1 or 2, it is characterised in that the Hbase database association rule tables take row Mode carry out the storage of the classifying rules.

4. big data sorting technique as claimed in claim 1, it is characterised in that the second mapping/abbreviation step is specially：

Including one or more the second mapping steps, for reading line by line to the input block, front m is taken successively according to per one-dimensional The mode of bit digital constructs model string to be sorted, and wherein m is positive integer, and lookup is in Hbase database association rule tables No to there is the classifying rules matched with the model string to be sorted, until meeting end condition, output category result wherein should End condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number reach One of either condition in user input threshold value.

5. a kind of big data categorizing system, it is characterised in that the system is included with lower module：

Training module, including the first mapping/abbreviation module of multiple circulation, for input data to be divided into into input block, By the classifying rules of the input block generation mode character string model string=>Category }, and the classifying rules is write Hbase database association rule tables；

Test module, including a second mapping/abbreviation module, for reading the input block, and construct pattern to be sorted Character string, searches the classifying rules matched with the model string to be sorted, and output point in the Hbase database association rule tables Class result；

First mapping/abbreviation the module is specially：

Including one or more first mapping blocks and an abbreviation module, wherein, first mapping block is used to be input into this Data are divided into the input block of fixed size, the input block are read line by line and according to per the one-dimensional side for taking front l positions successively Formula structural model character string, and the input block is generated into key-value pair<Model string, category>, wherein l is previous cycle Number of times；The abbreviation module is used to for the key assignments to be merged into project<Model string, list<Category>>, and whether judge the project It is pure, if pure, the rule is write in the Hbase database association rule tables, otherwise turn first mapping block, wherein this is pure Only list is referred to<Category>In comprising some category number of times percentage reach user setting threshold value.

6. big data categorizing system as claimed in claim 5, it is characterised in that be stored in in the Hbase database association rule tables The regular concrete grammar is：

7. the big data categorizing system as described in claim 5 or 6, it is characterised in that the Hbase database association rule tables take row Mode carry out the storage of the classifying rules.

8. big data categorizing system as claimed in claim 5, it is characterised in that the second mapping/abbreviation module is specially：

Including one or more the second mapping blocks, for reading line by line to the input block, front m is taken successively according to per one-dimensional The mode of bit digital constructs model string to be sorted, and wherein m is positive integer, and lookup is in Hbase database association rule tables No to there is the classifying rules matched with the model string to be sorted, until meeting end condition, output category result wherein should End condition be find matching classifying rules, peek word digit reach input data maximal accuracy, peek word bit number reach One of either condition in user input threshold value.