CN108021679A

CN108021679A - A kind of power equipments defect file classification method of parallelization

Info

Publication number: CN108021679A
Application number: CN201711288010.XA
Authority: CN
Inventors: 杨祎; 宇文梦柯; 王智翔; 白德盟; 辜超; 郭志红; 陈玉峰; 闫丹凤; 李贞�; 林颖; 李程启; 秦佳峰; 郑文杰; 李娜
Original assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2018-05-11

Abstract

The invention discloses a kind of power equipments defect file classification method of parallelization, field dictionary is added in user-oriented dictionary, defect case is pre-processed, is segmented and gone stop words；Using crawler algorithm, the corpus of text of electric network fault case is collected, is trained using the word2vec of Spark, the term vector for obtaining the field represents；Case the defects of acquisition and term vector are subjected to vectorization expression, defect case is subjected to text representation, forms matrix；By Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.

Description

A kind of power equipments defect file classification method of parallelization

Technical field

The present invention relates to a kind of power equipments defect file classification method of parallelization.

Background technology

Algorithm of documents categorization is mainly comprising pretreatment, text feature extraction, text representation, classified calculating this four steps. Wherein the pre-treatment step of Chinese text mainly includes participle, removes stop words；Text feature extract mainly have tfidf, Textrank is the method based on topic model of the method based on word frequency statistics and the lda that represent as representative；Text representation master Have without considering the one-hot modes of context of co-text and the mode based on word2vec；Last classifying step is for general Classification algorithm for datamining can account for.In the text categorization task of specific area, the problem of need to mainly considering is exactly to tie The characteristics of language in conjunction field and specialty, pretreatment, feature extraction and etc. carry out corresponding algorithm adjustment.Text classification Also need to carry out corresponding algorithm improvement according to the scale feature of object of classification in journey, it is generally the case that can be with for long text Directly classified using above-mentioned flow, and classifying quality can generally be better than short text, the information for being primarily due to long text contains Amount is more sufficient, and Comparatively speaking, short text directly uses above-mentioned flow in assorting process, and short text can be caused originally poor Feature is lost, thus would generally consider only carry out stop words a filtering, and no longer by tfidf scheduling algorithms further into The screening of row keyword.

In electric power defect text, the classification for the order of severity of defect, had all been manually rule of thumb to lacking in the past Sunken description is judged, is manually classified as " serious ", " general ", " critical " three classifications, is so not only resulted in substantial amounts of people Work labour cost, can also cause the difference of judging result because of the Subjective difference of different people.Therefore, by means of text classification It is very significant that algorithm, which carries out automatic text classification to calculate, but in terms of power domain defect case classification at present also Rarely has research.

General participle step is all based on acquiescence dictionary progress, this can be with for general public sphere text Accomplish accurately to segment, but for the text object of the scene, be difficult to preferably be tied just with acquiescence dictionary Fruit is, it is necessary to territoriality be taken into account, the specialized dictionary of addition power industry in the acquiescence dictionary of ansj, and accurately participle is The accurate important prerequisite for training word2vec.

Equally, general word2vec is trained according to general corpus, and the text pair that the invention is directed to It is very strong professional as having, it is therefore desirable to collect substantial amounts of text first for the field, carry out word2vec term vector expressions Training.Afterwards on the basis of with this training result, consider to be indicated for follow-up text.

Since the flow is built upon on the parallel frames of Spark, to reach for the efficient of big data input form Calculate, and the svm classifier algorithm bag in platform mllib is two graders, it is difficult to the more classification run into for the scene Scene is handled.

The content of the invention

The present invention is to solve the above-mentioned problems, it is proposed that a kind of power equipments defect file classification method of parallelization, this Invention solves the classification for carrying out defect urgency level with very strong professional electric power defect case text, directly uses and passes The analysis process of system is difficult to obtain satisfied classification results, and when data volume is big, can be efficient by the parallel frames of Spark Analysis process is completed, accomplishes the classification analysis of big data scale.

To achieve these goals, the present invention adopts the following technical scheme that：

A kind of power equipments defect file classification method of parallelization, comprises the following steps：

(1) field dictionary is added in user-oriented dictionary, defect case is pre-processed, segmented and gone to disable Word；

(2) crawler algorithm is utilized, the corpus of text of electric network fault case is collected, is instructed using the word2vec of Spark Practice, the term vector for obtaining the field represents；

(3) word in the genetic defects case for obtaining step (1) is converted into the corresponding term vector of step (2), and will Case data carries out text representation, forms the form of matrix；

(4) by Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.

Further, the order of the step (1) and step (2) is exchanged.

Further, in the step (1), the processing method segmented is：Text data is read from HDFS In the data structure of program, each one text data of behavior, the data structure of storage is RDD [String] form；By field Dictionary is imported into the user thesaurus of ansj, calls the Library.makeForest interfaces in ansj to import domain lexicon, Participle dictionary is subjected to completion, obtains complete dictionary, the foundation as participle；Operated using the map of Spark for each language Material carries out word segmentation processing, using accurate participle, that is, calls the ToAnalysis.parse interfaces in ansj, utilizes map operators pair Word segmentation processing is carried out simultaneously in parallel in each sentence.

In the step (1), the processing method for carrying out stop words is：Deactivation vocabulary is imported into the number of program from HDFS According in structure, being originally inputted as the form of each one stop words of behavior, the data structure of storage is RDD [String] form； Stop words is carried out using the map operators of Spark for the result of each point of complete word to operate, each obtained according to division Word, every filtering out in set of words is disabled, while using map operators stop words is simultaneously carried out for each text Filtering；Result is organized into RDD [Array [String]] form, the handling result of each one case of behavior, every result Form is some words, and centre is separated with the form in space, and the result handled well is stored in data structure, and with txt lattice Formula is output on HDFS.

Further, in the step (2), using reptile means, a large amount of texts in the field are collected, as domain term to Measure training corpus a part, by the external data of collection and it is to be analyzed the defects of case merge, composing training language material, into The pretreatment of row participle and stop words, calls the word2vec algorithm bags of Spark, using word2Vec.fit operators by previous step Result be input in word2vec models carry out term vector training, and pass through model.getVectors operators obtain training Obtained term vector is as a result, the case text being analysed to is read into data structure from HDFS, for the word in case It is replaced with trained corresponding vector.

Further, the vector result of some words of each case is averaged, as the overall special of the case Sign, arranges the result being calculated, and the feature of a case is corresponded to per a line, and often capable form is D_jUrgency level class Do not mark, and be output to the form of txt on HDFS.

Further, in the step (3), text feature data is imported into data structure, case data is carried out The cutting of training set and test set, sets iterations, and the structure of model is carried out using stochastic gradient descent method, utilizes training set Training pattern, the assessment of result is trained using accuracy rate or recall rate, if assessment result does not meet setting condition, again Iterative parameter and model parameter are adjusted, until output result meets setting condition.

Further, in the step (4), SVM algorithm is improved, polytypic scene can be tackled, had Body improved method is：

(4-1) is divided original case data according to every a kind of emergency class, and will divide obtained Sub Data Set Combination of two is carried out, forms new combined data set；

Each combined data set in initial data training set is input to the bis- classification tool bags of SVM of Spark by (4-2) In, carry out the training of model；

Data in test set are separately input to carry out classification in bis- graders of SVM that three training finish by (4-2) Judge, each grader is voted by classification results, after three graders, voting results are added, and is obtained final Classification results.

Compared with prior art, beneficial effects of the present invention are：

The present invention is higher than general-purpose algorithm for the accuracy of power equipments defect case text urgency level classification, combines Field feature carries out algorithm design so that total algorithm reliability is lifted.Flow is set based on Spark progress parallelizations at the same time Meter, compared to serial algorithm, can preferably adapt to for big data situation, reduce time loss.

The present invention carries out the task of defect emergency classification for power equipments defect case text, word2vec's Term vector represents link, the mode that the field language material of employing is trained, with the common training result phase based on open language material Than the language that the result can more accurately embody this area describes mode feature, and can be directly used in the field text The procedure links of the other applications such as cluster, association analysis.

Meanwhile two sorting algorithms of Spark platforms are rewritten into multi-classification algorithm by the present invention, the complicated classification of SVM is remained The characteristics of highly-parallel of process, filled up the blank of the more sorting techniques of SVM on Spark frames.Global analysis flow is complete The paralell design based on Spark and realization have been carried out, compared to general non-parallel schema, can preferably adapt to actual answer Big data scene in.

Brief description of the drawings

The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are used to explain the application, do not form the improper restriction to the application.

Fig. 1 is the overall flow figure of the present invention；

Fig. 2 is preprocessing process flow chart of the present invention；

Fig. 3 represents process flow diagram flow chart for feature of present invention；

Fig. 4 is classification process figure of the present invention；

Fig. 5 is the more classification process figures of SVM of the present invention；

Fig. 6 is the execution time Comparative result schematic diagram of the different scales of the present invention.

Embodiment：

The invention will be further described with embodiment below in conjunction with the accompanying drawings.

It is noted that described further below is all illustrative, it is intended to provides further instruction to the application.It is unless another Indicate, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.

In the present invention, term as " on ", " under ", "left", "right", "front", "rear", " vertical ", " level ", " side ", The orientation or position relationship of instructions such as " bottoms " are based on orientation shown in the drawings or position relationship, only to facilitate describing this hair Bright each component or component structure relation and definite relative, not refer in particular to either component or element in the present invention, it is impossible to understand For limitation of the present invention.

In the present invention, term such as " affixed ", " connected ", " connection " should be interpreted broadly, and expression can be fixedly connected, Can also be integrally connected or be detachably connected；It can be directly connected, can also be indirectly connected by intermediary.For The related scientific research of this area or technical staff, can determine the concrete meaning of above-mentioned term in the present invention as the case may be, It is not considered as limiting the invention.

The present invention solves the classification for carrying out defect urgency level with very strong professional electric power defect case text, Directly it is difficult to obtain satisfied classification results using traditional analysis process, and when data volume is big, can be parallel by Spark Frame is efficiently completed analysis process, accomplishes the classification analysis of big data scale.

As shown in Figure 1, the process mainly represents training, text feature by Text Pretreatment, field word2vec term vectors Represent, four steps compositions of urgency level classification are carried out using SVM.

Pre-treatment step mainly includes participle and removes stop words；Word2vec term vectors represent that training needs to introduce what is collected Exterior relevant field corpus of text, to ensure the universality of result；Text Representation mainly uses the average side of term vector Formula；Modified SVM multi-categorizers carry out model training first, carry out case classification afterwards.The specific introduction of four steps The part seen below.

Text Pretreatment and parallelization

The operation that the step mainly includes is participle and removes stop words, is namely based on Spark in addition and pre-process parallel The design of change, flow are as shown in Figure 2.

The operating procedure of the step：

1. text data is read in the data structure of program from HDFS, each one text data of behavior, store Data structure be RDD [String] form；

2. field dictionary is imported into the user thesaurus of ansj, the Library.makeForest interfaces in ansj are called Domain lexicon is imported, participle dictionary is subjected to completion.The complete dictionary that (acquiescence dictionary+field dictionary) is formed is finally obtained, Foundation as participle；

3. word segmentation processing is carried out for each language material using the map operations of Spark, herein using precisely segmenting, i.e., Call the ToAnalysis.parse interfaces in ansj.The parallelization of this single stepping is embodied in map operators for each language Sentence carries out word segmentation processing simultaneously in parallel, can save time cost；

4. being imported into vocabulary is disabled from HDFS in the data structure of program, it is originally inputted as each one deactivation of behavior The form of word, the data structure of storage is RDD [String] form；

5. carry out stop words for the result of each point of complete word using the map operators of Spark to operate.That is, according to draw Each word got, it is every in set of words is disabled, filter out.The parallelization of this single stepping is embodied in map calculations Son, stop words filtering is simultaneously carried out for each text；

6. being handled more than passing through, result is organized into RDD [Array [String]] form, each one case of behavior Handling result, the form of every result is some words, and centre is separated with the form in space.The result handled well is stored in number According in structure, and it is output to txt forms on HDFS.

The operation that the step mainly includes is training and obtains the vectorization expression of domanial words, is converted into case word Corresponding vectorization represents, each case text is carried out to vectorization expression and as the feature of the text.It is exactly base afterwards It is as shown in Figure 3 in the paralell design that Spark is carried out, flow.

The operating procedure of the step：

1. a large amount of texts in the field, the part as field term vector training corpus are collected using means such as reptiles. The present invention has crawled 62643 relevant papers from Hownet by the use of Python and Scrapy frames and has been used as external data；

2. by the external data of collection and it is to be analyzed the defects of case merge, composing training language material.The training by more than Language material carries out participle operation, that is, Strp.1~Step.3 of preprocessing part, and the result after participle is stored in data In structure；

3. calling the word2vec algorithm bags of Spark, the result of previous step is input to using word2Vec.fit operators The training of term vector is carried out in word2vec models, and the term vector trained and obtained is obtained by model.getVectors operators As a result, result to be organized into the form of " word vectors ", often row one, is output on HDFS with txt forms.This step it is parallel Change and be embodied in the word2vec algorithm bags that have invoked Spark frames, since the bag is namely based on what the mechanism of platform was write, Therefore the parallelization that Spark frames realize complicated training process to the full extent can be directed to., will be to during training Amount dimension setVectorSize is arranged to 200, setMinCount and is arranged to 0, other use default parameters；

4. the case text being analysed to is read into data structure from HDFS, for the word in case with training Corresponding vector be replaced；

5. the vector result of some words of each case is averaged, the global feature as the case.That is, if often Result after word i vectorizations represent in piece document is W_i=(w_i1,w_i2,…w_i200), the m word of case j passes through vectorization It is M_D after expression_j=(W_j1,W_j2,…,W_jm), then the character pair vector of the case is

The result that previous step is calculated is arranged, and the feature of a case is corresponded to per a line, and often capable form is “D_jUrgency level category label ".Wherein D_jIt is the vector of a 1*200, each element is separated with space in vector, the present invention It is middle will " promptly ", " seriously ", ", generally " this three classes was respectively labeled as 3,2,1, category label and D_jBetween with English comma ", " every Open.Result above is arranged according to form, and is output to the form of txt on HDFS.

As shown in figure 4, text classification and parallelization：

The operation that the step mainly includes be by case data be trained the cutting of collection and test set, SVM models structure Build, model training, Utilization assessment index carry out recruitment evaluation, adjust five steps of ginseng.The module is first by two classification in mllib SVM algorithm bag is rewritten, and becomes a multi-classification algorithm, is encapsulated and is applied in classification process afterwards.The flow The parallelization based on Spark be embodied in calling and improvement for primary SVM bags.

The integrated operation step of the process：

1. the text feature of the vectorization on HDFS is imported into the data structure of LIBSVM forms；

2. the randomSplit interfaces of MLUtils.loadLibSVMFile are called, by case data according to 60% He 40% ratio cut partition is training set and test set；

3. it is 150 to set iterations numIterations, other specification selection default value, is called SVMWithSGD.train operators selection stochastic gradient descent method carries out the structure of model；

4. carrying out the training of model using training set data, and model.predict operators are called to be carried out for test set The anticipation of classification；

5. anticipation result is compared with actual result, accuracy rate, recall rate, F1 values is selected to refer to as the evaluation of effect Mark carries out the assessment of result；

6. returning to step.3, reset for parameters, repeat step.3~step.6 until reaching satisfied As a result, finally obtaining corresponding parameter and model.

The present invention is improved for the SVM algorithm bag of mllib, can tackle polytypic scene, main to think Roadbed is in one-to-many method.

The concrete operation step of the process：

1. by " general ", " serious ", the text data of the vector form of " critical " three classifications separates.According to (" general ", " serious "), (" critical ", " serious "), the form of (" general ", " critical ") forms three newly into the combination of two of line data set Combined data set；

2. each combined data set in initial data training set is input in the bis- classification tool bags of SVM of Spark, Carry out the training of model (specific steps are shown in Fig. 4)；

3. for each data in test set, it is (" general ", " serious ", " critical ") to set its initial category poll =(0,0,0)

4. the data in test set are separately input to carry out sentencing for classification in bis- graders of SVM that three training finish It is disconnected.Each grader is voted by classification results.For example, when in (" general ", " danger ") grader, the classification of judgement is " general ", then (" general ", " serious ", " critical ")_{(general, dangerous)}=(1,0,0)

5. after three graders, voting results are added, i.e., (" general ", " serious ", " critical ")=(" general ", " serious ", " critical ")_{(general, dangerous)}+ (" general ", " serious ", " critical ")_{(urgent, dangerous)}

+ (" general ", " serious ", " critical ")_{(general, urgent)}。

Final classification is the corresponding classifications of max (" general ", " serious ", " critical ").

In text classification step, due to based on SVM algorithm bag be original algorithmic tool in Spark platforms, The parallel feature that frame is farthest combined during exploitation has carried out the parallelization of algorithm, thus this step and Row performance has also reached gratifying degree.

As a kind of Application Example, all experiments in the present invention all include 1 host node (master) at one, and 3 Carried out on the Spark clusters that a local from node (slave) is built.The disk size of cluster is configured to 2.88T, always interior to save as 32G.Spark versions are 1.6.0, and Hadoop versions are 2.7.0.

This experiment is mainly assessed from the angle of classification accuracy, and accuracy rate P, recall rate are selected for final result R and F1 values are weighed, and the calculation formula of three is as follows：

Following three scheme is selected to be contrasted with the solution of the present invention：

Contrast scheme 1：Tfidf expressions+naive Bayesian；

Contrast scheme 2：Tfidf expressions+SVM；

Contrast scheme 3：Based on the general word2vec+SVM for expecting training；

The present invention：The word2vec+SVM of training is expected based on field；

The classification results contrast of 1 different schemes of table

Comparative result more than, it can be found that the scheme works based on word2vec+SVM are generally better than its other party Case.Wherein, the word2vec vectors based on the training of field language material can preferably adapt to the scene compared to based on general language material Classification task.

In order to verify lifting of the algorithm after parallelization in the speed of service, data set is divided into 200K, 20M by us, The scale of 500M, 1G.For based on the parallel of Spark frames, it is contemplated that each executor is owned by fixed check figure mesh, and Core numbers directly result in the number that task is parallel in each executor.The check figure of the total execution thus set herein is got over It is more, it can more increase the degree of concurrence of program.Since the total check figure of cluster is 48, the num- set here The product of executors and executor-cores will be less than 48 on the whole, and by experimental debugging, what is carried out herein is parallel Test us and carry out following parameter configuration：

--deploy-mode cluster

--master yarn-cluster

--num-executors 12

--executor-cores 3

--executor-memory 16G

--driver-memory 8G

It is configured respectively for unit parameter (num-executors=1) and above-mentioned parallel parameter, for four kinds of scales Data perform the algorithm flow of the present invention, it is as shown in Figure 6 that it performs the time used.

As can be seen that the time loss of unit operation is above parallel time loss in four kinds of scales, meanwhile, with The growth of data set, the time loss meeting sharp increase of unit, and parallel algorithm time growth is Comparatively speaking more gentle.It is comprehensive Above as it can be seen that parallel algorithm can be less than uniprocessor algorithm on time loss, and as the growth of data scale, advantage are more bright It is aobvious.

The present invention be directed to the task that power equipments defect case text carries out defect emergency classification.In word2vec Term vector represent link, the mode that the field language material of employing is trained, with the common training result based on open language material Compare, the language which can more accurately embody this area describes mode feature, and can be directly used in field text The procedure links of the other applications such as this cluster, association analysis.Meanwhile two sorting algorithms of Spark platforms are rewritten into more classification Algorithm, the characteristics of remaining the highly-parallel of SVM complexity assorting processes, filled up more sorting techniques of SVM on Spark frames Blank.Global analysis flow has carried out the paralell design based on Spark and realization completely, compared to general non-parallel mould Formula, can preferably adapt to the big data scene in practical application.

The foregoing is merely the preferred embodiment of the application, the application is not limited to, for the skill of this area For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair Change, equivalent substitution, improvement etc., should be included within the protection domain of the application.

Although above-mentioned be described the embodiment of the present invention with reference to attached drawing, model not is protected to the present invention The limitation enclosed, those skilled in the art should understand that, on the basis of technical scheme, those skilled in the art are not Need to make the creative labor the various modifications that can be made or deformation still within protection scope of the present invention.

Claims

1. a kind of power equipments defect file classification method of parallelization, it is characterized in that：Comprise the following steps：

(1) field dictionary is added in user-oriented dictionary, defect case is pre-processed, segmented and gone stop words；

(2) crawler algorithm is utilized, the corpus of text of electric network fault case is collected, is trained, obtained using the word2vec of Spark The term vector in the field is taken to represent；

(3) word in the genetic defects case for obtaining step (1) is converted into the corresponding term vector of step (2), and by case Data carry out text representation, form the form of matrix；

2. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that：The step (1) and step (2) order exchange.

3. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that：The step (1) in, the processing method segmented is：Text data is read in the data structure of program from HDFS, each behavior One text data, the data structure of storage is RDD [String] form.

4. a kind of power equipments defect file classification method of parallelization as claimed in claim 3, it is characterized in that：By domain term Storehouse is imported into the user thesaurus of ansj, calls the Library.makeForest interfaces in ansj to import domain lexicon, will Segment dictionary and carry out completion, obtain complete dictionary, the foundation as participle；Operated using the map of Spark for each language material Carry out word segmentation processing, using accurate participle, that is, call the ToAnalysis.parse interfaces in ansj, using map operators for Each sentence carries out word segmentation processing simultaneously in parallel.

5. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that：The step (1) in, the processing method for carrying out stop words is：Vocabulary will be disabled to imported into the data structure of program from HDFS, it is original defeated Enter for the form of each one stop words of behavior, the data structure of storage is RDD [String] form；Calculated using the map of Spark Son carries out stop words for the result of each point of complete word and operates, each word obtained according to division is every to disable Filtering out in set of words, while using map operators stop words filtering is simultaneously carried out for each text；Result is arranged Into RDD [Array [String]] form, the handling result of each one case of behavior, the form of every result is some words, Centre is separated with the form in space, and the result handled well is stored in data structure, and is output to txt forms on HDFS.

6. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that：The step (2) in, using reptile means, a large amount of texts in the field is collected, as a part for field term vector training corpus, will be collected External data and it is to be analyzed the defects of case merge, composing training language material, carry out participle and stop words pretreatment, adjust With the word2vec algorithm bags of Spark, the result of previous step is input to word2vec models using word2Vec.fit operators The middle training for carrying out term vector, and the obtained term vector of training obtained by model.getVectors operators as a result, will treat point The case text of analysis is read into data structure from HDFS, is carried out for the word in case with trained corresponding vector Replace.

7. a kind of power equipments defect file classification method of parallelization as claimed in claim 6, it is characterized in that：By each piece The vector result of some words of case is averaged, and as the global feature of the case, the result being calculated is arranged, The feature of a case is corresponded to per a line, often capable form is " D_jUrgency level category label ", and be output to the form of txt On HDFS.

8. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that：The step (3) in, text feature data are imported into data structure, case data is trained to the cutting of collection and test set, are set Iterations, the structure of model is carried out using stochastic gradient descent method, using training set training pattern, using accuracy rate or is recalled Rate is trained the assessment of result, if assessment result does not meet setting condition, readjusts iterative parameter and model parameter, directly Meet setting condition to output result.

9. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that：The step (4) in, SVM algorithm is improved, polytypic scene can be tackled, specific improved method is：

(4-1) is divided original case data according to every a kind of emergency class, and will divide obtained Sub Data Set progress Combination of two, forms new combined data set；

Each combined data set in initial data training set is input in the bis- classification tool bags of SVM of Spark by (4-2), Carry out the training of model；

Data in test set are separately input to carry out the judgement of classification in bis- graders of SVM that three training finish by (4-2), Each grader is voted by classification results, and after three graders, voting results are added, obtain final classification As a result.