CN108021679A - A kind of power equipments defect file classification method of parallelization - Google Patents

A kind of power equipments defect file classification method of parallelization Download PDF

Info

Publication number
CN108021679A
CN108021679A CN201711288010.XA CN201711288010A CN108021679A CN 108021679 A CN108021679 A CN 108021679A CN 201711288010 A CN201711288010 A CN 201711288010A CN 108021679 A CN108021679 A CN 108021679A
Authority
CN
China
Prior art keywords
case
result
data
text
parallelization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711288010.XA
Other languages
Chinese (zh)
Inventor
杨祎
宇文梦柯
王智翔
白德盟
辜超
郭志红
陈玉峰
闫丹凤
李贞�
林颖
李程启
秦佳峰
郑文杰
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Beijing University of Posts and Telecommunications
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Beijing University of Posts and Telecommunications
Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Beijing University of Posts and Telecommunications, Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201711288010.XA priority Critical patent/CN108021679A/en
Publication of CN108021679A publication Critical patent/CN108021679A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of power equipments defect file classification method of parallelization, field dictionary is added in user-oriented dictionary, defect case is pre-processed, is segmented and gone stop words;Using crawler algorithm, the corpus of text of electric network fault case is collected, is trained using the word2vec of Spark, the term vector for obtaining the field represents;Case the defects of acquisition and term vector are subjected to vectorization expression, defect case is subjected to text representation, forms matrix;By Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.

Description

A kind of power equipments defect file classification method of parallelization
Technical field
The present invention relates to a kind of power equipments defect file classification method of parallelization.
Background technology
Algorithm of documents categorization is mainly comprising pretreatment, text feature extraction, text representation, classified calculating this four steps. Wherein the pre-treatment step of Chinese text mainly includes participle, removes stop words;Text feature extract mainly have tfidf, Textrank is the method based on topic model of the method based on word frequency statistics and the lda that represent as representative;Text representation master Have without considering the one-hot modes of context of co-text and the mode based on word2vec;Last classifying step is for general Classification algorithm for datamining can account for.In the text categorization task of specific area, the problem of need to mainly considering is exactly to tie The characteristics of language in conjunction field and specialty, pretreatment, feature extraction and etc. carry out corresponding algorithm adjustment.Text classification Also need to carry out corresponding algorithm improvement according to the scale feature of object of classification in journey, it is generally the case that can be with for long text Directly classified using above-mentioned flow, and classifying quality can generally be better than short text, the information for being primarily due to long text contains Amount is more sufficient, and Comparatively speaking, short text directly uses above-mentioned flow in assorting process, and short text can be caused originally poor Feature is lost, thus would generally consider only carry out stop words a filtering, and no longer by tfidf scheduling algorithms further into The screening of row keyword.
In electric power defect text, the classification for the order of severity of defect, had all been manually rule of thumb to lacking in the past Sunken description is judged, is manually classified as " serious ", " general ", " critical " three classifications, is so not only resulted in substantial amounts of people Work labour cost, can also cause the difference of judging result because of the Subjective difference of different people.Therefore, by means of text classification It is very significant that algorithm, which carries out automatic text classification to calculate, but in terms of power domain defect case classification at present also Rarely has research.
General participle step is all based on acquiescence dictionary progress, this can be with for general public sphere text Accomplish accurately to segment, but for the text object of the scene, be difficult to preferably be tied just with acquiescence dictionary Fruit is, it is necessary to territoriality be taken into account, the specialized dictionary of addition power industry in the acquiescence dictionary of ansj, and accurately participle is The accurate important prerequisite for training word2vec.
Equally, general word2vec is trained according to general corpus, and the text pair that the invention is directed to It is very strong professional as having, it is therefore desirable to collect substantial amounts of text first for the field, carry out word2vec term vector expressions Training.Afterwards on the basis of with this training result, consider to be indicated for follow-up text.
Since the flow is built upon on the parallel frames of Spark, to reach for the efficient of big data input form Calculate, and the svm classifier algorithm bag in platform mllib is two graders, it is difficult to the more classification run into for the scene Scene is handled.
The content of the invention
The present invention is to solve the above-mentioned problems, it is proposed that a kind of power equipments defect file classification method of parallelization, this Invention solves the classification for carrying out defect urgency level with very strong professional electric power defect case text, directly uses and passes The analysis process of system is difficult to obtain satisfied classification results, and when data volume is big, can be efficient by the parallel frames of Spark Analysis process is completed, accomplishes the classification analysis of big data scale.
To achieve these goals, the present invention adopts the following technical scheme that:
A kind of power equipments defect file classification method of parallelization, comprises the following steps:
(1) field dictionary is added in user-oriented dictionary, defect case is pre-processed, segmented and gone to disable Word;
(2) crawler algorithm is utilized, the corpus of text of electric network fault case is collected, is instructed using the word2vec of Spark Practice, the term vector for obtaining the field represents;
(3) word in the genetic defects case for obtaining step (1) is converted into the corresponding term vector of step (2), and will Case data carries out text representation, forms the form of matrix;
(4) by Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.
Further, the order of the step (1) and step (2) is exchanged.
Further, in the step (1), the processing method segmented is:Text data is read from HDFS In the data structure of program, each one text data of behavior, the data structure of storage is RDD [String] form;By field Dictionary is imported into the user thesaurus of ansj, calls the Library.makeForest interfaces in ansj to import domain lexicon, Participle dictionary is subjected to completion, obtains complete dictionary, the foundation as participle;Operated using the map of Spark for each language Material carries out word segmentation processing, using accurate participle, that is, calls the ToAnalysis.parse interfaces in ansj, utilizes map operators pair Word segmentation processing is carried out simultaneously in parallel in each sentence.
In the step (1), the processing method for carrying out stop words is:Deactivation vocabulary is imported into the number of program from HDFS According in structure, being originally inputted as the form of each one stop words of behavior, the data structure of storage is RDD [String] form; Stop words is carried out using the map operators of Spark for the result of each point of complete word to operate, each obtained according to division Word, every filtering out in set of words is disabled, while using map operators stop words is simultaneously carried out for each text Filtering;Result is organized into RDD [Array [String]] form, the handling result of each one case of behavior, every result Form is some words, and centre is separated with the form in space, and the result handled well is stored in data structure, and with txt lattice Formula is output on HDFS.
Further, in the step (2), using reptile means, a large amount of texts in the field are collected, as domain term to Measure training corpus a part, by the external data of collection and it is to be analyzed the defects of case merge, composing training language material, into The pretreatment of row participle and stop words, calls the word2vec algorithm bags of Spark, using word2Vec.fit operators by previous step Result be input in word2vec models carry out term vector training, and pass through model.getVectors operators obtain training Obtained term vector is as a result, the case text being analysed to is read into data structure from HDFS, for the word in case It is replaced with trained corresponding vector.
Further, the vector result of some words of each case is averaged, as the overall special of the case Sign, arranges the result being calculated, and the feature of a case is corresponded to per a line, and often capable form is DjUrgency level class Do not mark, and be output to the form of txt on HDFS.
Further, in the step (3), text feature data is imported into data structure, case data is carried out The cutting of training set and test set, sets iterations, and the structure of model is carried out using stochastic gradient descent method, utilizes training set Training pattern, the assessment of result is trained using accuracy rate or recall rate, if assessment result does not meet setting condition, again Iterative parameter and model parameter are adjusted, until output result meets setting condition.
Further, in the step (4), SVM algorithm is improved, polytypic scene can be tackled, had Body improved method is:
(4-1) is divided original case data according to every a kind of emergency class, and will divide obtained Sub Data Set Combination of two is carried out, forms new combined data set;
Each combined data set in initial data training set is input to the bis- classification tool bags of SVM of Spark by (4-2) In, carry out the training of model;
Data in test set are separately input to carry out classification in bis- graders of SVM that three training finish by (4-2) Judge, each grader is voted by classification results, after three graders, voting results are added, and is obtained final Classification results.
Compared with prior art, beneficial effects of the present invention are:
The present invention is higher than general-purpose algorithm for the accuracy of power equipments defect case text urgency level classification, combines Field feature carries out algorithm design so that total algorithm reliability is lifted.Flow is set based on Spark progress parallelizations at the same time Meter, compared to serial algorithm, can preferably adapt to for big data situation, reduce time loss.
The present invention carries out the task of defect emergency classification for power equipments defect case text, word2vec's Term vector represents link, the mode that the field language material of employing is trained, with the common training result phase based on open language material Than the language that the result can more accurately embody this area describes mode feature, and can be directly used in the field text The procedure links of the other applications such as cluster, association analysis.
Meanwhile two sorting algorithms of Spark platforms are rewritten into multi-classification algorithm by the present invention, the complicated classification of SVM is remained The characteristics of highly-parallel of process, filled up the blank of the more sorting techniques of SVM on Spark frames.Global analysis flow is complete The paralell design based on Spark and realization have been carried out, compared to general non-parallel schema, can preferably adapt to actual answer Big data scene in.
Brief description of the drawings
The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are used to explain the application, do not form the improper restriction to the application.
Fig. 1 is the overall flow figure of the present invention;
Fig. 2 is preprocessing process flow chart of the present invention;
Fig. 3 represents process flow diagram flow chart for feature of present invention;
Fig. 4 is classification process figure of the present invention;
Fig. 5 is the more classification process figures of SVM of the present invention;
Fig. 6 is the execution time Comparative result schematic diagram of the different scales of the present invention.
Embodiment:
The invention will be further described with embodiment below in conjunction with the accompanying drawings.
It is noted that described further below is all illustrative, it is intended to provides further instruction to the application.It is unless another Indicate, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.
In the present invention, term as " on ", " under ", "left", "right", "front", "rear", " vertical ", " level ", " side ", The orientation or position relationship of instructions such as " bottoms " are based on orientation shown in the drawings or position relationship, only to facilitate describing this hair Bright each component or component structure relation and definite relative, not refer in particular to either component or element in the present invention, it is impossible to understand For limitation of the present invention.
In the present invention, term such as " affixed ", " connected ", " connection " should be interpreted broadly, and expression can be fixedly connected, Can also be integrally connected or be detachably connected;It can be directly connected, can also be indirectly connected by intermediary.For The related scientific research of this area or technical staff, can determine the concrete meaning of above-mentioned term in the present invention as the case may be, It is not considered as limiting the invention.
The present invention solves the classification for carrying out defect urgency level with very strong professional electric power defect case text, Directly it is difficult to obtain satisfied classification results using traditional analysis process, and when data volume is big, can be parallel by Spark Frame is efficiently completed analysis process, accomplishes the classification analysis of big data scale.
As shown in Figure 1, the process mainly represents training, text feature by Text Pretreatment, field word2vec term vectors Represent, four steps compositions of urgency level classification are carried out using SVM.
Pre-treatment step mainly includes participle and removes stop words;Word2vec term vectors represent that training needs to introduce what is collected Exterior relevant field corpus of text, to ensure the universality of result;Text Representation mainly uses the average side of term vector Formula;Modified SVM multi-categorizers carry out model training first, carry out case classification afterwards.The specific introduction of four steps The part seen below.
Text Pretreatment and parallelization
The operation that the step mainly includes is participle and removes stop words, is namely based on Spark in addition and pre-process parallel The design of change, flow are as shown in Figure 2.
The operating procedure of the step:
1. text data is read in the data structure of program from HDFS, each one text data of behavior, store Data structure be RDD [String] form;
2. field dictionary is imported into the user thesaurus of ansj, the Library.makeForest interfaces in ansj are called Domain lexicon is imported, participle dictionary is subjected to completion.The complete dictionary that (acquiescence dictionary+field dictionary) is formed is finally obtained, Foundation as participle;
3. word segmentation processing is carried out for each language material using the map operations of Spark, herein using precisely segmenting, i.e., Call the ToAnalysis.parse interfaces in ansj.The parallelization of this single stepping is embodied in map operators for each language Sentence carries out word segmentation processing simultaneously in parallel, can save time cost;
4. being imported into vocabulary is disabled from HDFS in the data structure of program, it is originally inputted as each one deactivation of behavior The form of word, the data structure of storage is RDD [String] form;
5. carry out stop words for the result of each point of complete word using the map operators of Spark to operate.That is, according to draw Each word got, it is every in set of words is disabled, filter out.The parallelization of this single stepping is embodied in map calculations Son, stop words filtering is simultaneously carried out for each text;
6. being handled more than passing through, result is organized into RDD [Array [String]] form, each one case of behavior Handling result, the form of every result is some words, and centre is separated with the form in space.The result handled well is stored in number According in structure, and it is output to txt forms on HDFS.
The operation that the step mainly includes is training and obtains the vectorization expression of domanial words, is converted into case word Corresponding vectorization represents, each case text is carried out to vectorization expression and as the feature of the text.It is exactly base afterwards It is as shown in Figure 3 in the paralell design that Spark is carried out, flow.
The operating procedure of the step:
1. a large amount of texts in the field, the part as field term vector training corpus are collected using means such as reptiles. The present invention has crawled 62643 relevant papers from Hownet by the use of Python and Scrapy frames and has been used as external data;
2. by the external data of collection and it is to be analyzed the defects of case merge, composing training language material.The training by more than Language material carries out participle operation, that is, Strp.1~Step.3 of preprocessing part, and the result after participle is stored in data In structure;
3. calling the word2vec algorithm bags of Spark, the result of previous step is input to using word2Vec.fit operators The training of term vector is carried out in word2vec models, and the term vector trained and obtained is obtained by model.getVectors operators As a result, result to be organized into the form of " word vectors ", often row one, is output on HDFS with txt forms.This step it is parallel Change and be embodied in the word2vec algorithm bags that have invoked Spark frames, since the bag is namely based on what the mechanism of platform was write, Therefore the parallelization that Spark frames realize complicated training process to the full extent can be directed to., will be to during training Amount dimension setVectorSize is arranged to 200, setMinCount and is arranged to 0, other use default parameters;
4. the case text being analysed to is read into data structure from HDFS, for the word in case with training Corresponding vector be replaced;
5. the vector result of some words of each case is averaged, the global feature as the case.That is, if often Result after word i vectorizations represent in piece document is Wi=(wi1,wi2,…wi200), the m word of case j passes through vectorization It is M_D after expressionj=(Wj1,Wj2,…,Wjm), then the character pair vector of the case is
The result that previous step is calculated is arranged, and the feature of a case is corresponded to per a line, and often capable form is “DjUrgency level category label ".Wherein DjIt is the vector of a 1*200, each element is separated with space in vector, the present invention It is middle will " promptly ", " seriously ", ", generally " this three classes was respectively labeled as 3,2,1, category label and DjBetween with English comma ", " every Open.Result above is arranged according to form, and is output to the form of txt on HDFS.
As shown in figure 4, text classification and parallelization:
The operation that the step mainly includes be by case data be trained the cutting of collection and test set, SVM models structure Build, model training, Utilization assessment index carry out recruitment evaluation, adjust five steps of ginseng.The module is first by two classification in mllib SVM algorithm bag is rewritten, and becomes a multi-classification algorithm, is encapsulated and is applied in classification process afterwards.The flow The parallelization based on Spark be embodied in calling and improvement for primary SVM bags.
The integrated operation step of the process:
1. the text feature of the vectorization on HDFS is imported into the data structure of LIBSVM forms;
2. the randomSplit interfaces of MLUtils.loadLibSVMFile are called, by case data according to 60% He 40% ratio cut partition is training set and test set;
3. it is 150 to set iterations numIterations, other specification selection default value, is called SVMWithSGD.train operators selection stochastic gradient descent method carries out the structure of model;
4. carrying out the training of model using training set data, and model.predict operators are called to be carried out for test set The anticipation of classification;
5. anticipation result is compared with actual result, accuracy rate, recall rate, F1 values is selected to refer to as the evaluation of effect Mark carries out the assessment of result;
6. returning to step.3, reset for parameters, repeat step.3~step.6 until reaching satisfied As a result, finally obtaining corresponding parameter and model.
The present invention is improved for the SVM algorithm bag of mllib, can tackle polytypic scene, main to think Roadbed is in one-to-many method.
The concrete operation step of the process:
1. by " general ", " serious ", the text data of the vector form of " critical " three classifications separates.According to (" general ", " serious "), (" critical ", " serious "), the form of (" general ", " critical ") forms three newly into the combination of two of line data set Combined data set;
2. each combined data set in initial data training set is input in the bis- classification tool bags of SVM of Spark, Carry out the training of model (specific steps are shown in Fig. 4);
3. for each data in test set, it is (" general ", " serious ", " critical ") to set its initial category poll =(0,0,0)
4. the data in test set are separately input to carry out sentencing for classification in bis- graders of SVM that three training finish It is disconnected.Each grader is voted by classification results.For example, when in (" general ", " danger ") grader, the classification of judgement is " general ", then (" general ", " serious ", " critical ")(general, dangerous)=(1,0,0)
5. after three graders, voting results are added, i.e., (" general ", " serious ", " critical ")=(" general ", " serious ", " critical ")(general, dangerous)+ (" general ", " serious ", " critical ")(urgent, dangerous)
+ (" general ", " serious ", " critical ")(general, urgent)
Final classification is the corresponding classifications of max (" general ", " serious ", " critical ").
In text classification step, due to based on SVM algorithm bag be original algorithmic tool in Spark platforms, The parallel feature that frame is farthest combined during exploitation has carried out the parallelization of algorithm, thus this step and Row performance has also reached gratifying degree.
As a kind of Application Example, all experiments in the present invention all include 1 host node (master) at one, and 3 Carried out on the Spark clusters that a local from node (slave) is built.The disk size of cluster is configured to 2.88T, always interior to save as 32G.Spark versions are 1.6.0, and Hadoop versions are 2.7.0.
This experiment is mainly assessed from the angle of classification accuracy, and accuracy rate P, recall rate are selected for final result R and F1 values are weighed, and the calculation formula of three is as follows:
Following three scheme is selected to be contrasted with the solution of the present invention:
Contrast scheme 1:Tfidf expressions+naive Bayesian;
Contrast scheme 2:Tfidf expressions+SVM;
Contrast scheme 3:Based on the general word2vec+SVM for expecting training;
The present invention:The word2vec+SVM of training is expected based on field;
The classification results contrast of 1 different schemes of table
Comparative result more than, it can be found that the scheme works based on word2vec+SVM are generally better than its other party Case.Wherein, the word2vec vectors based on the training of field language material can preferably adapt to the scene compared to based on general language material Classification task.
In order to verify lifting of the algorithm after parallelization in the speed of service, data set is divided into 200K, 20M by us, The scale of 500M, 1G.For based on the parallel of Spark frames, it is contemplated that each executor is owned by fixed check figure mesh, and Core numbers directly result in the number that task is parallel in each executor.The check figure of the total execution thus set herein is got over It is more, it can more increase the degree of concurrence of program.Since the total check figure of cluster is 48, the num- set here The product of executors and executor-cores will be less than 48 on the whole, and by experimental debugging, what is carried out herein is parallel Test us and carry out following parameter configuration:
--deploy-mode cluster
--master yarn-cluster
--num-executors 12
--executor-cores 3
--executor-memory 16G
--driver-memory 8G
It is configured respectively for unit parameter (num-executors=1) and above-mentioned parallel parameter, for four kinds of scales Data perform the algorithm flow of the present invention, it is as shown in Figure 6 that it performs the time used.
As can be seen that the time loss of unit operation is above parallel time loss in four kinds of scales, meanwhile, with The growth of data set, the time loss meeting sharp increase of unit, and parallel algorithm time growth is Comparatively speaking more gentle.It is comprehensive Above as it can be seen that parallel algorithm can be less than uniprocessor algorithm on time loss, and as the growth of data scale, advantage are more bright It is aobvious.
The present invention be directed to the task that power equipments defect case text carries out defect emergency classification.In word2vec Term vector represent link, the mode that the field language material of employing is trained, with the common training result based on open language material Compare, the language which can more accurately embody this area describes mode feature, and can be directly used in field text The procedure links of the other applications such as this cluster, association analysis.Meanwhile two sorting algorithms of Spark platforms are rewritten into more classification Algorithm, the characteristics of remaining the highly-parallel of SVM complexity assorting processes, filled up more sorting techniques of SVM on Spark frames Blank.Global analysis flow has carried out the paralell design based on Spark and realization completely, compared to general non-parallel mould Formula, can preferably adapt to the big data scene in practical application.
The foregoing is merely the preferred embodiment of the application, the application is not limited to, for the skill of this area For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair Change, equivalent substitution, improvement etc., should be included within the protection domain of the application.
Although above-mentioned be described the embodiment of the present invention with reference to attached drawing, model not is protected to the present invention The limitation enclosed, those skilled in the art should understand that, on the basis of technical scheme, those skilled in the art are not Need to make the creative labor the various modifications that can be made or deformation still within protection scope of the present invention.

Claims (9)

1. a kind of power equipments defect file classification method of parallelization, it is characterized in that:Comprise the following steps:
(1) field dictionary is added in user-oriented dictionary, defect case is pre-processed, segmented and gone stop words;
(2) crawler algorithm is utilized, the corpus of text of electric network fault case is collected, is trained, obtained using the word2vec of Spark The term vector in the field is taken to represent;
(3) word in the genetic defects case for obtaining step (1) is converted into the corresponding term vector of step (2), and by case Data carry out text representation, form the form of matrix;
(4) by Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.
2. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step (1) and step (2) order exchange.
3. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step (1) in, the processing method segmented is:Text data is read in the data structure of program from HDFS, each behavior One text data, the data structure of storage is RDD [String] form.
4. a kind of power equipments defect file classification method of parallelization as claimed in claim 3, it is characterized in that:By domain term Storehouse is imported into the user thesaurus of ansj, calls the Library.makeForest interfaces in ansj to import domain lexicon, will Segment dictionary and carry out completion, obtain complete dictionary, the foundation as participle;Operated using the map of Spark for each language material Carry out word segmentation processing, using accurate participle, that is, call the ToAnalysis.parse interfaces in ansj, using map operators for Each sentence carries out word segmentation processing simultaneously in parallel.
5. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step (1) in, the processing method for carrying out stop words is:Vocabulary will be disabled to imported into the data structure of program from HDFS, it is original defeated Enter for the form of each one stop words of behavior, the data structure of storage is RDD [String] form;Calculated using the map of Spark Son carries out stop words for the result of each point of complete word and operates, each word obtained according to division is every to disable Filtering out in set of words, while using map operators stop words filtering is simultaneously carried out for each text;Result is arranged Into RDD [Array [String]] form, the handling result of each one case of behavior, the form of every result is some words, Centre is separated with the form in space, and the result handled well is stored in data structure, and is output to txt forms on HDFS.
6. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step (2) in, using reptile means, a large amount of texts in the field is collected, as a part for field term vector training corpus, will be collected External data and it is to be analyzed the defects of case merge, composing training language material, carry out participle and stop words pretreatment, adjust With the word2vec algorithm bags of Spark, the result of previous step is input to word2vec models using word2Vec.fit operators The middle training for carrying out term vector, and the obtained term vector of training obtained by model.getVectors operators as a result, will treat point The case text of analysis is read into data structure from HDFS, is carried out for the word in case with trained corresponding vector Replace.
7. a kind of power equipments defect file classification method of parallelization as claimed in claim 6, it is characterized in that:By each piece The vector result of some words of case is averaged, and as the global feature of the case, the result being calculated is arranged, The feature of a case is corresponded to per a line, often capable form is " DjUrgency level category label ", and be output to the form of txt On HDFS.
8. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step (3) in, text feature data are imported into data structure, case data is trained to the cutting of collection and test set, are set Iterations, the structure of model is carried out using stochastic gradient descent method, using training set training pattern, using accuracy rate or is recalled Rate is trained the assessment of result, if assessment result does not meet setting condition, readjusts iterative parameter and model parameter, directly Meet setting condition to output result.
9. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step (4) in, SVM algorithm is improved, polytypic scene can be tackled, specific improved method is:
(4-1) is divided original case data according to every a kind of emergency class, and will divide obtained Sub Data Set progress Combination of two, forms new combined data set;
Each combined data set in initial data training set is input in the bis- classification tool bags of SVM of Spark by (4-2), Carry out the training of model;
Data in test set are separately input to carry out the judgement of classification in bis- graders of SVM that three training finish by (4-2), Each grader is voted by classification results, and after three graders, voting results are added, obtain final classification As a result.
CN201711288010.XA 2017-12-07 2017-12-07 A kind of power equipments defect file classification method of parallelization Pending CN108021679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711288010.XA CN108021679A (en) 2017-12-07 2017-12-07 A kind of power equipments defect file classification method of parallelization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711288010.XA CN108021679A (en) 2017-12-07 2017-12-07 A kind of power equipments defect file classification method of parallelization

Publications (1)

Publication Number Publication Date
CN108021679A true CN108021679A (en) 2018-05-11

Family

ID=62078915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711288010.XA Pending CN108021679A (en) 2017-12-07 2017-12-07 A kind of power equipments defect file classification method of parallelization

Country Status (1)

Country Link
CN (1) CN108021679A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804558A (en) * 2018-05-22 2018-11-13 北京航空航天大学 A kind of defect report automatic classification method based on semantic model
CN109101483A (en) * 2018-07-04 2018-12-28 浙江大学 A kind of wrong identification method for electric inspection process text
CN109101481A (en) * 2018-06-25 2018-12-28 北京奇艺世纪科技有限公司 A kind of name entity recognition method, device and electronic equipment
CN109146152A (en) * 2018-08-01 2019-01-04 北京京东金融科技控股有限公司 Incident classification prediction technique and device on a kind of line
CN110287321A (en) * 2019-06-26 2019-09-27 南京邮电大学 A kind of electric power file classification method based on improvement feature selecting
CN110781671A (en) * 2019-10-29 2020-02-11 西安科技大学 Knowledge mining method for intelligent IETM fault maintenance record text
CN110895565A (en) * 2019-11-29 2020-03-20 国网湖南省电力有限公司 Method and system for classifying fault defect texts of power equipment
CN111177367A (en) * 2019-11-11 2020-05-19 腾讯科技(深圳)有限公司 Case classification method, classification model training method and related products
CN111191447A (en) * 2019-12-18 2020-05-22 东软集团股份有限公司 Equipment defect classification method, device and equipment
CN111241811A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Method, apparatus, computer device and storage medium for determining search term weight
CN111931861A (en) * 2020-09-09 2020-11-13 北京志翔科技股份有限公司 Anomaly detection method for heterogeneous data set and computer-readable storage medium
CN112749079A (en) * 2019-10-31 2021-05-04 中国移动通信集团浙江有限公司 Defect classification method and device for software test and computing equipment
CN114444469A (en) * 2022-01-11 2022-05-06 国家电网有限公司客户服务中心 Processing device based on 95598 customer service data resources
CN116383390A (en) * 2023-06-05 2023-07-04 南京数策信息科技有限公司 Unstructured data storage method for management information and cloud platform
CN117057312A (en) * 2023-10-11 2023-11-14 北京洛斯达科技发展有限公司 Python-based precise splitting method for extra-high voltage engineering water conservation design document

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389379A (en) * 2015-11-20 2016-03-09 重庆邮电大学 Rubbish article classification method based on distributed feature representation of text
CN105550200A (en) * 2015-12-02 2016-05-04 北京信息科技大学 Chinese segmentation method oriented to patent abstract
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389379A (en) * 2015-11-20 2016-03-09 重庆邮电大学 Rubbish article classification method based on distributed feature representation of text
CN105550200A (en) * 2015-12-02 2016-05-04 北京信息科技大学 Chinese segmentation method oriented to patent abstract
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YODE: "SVM多类分类---多个二值分类combine", 《新浪博客HTTP://BLOG.SINA.COM.CN/S/BLOG_4C98B96001009B8D.HTML》 *
冯贵川: "基于Word2vec的文本建模及分类研究", 《中国优秀硕士学位论文全文数据库 信息科技(月刊)计算机软件及计算机应用》 *
风中迷茫的蛤蛤: "ansj分词教程", 《CSDN博客HTTPS://BLOG.CSDN.NET/A360616218/ARTICLE/DETAILS/75268959》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804558A (en) * 2018-05-22 2018-11-13 北京航空航天大学 A kind of defect report automatic classification method based on semantic model
CN109101481A (en) * 2018-06-25 2018-12-28 北京奇艺世纪科技有限公司 A kind of name entity recognition method, device and electronic equipment
CN109101481B (en) * 2018-06-25 2022-07-22 北京奇艺世纪科技有限公司 Named entity identification method and device and electronic equipment
CN109101483A (en) * 2018-07-04 2018-12-28 浙江大学 A kind of wrong identification method for electric inspection process text
CN109101483B (en) * 2018-07-04 2020-04-14 浙江大学 Error identification method for power inspection text
CN109146152A (en) * 2018-08-01 2019-01-04 北京京东金融科技控股有限公司 Incident classification prediction technique and device on a kind of line
CN110287321A (en) * 2019-06-26 2019-09-27 南京邮电大学 A kind of electric power file classification method based on improvement feature selecting
CN110781671A (en) * 2019-10-29 2020-02-11 西安科技大学 Knowledge mining method for intelligent IETM fault maintenance record text
CN110781671B (en) * 2019-10-29 2023-02-14 西安科技大学 Knowledge mining method for intelligent IETM fault maintenance record text
CN112749079A (en) * 2019-10-31 2021-05-04 中国移动通信集团浙江有限公司 Defect classification method and device for software test and computing equipment
CN112749079B (en) * 2019-10-31 2023-12-26 中国移动通信集团浙江有限公司 Defect classification method and device for software test and computing equipment
CN111177367A (en) * 2019-11-11 2020-05-19 腾讯科技(深圳)有限公司 Case classification method, classification model training method and related products
CN110895565A (en) * 2019-11-29 2020-03-20 国网湖南省电力有限公司 Method and system for classifying fault defect texts of power equipment
CN111191447B (en) * 2019-12-18 2023-07-14 东软集团股份有限公司 Equipment defect classification method, device and equipment
CN111191447A (en) * 2019-12-18 2020-05-22 东软集团股份有限公司 Equipment defect classification method, device and equipment
CN111241811A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Method, apparatus, computer device and storage medium for determining search term weight
CN111241811B (en) * 2020-01-06 2024-05-10 平安科技(深圳)有限公司 Method, apparatus, computer device and storage medium for determining search term weight
CN111931861A (en) * 2020-09-09 2020-11-13 北京志翔科技股份有限公司 Anomaly detection method for heterogeneous data set and computer-readable storage medium
CN114444469A (en) * 2022-01-11 2022-05-06 国家电网有限公司客户服务中心 Processing device based on 95598 customer service data resources
CN114444469B (en) * 2022-01-11 2024-07-09 国家电网有限公司客户服务中心 Processing device based on 95598 customer service data resource
CN116383390A (en) * 2023-06-05 2023-07-04 南京数策信息科技有限公司 Unstructured data storage method for management information and cloud platform
CN116383390B (en) * 2023-06-05 2023-08-08 南京数策信息科技有限公司 Unstructured data storage method for management information and cloud platform
CN117057312A (en) * 2023-10-11 2023-11-14 北京洛斯达科技发展有限公司 Python-based precise splitting method for extra-high voltage engineering water conservation design document
CN117057312B (en) * 2023-10-11 2023-12-29 北京洛斯达科技发展有限公司 Python-based precise splitting method for extra-high voltage engineering water conservation design document

Similar Documents

Publication Publication Date Title
CN108021679A (en) A kind of power equipments defect file classification method of parallelization
JP7090936B2 (en) ESG-based corporate evaluation execution device and its operation method
CN103631859B (en) Intelligent review expert recommending method for science and technology projects
CN107944480A (en) A kind of enterprises ' industry sorting technique
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN107861942A (en) A kind of electric power based on deep learning is doubtful to complain work order recognition methods
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN106446230A (en) Method for optimizing word classification in machine learning text
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN106651057A (en) Mobile terminal user age prediction method based on installation package sequence table
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN109948340A (en) The PHP-Webshell detection method that a kind of convolutional neural networks and XGBoost are combined
CN108268431B (en) The method and apparatus of paragraph vectorization
CN110472040A (en) Extracting method and device, storage medium, the computer equipment of evaluation information
CN109299264A (en) File classification method, device, computer equipment and storage medium
CN107832290B (en) Method and device for identifying Chinese semantic relation
CN105045913B (en) File classification method based on WordNet and latent semantic analysis
CN109684476A (en) A kind of file classification method, document sorting apparatus and terminal device
CN110134961A (en) Processing method, device and the storage medium of text
CN110889412B (en) Medical long text positioning and classifying method and device in physical examination report
CN110097096A (en) A kind of file classification method based on TF-IDF matrix and capsule network
CN109271516A (en) Entity type classification method and system in a kind of knowledge mapping
CN110232128A (en) Topic file classification method and device
CN106649250A (en) Method and device for identifying emotional new words
CN108363691A (en) A kind of field term identifying system and method for 95598 work order of electric power

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180511