CN108021679A - A kind of power equipments defect file classification method of parallelization - Google Patents
A kind of power equipments defect file classification method of parallelization Download PDFInfo
- Publication number
- CN108021679A CN108021679A CN201711288010.XA CN201711288010A CN108021679A CN 108021679 A CN108021679 A CN 108021679A CN 201711288010 A CN201711288010 A CN 201711288010A CN 108021679 A CN108021679 A CN 108021679A
- Authority
- CN
- China
- Prior art keywords
- case
- result
- data
- text
- parallelization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of power equipments defect file classification method of parallelization, field dictionary is added in user-oriented dictionary, defect case is pre-processed, is segmented and gone stop words;Using crawler algorithm, the corpus of text of electric network fault case is collected, is trained using the word2vec of Spark, the term vector for obtaining the field represents;Case the defects of acquisition and term vector are subjected to vectorization expression, defect case is subjected to text representation, forms matrix;By Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.
Description
Technical field
The present invention relates to a kind of power equipments defect file classification method of parallelization.
Background technology
Algorithm of documents categorization is mainly comprising pretreatment, text feature extraction, text representation, classified calculating this four steps.
Wherein the pre-treatment step of Chinese text mainly includes participle, removes stop words;Text feature extract mainly have tfidf,
Textrank is the method based on topic model of the method based on word frequency statistics and the lda that represent as representative;Text representation master
Have without considering the one-hot modes of context of co-text and the mode based on word2vec;Last classifying step is for general
Classification algorithm for datamining can account for.In the text categorization task of specific area, the problem of need to mainly considering is exactly to tie
The characteristics of language in conjunction field and specialty, pretreatment, feature extraction and etc. carry out corresponding algorithm adjustment.Text classification
Also need to carry out corresponding algorithm improvement according to the scale feature of object of classification in journey, it is generally the case that can be with for long text
Directly classified using above-mentioned flow, and classifying quality can generally be better than short text, the information for being primarily due to long text contains
Amount is more sufficient, and Comparatively speaking, short text directly uses above-mentioned flow in assorting process, and short text can be caused originally poor
Feature is lost, thus would generally consider only carry out stop words a filtering, and no longer by tfidf scheduling algorithms further into
The screening of row keyword.
In electric power defect text, the classification for the order of severity of defect, had all been manually rule of thumb to lacking in the past
Sunken description is judged, is manually classified as " serious ", " general ", " critical " three classifications, is so not only resulted in substantial amounts of people
Work labour cost, can also cause the difference of judging result because of the Subjective difference of different people.Therefore, by means of text classification
It is very significant that algorithm, which carries out automatic text classification to calculate, but in terms of power domain defect case classification at present also
Rarely has research.
General participle step is all based on acquiescence dictionary progress, this can be with for general public sphere text
Accomplish accurately to segment, but for the text object of the scene, be difficult to preferably be tied just with acquiescence dictionary
Fruit is, it is necessary to territoriality be taken into account, the specialized dictionary of addition power industry in the acquiescence dictionary of ansj, and accurately participle is
The accurate important prerequisite for training word2vec.
Equally, general word2vec is trained according to general corpus, and the text pair that the invention is directed to
It is very strong professional as having, it is therefore desirable to collect substantial amounts of text first for the field, carry out word2vec term vector expressions
Training.Afterwards on the basis of with this training result, consider to be indicated for follow-up text.
Since the flow is built upon on the parallel frames of Spark, to reach for the efficient of big data input form
Calculate, and the svm classifier algorithm bag in platform mllib is two graders, it is difficult to the more classification run into for the scene
Scene is handled.
The content of the invention
The present invention is to solve the above-mentioned problems, it is proposed that a kind of power equipments defect file classification method of parallelization, this
Invention solves the classification for carrying out defect urgency level with very strong professional electric power defect case text, directly uses and passes
The analysis process of system is difficult to obtain satisfied classification results, and when data volume is big, can be efficient by the parallel frames of Spark
Analysis process is completed, accomplishes the classification analysis of big data scale.
To achieve these goals, the present invention adopts the following technical scheme that:
A kind of power equipments defect file classification method of parallelization, comprises the following steps:
(1) field dictionary is added in user-oriented dictionary, defect case is pre-processed, segmented and gone to disable
Word;
(2) crawler algorithm is utilized, the corpus of text of electric network fault case is collected, is instructed using the word2vec of Spark
Practice, the term vector for obtaining the field represents;
(3) word in the genetic defects case for obtaining step (1) is converted into the corresponding term vector of step (2), and will
Case data carries out text representation, forms the form of matrix;
(4) by Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.
Further, the order of the step (1) and step (2) is exchanged.
Further, in the step (1), the processing method segmented is:Text data is read from HDFS
In the data structure of program, each one text data of behavior, the data structure of storage is RDD [String] form;By field
Dictionary is imported into the user thesaurus of ansj, calls the Library.makeForest interfaces in ansj to import domain lexicon,
Participle dictionary is subjected to completion, obtains complete dictionary, the foundation as participle;Operated using the map of Spark for each language
Material carries out word segmentation processing, using accurate participle, that is, calls the ToAnalysis.parse interfaces in ansj, utilizes map operators pair
Word segmentation processing is carried out simultaneously in parallel in each sentence.
In the step (1), the processing method for carrying out stop words is:Deactivation vocabulary is imported into the number of program from HDFS
According in structure, being originally inputted as the form of each one stop words of behavior, the data structure of storage is RDD [String] form;
Stop words is carried out using the map operators of Spark for the result of each point of complete word to operate, each obtained according to division
Word, every filtering out in set of words is disabled, while using map operators stop words is simultaneously carried out for each text
Filtering;Result is organized into RDD [Array [String]] form, the handling result of each one case of behavior, every result
Form is some words, and centre is separated with the form in space, and the result handled well is stored in data structure, and with txt lattice
Formula is output on HDFS.
Further, in the step (2), using reptile means, a large amount of texts in the field are collected, as domain term to
Measure training corpus a part, by the external data of collection and it is to be analyzed the defects of case merge, composing training language material, into
The pretreatment of row participle and stop words, calls the word2vec algorithm bags of Spark, using word2Vec.fit operators by previous step
Result be input in word2vec models carry out term vector training, and pass through model.getVectors operators obtain training
Obtained term vector is as a result, the case text being analysed to is read into data structure from HDFS, for the word in case
It is replaced with trained corresponding vector.
Further, the vector result of some words of each case is averaged, as the overall special of the case
Sign, arranges the result being calculated, and the feature of a case is corresponded to per a line, and often capable form is DjUrgency level class
Do not mark, and be output to the form of txt on HDFS.
Further, in the step (3), text feature data is imported into data structure, case data is carried out
The cutting of training set and test set, sets iterations, and the structure of model is carried out using stochastic gradient descent method, utilizes training set
Training pattern, the assessment of result is trained using accuracy rate or recall rate, if assessment result does not meet setting condition, again
Iterative parameter and model parameter are adjusted, until output result meets setting condition.
Further, in the step (4), SVM algorithm is improved, polytypic scene can be tackled, had
Body improved method is:
(4-1) is divided original case data according to every a kind of emergency class, and will divide obtained Sub Data Set
Combination of two is carried out, forms new combined data set;
Each combined data set in initial data training set is input to the bis- classification tool bags of SVM of Spark by (4-2)
In, carry out the training of model;
Data in test set are separately input to carry out classification in bis- graders of SVM that three training finish by (4-2)
Judge, each grader is voted by classification results, after three graders, voting results are added, and is obtained final
Classification results.
Compared with prior art, beneficial effects of the present invention are:
The present invention is higher than general-purpose algorithm for the accuracy of power equipments defect case text urgency level classification, combines
Field feature carries out algorithm design so that total algorithm reliability is lifted.Flow is set based on Spark progress parallelizations at the same time
Meter, compared to serial algorithm, can preferably adapt to for big data situation, reduce time loss.
The present invention carries out the task of defect emergency classification for power equipments defect case text, word2vec's
Term vector represents link, the mode that the field language material of employing is trained, with the common training result phase based on open language material
Than the language that the result can more accurately embody this area describes mode feature, and can be directly used in the field text
The procedure links of the other applications such as cluster, association analysis.
Meanwhile two sorting algorithms of Spark platforms are rewritten into multi-classification algorithm by the present invention, the complicated classification of SVM is remained
The characteristics of highly-parallel of process, filled up the blank of the more sorting techniques of SVM on Spark frames.Global analysis flow is complete
The paralell design based on Spark and realization have been carried out, compared to general non-parallel schema, can preferably adapt to actual answer
Big data scene in.
Brief description of the drawings
The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows
Meaning property embodiment and its explanation are used to explain the application, do not form the improper restriction to the application.
Fig. 1 is the overall flow figure of the present invention;
Fig. 2 is preprocessing process flow chart of the present invention;
Fig. 3 represents process flow diagram flow chart for feature of present invention;
Fig. 4 is classification process figure of the present invention;
Fig. 5 is the more classification process figures of SVM of the present invention;
Fig. 6 is the execution time Comparative result schematic diagram of the different scales of the present invention.
Embodiment:
The invention will be further described with embodiment below in conjunction with the accompanying drawings.
It is noted that described further below is all illustrative, it is intended to provides further instruction to the application.It is unless another
Indicate, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field
The identical meanings of understanding.
It should be noted that term used herein above is merely to describe embodiment, and be not intended to restricted root
According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative
It is also intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " bag
Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.
In the present invention, term as " on ", " under ", "left", "right", "front", "rear", " vertical ", " level ", " side ",
The orientation or position relationship of instructions such as " bottoms " are based on orientation shown in the drawings or position relationship, only to facilitate describing this hair
Bright each component or component structure relation and definite relative, not refer in particular to either component or element in the present invention, it is impossible to understand
For limitation of the present invention.
In the present invention, term such as " affixed ", " connected ", " connection " should be interpreted broadly, and expression can be fixedly connected,
Can also be integrally connected or be detachably connected;It can be directly connected, can also be indirectly connected by intermediary.For
The related scientific research of this area or technical staff, can determine the concrete meaning of above-mentioned term in the present invention as the case may be,
It is not considered as limiting the invention.
The present invention solves the classification for carrying out defect urgency level with very strong professional electric power defect case text,
Directly it is difficult to obtain satisfied classification results using traditional analysis process, and when data volume is big, can be parallel by Spark
Frame is efficiently completed analysis process, accomplishes the classification analysis of big data scale.
As shown in Figure 1, the process mainly represents training, text feature by Text Pretreatment, field word2vec term vectors
Represent, four steps compositions of urgency level classification are carried out using SVM.
Pre-treatment step mainly includes participle and removes stop words;Word2vec term vectors represent that training needs to introduce what is collected
Exterior relevant field corpus of text, to ensure the universality of result;Text Representation mainly uses the average side of term vector
Formula;Modified SVM multi-categorizers carry out model training first, carry out case classification afterwards.The specific introduction of four steps
The part seen below.
Text Pretreatment and parallelization
The operation that the step mainly includes is participle and removes stop words, is namely based on Spark in addition and pre-process parallel
The design of change, flow are as shown in Figure 2.
The operating procedure of the step:
1. text data is read in the data structure of program from HDFS, each one text data of behavior, store
Data structure be RDD [String] form;
2. field dictionary is imported into the user thesaurus of ansj, the Library.makeForest interfaces in ansj are called
Domain lexicon is imported, participle dictionary is subjected to completion.The complete dictionary that (acquiescence dictionary+field dictionary) is formed is finally obtained,
Foundation as participle;
3. word segmentation processing is carried out for each language material using the map operations of Spark, herein using precisely segmenting, i.e.,
Call the ToAnalysis.parse interfaces in ansj.The parallelization of this single stepping is embodied in map operators for each language
Sentence carries out word segmentation processing simultaneously in parallel, can save time cost;
4. being imported into vocabulary is disabled from HDFS in the data structure of program, it is originally inputted as each one deactivation of behavior
The form of word, the data structure of storage is RDD [String] form;
5. carry out stop words for the result of each point of complete word using the map operators of Spark to operate.That is, according to draw
Each word got, it is every in set of words is disabled, filter out.The parallelization of this single stepping is embodied in map calculations
Son, stop words filtering is simultaneously carried out for each text;
6. being handled more than passing through, result is organized into RDD [Array [String]] form, each one case of behavior
Handling result, the form of every result is some words, and centre is separated with the form in space.The result handled well is stored in number
According in structure, and it is output to txt forms on HDFS.
The operation that the step mainly includes is training and obtains the vectorization expression of domanial words, is converted into case word
Corresponding vectorization represents, each case text is carried out to vectorization expression and as the feature of the text.It is exactly base afterwards
It is as shown in Figure 3 in the paralell design that Spark is carried out, flow.
The operating procedure of the step:
1. a large amount of texts in the field, the part as field term vector training corpus are collected using means such as reptiles.
The present invention has crawled 62643 relevant papers from Hownet by the use of Python and Scrapy frames and has been used as external data;
2. by the external data of collection and it is to be analyzed the defects of case merge, composing training language material.The training by more than
Language material carries out participle operation, that is, Strp.1~Step.3 of preprocessing part, and the result after participle is stored in data
In structure;
3. calling the word2vec algorithm bags of Spark, the result of previous step is input to using word2Vec.fit operators
The training of term vector is carried out in word2vec models, and the term vector trained and obtained is obtained by model.getVectors operators
As a result, result to be organized into the form of " word vectors ", often row one, is output on HDFS with txt forms.This step it is parallel
Change and be embodied in the word2vec algorithm bags that have invoked Spark frames, since the bag is namely based on what the mechanism of platform was write,
Therefore the parallelization that Spark frames realize complicated training process to the full extent can be directed to., will be to during training
Amount dimension setVectorSize is arranged to 200, setMinCount and is arranged to 0, other use default parameters;
4. the case text being analysed to is read into data structure from HDFS, for the word in case with training
Corresponding vector be replaced;
5. the vector result of some words of each case is averaged, the global feature as the case.That is, if often
Result after word i vectorizations represent in piece document is Wi=(wi1,wi2,…wi200), the m word of case j passes through vectorization
It is M_D after expressionj=(Wj1,Wj2,…,Wjm), then the character pair vector of the case is
The result that previous step is calculated is arranged, and the feature of a case is corresponded to per a line, and often capable form is
“DjUrgency level category label ".Wherein DjIt is the vector of a 1*200, each element is separated with space in vector, the present invention
It is middle will " promptly ", " seriously ", ", generally " this three classes was respectively labeled as 3,2,1, category label and DjBetween with English comma ", " every
Open.Result above is arranged according to form, and is output to the form of txt on HDFS.
As shown in figure 4, text classification and parallelization:
The operation that the step mainly includes be by case data be trained the cutting of collection and test set, SVM models structure
Build, model training, Utilization assessment index carry out recruitment evaluation, adjust five steps of ginseng.The module is first by two classification in mllib
SVM algorithm bag is rewritten, and becomes a multi-classification algorithm, is encapsulated and is applied in classification process afterwards.The flow
The parallelization based on Spark be embodied in calling and improvement for primary SVM bags.
The integrated operation step of the process:
1. the text feature of the vectorization on HDFS is imported into the data structure of LIBSVM forms;
2. the randomSplit interfaces of MLUtils.loadLibSVMFile are called, by case data according to 60% He
40% ratio cut partition is training set and test set;
3. it is 150 to set iterations numIterations, other specification selection default value, is called
SVMWithSGD.train operators selection stochastic gradient descent method carries out the structure of model;
4. carrying out the training of model using training set data, and model.predict operators are called to be carried out for test set
The anticipation of classification;
5. anticipation result is compared with actual result, accuracy rate, recall rate, F1 values is selected to refer to as the evaluation of effect
Mark carries out the assessment of result;
6. returning to step.3, reset for parameters, repeat step.3~step.6 until reaching satisfied
As a result, finally obtaining corresponding parameter and model.
The present invention is improved for the SVM algorithm bag of mllib, can tackle polytypic scene, main to think
Roadbed is in one-to-many method.
The concrete operation step of the process:
1. by " general ", " serious ", the text data of the vector form of " critical " three classifications separates.According to (" general ",
" serious "), (" critical ", " serious "), the form of (" general ", " critical ") forms three newly into the combination of two of line data set
Combined data set;
2. each combined data set in initial data training set is input in the bis- classification tool bags of SVM of Spark,
Carry out the training of model (specific steps are shown in Fig. 4);
3. for each data in test set, it is (" general ", " serious ", " critical ") to set its initial category poll
=(0,0,0)
4. the data in test set are separately input to carry out sentencing for classification in bis- graders of SVM that three training finish
It is disconnected.Each grader is voted by classification results.For example, when in (" general ", " danger ") grader, the classification of judgement is
" general ", then (" general ", " serious ", " critical ")(general, dangerous)=(1,0,0)
5. after three graders, voting results are added, i.e., (" general ", " serious ", " critical ")=(" general ",
" serious ", " critical ")(general, dangerous)+ (" general ", " serious ", " critical ")(urgent, dangerous)
+ (" general ", " serious ", " critical ")(general, urgent)。
Final classification is the corresponding classifications of max (" general ", " serious ", " critical ").
In text classification step, due to based on SVM algorithm bag be original algorithmic tool in Spark platforms,
The parallel feature that frame is farthest combined during exploitation has carried out the parallelization of algorithm, thus this step and
Row performance has also reached gratifying degree.
As a kind of Application Example, all experiments in the present invention all include 1 host node (master) at one, and 3
Carried out on the Spark clusters that a local from node (slave) is built.The disk size of cluster is configured to 2.88T, always interior to save as
32G.Spark versions are 1.6.0, and Hadoop versions are 2.7.0.
This experiment is mainly assessed from the angle of classification accuracy, and accuracy rate P, recall rate are selected for final result
R and F1 values are weighed, and the calculation formula of three is as follows:
Following three scheme is selected to be contrasted with the solution of the present invention:
Contrast scheme 1:Tfidf expressions+naive Bayesian;
Contrast scheme 2:Tfidf expressions+SVM;
Contrast scheme 3:Based on the general word2vec+SVM for expecting training;
The present invention:The word2vec+SVM of training is expected based on field;
The classification results contrast of 1 different schemes of table
Comparative result more than, it can be found that the scheme works based on word2vec+SVM are generally better than its other party
Case.Wherein, the word2vec vectors based on the training of field language material can preferably adapt to the scene compared to based on general language material
Classification task.
In order to verify lifting of the algorithm after parallelization in the speed of service, data set is divided into 200K, 20M by us,
The scale of 500M, 1G.For based on the parallel of Spark frames, it is contemplated that each executor is owned by fixed check figure mesh, and
Core numbers directly result in the number that task is parallel in each executor.The check figure of the total execution thus set herein is got over
It is more, it can more increase the degree of concurrence of program.Since the total check figure of cluster is 48, the num- set here
The product of executors and executor-cores will be less than 48 on the whole, and by experimental debugging, what is carried out herein is parallel
Test us and carry out following parameter configuration:
--deploy-mode cluster
--master yarn-cluster
--num-executors 12
--executor-cores 3
--executor-memory 16G
--driver-memory 8G
It is configured respectively for unit parameter (num-executors=1) and above-mentioned parallel parameter, for four kinds of scales
Data perform the algorithm flow of the present invention, it is as shown in Figure 6 that it performs the time used.
As can be seen that the time loss of unit operation is above parallel time loss in four kinds of scales, meanwhile, with
The growth of data set, the time loss meeting sharp increase of unit, and parallel algorithm time growth is Comparatively speaking more gentle.It is comprehensive
Above as it can be seen that parallel algorithm can be less than uniprocessor algorithm on time loss, and as the growth of data scale, advantage are more bright
It is aobvious.
The present invention be directed to the task that power equipments defect case text carries out defect emergency classification.In word2vec
Term vector represent link, the mode that the field language material of employing is trained, with the common training result based on open language material
Compare, the language which can more accurately embody this area describes mode feature, and can be directly used in field text
The procedure links of the other applications such as this cluster, association analysis.Meanwhile two sorting algorithms of Spark platforms are rewritten into more classification
Algorithm, the characteristics of remaining the highly-parallel of SVM complexity assorting processes, filled up more sorting techniques of SVM on Spark frames
Blank.Global analysis flow has carried out the paralell design based on Spark and realization completely, compared to general non-parallel mould
Formula, can preferably adapt to the big data scene in practical application.
The foregoing is merely the preferred embodiment of the application, the application is not limited to, for the skill of this area
For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair
Change, equivalent substitution, improvement etc., should be included within the protection domain of the application.
Although above-mentioned be described the embodiment of the present invention with reference to attached drawing, model not is protected to the present invention
The limitation enclosed, those skilled in the art should understand that, on the basis of technical scheme, those skilled in the art are not
Need to make the creative labor the various modifications that can be made or deformation still within protection scope of the present invention.
Claims (9)
1. a kind of power equipments defect file classification method of parallelization, it is characterized in that:Comprise the following steps:
(1) field dictionary is added in user-oriented dictionary, defect case is pre-processed, segmented and gone stop words;
(2) crawler algorithm is utilized, the corpus of text of electric network fault case is collected, is trained, obtained using the word2vec of Spark
The term vector in the field is taken to represent;
(3) word in the genetic defects case for obtaining step (1) is converted into the corresponding term vector of step (2), and by case
Data carry out text representation, form the form of matrix;
(4) by Input matrix into SVM multi-categorizers, it is trained and classifies, obtain classification results.
2. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step
(1) and step (2) order exchange.
3. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step
(1) in, the processing method segmented is:Text data is read in the data structure of program from HDFS, each behavior
One text data, the data structure of storage is RDD [String] form.
4. a kind of power equipments defect file classification method of parallelization as claimed in claim 3, it is characterized in that:By domain term
Storehouse is imported into the user thesaurus of ansj, calls the Library.makeForest interfaces in ansj to import domain lexicon, will
Segment dictionary and carry out completion, obtain complete dictionary, the foundation as participle;Operated using the map of Spark for each language material
Carry out word segmentation processing, using accurate participle, that is, call the ToAnalysis.parse interfaces in ansj, using map operators for
Each sentence carries out word segmentation processing simultaneously in parallel.
5. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step
(1) in, the processing method for carrying out stop words is:Vocabulary will be disabled to imported into the data structure of program from HDFS, it is original defeated
Enter for the form of each one stop words of behavior, the data structure of storage is RDD [String] form;Calculated using the map of Spark
Son carries out stop words for the result of each point of complete word and operates, each word obtained according to division is every to disable
Filtering out in set of words, while using map operators stop words filtering is simultaneously carried out for each text;Result is arranged
Into RDD [Array [String]] form, the handling result of each one case of behavior, the form of every result is some words,
Centre is separated with the form in space, and the result handled well is stored in data structure, and is output to txt forms on HDFS.
6. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step
(2) in, using reptile means, a large amount of texts in the field is collected, as a part for field term vector training corpus, will be collected
External data and it is to be analyzed the defects of case merge, composing training language material, carry out participle and stop words pretreatment, adjust
With the word2vec algorithm bags of Spark, the result of previous step is input to word2vec models using word2Vec.fit operators
The middle training for carrying out term vector, and the obtained term vector of training obtained by model.getVectors operators as a result, will treat point
The case text of analysis is read into data structure from HDFS, is carried out for the word in case with trained corresponding vector
Replace.
7. a kind of power equipments defect file classification method of parallelization as claimed in claim 6, it is characterized in that:By each piece
The vector result of some words of case is averaged, and as the global feature of the case, the result being calculated is arranged,
The feature of a case is corresponded to per a line, often capable form is " DjUrgency level category label ", and be output to the form of txt
On HDFS.
8. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step
(3) in, text feature data are imported into data structure, case data is trained to the cutting of collection and test set, are set
Iterations, the structure of model is carried out using stochastic gradient descent method, using training set training pattern, using accuracy rate or is recalled
Rate is trained the assessment of result, if assessment result does not meet setting condition, readjusts iterative parameter and model parameter, directly
Meet setting condition to output result.
9. a kind of power equipments defect file classification method of parallelization as claimed in claim 1, it is characterized in that:The step
(4) in, SVM algorithm is improved, polytypic scene can be tackled, specific improved method is:
(4-1) is divided original case data according to every a kind of emergency class, and will divide obtained Sub Data Set progress
Combination of two, forms new combined data set;
Each combined data set in initial data training set is input in the bis- classification tool bags of SVM of Spark by (4-2),
Carry out the training of model;
Data in test set are separately input to carry out the judgement of classification in bis- graders of SVM that three training finish by (4-2),
Each grader is voted by classification results, and after three graders, voting results are added, obtain final classification
As a result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711288010.XA CN108021679A (en) | 2017-12-07 | 2017-12-07 | A kind of power equipments defect file classification method of parallelization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711288010.XA CN108021679A (en) | 2017-12-07 | 2017-12-07 | A kind of power equipments defect file classification method of parallelization |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108021679A true CN108021679A (en) | 2018-05-11 |
Family
ID=62078915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711288010.XA Pending CN108021679A (en) | 2017-12-07 | 2017-12-07 | A kind of power equipments defect file classification method of parallelization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108021679A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN109101483A (en) * | 2018-07-04 | 2018-12-28 | 浙江大学 | A kind of wrong identification method for electric inspection process text |
CN109101481A (en) * | 2018-06-25 | 2018-12-28 | 北京奇艺世纪科技有限公司 | A kind of name entity recognition method, device and electronic equipment |
CN109146152A (en) * | 2018-08-01 | 2019-01-04 | 北京京东金融科技控股有限公司 | Incident classification prediction technique and device on a kind of line |
CN110287321A (en) * | 2019-06-26 | 2019-09-27 | 南京邮电大学 | A kind of electric power file classification method based on improvement feature selecting |
CN110781671A (en) * | 2019-10-29 | 2020-02-11 | 西安科技大学 | Knowledge mining method for intelligent IETM fault maintenance record text |
CN110895565A (en) * | 2019-11-29 | 2020-03-20 | 国网湖南省电力有限公司 | Method and system for classifying fault defect texts of power equipment |
CN111177367A (en) * | 2019-11-11 | 2020-05-19 | 腾讯科技(深圳)有限公司 | Case classification method, classification model training method and related products |
CN111191447A (en) * | 2019-12-18 | 2020-05-22 | 东软集团股份有限公司 | Equipment defect classification method, device and equipment |
CN111241811A (en) * | 2020-01-06 | 2020-06-05 | 平安科技(深圳)有限公司 | Method, apparatus, computer device and storage medium for determining search term weight |
CN111931861A (en) * | 2020-09-09 | 2020-11-13 | 北京志翔科技股份有限公司 | Anomaly detection method for heterogeneous data set and computer-readable storage medium |
CN112749079A (en) * | 2019-10-31 | 2021-05-04 | 中国移动通信集团浙江有限公司 | Defect classification method and device for software test and computing equipment |
CN114444469A (en) * | 2022-01-11 | 2022-05-06 | 国家电网有限公司客户服务中心 | Processing device based on 95598 customer service data resources |
CN116383390A (en) * | 2023-06-05 | 2023-07-04 | 南京数策信息科技有限公司 | Unstructured data storage method for management information and cloud platform |
CN117057312A (en) * | 2023-10-11 | 2023-11-14 | 北京洛斯达科技发展有限公司 | Python-based precise splitting method for extra-high voltage engineering water conservation design document |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389379A (en) * | 2015-11-20 | 2016-03-09 | 重庆邮电大学 | Rubbish article classification method based on distributed feature representation of text |
CN105550200A (en) * | 2015-12-02 | 2016-05-04 | 北京信息科技大学 | Chinese segmentation method oriented to patent abstract |
CN105740424A (en) * | 2016-01-29 | 2016-07-06 | 湖南大学 | Spark platform based high efficiency text classification method |
-
2017
- 2017-12-07 CN CN201711288010.XA patent/CN108021679A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389379A (en) * | 2015-11-20 | 2016-03-09 | 重庆邮电大学 | Rubbish article classification method based on distributed feature representation of text |
CN105550200A (en) * | 2015-12-02 | 2016-05-04 | 北京信息科技大学 | Chinese segmentation method oriented to patent abstract |
CN105740424A (en) * | 2016-01-29 | 2016-07-06 | 湖南大学 | Spark platform based high efficiency text classification method |
Non-Patent Citations (3)
Title |
---|
YODE: "SVM多类分类---多个二值分类combine", 《新浪博客HTTP://BLOG.SINA.COM.CN/S/BLOG_4C98B96001009B8D.HTML》 * |
冯贵川: "基于Word2vec的文本建模及分类研究", 《中国优秀硕士学位论文全文数据库 信息科技(月刊)计算机软件及计算机应用》 * |
风中迷茫的蛤蛤: "ansj分词教程", 《CSDN博客HTTPS://BLOG.CSDN.NET/A360616218/ARTICLE/DETAILS/75268959》 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN109101481A (en) * | 2018-06-25 | 2018-12-28 | 北京奇艺世纪科技有限公司 | A kind of name entity recognition method, device and electronic equipment |
CN109101481B (en) * | 2018-06-25 | 2022-07-22 | 北京奇艺世纪科技有限公司 | Named entity identification method and device and electronic equipment |
CN109101483A (en) * | 2018-07-04 | 2018-12-28 | 浙江大学 | A kind of wrong identification method for electric inspection process text |
CN109101483B (en) * | 2018-07-04 | 2020-04-14 | 浙江大学 | Error identification method for power inspection text |
CN109146152A (en) * | 2018-08-01 | 2019-01-04 | 北京京东金融科技控股有限公司 | Incident classification prediction technique and device on a kind of line |
CN110287321A (en) * | 2019-06-26 | 2019-09-27 | 南京邮电大学 | A kind of electric power file classification method based on improvement feature selecting |
CN110781671A (en) * | 2019-10-29 | 2020-02-11 | 西安科技大学 | Knowledge mining method for intelligent IETM fault maintenance record text |
CN110781671B (en) * | 2019-10-29 | 2023-02-14 | 西安科技大学 | Knowledge mining method for intelligent IETM fault maintenance record text |
CN112749079A (en) * | 2019-10-31 | 2021-05-04 | 中国移动通信集团浙江有限公司 | Defect classification method and device for software test and computing equipment |
CN112749079B (en) * | 2019-10-31 | 2023-12-26 | 中国移动通信集团浙江有限公司 | Defect classification method and device for software test and computing equipment |
CN111177367A (en) * | 2019-11-11 | 2020-05-19 | 腾讯科技(深圳)有限公司 | Case classification method, classification model training method and related products |
CN110895565A (en) * | 2019-11-29 | 2020-03-20 | 国网湖南省电力有限公司 | Method and system for classifying fault defect texts of power equipment |
CN111191447B (en) * | 2019-12-18 | 2023-07-14 | 东软集团股份有限公司 | Equipment defect classification method, device and equipment |
CN111191447A (en) * | 2019-12-18 | 2020-05-22 | 东软集团股份有限公司 | Equipment defect classification method, device and equipment |
CN111241811A (en) * | 2020-01-06 | 2020-06-05 | 平安科技(深圳)有限公司 | Method, apparatus, computer device and storage medium for determining search term weight |
CN111241811B (en) * | 2020-01-06 | 2024-05-10 | 平安科技(深圳)有限公司 | Method, apparatus, computer device and storage medium for determining search term weight |
CN111931861A (en) * | 2020-09-09 | 2020-11-13 | 北京志翔科技股份有限公司 | Anomaly detection method for heterogeneous data set and computer-readable storage medium |
CN114444469A (en) * | 2022-01-11 | 2022-05-06 | 国家电网有限公司客户服务中心 | Processing device based on 95598 customer service data resources |
CN114444469B (en) * | 2022-01-11 | 2024-07-09 | 国家电网有限公司客户服务中心 | Processing device based on 95598 customer service data resource |
CN116383390A (en) * | 2023-06-05 | 2023-07-04 | 南京数策信息科技有限公司 | Unstructured data storage method for management information and cloud platform |
CN116383390B (en) * | 2023-06-05 | 2023-08-08 | 南京数策信息科技有限公司 | Unstructured data storage method for management information and cloud platform |
CN117057312A (en) * | 2023-10-11 | 2023-11-14 | 北京洛斯达科技发展有限公司 | Python-based precise splitting method for extra-high voltage engineering water conservation design document |
CN117057312B (en) * | 2023-10-11 | 2023-12-29 | 北京洛斯达科技发展有限公司 | Python-based precise splitting method for extra-high voltage engineering water conservation design document |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108021679A (en) | A kind of power equipments defect file classification method of parallelization | |
JP7090936B2 (en) | ESG-based corporate evaluation execution device and its operation method | |
CN103631859B (en) | Intelligent review expert recommending method for science and technology projects | |
CN107944480A (en) | A kind of enterprises ' industry sorting technique | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
CN107861942A (en) | A kind of electric power based on deep learning is doubtful to complain work order recognition methods | |
CN107330011A (en) | The recognition methods of the name entity of many strategy fusions and device | |
CN106446230A (en) | Method for optimizing word classification in machine learning text | |
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
CN106651057A (en) | Mobile terminal user age prediction method based on installation package sequence table | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN109948340A (en) | The PHP-Webshell detection method that a kind of convolutional neural networks and XGBoost are combined | |
CN108268431B (en) | The method and apparatus of paragraph vectorization | |
CN110472040A (en) | Extracting method and device, storage medium, the computer equipment of evaluation information | |
CN109299264A (en) | File classification method, device, computer equipment and storage medium | |
CN107832290B (en) | Method and device for identifying Chinese semantic relation | |
CN105045913B (en) | File classification method based on WordNet and latent semantic analysis | |
CN109684476A (en) | A kind of file classification method, document sorting apparatus and terminal device | |
CN110134961A (en) | Processing method, device and the storage medium of text | |
CN110889412B (en) | Medical long text positioning and classifying method and device in physical examination report | |
CN110097096A (en) | A kind of file classification method based on TF-IDF matrix and capsule network | |
CN109271516A (en) | Entity type classification method and system in a kind of knowledge mapping | |
CN110232128A (en) | Topic file classification method and device | |
CN106649250A (en) | Method and device for identifying emotional new words | |
CN108363691A (en) | A kind of field term identifying system and method for 95598 work order of electric power |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180511 |