CN110389932B - Automatic classification method and device for power files - Google Patents

Automatic classification method and device for power files Download PDF

Info

Publication number
CN110389932B
CN110389932B CN201910588345.6A CN201910588345A CN110389932B CN 110389932 B CN110389932 B CN 110389932B CN 201910588345 A CN201910588345 A CN 201910588345A CN 110389932 B CN110389932 B CN 110389932B
Authority
CN
China
Prior art keywords
classified
power
word
title
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910588345.6A
Other languages
Chinese (zh)
Other versions
CN110389932A (en
Inventor
徐小天
李敏
孙跃
高冉馨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
North China Electric Power Research Institute Co Ltd
Original Assignee
State Grid Corp of China SGCC
North China Electric Power Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, North China Electric Power Research Institute Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201910588345.6A priority Critical patent/CN110389932B/en
Publication of CN110389932A publication Critical patent/CN110389932A/en
Application granted granted Critical
Publication of CN110389932B publication Critical patent/CN110389932B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • G06F16/152File search processing using file content signatures, e.g. hash values
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The application provides an automatic classification method and device for power files, wherein the method comprises the following steps: generating a corpus set, a title set and a vocabulary set according to the power file to be classified; generating a K-dimensional input vector based on the vocabulary set; wherein K is the number of the vocabularies in the vocabulary set; training an input vector by using a corpus set and a title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector; counting word frequencies of all keywords in the vocabulary set in all the electric power files to be classified respectively and generating vectors corresponding to all the electric power files to be classified respectively; constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively; and performing clustering analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as a vector distance in the clustering analysis. The method and the device can improve the classification efficiency and the classification accuracy.

Description

Automatic classification method and device for power files
Technical Field
The invention relates to the field of data processing, in particular to an automatic classification method and device for power files.
Background
In the process of technical services such as power production, debugging and the like, technicians generate a large number of periodic technical reports along with work for the purpose of data storage and knowledge sharing. These techniques are enormous in number and, due to the differences in the generations and authors, there are large differences in the templates and naming rules used; furthermore, due to age, the technical reports are often kept in the technician's personal electronic storage media, and only year and author may be marked during the compilation process, so that the vast majority of compiled reports lack the logical relationship that can be used for combing out classifications.
In the prior art, a method for manually classifying reports is mainly adopted. During manual classification, a large number of reports need to be checked one by one, so that the category of each report is determined, and the classification of a large number of collected reports is realized.
Therefore, in the prior art, a large amount of manpower is required to be invested by adopting a manual classification method, the classification cost is increased, and the classification efficiency is low.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an automatic classification method and device for power files, which can reduce user operation and improve classification efficiency and classification accuracy.
In order to solve the technical problems, the invention provides the following technical scheme:
in a first aspect, the present invention provides a method for automatically classifying power files, including:
generating a corpus set, a title set and a vocabulary set according to the power file to be classified; the vocabulary set consists of keywords in each title of the power file to be classified;
generating a K-dimensional input vector based on the vocabulary set; k is the number of keywords in the vocabulary set;
training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector;
counting word frequencies of all keywords in the word set in all to-be-classified power files respectively and generating vectors corresponding to all to-be-classified power files respectively;
constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and vectors corresponding to the electric power files to be classified respectively;
and performing clustering analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as a vector distance in the clustering analysis.
Further, the method also comprises the following steps:
respectively calculating the mean value of a plurality of product results in each category, and respectively determining each power file corresponding to each product result with the minimum difference value with the mean value in each category;
hash values of the titles of the respective power files are adopted as tags of the respective categories.
Further, the generating a corpus set, a title set and a vocabulary set according to the power file to be classified includes:
extracting a title, an abstract and a text first section of the electric power file to be classified;
carrying out sentence-splitting processing on the extracted abstract and the first segment of the text to obtain a corpus set;
obtaining a title set based on the extracted titles and performing word segmentation processing on the extracted titles to obtain keywords in each title;
the vocabulary set is composed of keywords in each title of the power file to be classified.
Further, the word embedding mode includes: at least one of Word2Vec and Glove.
Further, the counting the word frequency of each keyword in the word set in each power file to be classified respectively includes:
and calculating the word frequency of each keyword in the vocabulary set in each power file to be classified by adopting a TF-IDF mode.
Further, the cluster analysis employs at least one of K-Means and Gaussian mixture models.
In a second aspect, the present invention provides an automatic classification device for power files, including:
the collection unit is used for generating a corpus collection, a title collection and a vocabulary collection according to the power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified;
the vector unit is used for generating a K-dimensional input vector based on the vocabulary set; k is the number of the keywords in the vocabulary set;
the training unit is used for training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector;
the word frequency unit is used for counting the word frequency of each keyword in the vocabulary set in each electric power file to be classified and generating a vector corresponding to each electric power file to be classified;
the matrix unit is used for constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively;
and the classifying unit is used for carrying out cluster analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as the vector distance in the cluster analysis.
Further, the method also comprises the following steps:
the mean value unit is used for respectively calculating the mean values of a plurality of product results in each category and respectively determining each power file corresponding to each product result with the minimum difference value with the mean value in each category;
a title unit for adopting the hash value of the title of each power file as the label of each category.
Further, the collection unit includes:
the extracting subunit is used for extracting the title, the abstract and the text first section of the power file to be classified;
the first generating subunit is used for carrying out sentence-splitting processing on the extracted abstract and the first segment of the text to obtain a corpus set;
the second generation subunit is used for obtaining a title set based on the extracted titles and performing word segmentation processing on the extracted titles to obtain keywords in each title;
the vocabulary set is composed of keywords in each title of the power files to be classified.
Further, the word frequency unit includes:
and the word frequency subunit is used for calculating the word frequency of each keyword in the vocabulary set in each electric power file to be classified in a TF-IDF mode.
Further, the word embedding mode includes: at least one of Word2Vec and Glove.
Further, the cluster analysis employs at least one of K-Means and Gaussian mixture models.
In a third aspect, the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the power file automatic classification method.
In a fourth aspect, the present invention provides a computer readable storage medium, on which a computer program is stored, which computer program, when executed by a processor, implements the steps of the method for automatically classifying power files.
According to the technical scheme, the invention provides the automatic classification method and the automatic classification device for the power files, wherein a corpus set, a title set and a vocabulary set are generated according to the power files to be classified; the vocabulary set consists of key words in each title of the power file to be classified; generating a K-dimensional input vector based on the vocabulary set; wherein K is the number of the words in the word set; training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector; counting word frequencies of all keywords in the word set in all to-be-classified power files respectively and generating vectors corresponding to all to-be-classified power files respectively; constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively; and clustering analysis is carried out on each product result, and the Minkowski distance is used as the vector distance in the clustering analysis to divide the power file to be classified into a preset number of categories, so that automatic classification of the power file can be realized, the problem of low classification efficiency caused by manual classification is avoided, the labor cost is reduced, and the classification efficiency and the classification accuracy are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic diagram of a communication structure of an apparatus for automatically classifying power files according to the present invention.
Fig. 2 is a schematic view of another communication structure of the method and apparatus for automatically classifying power files according to the present invention.
Fig. 3 is a schematic flow chart of an automatic classification method for power files according to an embodiment of the present invention.
Fig. 4 is a schematic flowchart of another method for automatically classifying power files according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an automatic classification apparatus for power files according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of another automatic power file sorting apparatus according to an embodiment of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
For effective storage and shared utilization of data information, scientific research institutions in the power industry need to classify power reports according to business types, years, service objects (such as power plant units, transformer substations, new energy stations and the like), equipment models and other logics, so that technical reports of the same category have internal relation in business. The sorting and sorting of these power technology reports by hand is a difficult task due to the large number of reports and the above-mentioned naming conventions.
The problems that the existing manual classification method needs a large amount of manpower, the classification cost is increased, and the classification efficiency is low are considered. The invention provides an automatic classification method, an automatic classification device, electronic equipment and a computer readable storage medium for power files, which are characterized in that a corpus set, a title set and a vocabulary set are generated according to power files to be classified; the vocabulary set consists of keywords in each title of the power file to be classified; generating a K-dimensional input vector based on the vocabulary set; wherein K is the number of the vocabularies in the vocabulary set; training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector; counting word frequencies of all the keywords in the vocabulary set in all the electric power files to be classified respectively and generating vectors corresponding to all the electric power files to be classified respectively; constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively; and clustering analysis is carried out on each product result, and the Minkowski distance is used as the vector distance in the clustering analysis to divide the power file to be classified into a preset number of categories, so that automatic classification of the power file can be realized, the problem of low classification efficiency caused by manual classification is avoided, the labor cost is reduced, and the classification efficiency and the classification accuracy are improved.
Based on the above content, the present invention further provides an automatic classification device for power files, which may be a server A1, and referring to fig. 1, the server A1 may be in communication connection with a client device B1, a user may input a power file to be classified and other related data into the client device B1, the client device B1 may send the power file to be classified and other related data to the server A1 on line, and the server A1 may receive the power file to be classified and other related data sent by the client device B1 on line, and then classify the power file to be classified off line or on line according to the power file to be classified. Then, the server A1 sends the classification result to the client device B1 on line, so that the user can obtain the final classification result through the client device B1.
Further, the server A1 may be further in communication connection with a power file collecting device C1 to be classified, as shown in fig. 2, the power file collecting device C1 to be classified may directly obtain the power file to be classified and other related data from a target area, or may be in communication connection with a database D1 to obtain the corresponding power file to be classified and other related data from the database D1. Then, the to-be-classified power file collecting device C1 sends the to-be-classified power file and other related data to the server A1.
It is understood that the client device B1 may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, the part for automatically classifying the power file may be performed on the server A1 side as described above, that is, the architecture shown in fig. 1, or all operations may be completed in the client device B1. Specifically, the selection may be performed according to the processing capability of the client device B1, the limitation of the user usage scenario, and the like. The invention is not limited in this regard. If all the operations are completed in the client device B1, the client device B1 may further include a processor for performing specific processing of automatically classifying the power file.
The client device may have a communication module (i.e., a communication unit) and may be communicatively connected to a remote server to implement data transmission with the server. For example, the communication unit may transmit the power file to be classified and other related data input by the user to the server, so that the server automatically classifies the power file according to the power file to be classified and other related data. The communication unit may also receive the classified look result returned by the server. The server may include a server on the task scheduling center side, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.
The server and the client device may communicate using any suitable network protocol, including network protocols not yet developed at the filing date of the present application. The network protocol may include, for example, a TCP/IP protocol, a UDP/IP protocol, an HTTP protocol, an HTTPS protocol, or the like. Of course, the network Protocol may also include, for example, an RPC Protocol (Remote Procedure Call Protocol), a REST Protocol (Representational State Transfer Protocol), and the like used above the above Protocol.
In order to effectively improve the classification efficiency and the classification accuracy, the embodiment of the automatic classification method for the power file provided by the invention is as follows, with reference to fig. 3, the automatic classification method for the power file specifically includes the following contents:
s101: generating a corpus set, a title set and a vocabulary set according to the power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified;
in the step, extracting the power files to be classified in the power file library, and extracting the titles, the abstracts and the text first sections of the power files to be classified; performing sentence-splitting processing on all the extracted abstract and the first segment content of the text to obtain a corpus set, wherein the corpus set is recorded as S = { S = (S) } 1 ,s 2 ,…,s L And (5) forming a title set by the titles of the power files to be classified, wherein the title set is marked as S t ={s t 1 ,s t 2 ,…,s t M }。
Performing word segmentation on the extracted titles to obtain keywords in each title, specifically, collecting the titles S t As the input of the word segmentation device, the word segmentation device outputs non-repeated K keywords, and the K keywords form a word set W t ={w 1 ,w 2 ,…,w K }。
It should be noted that the title set is used as a basis for article classification, and the corpus set is used as a corpus for training to determine the inter-vocabulary connection. Furthermore, keywords appearing in the title are used as the basis for classifying the power file, and the clauses in the abstract and the first segment of the text are used as training linguistic data for determining the connection between the vocabularies.
In the step, high-frequency words which are meaningless to classification, such as auxiliary words, quantifier words and the like, are removed in the process of outputting K key words which are not repeated through the word segmentation device.
From the above description, the keywords are only from the title of the power file, and the training corpus used is from the abstract and the key part of the text of the power file, so that the calculation amount of the word embedding training process is reduced.
S102: generating a K-dimensional input vector based on the vocabulary set; k is the number of keywords in the vocabulary set;
in this step, a vocabulary set W is used t Generating a K-dimensional input vector of V init (w i ) Is marked as V init (w i )=[v wi 1 ,v wi 2 ,…,v wi K ] T
Wherein, w i As a collection of words W t The ith keyword in (1), i =1,2,3,4, ·, K, which is the number of keywords in the vocabulary set; v. of wi j J =1,2,3,4, ·, K being the jth element in the K-dimensional input vector for the ith keyword, the number of keywords in the vocabulary set.
Wherein, if j = i, v wi j =1; if j ≠ i, v wi j =0。
S103: training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector;
the corpus collection and the title collection are used for training an input vector based on statistics or a predicted Word Embedding (Word Embedding), and specifically, word Embedding can be performed by using at least one of Word2Vec and Glove. Where Word2vec is a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network. GloVe is an unsupervised learning algorithm for obtaining word vector representations. Aggregated global word-word co-occurrence statistics from the corpus are trained and the resulting representation exhibits interesting linear substructures of the word vector space.
In this step, the vocabulary set W is set after the training is completed t The ith keyword w in i Will be driven from V init (w i ) Is compressed to a preset dimension C to obtain a C-dimension word vector V c (w i )。
It should be noted that the input vector of K dimension is reduced to the word vector of C dimension, and K > C.
A C-dimensional word vector V to be obtained c (w i ) With the original vocabulary W i And correspond to each other. According to the preset value of C, the vocabulary w in the original topic i Are mapped into a relatively lower dimensional space.
S104: counting word frequencies of all keywords in the word set in all to-be-classified power files respectively and generating vectors corresponding to all to-be-classified power files respectively;
in the step, word frequencies of all keywords in the electric power files to be classified are counted, and the relative word frequencies of the keywords are counted in a TF-IDF mode.
It can be understood that for the mth power file d m Wherein the ith keyword w i The frequency of occurrence is denoted as f dm i For m-th power file d m All obtain a vector V of dimension K f (d m )=[f dm 1 ,f dm 2 ,…,f dm K ] T The vector V f Indicating the frequency of occurrence of all title keywords in the power file.
It should be noted that TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency (Term Frequency), and IDF means Inverse text Frequency index (Inverse Document Frequency).
From the above description, the power file is represented by the word frequency linear combination of the word vectors obtained by using the word embedding mode, so that the calculation cost of vectorization of the power file can be effectively reduced while the interconnection of words is kept, and the training calculation amount of power file clustering is reduced.
S105: constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively;
in this step, a matrix M is constructed from the C-dimensional word vectors w Matrix M w Is marked as M w =[V c (w 1 ),V c (w 2 ),…,V c (w K )],M w Is a C x K dimensional matrix, the product of the C dimensional word vector and the K dimensional input vector. Calculating the product of the matrix and the vector corresponding to each power file to be classified, and performing calculation on the m-th power file d m Calculating V dm =M w V f (d m ) Then V is dm Is a C-dimensional vector.
S106: and performing clustering analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as a vector distance in the clustering analysis.
In this step, the matrix of each power file is respectively subjected to cluster analysis with the product of the vector corresponding to each power file to be classified, and the clustering may use at least one algorithm of K-Means and gaussian mixture model for the cluster analysis, and divide the power files to be classified into H categories by using minkowski distance as the vector distance in the cluster and according to the preset category number H. And clustering for multiple times by changing the value of the number H, and selecting the number H meeting the requirement according to the acceptable upper limit of the number H and the clustering result.
Further, when minkowski distances are specifically used as vector distances in clusters, the minkowski distance is adopted in which: the manhattan distance when the variable parameter is 1, the euclidean distance when the variable parameter is 2, and the chebyshev distance when the variable parameter approaches infinity.
As can be seen from the above description, in the automatic classification method for power files according to the embodiment of the present invention, a corpus set, a title set, and a vocabulary set are generated according to a power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified; generating a K-dimensional input vector based on the vocabulary set; wherein K is the number of the words in the word set; training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector; counting word frequencies of all keywords in the word set in all to-be-classified power files respectively and generating vectors corresponding to all to-be-classified power files respectively; constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively; the method has the advantages that each product result is subjected to cluster analysis, the Minkowski distance is used as the vector distance in the cluster analysis to divide the power files to be classified into the preset number of categories, automatic classification of the power files can be achieved, the problem of low classification efficiency caused by manual classification is avoided, meanwhile, labor cost is reduced, and classification efficiency and classification accuracy are improved.
Based on the embodiment of the automatic classification method for power files, an embodiment of the present invention provides another embodiment of an automatic classification method for power files, and referring to fig. 4, on the basis of the embodiment of the automatic classification method for power files, the method further includes:
s107: respectively calculating the mean value of a plurality of product results in each category, and respectively determining each power file corresponding to each product result with the minimum difference value with the mean value in each category;
s108: hash values of the titles of the respective power files are adopted as tags of the respective categories.
In the present embodiment, having classified the power file to be classified as H class, the mean V of the multiple multiplication results in each class is calculated mean And calculating and averaging V using Euclidean distance mean Product result V with minimum difference dm * Determining a multiplication result V dm * The hash value of the title of the corresponding power file is used as the label of the category. Because the power files in each category which are different from the average value are different, labels of each category are different, and the category naming of the classified power files is realized.
As can be seen from the above description, the method for automatically classifying power files provided in the embodiments of the present invention can effectively cluster a large number of unordered power files according to the word frequency and word order characteristics of the power file keywords, so as to implement automatic classification of the power files, and further enable the power files in each cluster to be related to each other in business. The automatic classification method for the power files provided by the embodiment can also avoid the problem of low classification efficiency caused by manual classification, simultaneously reduces the labor cost, and also improves the classification efficiency and the classification accuracy.
The embodiment of the present invention provides a specific implementation manner of an automatic power file classification device capable of implementing all contents in the automatic power file classification method, and referring to fig. 5, the automatic power file classification device specifically includes the following contents:
the collection unit 10 is used for generating a corpus collection, a title collection and a vocabulary collection according to the power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified;
a vector unit 20 for generating an input vector of K dimensions based on the vocabulary set; k is the number of keywords in the vocabulary set;
a training unit 30, configured to train the input vector using the corpus set and the title set and based on a word embedding manner, so that the K-dimensional input vector is compressed into a C-dimensional word vector;
the word frequency unit 40 is configured to count word frequencies of the keywords in the vocabulary set in the power files to be classified respectively and generate vectors corresponding to the power files to be classified respectively;
the matrix unit 50 is used for constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively;
and the classifying unit 60 is configured to perform cluster analysis on each multiplication result and divide the power file to be classified into a preset number of categories by using the minkowski distance as a vector distance in the cluster analysis.
Further, referring to fig. 6, on the basis of the above automatic classification device for power files, the automatic classification device for power files further includes:
a mean value unit 70, configured to calculate mean values of a plurality of product results in each category, and determine, in each category, each power file corresponding to each product result having a smallest difference value with the mean value in each category;
a title unit 80 for adopting the hash value of the title of each power file as the label of each category.
Further, the aggregation unit 10 includes:
the extracting subunit is used for extracting the title, the abstract and the text first section of the power file to be classified;
the first generating subunit is used for carrying out sentence-splitting processing on the extracted abstract and the first segment of the text to obtain a corpus set;
the second generation subunit is used for obtaining a title set based on the extracted titles and performing word segmentation processing on the extracted titles to obtain keywords in each title;
the vocabulary set is composed of keywords in each title of the power files to be classified.
Further, the word frequency unit 40 includes:
and the word frequency subunit is used for calculating the word frequency of each keyword in the vocabulary set in each electric power file to be classified in a TF-IDF mode.
Further, the word embedding manner includes: at least one of Word2Vec and Glove.
Further, the cluster analysis employs at least one of K-Means and Gaussian mixture models.
The embodiment of the automatic power file classification apparatus provided in the present invention may be specifically used to execute the processing flow of the embodiment of the automatic power file classification method in the above embodiment, and the functions of the embodiment are not described herein again, and reference may be made to the detailed description of the embodiment of the method.
As can be seen from the above description, the automatic classification device for power files according to the embodiment of the present invention generates a corpus set, a title set, and a vocabulary set according to a power file to be classified; the vocabulary set consists of keywords in each title of the power file to be classified; generating a K-dimensional input vector based on the vocabulary set; wherein K is the number of the words in the word set; training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector; counting word frequencies of all the keywords in the vocabulary set in all the electric power files to be classified respectively and generating vectors corresponding to all the electric power files to be classified respectively; constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively; and clustering analysis is carried out on each product result, and the Minkowski distance is used as the vector distance in the clustering analysis to divide the power file to be classified into a preset number of categories, so that automatic classification of the power file can be realized, the problem of low classification efficiency caused by manual classification is avoided, the labor cost is reduced, and the classification efficiency and the classification accuracy are improved.
The embodiment of the present invention further provides a specific implementation manner of an electronic device, which is capable of implementing all steps in the automatic classification method of the power file in the foregoing embodiment, and referring to fig. 7, the electronic device specifically includes the following contents:
a processor (processor) 601, a memory (memory) 602, a communication Interface (Communications Interface) 603, and a bus 604;
the processor 601, the memory 602 and the communication interface 603 complete mutual communication through the bus 604; the processor 601 is configured to call a computer program in the memory 602, and the processor implements all the steps in the automatic classification method for power files in the above embodiments when executing the computer program, for example, the processor implements the following steps when executing the computer program: generating a corpus set, a title set and a vocabulary set according to the power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified; generating a K-dimensional input vector based on the vocabulary set; wherein K is the number of the words in the word set; training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector; counting word frequencies of all keywords in the word set in all to-be-classified power files respectively and generating vectors corresponding to all to-be-classified power files respectively; constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and vectors corresponding to the electric power files to be classified respectively; and carrying out cluster analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as a vector distance in the cluster analysis.
An embodiment of the present invention further provides a computer-readable storage medium capable of implementing all the steps in the power file automatic classification method in the above embodiment, where the computer-readable storage medium stores a computer program, and the computer program implements all the steps of the power file automatic classification method in the above embodiment when executed by a processor, for example, the processor implements the following steps when executing the computer program: generating a corpus set, a title set and a vocabulary set according to the power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified; generating a K-dimensional input vector based on the vocabulary set; wherein K is the number of the vocabularies in the vocabulary set; training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector; counting word frequencies of all keywords in the word set in all to-be-classified power files respectively and generating vectors corresponding to all to-be-classified power files respectively; constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively; and carrying out cluster analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as a vector distance in the cluster analysis.
Although the present invention provides method steps as described in the examples or flowcharts, more or fewer steps may be included based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a vehicle human interaction device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "upper", "lower", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "coupled" are to be construed broadly and encompass, for example, both fixed and removable coupling as well as integral coupling; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.
In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention is not limited to any single aspect, nor is it limited to any single embodiment, nor is it limited to any combination and/or permutation of these aspects and/or embodiments. Moreover, each aspect and/or embodiment of the present invention may be utilized alone or in combination with one or more other aspects and/or embodiments thereof.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (10)

1. An automatic classification method for power files is characterized by comprising the following steps:
generating a corpus set, a title set and a vocabulary set according to the power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified;
generating a K-dimensional input vector based on the vocabulary set; k is the number of the keywords in the vocabulary set;
training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector;
counting word frequencies of all keywords in the word set in all to-be-classified power files respectively and generating vectors corresponding to all to-be-classified power files respectively;
constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively;
performing cluster analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as a vector distance in the cluster analysis;
further comprising:
respectively calculating the mean value of a plurality of product results in each category, and respectively determining each power file corresponding to each product result with the minimum difference value with the mean value in each category;
adopting the hash value of the title of each power file as a label of each category;
the generating of the corpus set, the title set and the vocabulary set according to the power file to be classified comprises the following steps:
extracting the title, the abstract and the text first section of the electric power file to be classified;
performing sentence division processing on the extracted abstract and the first segment of the text to obtain a corpus set;
obtaining a title set based on the extracted titles and performing word segmentation processing on the extracted titles to obtain keywords in each title;
the vocabulary set is composed of keywords in each title of the power file to be classified.
2. The automatic classification method for the power file according to claim 1, characterized in that the word embedding manner comprises: at least one of Word2Vec and Glove.
3. The method according to claim 1, wherein the counting word frequency of each keyword in the vocabulary set in each power file to be classified respectively comprises:
and calculating the word frequency of each keyword in the vocabulary set in each power file to be classified by adopting a TF-IDF mode.
4. The method of claim 1, wherein the cluster analysis employs at least one of K-Means and gaussian mixture models.
5. An automatic classification device for power files is characterized by comprising:
the collection unit is used for generating a corpus collection, a title collection and a vocabulary collection according to the power file to be classified; the vocabulary set consists of key words in each title of the power file to be classified;
a vector unit for generating a K-dimensional input vector based on the vocabulary set; k is the number of keywords in the vocabulary set;
the training unit is used for training the input vector by using the corpus set and the title set based on a word embedding mode so as to compress the K-dimensional input vector into a C-dimensional word vector;
the word frequency unit is used for counting the word frequency of each keyword in the vocabulary set in each electric power file to be classified and generating a vector corresponding to each electric power file to be classified;
the matrix unit is used for constructing a matrix according to the C-dimensional word vectors and calculating products of the matrix and the vectors corresponding to the electric power files to be classified respectively;
the classification unit is used for carrying out clustering analysis on each product result and dividing the power file to be classified into a preset number of categories by using the Minkowski distance as a vector distance in the clustering analysis;
further comprising:
the mean value unit is used for respectively calculating the mean values of a plurality of product results in each category and respectively determining each power file corresponding to each product result with the minimum difference value with the mean value in each category;
a title unit configured to adopt hash values of titles of the respective power files as labels of the respective categories;
the aggregation unit includes:
the extracting subunit is used for extracting the title, the abstract and the text first section of the power file to be classified;
the first generating subunit is used for carrying out sentence-splitting processing on the extracted abstract and the first segment of the text to obtain a corpus set;
the second generation subunit is used for obtaining a title set based on the extracted titles and performing word segmentation processing on the extracted titles to obtain keywords in each title;
the vocabulary set is composed of keywords in each title of the power file to be classified.
6. The automatic classification device for power files according to claim 5, wherein the word frequency unit comprises:
and the word frequency subunit is used for calculating the word frequency of each keyword in the vocabulary set in each electric power file to be classified in a TF-IDF mode.
7. The automatic classification device for the power files according to claim 5, wherein the words are embedded in a mode comprising: at least one of Word2Vec and Glove.
8. The automatic classification device of claim 5, characterized in that the cluster analysis employs at least one of K-Means and Gaussian mixture models.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for automatically classifying power files according to any one of claims 1 to 4 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for automatically classifying power files according to any one of claims 1 to 4.
CN201910588345.6A 2019-07-02 2019-07-02 Automatic classification method and device for power files Active CN110389932B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910588345.6A CN110389932B (en) 2019-07-02 2019-07-02 Automatic classification method and device for power files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910588345.6A CN110389932B (en) 2019-07-02 2019-07-02 Automatic classification method and device for power files

Publications (2)

Publication Number Publication Date
CN110389932A CN110389932A (en) 2019-10-29
CN110389932B true CN110389932B (en) 2023-01-13

Family

ID=68286118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910588345.6A Active CN110389932B (en) 2019-07-02 2019-07-02 Automatic classification method and device for power files

Country Status (1)

Country Link
CN (1) CN110389932B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955774B (en) * 2019-11-08 2022-10-11 武汉光谷信息技术股份有限公司 Word frequency distribution-based character classification method, device, equipment and medium
CN110990563A (en) * 2019-11-18 2020-04-10 北京信息科技大学 Artificial intelligence-based traditional culture material library construction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering
CN109189934A (en) * 2018-11-13 2019-01-11 平安科技(深圳)有限公司 Public sentiment recommended method, device, computer equipment and storage medium
CN109299266A (en) * 2018-10-16 2019-02-01 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN109753567A (en) * 2019-01-31 2019-05-14 安徽大学 A kind of file classification method of combination title and text attention mechanism

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430442B2 (en) * 2016-03-09 2019-10-01 Symantec Corporation Systems and methods for automated classification of application network activity
US11074285B2 (en) * 2017-05-10 2021-07-27 Yva.Ai, Inc. Recursive agglomerative clustering of time-structured communications
CN109635107A (en) * 2018-11-19 2019-04-16 北京亚鸿世纪科技发展有限公司 The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN109783637A (en) * 2018-12-12 2019-05-21 国网浙江省电力有限公司杭州供电公司 Electric power overhaul text mining method based on deep neural network
CN109933670B (en) * 2019-03-19 2021-06-04 中南大学 Text classification method for calculating semantic distance based on combined matrix

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering
CN109299266A (en) * 2018-10-16 2019-02-01 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN109189934A (en) * 2018-11-13 2019-01-11 平安科技(深圳)有限公司 Public sentiment recommended method, device, computer equipment and storage medium
CN109753567A (en) * 2019-01-31 2019-05-14 安徽大学 A kind of file classification method of combination title and text attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于正文和标题文本分类的主题建模;郑诚等;《计算机应用与软件》;20170915;全文 *

Also Published As

Publication number Publication date
CN110389932A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
Chaudhari et al. An attentive survey of attention models
Zhang et al. Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research
Adwan et al. Twitter sentiment analysis approaches: A survey
CN102123172B (en) Implementation method of Web service discovery based on neural network clustering optimization
CN107145485B (en) Method and apparatus for compressing topic models
CN110909540B (en) Method and device for identifying new words of short message spam and electronic equipment
CN111338897A (en) Identification method of abnormal node in application host, monitoring equipment and electronic equipment
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN110389932B (en) Automatic classification method and device for power files
CN110993037A (en) Protein activity prediction device based on multi-view classification model
CN113254711A (en) Interactive image display method and device, computer equipment and storage medium
CN111881666B (en) Information processing method, device, equipment and storage medium
CN114443899A (en) Video classification method, device, equipment and medium
CN110069558A (en) Data analysing method and terminal device based on deep learning
CN113743079A (en) Text similarity calculation method and device based on co-occurrence entity interaction graph
Zhong et al. An exploration of cross-modal retrieval for unseen concepts
CN111178578A (en) Financial stock prediction method integrating clustering and ensemble learning
CN103793466A (en) Image retrieval method and image retrieval device
Motohashi et al. Technological competitiveness of China's internet platformers: comparison of Google and Baidu by using patent text information
Li et al. Evidence-based SVM fusion for 3D model retrieval
CN113112299A (en) Noble metal price trend prediction method and device
CN112256730A (en) Information retrieval method and device, electronic equipment and readable storage medium
CN112506959B (en) Data scheduling method and device for intelligent ship database retrieval and retrieval system
CN117633197B (en) Search information generation method and device applied to paraphrasing document and electronic equipment
Gebeyehu et al. A two step data mining approach for amharic text classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant