CN110399606A

CN110399606A - A kind of unsupervised electric power document subject matter generation method and system

Info

Publication number: CN110399606A
Application number: CN201811488091.2A
Authority: CN
Inventors: 刘迪; 陈静; 崔迎宝; 陈薇; 邱镇; 王腾蛟; 刘园园
Original assignee: Peking University; State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd; National Network Information and Communication Industry Group Co Ltd
Current assignee: Peking University; State Grid Corp of China SGCC; State Grid Zhejiang Electric Power Co Ltd; National Network Information and Communication Industry Group Co Ltd
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2019-11-01
Anticipated expiration: 2038-12-06
Also published as: CN110399606B

Abstract

The present invention provides a kind of unsupervised electric power document subject matter generation method and system, for quickly generating the document subject matter of power domain.First with correlation analysis in the present invention, document data relevant to specific area is screened, recycles clustering method to find generic document, subject distillation then is carried out to it, and apply this in subject extraction system, so that the theme for extracting specific area is more feasible.

Description

A kind of unsupervised electric power document subject matter generation method and system

Technical field

The present invention relates to document subject matter extractions, and in particular to a kind of unsupervised electric power document subject matter generation method and system, Belong to natural language processing and computer software field.

Background technique

In recent years, with the high speed development of internet, the exponential growth of data on each news release platform, how The compression that data mixed and disorderly to magnanimity carry out high quality is extracted, make user efficiently searched from these data useful information at For the research emphasis of current natural language processing field.The compression extraction of data relates generally to document subject matter technology, document subject matter Extraction is divided into extraction-type and production.Extraction-type subject methods are to carry out assessment marking to the sentence in original text, and selecting most can generation Several sentences of table original text purport are as full text theme.Production subject methods are to make computer using technologies such as machine learning The sentence of non-original text is reconfigured, original text theme is generated.Since production theme is limited by natural language understanding technology, The theme readability of generation is not high, and stability is lower, while when training generates the model of theme, need high quality has supervision Chinese data is limited by manpower, and the standard digest data of existing specific area are few, so that obtaining has the training data of supervision to become It obtains extremely difficult.Extraction-type theme generating mode avoids machine and understands text and need to reorganize the problem of language, utilizes The readable sentence generation theme of original text, extracts document information, readable high, can largely reduce the letter of user Breath load.

Wuhan University proposes a kind of more document automatic theme extracting methods based on hybrid machine learning model, first directly It connects and vectorization is carried out to document using word2vec, then using preparatory trained classifier, the document of vectorization is carried out Classification, the main purpose of classification is the sentence that theme is suitable as in the data for find original document, then to being suitable as theme Sentence utilize TextRank algorithm carry out subject distillation.Inner Mongol Normal University proposes a kind of based on LDA and TextRank In conjunction with more document automatic theme methods, be to be pre-processed to original document first, establish topic model, obtain comparing in document More important sentence, then the topic model when considering operation node weights, obtains iterative formula, and TextRank is recycled to calculate Method carries out subject distillation to the multiple documents under same subject.Both the above method can not all lead the data of specific area Topic is extracted, and for first method after classifying to document, the sentence for being suitble to do theme may be comprising largely and specific area Unrelated sentence, at the same second method using LDA carry out theme modeling when also can exist the data unrelated for field into The case where row modeling.

Summary of the invention

The purpose of the present invention is to overcome the shortcomings of the existing technology and insufficient, and it is raw to provide a kind of unsupervised electric power document subject matter At method and system, for quickly generating the document subject matter of power domain.First with correlation analysis, sieve in the present invention Document data relevant to specific area is selected, recycles clustering method to find generic document, theme then is carried out to it and is mentioned It takes, and this is applied in subject extraction system, so that the theme for extracting specific area is more feasible.

To achieve the above object, The technical solution adopted by the invention is as follows:

A kind of unsupervised electric power document subject matter generation method, step include:

Acquisition public sentiment initial data is simultaneously organized into document, transforms a document to tfidf vector；

Whether occurred in power domain vocabulary according to the word in the document, and combine the tfidf vector generated, calculated To matching value；

Document using matching value greater than 0 is as document relevant to power domain；

Tfidf vector is converted by document relevant to power domain, is clustered, obtains different classes of document；

Cutting is carried out to obtained different classes of document, using the sentence of cutting and the corresponding word list of sentence as node It is added to non-directed graph；

Vectorization is carried out to the corresponding word list of sentence after cutting, calculates the space length between vector, and as text Similarity between shelves, is added in non-directed graph using the similarity as side, corresponds with node, and non-directed graph is completed in building；

The node of the non-directed graph is ranked up by similarity is descending, the sentence of K node on behalf in the top Theme as document.

Further, system acquisition initial data is monitored from state's net public sentiment, data acquisition source includes wechat public platform, new The texts distribution platform such as unrestrained microblogging, discussion bar, forum, news.

Further, the initial data of acquisition includes the title and content of document.

Further, document is subjected to tfidf vectorization expression, the instruction of tfidf vector using trained tfidf vector Practicing step includes:

It takes out several initial data at random from document, filters out symbol and English alphabet useless in data；

Word cutting is carried out to document using participle tool pyltp, word list is converted by text data, removes stop words；

For the word list after removal stop words, generated using TfidfVectorizer () the function training in sklearn Tfidf vector, and generate word corresponding to element index in tfidf vector.

Further, the document matches value calculating method are as follows:

Word in document is ranked up by the value for corresponding to tfidf is descending, generates new word list；

The word list is traversed, it is right if the biggish word of tfidf value of the word list has appeared in power domain vocabulary Tfidf value sums up；If the smaller word of tfidf value of the word list has appeared in power domain vocabulary, to tfidf value 1/2 sum up；Two-part value adduction, obtains the addition and value of the tfidf of word in document；

To obtaining addition and value and be balanced that final matching value is calculated, calculation formula is as follows:

Wherein, scores is matching value, and doc_score is the addition and value of tfidf, and n is the number of word in document.

Further, the biggish word of tfidf value refers to preceding the 15% of word list in the described word list, in the word list The lesser word of tfidf value is be worth word list rear 85%.

Further, it is by the method and step that above-mentioned document relevant to power domain is converted into tfidf vector:

For document relevant to power domain, filter out unrelated symbol and letter, then carry out word cutting, removal deactivate Word generates word list；

Using the word list of generation, training generates tfidf vector；

Using the tfidf vector of the generation, tfidf vector is converted by the relevant document of power domain.

Further, clustering method includes Kmeans, Dbscan.

Further, as clustered using Kmeans method, step includes:

The tfidf vector that the relevant document of power domain is converted using TruncatedSVD () method in sklearn Carry out dimensionality reduction；

To the vector after dimensionality reduction, KMeans () clustering in sklearn is utilized；

File system is written into document under classification and each classification after cluster；

Further, as clustered using Dbscan method, step includes:

Gathered using the tfidf vector that DBSCAN () method in sklearn converts the relevant document of power domain Class；

Local file system is written into file under multiple classifications and each classification after cluster.

Further, obtained different classes of document is cut into short using the punctuation mark in document as separator Sentence；

Short sentence after cutting is subjected to word cutting using participle tool pyltp, removes stop words, generates the word column of each sentence Table.

Further, to the method and step of obtained different classes of document vectorization are as follows:

Tfidf vector is generated according to the training of the word list of sentence；

Word2vec vectorization expression is carried out to word each in word list, each word is obtained according to the tfidf vector of document Weight in a document, be calculated according to the following formula it is different classes of under document vectorization result:

Wherein, v is vectorization as a result, V_nIt is indicated for the word2vec vectorization of word, ω_nFor the element in tfidf vector, n For the natural number more than or equal to 1.

Further, K is to the sentence quantity for generating document subject matter after Documents Similarity sequence.

A kind of unsupervised quick electric power document subject matter generation system, including memory and processor, memory storage Computer program, the program are configured as being executed by the processor, which includes for executing each step in the above method Instruction.

The principle of the present invention is, on the basis of existing text data, calculates original document matching value relevant to electric power, When the big Mr. Yu's threshold value of matching value, it is believed that there is correlation with electric power, by calculating the correlation of original document and power domain, Filter out the data highly relevant with power domain；Then using Kmeans or Dbscan clustering to these clustering documents, Find the multiple documents under different classifications；Subject distillation is carried out using TextRank algorithm to document after cluster respectively again.

Compared with prior art, the present invention having the advantages that:

1. it realizes using unlabelled electric power document as the input of system, it can using clustering algorithm and text subject algorithm Rapidly to obtain document subject matter, realize that theme generates system, so that the document subject matter for obtaining power domain is more practical, feasibility It is higher, using bigger with promotional value.

2. the invention proposes combine word2vec and tfidf advantage in view of preferably carrying out vectorization to document Vectorization mode is carried out to document, being effectively utilized influences vector bring in the prior word of document, so that generate Vectors for documents has more expressive force.

3. by carrying out modular design to system, by Text similarity computing, cluster, the combination of the functions such as subject distillation Together, so that the flexible row of system is higher, robustness is stronger.

4. the present invention provides a kind of unsupervised themes to generate system, avoids supervision message and obtains difficult problem, And theme is extracted in the way of unsupervised and has higher readable and stronger stability.

Detailed description of the invention

Fig. 1 is a kind of flow chart of unsupervised electric power document subject matter generation method of the present embodiment.

Specific embodiment

To enable features described above and advantage of the invention to be clearer and more comprehensible, with reference to the accompanying drawing, to specific reality of the invention The mode of applying is further described.

The present embodiment provides a kind of unsupervised electric power document subject matter generation method, with realize electric power document under unsupervised Theme is quickly generated, as shown in Figure 1, including the following steps:

1. having the data of correlation in matching initial data with power domain, the specific steps are as follows:

1.1 monitor system acquisition initial data from state's net public sentiment, and data acquisition source includes that wechat public platform, Sina are micro- The texts distribution platforms such as rich, discussion bar, forum and news；

1.2 collected initial data include the title and content of document, arrange document；

1.3 take out several initial data at random from document, and training generates tfidf vector, and the training process is as follows:

1.3.1 several initial data are randomly selected first, filter out some useless symbols and English alphabet in data；

1.3.2 word cutting is carried out to document using participle tool pyltp, converts word list for text data, removal deactivates Word；

1.3.3 the good word list of stop words final finishing is removed, TfidfVectorizer () function training in sklearn is utilized Tfidf vector is generated, and generates word corresponding to element index in tfidf vector；

1.4 read document one by one, similar with 1.3.1 treatment process, by the removal of unrelated symbol and then word cutting in data, then Vectorization expression is carried out to the document using trained tfidf vector, the result that document vectorization indicates is denoted as doc_csr.

1.5 read in local file with the higher electric power vocabulary of the electric power degree of correlation, save in memory, are denoted as target_ word_set。

1.6 calculate the matching value of document and power domain vocabulary, are denoted as doc_scores；Specific step is as follows:

1.6.1 the element value of non-zero in doc_csr vector is found first, and storage in lists, is denoted as scores；

1.6.2 to the descending sequence of scores, the index in corresponding scores is generated, sorted_ptr is denoted as；

1.6.3 according to scores and sorted_ptr, the list of the descending sequence of ifidf value is generated, sorted_ is denoted as scores；

1.6.4 according to sorted_ptr and idx_2_word, the corresponding descending sequence of tfidf value of document vocabulary is generated List, be denoted as sorted_words；

1.6.5 word list sorted_words is traversed, is calculated, finally obtains doc_scores；The calculating process is such as Under:

1.6.5.1 given threshold top_k_word_num, threshold value top_k_word_num are usually set to document word frequency The 15% of doc_words_num, i.e. doc_words_num*0.15；

1.6.5.2 to top_k_word_num word preceding in sorted_words, if there is in target_word_set In, then to the word, corresponding tfidf value is summed up in sorted_scores；

1.6.5.3 when the quantity of word in sorted_words is greater than top_k_word_num, then in sorted_score 1/2 adduction of element value of the element index greater than top_k_word_num；

1.6.5.4, the value of above-mentioned two step is added together, obtains addition and value；

1.7 finally obtain document and its matching value with power domain；

2. by being clustered with document that power domain has correlation, the specific steps are as follows:

B2.1 given threshold relevant_score_threshold, threshold range be greater than 0 and less than 1, when document with When the matching value score of power domain is greater than relevant_score_threshold, indicate that the document has with power domain Correlation；

2.2 read the score of document one by one, compare score and relevant_score_threshold size, find complete Portion's document relevant to power domain；

2.3 pairs of documents relevant to power domain filter out unrelated symbol and letter and carry out word cutting again, and removal deactivates Word generates word list；

2.4 using the word list generated, and training generates tfidf vector, and indicates V to document progress vectorization₂；

The different clustering method of 2.5 selections, clustering method includes but is not limited to Kmeans, Dbscan；

2.6 as selected Kmeans method to be clustered, and sorting procedure is as follows:

It 2.6.1 is the efficiency for improving cluster, to V₂Dimensionality reduction is carried out using the TruncatedSVD () method in sklearn；

2.6.3 classification number K >=1 of cluster, including but not limited to 3,5,8,10 are set；To the vector after dimensionality reduction, utilize KMeans () clustering in sklearn；

2.6.4 local disk is written into the document under the classification and each classification after cluster, in case theme uses when generating；

2.7 as selected Dbscan method to be clustered, and sorting procedure is as follows:

2.7.1 using DBSCAN () method in sklearn to V₂It is clustered；

2.7.2 file system is written into the file under the multiple classifications and each classification after cluster；

3. the different classes of document rapidly extracting theme after pair cluster, the specific steps are as follows:

3.1 be successively read it is different classes of under document, to document with ",！... " it is separator, it is short by document cutting Sentence；

The 3.2 no nodes of building do not have the non-directed graph on side；

Each sentence after cutting is carried out word cutting using participle tool pyltp by 3.3, then removes stop words, is generated every Word list [the W of a sentence₁,W₂,W₃,......,W_n-1,W_n]；

3.4 are successively added non-directed graph using the sentence of cutting and the corresponding word list of sentence as the node of non-directed graph, and right The node of addition is numbered, number 0,1 ..., n；

3.5 calculate similarity of the non-directed graphs two-by-two between node, the side as non-directed graph；Calculating process is as follows:

3.5.1 reading local word2vec vector；

3.5.2 tfidf vector is generated using the word list training of sentence.

3.5.3 the tfidf vector for utilizing each sentence of word2vec vector sum, generates the vector of each document, generated Journey is as follows:

3.5.3.1 document word segmentation result is [air ', ' switch ', ' dress ', ' family ', ' have a power failure ', ' electric shock '], ' air ' It is V in word2vec vector₁, and so on, the vector of ' electric shock ' in word2vec is V₆；

3.5.3.2 according to the tfidf vector [ω of document₁,ω₂,ω₃,......,ω_n], obtain each word in a document Weight；

3.5.3.3 generating document vector isFormula is as follows:

3.5.4 the space length between vector, the similarity as document two-by-two are calculated using the document vector generated；

3.5.5 it is added to non-directed graph using the similarity between document as the side of non-directed graph, is completed by the number of node Complete non-directed graph is completed in the one-to-one correspondence of node and side, final building；

3.5.6 descending sequence is carried out according to side (i.e. similarity) to the node for the non-directed graph being built into；

3.5.7 theme of the K node (i.e. sentence) as document before similarity ranking, K can take greater than 1 and be less than wireless The arbitrary value of node of graph sum, the general sentence quantity controlled by K value when generating theme.

The following are the Experimental comparisons of the method for the present invention and art methods:

Electric power document: community, power plant gos deep into area under one's jurisdiction and carries out work of flood prevention with all strength, to carry out answering for heavy rain extreme weather conscientiously To work, it is ensured that life and property safety of people maintains area under one's jurisdiction normal order.Community workers are to local city Safety of Flood Control publicity and security risk investigation are carried out, the awareness of safety of community resident is strengthened.On August 14th, 2018, society, power plant Area has carried out security risk investigation to local old one-storey house, low-lying location, the keypoint parts such as advertising board.Community deputy secretary Leading staff for local, there are the regions of security risk to have carried out emphasis investigation, especially for the experienced one-storey house of season man The key areas such as area, paper mill cottage area and low-lying location are conscientiously checked, and heavy rain, which is once attacked, will jeopardize subjective reflection Life security.In addition, the work that community is on duty during also strengthening flood control.Personnel awaits orders in community, protects within 24 hours Hold the unimpeded of communication, it is ensured that personnel are in place.

The result that art methods are extracted: on August 14th, 2018, community, power plant is to local old one-storey house, low-lying Location, the keypoint parts such as advertising board have carried out security risk investigation.Community deputy secretary leads staff for area under one's jurisdiction memory Emphasis investigation is carried out in the region of security risk, especially for the experienced cottage area of season man, paper mill cottage area and low-lying land The key areas such as section are conscientiously checked, and heavy rain, which is once attacked, will jeopardize the life security of subjective reflection.

The result that the method for the present invention is extracted: community, power plant gos deep into area under one's jurisdiction and carries out work of flood prevention with all strength, to carry out heavy rain conscientiously The reply work of extreme weather, it is ensured that life and property safety of people maintains area under one's jurisdiction normal order.Community workers are to linchpin City carries out Safety of Flood Control publicity and security risk investigation in area, strengthens the awareness of safety of community resident.

In a part that the result extracted by the prior art it can be seen from the result of said extracted has related generally to original text Hold, can not be as the theme of original text shelves, and the result that the method for the present invention is extracted can more summarize the content of original text, be very suitable to make It is the theme.

It is finally noted that the purpose publicized and implemented is to help to further understand the present invention, but this field Technical staff is understood that without departing from the spirit and scope of the invention and the appended claims, various substitutions and modifications It is all possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with right Subject to the range that claim defines.

Claims

1. a kind of unsupervised electric power document subject matter generation method, step include:

Whether occurred in power domain vocabulary according to the word in the document, and combine the tfidf vector generated, is calculated With value；

Cutting is carried out to obtained different classes of document, is added using the sentence of cutting and the corresponding word list of sentence as node To non-directed graph；

Vectorization is carried out to the corresponding word list of sentence after cutting, calculates the space length between vector, and as document it Between similarity, be added in non-directed graph using the similarity as side, with node correspond, building complete non-directed graph；

The node of the non-directed graph is ranked up by similarity is descending, the sentence conduct of K node on behalf in the top The theme of document, K are to the sentence quantity for generating document subject matter after sequencing of similarity.

2. the method as described in claim 1, which is characterized in that monitor system acquisition initial data from state's net public sentiment, data are adopted Collecting source includes wechat public platform, Sina weibo, discussion bar, forum, news.

3. the method as described in claim 1, which is characterized in that the initial data of acquisition include document title and content。

4. the method as described in claim 1, which is characterized in that transform a document to tfidf using trained tfidf vector The training step of vector, tfidf vector includes:

5. the method as described in claim 1, which is characterized in that the method and step that the document matches value calculates includes:

Traverse the word list, if preceding 15% word of the word list has appeared in power domain vocabulary, to tfidf value into Row adduction；If rear 85% word of the word list has appeared in power domain vocabulary, add to the 1/2 of tfidf value With；Two-part value adduction, obtains the addition and value of the tfidf of word in document；

6. the method as described in claim 1, which is characterized in that convert tfidf vector for document relevant to power domain Method and step include:

For document relevant to power domain, unrelated symbol and letter are filtered out, then carries out word cutting, removal stop words, it is raw At word list；

Using the word list of generation, training generates tfidf vector；

7. the method as described in claim 1, which is characterized in that clustering method includes Kmeans, Dbscan；

The step of being clustered using Kmeans method include:

The tfidf vector that the relevant document of power domain converts is carried out using TruncatedSVD () method in sklearn Dimensionality reduction；

The step of being clustered using Dbscan method include:

It is clustered using the tfidf vector that DBSCAN () method in sklearn converts the relevant document of power domain；

8. the method as described in claim 1, which is characterized in that obtained different classes of document, by the punctuate in document Symbol is cut into short sentence as separator；Short sentence after cutting is subjected to word cutting using participle tool pyltp, removes stop words, Generate the word list of each sentence.

9. the method as described in claim 1, which is characterized in that the method and step of obtained different classes of document vectorization Include:

Word2vec vectorization expression is carried out to word each in word list, each word is obtained in text according to the tfidf vector of document Shelves in weight, be calculated according to the following formula it is different classes of under document vectorization result；

Wherein, v is vectorization as a result, V_nIt is indicated for the word2vec vectorization of word, ω_nFor the element in tfidf vector, n is big In the natural number for being equal to 1.

10. a kind of unsupervised quick electric power document subject matter generates system, including memory and processor, memory storage meter Calculation machine program, the program are configured as being executed by the processor, which includes any for executing the claims 1 to 9 The instruction of each step in the method.