CN110399606A - A kind of unsupervised electric power document subject matter generation method and system - Google Patents

A kind of unsupervised electric power document subject matter generation method and system Download PDF

Info

Publication number
CN110399606A
CN110399606A CN201811488091.2A CN201811488091A CN110399606A CN 110399606 A CN110399606 A CN 110399606A CN 201811488091 A CN201811488091 A CN 201811488091A CN 110399606 A CN110399606 A CN 110399606A
Authority
CN
China
Prior art keywords
document
word
tfidf
vector
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811488091.2A
Other languages
Chinese (zh)
Other versions
CN110399606B (en
Inventor
刘迪
陈静
崔迎宝
陈薇
邱镇
王腾蛟
刘园园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
National Network Information and Communication Industry Group Co Ltd
Original Assignee
Peking University
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
National Network Information and Communication Industry Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, State Grid Corp of China SGCC, State Grid Zhejiang Electric Power Co Ltd, National Network Information and Communication Industry Group Co Ltd filed Critical Peking University
Priority to CN201811488091.2A priority Critical patent/CN110399606B/en
Publication of CN110399606A publication Critical patent/CN110399606A/en
Application granted granted Critical
Publication of CN110399606B publication Critical patent/CN110399606B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention provides a kind of unsupervised electric power document subject matter generation method and system, for quickly generating the document subject matter of power domain.First with correlation analysis in the present invention, document data relevant to specific area is screened, recycles clustering method to find generic document, subject distillation then is carried out to it, and apply this in subject extraction system, so that the theme for extracting specific area is more feasible.

Description

A kind of unsupervised electric power document subject matter generation method and system
Technical field
The present invention relates to document subject matter extractions, and in particular to a kind of unsupervised electric power document subject matter generation method and system, Belong to natural language processing and computer software field.
Background technique
In recent years, with the high speed development of internet, the exponential growth of data on each news release platform, how The compression that data mixed and disorderly to magnanimity carry out high quality is extracted, make user efficiently searched from these data useful information at For the research emphasis of current natural language processing field.The compression extraction of data relates generally to document subject matter technology, document subject matter Extraction is divided into extraction-type and production.Extraction-type subject methods are to carry out assessment marking to the sentence in original text, and selecting most can generation Several sentences of table original text purport are as full text theme.Production subject methods are to make computer using technologies such as machine learning The sentence of non-original text is reconfigured, original text theme is generated.Since production theme is limited by natural language understanding technology, The theme readability of generation is not high, and stability is lower, while when training generates the model of theme, need high quality has supervision Chinese data is limited by manpower, and the standard digest data of existing specific area are few, so that obtaining has the training data of supervision to become It obtains extremely difficult.Extraction-type theme generating mode avoids machine and understands text and need to reorganize the problem of language, utilizes The readable sentence generation theme of original text, extracts document information, readable high, can largely reduce the letter of user Breath load.
Wuhan University proposes a kind of more document automatic theme extracting methods based on hybrid machine learning model, first directly It connects and vectorization is carried out to document using word2vec, then using preparatory trained classifier, the document of vectorization is carried out Classification, the main purpose of classification is the sentence that theme is suitable as in the data for find original document, then to being suitable as theme Sentence utilize TextRank algorithm carry out subject distillation.Inner Mongol Normal University proposes a kind of based on LDA and TextRank In conjunction with more document automatic theme methods, be to be pre-processed to original document first, establish topic model, obtain comparing in document More important sentence, then the topic model when considering operation node weights, obtains iterative formula, and TextRank is recycled to calculate Method carries out subject distillation to the multiple documents under same subject.Both the above method can not all lead the data of specific area Topic is extracted, and for first method after classifying to document, the sentence for being suitble to do theme may be comprising largely and specific area Unrelated sentence, at the same second method using LDA carry out theme modeling when also can exist the data unrelated for field into The case where row modeling.
Summary of the invention
The purpose of the present invention is to overcome the shortcomings of the existing technology and insufficient, and it is raw to provide a kind of unsupervised electric power document subject matter At method and system, for quickly generating the document subject matter of power domain.First with correlation analysis, sieve in the present invention Document data relevant to specific area is selected, recycles clustering method to find generic document, theme then is carried out to it and is mentioned It takes, and this is applied in subject extraction system, so that the theme for extracting specific area is more feasible.
To achieve the above object, The technical solution adopted by the invention is as follows:
A kind of unsupervised electric power document subject matter generation method, step include:
Acquisition public sentiment initial data is simultaneously organized into document, transforms a document to tfidf vector;
Whether occurred in power domain vocabulary according to the word in the document, and combine the tfidf vector generated, calculated To matching value;
Document using matching value greater than 0 is as document relevant to power domain;
Tfidf vector is converted by document relevant to power domain, is clustered, obtains different classes of document;
Cutting is carried out to obtained different classes of document, using the sentence of cutting and the corresponding word list of sentence as node It is added to non-directed graph;
Vectorization is carried out to the corresponding word list of sentence after cutting, calculates the space length between vector, and as text Similarity between shelves, is added in non-directed graph using the similarity as side, corresponds with node, and non-directed graph is completed in building;
The node of the non-directed graph is ranked up by similarity is descending, the sentence of K node on behalf in the top Theme as document.
Further, system acquisition initial data is monitored from state's net public sentiment, data acquisition source includes wechat public platform, new The texts distribution platform such as unrestrained microblogging, discussion bar, forum, news.
Further, the initial data of acquisition includes the title and content of document.
Further, document is subjected to tfidf vectorization expression, the instruction of tfidf vector using trained tfidf vector Practicing step includes:
It takes out several initial data at random from document, filters out symbol and English alphabet useless in data;
Word cutting is carried out to document using participle tool pyltp, word list is converted by text data, removes stop words;
For the word list after removal stop words, generated using TfidfVectorizer () the function training in sklearn Tfidf vector, and generate word corresponding to element index in tfidf vector.
Further, the document matches value calculating method are as follows:
Word in document is ranked up by the value for corresponding to tfidf is descending, generates new word list;
The word list is traversed, it is right if the biggish word of tfidf value of the word list has appeared in power domain vocabulary Tfidf value sums up;If the smaller word of tfidf value of the word list has appeared in power domain vocabulary, to tfidf value 1/2 sum up;Two-part value adduction, obtains the addition and value of the tfidf of word in document;
To obtaining addition and value and be balanced that final matching value is calculated, calculation formula is as follows:
Wherein, scores is matching value, and doc_score is the addition and value of tfidf, and n is the number of word in document.
Further, the biggish word of tfidf value refers to preceding the 15% of word list in the described word list, in the word list The lesser word of tfidf value is be worth word list rear 85%.
Further, it is by the method and step that above-mentioned document relevant to power domain is converted into tfidf vector:
For document relevant to power domain, filter out unrelated symbol and letter, then carry out word cutting, removal deactivate Word generates word list;
Using the word list of generation, training generates tfidf vector;
Using the tfidf vector of the generation, tfidf vector is converted by the relevant document of power domain.
Further, clustering method includes Kmeans, Dbscan.
Further, as clustered using Kmeans method, step includes:
The tfidf vector that the relevant document of power domain is converted using TruncatedSVD () method in sklearn Carry out dimensionality reduction;
To the vector after dimensionality reduction, KMeans () clustering in sklearn is utilized;
File system is written into document under classification and each classification after cluster;
Further, as clustered using Dbscan method, step includes:
Gathered using the tfidf vector that DBSCAN () method in sklearn converts the relevant document of power domain Class;
Local file system is written into file under multiple classifications and each classification after cluster.
Further, obtained different classes of document is cut into short using the punctuation mark in document as separator Sentence;
Short sentence after cutting is subjected to word cutting using participle tool pyltp, removes stop words, generates the word column of each sentence Table.
Further, to the method and step of obtained different classes of document vectorization are as follows:
Tfidf vector is generated according to the training of the word list of sentence;
Word2vec vectorization expression is carried out to word each in word list, each word is obtained according to the tfidf vector of document Weight in a document, be calculated according to the following formula it is different classes of under document vectorization result:
Wherein, v is vectorization as a result, VnIt is indicated for the word2vec vectorization of word, ωnFor the element in tfidf vector, n For the natural number more than or equal to 1.
Further, K is to the sentence quantity for generating document subject matter after Documents Similarity sequence.
A kind of unsupervised quick electric power document subject matter generation system, including memory and processor, memory storage Computer program, the program are configured as being executed by the processor, which includes for executing each step in the above method Instruction.
The principle of the present invention is, on the basis of existing text data, calculates original document matching value relevant to electric power, When the big Mr. Yu's threshold value of matching value, it is believed that there is correlation with electric power, by calculating the correlation of original document and power domain, Filter out the data highly relevant with power domain;Then using Kmeans or Dbscan clustering to these clustering documents, Find the multiple documents under different classifications;Subject distillation is carried out using TextRank algorithm to document after cluster respectively again.
Compared with prior art, the present invention having the advantages that:
1. it realizes using unlabelled electric power document as the input of system, it can using clustering algorithm and text subject algorithm Rapidly to obtain document subject matter, realize that theme generates system, so that the document subject matter for obtaining power domain is more practical, feasibility It is higher, using bigger with promotional value.
2. the invention proposes combine word2vec and tfidf advantage in view of preferably carrying out vectorization to document Vectorization mode is carried out to document, being effectively utilized influences vector bring in the prior word of document, so that generate Vectors for documents has more expressive force.
3. by carrying out modular design to system, by Text similarity computing, cluster, the combination of the functions such as subject distillation Together, so that the flexible row of system is higher, robustness is stronger.
4. the present invention provides a kind of unsupervised themes to generate system, avoids supervision message and obtains difficult problem, And theme is extracted in the way of unsupervised and has higher readable and stronger stability.
Detailed description of the invention
Fig. 1 is a kind of flow chart of unsupervised electric power document subject matter generation method of the present embodiment.
Specific embodiment
To enable features described above and advantage of the invention to be clearer and more comprehensible, with reference to the accompanying drawing, to specific reality of the invention The mode of applying is further described.
The present embodiment provides a kind of unsupervised electric power document subject matter generation method, with realize electric power document under unsupervised Theme is quickly generated, as shown in Figure 1, including the following steps:
1. having the data of correlation in matching initial data with power domain, the specific steps are as follows:
1.1 monitor system acquisition initial data from state's net public sentiment, and data acquisition source includes that wechat public platform, Sina are micro- The texts distribution platforms such as rich, discussion bar, forum and news;
1.2 collected initial data include the title and content of document, arrange document;
1.3 take out several initial data at random from document, and training generates tfidf vector, and the training process is as follows:
1.3.1 several initial data are randomly selected first, filter out some useless symbols and English alphabet in data;
1.3.2 word cutting is carried out to document using participle tool pyltp, converts word list for text data, removal deactivates Word;
1.3.3 the good word list of stop words final finishing is removed, TfidfVectorizer () function training in sklearn is utilized Tfidf vector is generated, and generates word corresponding to element index in tfidf vector;
1.4 read document one by one, similar with 1.3.1 treatment process, by the removal of unrelated symbol and then word cutting in data, then Vectorization expression is carried out to the document using trained tfidf vector, the result that document vectorization indicates is denoted as doc_csr.
1.5 read in local file with the higher electric power vocabulary of the electric power degree of correlation, save in memory, are denoted as target_ word_set。
1.6 calculate the matching value of document and power domain vocabulary, are denoted as doc_scores;Specific step is as follows:
1.6.1 the element value of non-zero in doc_csr vector is found first, and storage in lists, is denoted as scores;
1.6.2 to the descending sequence of scores, the index in corresponding scores is generated, sorted_ptr is denoted as;
1.6.3 according to scores and sorted_ptr, the list of the descending sequence of ifidf value is generated, sorted_ is denoted as scores;
1.6.4 according to sorted_ptr and idx_2_word, the corresponding descending sequence of tfidf value of document vocabulary is generated List, be denoted as sorted_words;
1.6.5 word list sorted_words is traversed, is calculated, finally obtains doc_scores;The calculating process is such as Under:
1.6.5.1 given threshold top_k_word_num, threshold value top_k_word_num are usually set to document word frequency The 15% of doc_words_num, i.e. doc_words_num*0.15;
1.6.5.2 to top_k_word_num word preceding in sorted_words, if there is in target_word_set In, then to the word, corresponding tfidf value is summed up in sorted_scores;
1.6.5.3 when the quantity of word in sorted_words is greater than top_k_word_num, then in sorted_score 1/2 adduction of element value of the element index greater than top_k_word_num;
1.6.5.4, the value of above-mentioned two step is added together, obtains addition and value;
1.7 finally obtain document and its matching value with power domain;
To obtaining addition and value and be balanced that final matching value is calculated, calculation formula is as follows:
Wherein, scores is matching value, and doc_score is the addition and value of tfidf, and n is the number of word in document.
2. by being clustered with document that power domain has correlation, the specific steps are as follows:
B2.1 given threshold relevant_score_threshold, threshold range be greater than 0 and less than 1, when document with When the matching value score of power domain is greater than relevant_score_threshold, indicate that the document has with power domain Correlation;
2.2 read the score of document one by one, compare score and relevant_score_threshold size, find complete Portion's document relevant to power domain;
2.3 pairs of documents relevant to power domain filter out unrelated symbol and letter and carry out word cutting again, and removal deactivates Word generates word list;
2.4 using the word list generated, and training generates tfidf vector, and indicates V to document progress vectorization2
The different clustering method of 2.5 selections, clustering method includes but is not limited to Kmeans, Dbscan;
2.6 as selected Kmeans method to be clustered, and sorting procedure is as follows:
It 2.6.1 is the efficiency for improving cluster, to V2Dimensionality reduction is carried out using the TruncatedSVD () method in sklearn;
2.6.3 classification number K >=1 of cluster, including but not limited to 3,5,8,10 are set;To the vector after dimensionality reduction, utilize KMeans () clustering in sklearn;
2.6.4 local disk is written into the document under the classification and each classification after cluster, in case theme uses when generating;
2.7 as selected Dbscan method to be clustered, and sorting procedure is as follows:
2.7.1 using DBSCAN () method in sklearn to V2It is clustered;
2.7.2 file system is written into the file under the multiple classifications and each classification after cluster;
3. the different classes of document rapidly extracting theme after pair cluster, the specific steps are as follows:
3.1 be successively read it is different classes of under document, to document with ",!... " it is separator, it is short by document cutting Sentence;
The 3.2 no nodes of building do not have the non-directed graph on side;
Each sentence after cutting is carried out word cutting using participle tool pyltp by 3.3, then removes stop words, is generated every Word list [the W of a sentence1,W2,W3,......,Wn-1,Wn];
3.4 are successively added non-directed graph using the sentence of cutting and the corresponding word list of sentence as the node of non-directed graph, and right The node of addition is numbered, number 0,1 ..., n;
3.5 calculate similarity of the non-directed graphs two-by-two between node, the side as non-directed graph;Calculating process is as follows:
3.5.1 reading local word2vec vector;
3.5.2 tfidf vector is generated using the word list training of sentence.
3.5.3 the tfidf vector for utilizing each sentence of word2vec vector sum, generates the vector of each document, generated Journey is as follows:
3.5.3.1 document word segmentation result is [air ', ' switch ', ' dress ', ' family ', ' have a power failure ', ' electric shock '], ' air ' It is V in word2vec vector1, and so on, the vector of ' electric shock ' in word2vec is V6
3.5.3.2 according to the tfidf vector [ω of document123,......,ωn], obtain each word in a document Weight;
3.5.3.3 generating document vector isFormula is as follows:
3.5.4 the space length between vector, the similarity as document two-by-two are calculated using the document vector generated;
3.5.5 it is added to non-directed graph using the similarity between document as the side of non-directed graph, is completed by the number of node Complete non-directed graph is completed in the one-to-one correspondence of node and side, final building;
3.5.6 descending sequence is carried out according to side (i.e. similarity) to the node for the non-directed graph being built into;
3.5.7 theme of the K node (i.e. sentence) as document before similarity ranking, K can take greater than 1 and be less than wireless The arbitrary value of node of graph sum, the general sentence quantity controlled by K value when generating theme.
The following are the Experimental comparisons of the method for the present invention and art methods:
Electric power document: community, power plant gos deep into area under one's jurisdiction and carries out work of flood prevention with all strength, to carry out answering for heavy rain extreme weather conscientiously To work, it is ensured that life and property safety of people maintains area under one's jurisdiction normal order.Community workers are to local city Safety of Flood Control publicity and security risk investigation are carried out, the awareness of safety of community resident is strengthened.On August 14th, 2018, society, power plant Area has carried out security risk investigation to local old one-storey house, low-lying location, the keypoint parts such as advertising board.Community deputy secretary Leading staff for local, there are the regions of security risk to have carried out emphasis investigation, especially for the experienced one-storey house of season man The key areas such as area, paper mill cottage area and low-lying location are conscientiously checked, and heavy rain, which is once attacked, will jeopardize subjective reflection Life security.In addition, the work that community is on duty during also strengthening flood control.Personnel awaits orders in community, protects within 24 hours Hold the unimpeded of communication, it is ensured that personnel are in place.
The result that art methods are extracted: on August 14th, 2018, community, power plant is to local old one-storey house, low-lying Location, the keypoint parts such as advertising board have carried out security risk investigation.Community deputy secretary leads staff for area under one's jurisdiction memory Emphasis investigation is carried out in the region of security risk, especially for the experienced cottage area of season man, paper mill cottage area and low-lying land The key areas such as section are conscientiously checked, and heavy rain, which is once attacked, will jeopardize the life security of subjective reflection.
The result that the method for the present invention is extracted: community, power plant gos deep into area under one's jurisdiction and carries out work of flood prevention with all strength, to carry out heavy rain conscientiously The reply work of extreme weather, it is ensured that life and property safety of people maintains area under one's jurisdiction normal order.Community workers are to linchpin City carries out Safety of Flood Control publicity and security risk investigation in area, strengthens the awareness of safety of community resident.
In a part that the result extracted by the prior art it can be seen from the result of said extracted has related generally to original text Hold, can not be as the theme of original text shelves, and the result that the method for the present invention is extracted can more summarize the content of original text, be very suitable to make It is the theme.
It is finally noted that the purpose publicized and implemented is to help to further understand the present invention, but this field Technical staff is understood that without departing from the spirit and scope of the invention and the appended claims, various substitutions and modifications It is all possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with right Subject to the range that claim defines.

Claims (10)

1. a kind of unsupervised electric power document subject matter generation method, step include:
Acquisition public sentiment initial data is simultaneously organized into document, transforms a document to tfidf vector;
Whether occurred in power domain vocabulary according to the word in the document, and combine the tfidf vector generated, is calculated With value;
Document using matching value greater than 0 is as document relevant to power domain;
Tfidf vector is converted by document relevant to power domain, is clustered, obtains different classes of document;
Cutting is carried out to obtained different classes of document, is added using the sentence of cutting and the corresponding word list of sentence as node To non-directed graph;
Vectorization is carried out to the corresponding word list of sentence after cutting, calculates the space length between vector, and as document it Between similarity, be added in non-directed graph using the similarity as side, with node correspond, building complete non-directed graph;
The node of the non-directed graph is ranked up by similarity is descending, the sentence conduct of K node on behalf in the top The theme of document, K are to the sentence quantity for generating document subject matter after sequencing of similarity.
2. the method as described in claim 1, which is characterized in that monitor system acquisition initial data from state's net public sentiment, data are adopted Collecting source includes wechat public platform, Sina weibo, discussion bar, forum, news.
3. the method as described in claim 1, which is characterized in that the initial data of acquisition include document title and content。
4. the method as described in claim 1, which is characterized in that transform a document to tfidf using trained tfidf vector The training step of vector, tfidf vector includes:
It takes out several initial data at random from document, filters out symbol and English alphabet useless in data;
Word cutting is carried out to document using participle tool pyltp, word list is converted by text data, removes stop words;
For the word list after removal stop words, generated using TfidfVectorizer () the function training in sklearn Tfidf vector, and generate word corresponding to element index in tfidf vector.
5. the method as described in claim 1, which is characterized in that the method and step that the document matches value calculates includes:
Word in document is ranked up by the value for corresponding to tfidf is descending, generates new word list;
Traverse the word list, if preceding 15% word of the word list has appeared in power domain vocabulary, to tfidf value into Row adduction;If rear 85% word of the word list has appeared in power domain vocabulary, add to the 1/2 of tfidf value With;Two-part value adduction, obtains the addition and value of the tfidf of word in document;
To obtaining addition and value and be balanced that final matching value is calculated, calculation formula is as follows:
Wherein, scores is matching value, and doc_score is the addition and value of tfidf, and n is the number of word in document.
6. the method as described in claim 1, which is characterized in that convert tfidf vector for document relevant to power domain Method and step include:
For document relevant to power domain, unrelated symbol and letter are filtered out, then carries out word cutting, removal stop words, it is raw At word list;
Using the word list of generation, training generates tfidf vector;
Using the tfidf vector of the generation, tfidf vector is converted by the relevant document of power domain.
7. the method as described in claim 1, which is characterized in that clustering method includes Kmeans, Dbscan;
The step of being clustered using Kmeans method include:
The tfidf vector that the relevant document of power domain converts is carried out using TruncatedSVD () method in sklearn Dimensionality reduction;
To the vector after dimensionality reduction, KMeans () clustering in sklearn is utilized;
File system is written into document under classification and each classification after cluster;
The step of being clustered using Dbscan method include:
It is clustered using the tfidf vector that DBSCAN () method in sklearn converts the relevant document of power domain;
Local file system is written into file under multiple classifications and each classification after cluster.
8. the method as described in claim 1, which is characterized in that obtained different classes of document, by the punctuate in document Symbol is cut into short sentence as separator;Short sentence after cutting is subjected to word cutting using participle tool pyltp, removes stop words, Generate the word list of each sentence.
9. the method as described in claim 1, which is characterized in that the method and step of obtained different classes of document vectorization Include:
Tfidf vector is generated according to the training of the word list of sentence;
Word2vec vectorization expression is carried out to word each in word list, each word is obtained in text according to the tfidf vector of document Shelves in weight, be calculated according to the following formula it is different classes of under document vectorization result;
Wherein, v is vectorization as a result, VnIt is indicated for the word2vec vectorization of word, ωnFor the element in tfidf vector, n is big In the natural number for being equal to 1.
10. a kind of unsupervised quick electric power document subject matter generates system, including memory and processor, memory storage meter Calculation machine program, the program are configured as being executed by the processor, which includes any for executing the claims 1 to 9 The instruction of each step in the method.
CN201811488091.2A 2018-12-06 2018-12-06 Unsupervised electric power document theme generation method and system Active CN110399606B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811488091.2A CN110399606B (en) 2018-12-06 2018-12-06 Unsupervised electric power document theme generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811488091.2A CN110399606B (en) 2018-12-06 2018-12-06 Unsupervised electric power document theme generation method and system

Publications (2)

Publication Number Publication Date
CN110399606A true CN110399606A (en) 2019-11-01
CN110399606B CN110399606B (en) 2023-04-07

Family

ID=68322559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811488091.2A Active CN110399606B (en) 2018-12-06 2018-12-06 Unsupervised electric power document theme generation method and system

Country Status (1)

Country Link
CN (1) CN110399606B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990676A (en) * 2019-11-28 2020-04-10 福建亿榕信息技术有限公司 Social media hotspot topic extraction method and system
CN111079442A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Vectorization representation method and device of document and computer equipment
CN111241288A (en) * 2020-01-17 2020-06-05 烟台海颐软件股份有限公司 Emergency sensing system of large centralized power customer service center and construction method
CN112270191A (en) * 2020-11-18 2021-01-26 国网北京市电力公司 Method and device for extracting work order text theme
CN112380342A (en) * 2020-11-10 2021-02-19 福建亿榕信息技术有限公司 Electric power document theme extraction method and device
CN113591475A (en) * 2021-08-03 2021-11-02 美的集团(上海)有限公司 Unsupervised interpretable word segmentation method and device and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009093651A (en) * 2007-10-05 2009-04-30 Fujitsu Ltd Modeling topics using statistical distribution
US20110231411A1 (en) * 2008-08-08 2011-09-22 Holland Bloorview Kids Rehabilitation Hospital Topic Word Generation Method and System
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method
CN106294314A (en) * 2016-07-19 2017-01-04 北京奇艺世纪科技有限公司 Topics Crawling method and device
CN106407182A (en) * 2016-09-19 2017-02-15 国网福建省电力有限公司 A method for automatic abstracting for electronic official documents of enterprises
CN106844328A (en) * 2016-08-23 2017-06-13 华南师范大学 A kind of new extensive document subject matter semantic analysis and system
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
CN108090049A (en) * 2018-01-17 2018-05-29 山东工商学院 Multi-document summary extraction method and system based on sentence vector

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009093651A (en) * 2007-10-05 2009-04-30 Fujitsu Ltd Modeling topics using statistical distribution
US20110231411A1 (en) * 2008-08-08 2011-09-22 Holland Bloorview Kids Rehabilitation Hospital Topic Word Generation Method and System
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method
CN106294314A (en) * 2016-07-19 2017-01-04 北京奇艺世纪科技有限公司 Topics Crawling method and device
CN106844328A (en) * 2016-08-23 2017-06-13 华南师范大学 A kind of new extensive document subject matter semantic analysis and system
CN106407182A (en) * 2016-09-19 2017-02-15 国网福建省电力有限公司 A method for automatic abstracting for electronic official documents of enterprises
CN106997382A (en) * 2017-03-22 2017-08-01 山东大学 Innovation intention label automatic marking method and system based on big data
CN108090049A (en) * 2018-01-17 2018-05-29 山东工商学院 Multi-document summary extraction method and system based on sentence vector

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张波飞等: "基于LDA与TextRank结合的多文档自动摘要研究", 《软件导刊》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990676A (en) * 2019-11-28 2020-04-10 福建亿榕信息技术有限公司 Social media hotspot topic extraction method and system
CN111079442A (en) * 2019-12-20 2020-04-28 北京百度网讯科技有限公司 Vectorization representation method and device of document and computer equipment
CN111079442B (en) * 2019-12-20 2021-05-18 北京百度网讯科技有限公司 Vectorization representation method and device of document and computer equipment
US11403468B2 (en) 2019-12-20 2022-08-02 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for generating vector representation of text, and related computer device
CN111241288A (en) * 2020-01-17 2020-06-05 烟台海颐软件股份有限公司 Emergency sensing system of large centralized power customer service center and construction method
CN112380342A (en) * 2020-11-10 2021-02-19 福建亿榕信息技术有限公司 Electric power document theme extraction method and device
CN112270191A (en) * 2020-11-18 2021-01-26 国网北京市电力公司 Method and device for extracting work order text theme
CN113591475A (en) * 2021-08-03 2021-11-02 美的集团(上海)有限公司 Unsupervised interpretable word segmentation method and device and electronic equipment

Also Published As

Publication number Publication date
CN110399606B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN110399606A (en) A kind of unsupervised electric power document subject matter generation method and system
CN107193803B (en) Semantic-based specific task text keyword extraction method
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN108268668B (en) Topic diversity-based text data viewpoint abstract mining method
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
Abu-Errub Arabic text classification algorithm using TFIDF and chi square measurements
CN105224520B (en) A kind of Chinese patent document term automatic identifying method
CN104298732B (en) The personalized text sequence of network-oriented user a kind of and recommendation method
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN109815400A (en) Personage's interest extracting method based on long text
CN103092966A (en) Vocabulary mining method and device
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN107526792A (en) A kind of Chinese question sentence keyword rapid extracting method
CN108038204A (en) For the viewpoint searching system and method for social media
JP3735336B2 (en) Document summarization method and system
Bölücü et al. Hate Speech and Offensive Content Identification with Graph Convolutional Networks.
CN110929022A (en) Text abstract generation method and system
CN111339778B (en) Text processing method, device, storage medium and processor
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN107590163B (en) The methods, devices and systems of text feature selection
CN111930885B (en) Text topic extraction method and device and computer equipment
CN111538893B (en) Method for extracting network security new words from unstructured data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant