CN110399606A - A kind of unsupervised electric power document subject matter generation method and system - Google Patents
A kind of unsupervised electric power document subject matter generation method and system Download PDFInfo
- Publication number
- CN110399606A CN110399606A CN201811488091.2A CN201811488091A CN110399606A CN 110399606 A CN110399606 A CN 110399606A CN 201811488091 A CN201811488091 A CN 201811488091A CN 110399606 A CN110399606 A CN 110399606A
- Authority
- CN
- China
- Prior art keywords
- document
- word
- tfidf
- vector
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The present invention provides a kind of unsupervised electric power document subject matter generation method and system, for quickly generating the document subject matter of power domain.First with correlation analysis in the present invention, document data relevant to specific area is screened, recycles clustering method to find generic document, subject distillation then is carried out to it, and apply this in subject extraction system, so that the theme for extracting specific area is more feasible.
Description
Technical field
The present invention relates to document subject matter extractions, and in particular to a kind of unsupervised electric power document subject matter generation method and system,
Belong to natural language processing and computer software field.
Background technique
In recent years, with the high speed development of internet, the exponential growth of data on each news release platform, how
The compression that data mixed and disorderly to magnanimity carry out high quality is extracted, make user efficiently searched from these data useful information at
For the research emphasis of current natural language processing field.The compression extraction of data relates generally to document subject matter technology, document subject matter
Extraction is divided into extraction-type and production.Extraction-type subject methods are to carry out assessment marking to the sentence in original text, and selecting most can generation
Several sentences of table original text purport are as full text theme.Production subject methods are to make computer using technologies such as machine learning
The sentence of non-original text is reconfigured, original text theme is generated.Since production theme is limited by natural language understanding technology,
The theme readability of generation is not high, and stability is lower, while when training generates the model of theme, need high quality has supervision
Chinese data is limited by manpower, and the standard digest data of existing specific area are few, so that obtaining has the training data of supervision to become
It obtains extremely difficult.Extraction-type theme generating mode avoids machine and understands text and need to reorganize the problem of language, utilizes
The readable sentence generation theme of original text, extracts document information, readable high, can largely reduce the letter of user
Breath load.
Wuhan University proposes a kind of more document automatic theme extracting methods based on hybrid machine learning model, first directly
It connects and vectorization is carried out to document using word2vec, then using preparatory trained classifier, the document of vectorization is carried out
Classification, the main purpose of classification is the sentence that theme is suitable as in the data for find original document, then to being suitable as theme
Sentence utilize TextRank algorithm carry out subject distillation.Inner Mongol Normal University proposes a kind of based on LDA and TextRank
In conjunction with more document automatic theme methods, be to be pre-processed to original document first, establish topic model, obtain comparing in document
More important sentence, then the topic model when considering operation node weights, obtains iterative formula, and TextRank is recycled to calculate
Method carries out subject distillation to the multiple documents under same subject.Both the above method can not all lead the data of specific area
Topic is extracted, and for first method after classifying to document, the sentence for being suitble to do theme may be comprising largely and specific area
Unrelated sentence, at the same second method using LDA carry out theme modeling when also can exist the data unrelated for field into
The case where row modeling.
Summary of the invention
The purpose of the present invention is to overcome the shortcomings of the existing technology and insufficient, and it is raw to provide a kind of unsupervised electric power document subject matter
At method and system, for quickly generating the document subject matter of power domain.First with correlation analysis, sieve in the present invention
Document data relevant to specific area is selected, recycles clustering method to find generic document, theme then is carried out to it and is mentioned
It takes, and this is applied in subject extraction system, so that the theme for extracting specific area is more feasible.
To achieve the above object, The technical solution adopted by the invention is as follows:
A kind of unsupervised electric power document subject matter generation method, step include:
Acquisition public sentiment initial data is simultaneously organized into document, transforms a document to tfidf vector;
Whether occurred in power domain vocabulary according to the word in the document, and combine the tfidf vector generated, calculated
To matching value;
Document using matching value greater than 0 is as document relevant to power domain;
Tfidf vector is converted by document relevant to power domain, is clustered, obtains different classes of document;
Cutting is carried out to obtained different classes of document, using the sentence of cutting and the corresponding word list of sentence as node
It is added to non-directed graph;
Vectorization is carried out to the corresponding word list of sentence after cutting, calculates the space length between vector, and as text
Similarity between shelves, is added in non-directed graph using the similarity as side, corresponds with node, and non-directed graph is completed in building;
The node of the non-directed graph is ranked up by similarity is descending, the sentence of K node on behalf in the top
Theme as document.
Further, system acquisition initial data is monitored from state's net public sentiment, data acquisition source includes wechat public platform, new
The texts distribution platform such as unrestrained microblogging, discussion bar, forum, news.
Further, the initial data of acquisition includes the title and content of document.
Further, document is subjected to tfidf vectorization expression, the instruction of tfidf vector using trained tfidf vector
Practicing step includes:
It takes out several initial data at random from document, filters out symbol and English alphabet useless in data;
Word cutting is carried out to document using participle tool pyltp, word list is converted by text data, removes stop words;
For the word list after removal stop words, generated using TfidfVectorizer () the function training in sklearn
Tfidf vector, and generate word corresponding to element index in tfidf vector.
Further, the document matches value calculating method are as follows:
Word in document is ranked up by the value for corresponding to tfidf is descending, generates new word list;
The word list is traversed, it is right if the biggish word of tfidf value of the word list has appeared in power domain vocabulary
Tfidf value sums up;If the smaller word of tfidf value of the word list has appeared in power domain vocabulary, to tfidf value
1/2 sum up;Two-part value adduction, obtains the addition and value of the tfidf of word in document;
To obtaining addition and value and be balanced that final matching value is calculated, calculation formula is as follows:
Wherein, scores is matching value, and doc_score is the addition and value of tfidf, and n is the number of word in document.
Further, the biggish word of tfidf value refers to preceding the 15% of word list in the described word list, in the word list
The lesser word of tfidf value is be worth word list rear 85%.
Further, it is by the method and step that above-mentioned document relevant to power domain is converted into tfidf vector:
For document relevant to power domain, filter out unrelated symbol and letter, then carry out word cutting, removal deactivate
Word generates word list;
Using the word list of generation, training generates tfidf vector;
Using the tfidf vector of the generation, tfidf vector is converted by the relevant document of power domain.
Further, clustering method includes Kmeans, Dbscan.
Further, as clustered using Kmeans method, step includes:
The tfidf vector that the relevant document of power domain is converted using TruncatedSVD () method in sklearn
Carry out dimensionality reduction;
To the vector after dimensionality reduction, KMeans () clustering in sklearn is utilized;
File system is written into document under classification and each classification after cluster;
Further, as clustered using Dbscan method, step includes:
Gathered using the tfidf vector that DBSCAN () method in sklearn converts the relevant document of power domain
Class;
Local file system is written into file under multiple classifications and each classification after cluster.
Further, obtained different classes of document is cut into short using the punctuation mark in document as separator
Sentence;
Short sentence after cutting is subjected to word cutting using participle tool pyltp, removes stop words, generates the word column of each sentence
Table.
Further, to the method and step of obtained different classes of document vectorization are as follows:
Tfidf vector is generated according to the training of the word list of sentence;
Word2vec vectorization expression is carried out to word each in word list, each word is obtained according to the tfidf vector of document
Weight in a document, be calculated according to the following formula it is different classes of under document vectorization result:
Wherein, v is vectorization as a result, VnIt is indicated for the word2vec vectorization of word, ωnFor the element in tfidf vector, n
For the natural number more than or equal to 1.
Further, K is to the sentence quantity for generating document subject matter after Documents Similarity sequence.
A kind of unsupervised quick electric power document subject matter generation system, including memory and processor, memory storage
Computer program, the program are configured as being executed by the processor, which includes for executing each step in the above method
Instruction.
The principle of the present invention is, on the basis of existing text data, calculates original document matching value relevant to electric power,
When the big Mr. Yu's threshold value of matching value, it is believed that there is correlation with electric power, by calculating the correlation of original document and power domain,
Filter out the data highly relevant with power domain;Then using Kmeans or Dbscan clustering to these clustering documents,
Find the multiple documents under different classifications;Subject distillation is carried out using TextRank algorithm to document after cluster respectively again.
Compared with prior art, the present invention having the advantages that:
1. it realizes using unlabelled electric power document as the input of system, it can using clustering algorithm and text subject algorithm
Rapidly to obtain document subject matter, realize that theme generates system, so that the document subject matter for obtaining power domain is more practical, feasibility
It is higher, using bigger with promotional value.
2. the invention proposes combine word2vec and tfidf advantage in view of preferably carrying out vectorization to document
Vectorization mode is carried out to document, being effectively utilized influences vector bring in the prior word of document, so that generate
Vectors for documents has more expressive force.
3. by carrying out modular design to system, by Text similarity computing, cluster, the combination of the functions such as subject distillation
Together, so that the flexible row of system is higher, robustness is stronger.
4. the present invention provides a kind of unsupervised themes to generate system, avoids supervision message and obtains difficult problem,
And theme is extracted in the way of unsupervised and has higher readable and stronger stability.
Detailed description of the invention
Fig. 1 is a kind of flow chart of unsupervised electric power document subject matter generation method of the present embodiment.
Specific embodiment
To enable features described above and advantage of the invention to be clearer and more comprehensible, with reference to the accompanying drawing, to specific reality of the invention
The mode of applying is further described.
The present embodiment provides a kind of unsupervised electric power document subject matter generation method, with realize electric power document under unsupervised
Theme is quickly generated, as shown in Figure 1, including the following steps:
1. having the data of correlation in matching initial data with power domain, the specific steps are as follows:
1.1 monitor system acquisition initial data from state's net public sentiment, and data acquisition source includes that wechat public platform, Sina are micro-
The texts distribution platforms such as rich, discussion bar, forum and news;
1.2 collected initial data include the title and content of document, arrange document;
1.3 take out several initial data at random from document, and training generates tfidf vector, and the training process is as follows:
1.3.1 several initial data are randomly selected first, filter out some useless symbols and English alphabet in data;
1.3.2 word cutting is carried out to document using participle tool pyltp, converts word list for text data, removal deactivates
Word;
1.3.3 the good word list of stop words final finishing is removed, TfidfVectorizer () function training in sklearn is utilized
Tfidf vector is generated, and generates word corresponding to element index in tfidf vector;
1.4 read document one by one, similar with 1.3.1 treatment process, by the removal of unrelated symbol and then word cutting in data, then
Vectorization expression is carried out to the document using trained tfidf vector, the result that document vectorization indicates is denoted as doc_csr.
1.5 read in local file with the higher electric power vocabulary of the electric power degree of correlation, save in memory, are denoted as target_
word_set。
1.6 calculate the matching value of document and power domain vocabulary, are denoted as doc_scores;Specific step is as follows:
1.6.1 the element value of non-zero in doc_csr vector is found first, and storage in lists, is denoted as scores;
1.6.2 to the descending sequence of scores, the index in corresponding scores is generated, sorted_ptr is denoted as;
1.6.3 according to scores and sorted_ptr, the list of the descending sequence of ifidf value is generated, sorted_ is denoted as
scores;
1.6.4 according to sorted_ptr and idx_2_word, the corresponding descending sequence of tfidf value of document vocabulary is generated
List, be denoted as sorted_words;
1.6.5 word list sorted_words is traversed, is calculated, finally obtains doc_scores;The calculating process is such as
Under:
1.6.5.1 given threshold top_k_word_num, threshold value top_k_word_num are usually set to document word frequency
The 15% of doc_words_num, i.e. doc_words_num*0.15;
1.6.5.2 to top_k_word_num word preceding in sorted_words, if there is in target_word_set
In, then to the word, corresponding tfidf value is summed up in sorted_scores;
1.6.5.3 when the quantity of word in sorted_words is greater than top_k_word_num, then in sorted_score
1/2 adduction of element value of the element index greater than top_k_word_num;
1.6.5.4, the value of above-mentioned two step is added together, obtains addition and value;
1.7 finally obtain document and its matching value with power domain;
To obtaining addition and value and be balanced that final matching value is calculated, calculation formula is as follows:
Wherein, scores is matching value, and doc_score is the addition and value of tfidf, and n is the number of word in document.
2. by being clustered with document that power domain has correlation, the specific steps are as follows:
B2.1 given threshold relevant_score_threshold, threshold range be greater than 0 and less than 1, when document with
When the matching value score of power domain is greater than relevant_score_threshold, indicate that the document has with power domain
Correlation;
2.2 read the score of document one by one, compare score and relevant_score_threshold size, find complete
Portion's document relevant to power domain;
2.3 pairs of documents relevant to power domain filter out unrelated symbol and letter and carry out word cutting again, and removal deactivates
Word generates word list;
2.4 using the word list generated, and training generates tfidf vector, and indicates V to document progress vectorization2;
The different clustering method of 2.5 selections, clustering method includes but is not limited to Kmeans, Dbscan;
2.6 as selected Kmeans method to be clustered, and sorting procedure is as follows:
It 2.6.1 is the efficiency for improving cluster, to V2Dimensionality reduction is carried out using the TruncatedSVD () method in sklearn;
2.6.3 classification number K >=1 of cluster, including but not limited to 3,5,8,10 are set;To the vector after dimensionality reduction, utilize
KMeans () clustering in sklearn;
2.6.4 local disk is written into the document under the classification and each classification after cluster, in case theme uses when generating;
2.7 as selected Dbscan method to be clustered, and sorting procedure is as follows:
2.7.1 using DBSCAN () method in sklearn to V2It is clustered;
2.7.2 file system is written into the file under the multiple classifications and each classification after cluster;
3. the different classes of document rapidly extracting theme after pair cluster, the specific steps are as follows:
3.1 be successively read it is different classes of under document, to document with ",!... " it is separator, it is short by document cutting
Sentence;
The 3.2 no nodes of building do not have the non-directed graph on side;
Each sentence after cutting is carried out word cutting using participle tool pyltp by 3.3, then removes stop words, is generated every
Word list [the W of a sentence1,W2,W3,......,Wn-1,Wn];
3.4 are successively added non-directed graph using the sentence of cutting and the corresponding word list of sentence as the node of non-directed graph, and right
The node of addition is numbered, number 0,1 ..., n;
3.5 calculate similarity of the non-directed graphs two-by-two between node, the side as non-directed graph;Calculating process is as follows:
3.5.1 reading local word2vec vector;
3.5.2 tfidf vector is generated using the word list training of sentence.
3.5.3 the tfidf vector for utilizing each sentence of word2vec vector sum, generates the vector of each document, generated
Journey is as follows:
3.5.3.1 document word segmentation result is [air ', ' switch ', ' dress ', ' family ', ' have a power failure ', ' electric shock '], ' air '
It is V in word2vec vector1, and so on, the vector of ' electric shock ' in word2vec is V6;
3.5.3.2 according to the tfidf vector [ω of document1,ω2,ω3,......,ωn], obtain each word in a document
Weight;
3.5.3.3 generating document vector isFormula is as follows:
3.5.4 the space length between vector, the similarity as document two-by-two are calculated using the document vector generated;
3.5.5 it is added to non-directed graph using the similarity between document as the side of non-directed graph, is completed by the number of node
Complete non-directed graph is completed in the one-to-one correspondence of node and side, final building;
3.5.6 descending sequence is carried out according to side (i.e. similarity) to the node for the non-directed graph being built into;
3.5.7 theme of the K node (i.e. sentence) as document before similarity ranking, K can take greater than 1 and be less than wireless
The arbitrary value of node of graph sum, the general sentence quantity controlled by K value when generating theme.
The following are the Experimental comparisons of the method for the present invention and art methods:
Electric power document: community, power plant gos deep into area under one's jurisdiction and carries out work of flood prevention with all strength, to carry out answering for heavy rain extreme weather conscientiously
To work, it is ensured that life and property safety of people maintains area under one's jurisdiction normal order.Community workers are to local city
Safety of Flood Control publicity and security risk investigation are carried out, the awareness of safety of community resident is strengthened.On August 14th, 2018, society, power plant
Area has carried out security risk investigation to local old one-storey house, low-lying location, the keypoint parts such as advertising board.Community deputy secretary
Leading staff for local, there are the regions of security risk to have carried out emphasis investigation, especially for the experienced one-storey house of season man
The key areas such as area, paper mill cottage area and low-lying location are conscientiously checked, and heavy rain, which is once attacked, will jeopardize subjective reflection
Life security.In addition, the work that community is on duty during also strengthening flood control.Personnel awaits orders in community, protects within 24 hours
Hold the unimpeded of communication, it is ensured that personnel are in place.
The result that art methods are extracted: on August 14th, 2018, community, power plant is to local old one-storey house, low-lying
Location, the keypoint parts such as advertising board have carried out security risk investigation.Community deputy secretary leads staff for area under one's jurisdiction memory
Emphasis investigation is carried out in the region of security risk, especially for the experienced cottage area of season man, paper mill cottage area and low-lying land
The key areas such as section are conscientiously checked, and heavy rain, which is once attacked, will jeopardize the life security of subjective reflection.
The result that the method for the present invention is extracted: community, power plant gos deep into area under one's jurisdiction and carries out work of flood prevention with all strength, to carry out heavy rain conscientiously
The reply work of extreme weather, it is ensured that life and property safety of people maintains area under one's jurisdiction normal order.Community workers are to linchpin
City carries out Safety of Flood Control publicity and security risk investigation in area, strengthens the awareness of safety of community resident.
In a part that the result extracted by the prior art it can be seen from the result of said extracted has related generally to original text
Hold, can not be as the theme of original text shelves, and the result that the method for the present invention is extracted can more summarize the content of original text, be very suitable to make
It is the theme.
It is finally noted that the purpose publicized and implemented is to help to further understand the present invention, but this field
Technical staff is understood that without departing from the spirit and scope of the invention and the appended claims, various substitutions and modifications
It is all possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with right
Subject to the range that claim defines.
Claims (10)
1. a kind of unsupervised electric power document subject matter generation method, step include:
Acquisition public sentiment initial data is simultaneously organized into document, transforms a document to tfidf vector;
Whether occurred in power domain vocabulary according to the word in the document, and combine the tfidf vector generated, is calculated
With value;
Document using matching value greater than 0 is as document relevant to power domain;
Tfidf vector is converted by document relevant to power domain, is clustered, obtains different classes of document;
Cutting is carried out to obtained different classes of document, is added using the sentence of cutting and the corresponding word list of sentence as node
To non-directed graph;
Vectorization is carried out to the corresponding word list of sentence after cutting, calculates the space length between vector, and as document it
Between similarity, be added in non-directed graph using the similarity as side, with node correspond, building complete non-directed graph;
The node of the non-directed graph is ranked up by similarity is descending, the sentence conduct of K node on behalf in the top
The theme of document, K are to the sentence quantity for generating document subject matter after sequencing of similarity.
2. the method as described in claim 1, which is characterized in that monitor system acquisition initial data from state's net public sentiment, data are adopted
Collecting source includes wechat public platform, Sina weibo, discussion bar, forum, news.
3. the method as described in claim 1, which is characterized in that the initial data of acquisition include document title and
content。
4. the method as described in claim 1, which is characterized in that transform a document to tfidf using trained tfidf vector
The training step of vector, tfidf vector includes:
It takes out several initial data at random from document, filters out symbol and English alphabet useless in data;
Word cutting is carried out to document using participle tool pyltp, word list is converted by text data, removes stop words;
For the word list after removal stop words, generated using TfidfVectorizer () the function training in sklearn
Tfidf vector, and generate word corresponding to element index in tfidf vector.
5. the method as described in claim 1, which is characterized in that the method and step that the document matches value calculates includes:
Word in document is ranked up by the value for corresponding to tfidf is descending, generates new word list;
Traverse the word list, if preceding 15% word of the word list has appeared in power domain vocabulary, to tfidf value into
Row adduction;If rear 85% word of the word list has appeared in power domain vocabulary, add to the 1/2 of tfidf value
With;Two-part value adduction, obtains the addition and value of the tfidf of word in document;
To obtaining addition and value and be balanced that final matching value is calculated, calculation formula is as follows:
Wherein, scores is matching value, and doc_score is the addition and value of tfidf, and n is the number of word in document.
6. the method as described in claim 1, which is characterized in that convert tfidf vector for document relevant to power domain
Method and step include:
For document relevant to power domain, unrelated symbol and letter are filtered out, then carries out word cutting, removal stop words, it is raw
At word list;
Using the word list of generation, training generates tfidf vector;
Using the tfidf vector of the generation, tfidf vector is converted by the relevant document of power domain.
7. the method as described in claim 1, which is characterized in that clustering method includes Kmeans, Dbscan;
The step of being clustered using Kmeans method include:
The tfidf vector that the relevant document of power domain converts is carried out using TruncatedSVD () method in sklearn
Dimensionality reduction;
To the vector after dimensionality reduction, KMeans () clustering in sklearn is utilized;
File system is written into document under classification and each classification after cluster;
The step of being clustered using Dbscan method include:
It is clustered using the tfidf vector that DBSCAN () method in sklearn converts the relevant document of power domain;
Local file system is written into file under multiple classifications and each classification after cluster.
8. the method as described in claim 1, which is characterized in that obtained different classes of document, by the punctuate in document
Symbol is cut into short sentence as separator;Short sentence after cutting is subjected to word cutting using participle tool pyltp, removes stop words,
Generate the word list of each sentence.
9. the method as described in claim 1, which is characterized in that the method and step of obtained different classes of document vectorization
Include:
Tfidf vector is generated according to the training of the word list of sentence;
Word2vec vectorization expression is carried out to word each in word list, each word is obtained in text according to the tfidf vector of document
Shelves in weight, be calculated according to the following formula it is different classes of under document vectorization result;
Wherein, v is vectorization as a result, VnIt is indicated for the word2vec vectorization of word, ωnFor the element in tfidf vector, n is big
In the natural number for being equal to 1.
10. a kind of unsupervised quick electric power document subject matter generates system, including memory and processor, memory storage meter
Calculation machine program, the program are configured as being executed by the processor, which includes any for executing the claims 1 to 9
The instruction of each step in the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811488091.2A CN110399606B (en) | 2018-12-06 | 2018-12-06 | Unsupervised electric power document theme generation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811488091.2A CN110399606B (en) | 2018-12-06 | 2018-12-06 | Unsupervised electric power document theme generation method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110399606A true CN110399606A (en) | 2019-11-01 |
CN110399606B CN110399606B (en) | 2023-04-07 |
Family
ID=68322559
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811488091.2A Active CN110399606B (en) | 2018-12-06 | 2018-12-06 | Unsupervised electric power document theme generation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110399606B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990676A (en) * | 2019-11-28 | 2020-04-10 | 福建亿榕信息技术有限公司 | Social media hotspot topic extraction method and system |
CN111079442A (en) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
CN111241288A (en) * | 2020-01-17 | 2020-06-05 | 烟台海颐软件股份有限公司 | Emergency sensing system of large centralized power customer service center and construction method |
CN112270191A (en) * | 2020-11-18 | 2021-01-26 | 国网北京市电力公司 | Method and device for extracting work order text theme |
CN112380342A (en) * | 2020-11-10 | 2021-02-19 | 福建亿榕信息技术有限公司 | Electric power document theme extraction method and device |
CN113591475A (en) * | 2021-08-03 | 2021-11-02 | 美的集团(上海)有限公司 | Unsupervised interpretable word segmentation method and device and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009093651A (en) * | 2007-10-05 | 2009-04-30 | Fujitsu Ltd | Modeling topics using statistical distribution |
US20110231411A1 (en) * | 2008-08-08 | 2011-09-22 | Holland Bloorview Kids Rehabilitation Hospital | Topic Word Generation Method and System |
CN104268200A (en) * | 2013-09-22 | 2015-01-07 | 中科嘉速(北京)并行软件有限公司 | Unsupervised named entity semantic disambiguation method based on deep learning |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
CN106294314A (en) * | 2016-07-19 | 2017-01-04 | 北京奇艺世纪科技有限公司 | Topics Crawling method and device |
CN106407182A (en) * | 2016-09-19 | 2017-02-15 | 国网福建省电力有限公司 | A method for automatic abstracting for electronic official documents of enterprises |
CN106844328A (en) * | 2016-08-23 | 2017-06-13 | 华南师范大学 | A kind of new extensive document subject matter semantic analysis and system |
CN106997382A (en) * | 2017-03-22 | 2017-08-01 | 山东大学 | Innovation intention label automatic marking method and system based on big data |
CN108090049A (en) * | 2018-01-17 | 2018-05-29 | 山东工商学院 | Multi-document summary extraction method and system based on sentence vector |
-
2018
- 2018-12-06 CN CN201811488091.2A patent/CN110399606B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009093651A (en) * | 2007-10-05 | 2009-04-30 | Fujitsu Ltd | Modeling topics using statistical distribution |
US20110231411A1 (en) * | 2008-08-08 | 2011-09-22 | Holland Bloorview Kids Rehabilitation Hospital | Topic Word Generation Method and System |
CN104268200A (en) * | 2013-09-22 | 2015-01-07 | 中科嘉速(北京)并行软件有限公司 | Unsupervised named entity semantic disambiguation method based on deep learning |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
CN106294314A (en) * | 2016-07-19 | 2017-01-04 | 北京奇艺世纪科技有限公司 | Topics Crawling method and device |
CN106844328A (en) * | 2016-08-23 | 2017-06-13 | 华南师范大学 | A kind of new extensive document subject matter semantic analysis and system |
CN106407182A (en) * | 2016-09-19 | 2017-02-15 | 国网福建省电力有限公司 | A method for automatic abstracting for electronic official documents of enterprises |
CN106997382A (en) * | 2017-03-22 | 2017-08-01 | 山东大学 | Innovation intention label automatic marking method and system based on big data |
CN108090049A (en) * | 2018-01-17 | 2018-05-29 | 山东工商学院 | Multi-document summary extraction method and system based on sentence vector |
Non-Patent Citations (1)
Title |
---|
张波飞等: "基于LDA与TextRank结合的多文档自动摘要研究", 《软件导刊》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990676A (en) * | 2019-11-28 | 2020-04-10 | 福建亿榕信息技术有限公司 | Social media hotspot topic extraction method and system |
CN111079442A (en) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
CN111079442B (en) * | 2019-12-20 | 2021-05-18 | 北京百度网讯科技有限公司 | Vectorization representation method and device of document and computer equipment |
US11403468B2 (en) | 2019-12-20 | 2022-08-02 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for generating vector representation of text, and related computer device |
CN111241288A (en) * | 2020-01-17 | 2020-06-05 | 烟台海颐软件股份有限公司 | Emergency sensing system of large centralized power customer service center and construction method |
CN112380342A (en) * | 2020-11-10 | 2021-02-19 | 福建亿榕信息技术有限公司 | Electric power document theme extraction method and device |
CN112270191A (en) * | 2020-11-18 | 2021-01-26 | 国网北京市电力公司 | Method and device for extracting work order text theme |
CN113591475A (en) * | 2021-08-03 | 2021-11-02 | 美的集团(上海)有限公司 | Unsupervised interpretable word segmentation method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110399606B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110399606A (en) | A kind of unsupervised electric power document subject matter generation method and system | |
CN107193803B (en) | Semantic-based specific task text keyword extraction method | |
CN109189901B (en) | Method for automatically discovering new classification and corresponding corpus in intelligent customer service system | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN108268668B (en) | Topic diversity-based text data viewpoint abstract mining method | |
CN111950273A (en) | Network public opinion emergency automatic identification method based on emotion information extraction analysis | |
Abu-Errub | Arabic text classification algorithm using TFIDF and chi square measurements | |
CN105224520B (en) | A kind of Chinese patent document term automatic identifying method | |
CN104298732B (en) | The personalized text sequence of network-oriented user a kind of and recommendation method | |
CN101702167A (en) | Method for extracting attribution and comment word with template based on internet | |
CN109815400A (en) | Personage's interest extracting method based on long text | |
CN103092966A (en) | Vocabulary mining method and device | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN107526792A (en) | A kind of Chinese question sentence keyword rapid extracting method | |
CN108038204A (en) | For the viewpoint searching system and method for social media | |
JP3735336B2 (en) | Document summarization method and system | |
Bölücü et al. | Hate Speech and Offensive Content Identification with Graph Convolutional Networks. | |
CN110929022A (en) | Text abstract generation method and system | |
CN111339778B (en) | Text processing method, device, storage medium and processor | |
CN110597982A (en) | Short text topic clustering algorithm based on word co-occurrence network | |
CN107590163B (en) | The methods, devices and systems of text feature selection | |
CN111930885B (en) | Text topic extraction method and device and computer equipment | |
CN111538893B (en) | Method for extracting network security new words from unstructured data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |