CN108052593A - A kind of subject key words extracting method based on descriptor vector sum network structure - Google Patents

A kind of subject key words extracting method based on descriptor vector sum network structure Download PDF

Info

Publication number
CN108052593A
CN108052593A CN201711315360.0A CN201711315360A CN108052593A CN 108052593 A CN108052593 A CN 108052593A CN 201711315360 A CN201711315360 A CN 201711315360A CN 108052593 A CN108052593 A CN 108052593A
Authority
CN
China
Prior art keywords
keyword
word
theme
subject
mrow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711315360.0A
Other languages
Chinese (zh)
Other versions
CN108052593B (en
Inventor
胡晓慧
李超
曾庆田
戴明弟
赵中英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN201711315360.0A priority Critical patent/CN108052593B/en
Publication of CN108052593A publication Critical patent/CN108052593A/en
Application granted granted Critical
Publication of CN108052593B publication Critical patent/CN108052593B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of subject key words extracting methods based on descriptor vector sum network structure, and in particular to keyword technical field is extracted from text.The subject key words extracting method based on descriptor vector sum network structure is based on LDA topic models and carries out Subject Clustering to corpus of text, and obtains 100 keywords in each theme with the degree of subject relativity for top100;Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating the semantic similarity obtained between each two word, the word in semantically similarity top5 with each keyword in keyword is calculated respectively, and the word of keyword and each keyword in semantically similarity top5 collectively constitutes new keyword set;Keyword network is built, and obtains keyword of the word of each set top 20 as the theme.This method can the keyword higher to word frequency in document extract and effectively find relatively low but strong with the thematic relation keyword of word frequency.

Description

A kind of subject key words extracting method based on descriptor vector sum network structure
Technical field
The present invention relates to keyword technical field is extracted from text, and in particular to one kind is based on descriptor vector sum network The subject key words extracting method of structure.
Background technology
With representing that in natural language processing field extensive use, the vector of word is carried out using word2vec learning art It represents, can be good at describing and obtain the semanteme and syntax rule of word, meanwhile, topic model can be good at explaining document The theme aggregation situation of level.Therefore, the research that the term vector of fusion topic model and topic keyword represents at present is increasingly Extensively.
LDA topic models:In the various topic models of proposition, LDA is the generation model for being capable of general subject matter distribution. LDA is the Bayesian model of three stage layereds, wherein each project gathered having of being modeled as that potential theme collection closes Limit mixing, on the contrary, each theme is also modeled as the unlimited mixing of one group of potential theme probability.In the context of text modeling In, the display that theme probability provides document represents.The modeling process of LDA can be described as each resource (i.e. P (z | d)) it looks for Mixed to corresponding theme, each theme is by another probability distribution (i.e. P (t | z)) it describes.This can formally be represented For:
Wherein, P (ti| it is d) probability on i-th of given document d, ziIt is potential theme.P(ti|zj=j) it is theme j Middle tiProbability.P(zj=j | d) it is probability of the document on theme j.The quantity of the Z of potential theme must define in advance.LDA makes Descriptor distribution P (t | z) and text are estimated from unlabelled corpus with Dirichlet prior distributions and definite number of topics Shelves theme distribution P (z | d).
LDA is the very wide topic model of use scope, and most others topic models are all based on LDA and are extended.But The keyword that LDA is extracted as a whole is generally too wide in range, it is impossible to preferably reflection article theme, therefore side proposed by the present invention Method is innovation.
word embedding:Word insertion be according to syntax and semantic information by each Chinese word coding for vector row (word to Amount), therefore distance of the similar word on its term vector is similar.It is counted from natural text and establishes a language model And after obtaining term vector, syntactic analysis, sentiment analysis etc. can be carried out as the input of neutral net, can also be used as Supplemental characteristic expands existing model.But only term vector is the expected theme of None- identified text, it is necessary to by itself and master Topic model is combined.
Existing unsupervised keyword extraction techniques mainly include TF-IDF, the schemes such as Topic model, TextRank. Its technical disadvantages is mainly reflected in the following aspects:
TF-IDF is a kind of common weighting technique for information retrieval and data mining, is to search key importance Measurement, the extraction applied to text key word equally obtains preferable effect.But TF-IDF is based on word frequency and keyword The cross entropy of probability distribution is the sequencing occurred without considering word, do not account in text each word and context it Between relation.
Widely used Topic model such as LDA can preferably excavate theme from document, but its pass for extracting Keyword is crossed to be wide in range, there are many higher but unrelated with the theme word of word frequency, it is impossible to theme is preferably reacted, so as crucial Word is inappropriate.
TextRank algorithm is a kind of sort algorithm based on figure for text, and text is split as sentence, utilizes word In the text the cooccurrence relation of context establishes graph model, and PageRank value in graph model extracts keyword.It should Algorithm succinctly can effectively extract the keyword of single document on the basis of word frequency and Term co-occurrence relation is considered, but nothing Method is identified and clusters to the theme of multiple documents, thus the keyword of particular topic Documents can not be extracted.
The content of the invention
The purpose of the present invention is against the above deficiency, it is proposed that a kind of by topic model LDA and Word embedding phases With reference to extracting the keyword of same subject text using the Internet communication of similarity, pass that can be higher to word frequency in document Keyword extracts, but can effectively find that word frequency is relatively low but the keyword extracting method of keyword that thematic relation is strong.
The present invention specifically adopts the following technical scheme that:
As shown in Figure 1, a kind of subject key words extracting method based on descriptor vector sum network structure, specifically includes:
Original corpus of text is segmented;
Based on LDA topic models to corpus of text carry out Subject Clustering, and obtain in each theme with the degree of subject relativity For the keyword set KeywordsSet of top1001={ k1..., k100};
Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating between term vector Cosine value obtains the semantic similarity between each two word;
It calculates respectively and keyword set KeywordsSet1In each keyword semantically similarity top5's Word, keyword set KeywordsSet1In word and its collectively constitute new keyword set in the word of semantically similarity top5 Close KeywordsSet2
With keyword set KeywordsSet2In each keyword for node, semantic similarity between word and word Inverse is the weight on side, builds keyword network, and obtains keyword set according to the PageRank value of each node KeywordsSet2Keyword of the word of middle top20 as the theme forms final keyword set KeyordsSetfinal
Preferably, the participle, the urtext that will be obtained are divided into word sequence so as to follow-up Subject Clustering and key Word extracts, and when input as word2vec of the result of participle removes additional character;During input as LDA, remove function word, nothing Method is as the place name of subject key words and the preposition of the repetition unrelated with theme.
It is preferably based on LDA topic models and Subject Clustering is carried out to corpus of text, used in Language Modeling Perplexity is fine or not to weigh modeling effect, i.e., relatively low perplexity represents better Generalization Capability, perplexity Calculating formula is as follows:
Wherein, P (wi|tj) it is word wiIn theme tjOn distribution, P (tj| d) it is theme tjDistribution on document d, N are Without dittograph sum in corpus, K is number of topics, i=1 ..., N, j=1 ..., K.
Preferably, in the term vector generating process, using the word segmentation result of the mixing text of title and content as defeated The term vector for entering to obtain each word represents the process of model.
Preferably, during the keyword network struction, construction step specifically includes:
S1:Using the cosine relation between term vector, calculate under same subject with obtained in Subject Clustering step just The word of beginning keywords semantics similarity top5, duplicate removal and with keyword set KeywordsSet1Form new keyword set KeywordsSet2
S2:It calculates under each theme, keyword set KeywordsSet2In each similarity of word between any two, fall Number is as the weight between 2 points;
S3:Threshold value is set, and filtering similarity is less than the side of threshold value;
S4:Build the keyword network of each theme;
S5:Subject key words are extracted:After the completion of keyword network struction, calculate in each subject network PageRank value from Top20 high to Low node, using its corresponding word as the keyword set KeywordsSet of the themefinal
The present invention has the advantages that:
The present invention is primarily based on LDA topic models and corpus of text is clustered;Secondly, it is using word2vec that text is pre- Each vocabulary in material is shown as a term vector;Then, each keyword similarity top5 in the document of the theme is obtained Word collectively constitutes new keyword set.Finally, using keyword as node, the similarity between word is the weight on side, and structure closes Keyword network obtains keyword of the core node as the theme of network;
Topic model LDA is combined by this method with Word embedding, is extracted using the Internet communication of similarity The keyword of same subject text, not only can the keyword higher to word frequency in document extract, can simultaneously be effectively It was found that word frequency is relatively low but keyword that thematic relation is strong;
This method carries out secondary discovery on the basis of word frequency is considered according to term vector relation pair keyword, and word frequency is not high But the word of semantic similarity is included in the alternative set of keyword, can reasonably expand the range of choice of keyword so that final Keyword under the same subject of acquisition is semantically contacting more closely;
This method introduces term vector and carries out network struction based on the distance between term vector and can more accurately find out same Keyword similar in the meaning of a word under one theme, so as to obtain more accurately result.
Description of the drawings
Fig. 1 is the subject key words extracting method flow chart based on descriptor vector sum network structure;
Fig. 2 is puzzlement degree (perplexity) graph;
Fig. 3 is the keyword distribution map of teaching class notice;
Fig. 4 is the keyword distribution map that class notifies of appraising and choosing excellent;
Fig. 5 is the keyword distribution map of library's class notice.
Specific embodiment
The specific embodiment of the present invention is described further in the following with reference to the drawings and specific embodiments:
As shown in Figure 1, a kind of subject key words extracting method based on descriptor vector sum network structure, specifically includes:
Original corpus of text is segmented;
Based on LDA topic models to corpus of text carry out Subject Clustering, and obtain in each theme with the degree of subject relativity For the keyword set KeywordsSet of top1001={ k1..., k100};
Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating between term vector Cosine value obtains the semantic similarity between each two word;
It calculates respectively and keyword set KeywordsSet1In each keyword semantically similarity top5's Word, keyword set KeywordsSet1In word and its collectively constitute new keyword set in the word of semantically similarity top5 Close KeywordsSet2
With keyword set KeywordsSet2In each keyword for node, semantic similarity between word and word Inverse is the weight on side, builds keyword network, and obtains keyword set according to the PageRank value of each node KeywordsSet2Keyword of the word of middle top20 as the theme forms final keyword set KeywordsSetfinal
Participle, the urtext that will be obtained are divided into word sequence so as to follow-up Subject Clustering and keyword extraction, participle Input of the result as word2vec when remove additional character;During input as LDA, remove function word, theme can not be used as The place name of keyword and the preposition of repetition largely unrelated with theme.
As described in Figure 2, Subject Clustering is carried out to corpus of text based on LDA topic models, is used in Language Modeling Perplexity is fine or not to weigh modeling effect, i.e., relatively low perplexity represents better Generalization Capability, perplexity Calculating formula is as follows:
Wherein, P (wi|tj) it is word wiIn theme tjOn distribution, P (tj| d) it is theme tjDistribution on document d, N are Without dittograph sum in corpus, K is number of topics, i=1 ..., N, j=1 ..., K.Change topic quantity, pass through calculating The perplexity of data set obtains optimal theme number under different themes number.
Quantitative value at trade-off curve inflection point enables to that the perplexity values of data set are smaller and theme quantity is unlikely In excessive.Then the word distribution under the theme distribution and each theme of every document is obtained, selects LDA values ranking under each theme Top100 word is as initial keyword set.
In term vector generating process, each word is obtained as input using the word segmentation result of title and the mixing text of content Term vector represent model process.Select CBOW models that window size is arranged to 5 to predict present pivot word in this programme Probability, and negative sampling algorithm is selected to pass through logistic regression to distinguish target word and extract noise profile.1 (word2vec of table Model training parameter setting) give the explanation and default value of key parameter in training process.
Table 1
The high dimension vector that all words in text may finally be obtained represents, and can be owned using the term vector model Similarity relation between word, i.e., distance semantically.
During keyword network struction, construction step specifically includes:
S1:Using the cosine relation between term vector, calculate under same subject with obtained in Subject Clustering step just The word of beginning keywords semantics similarity top5, duplicate removal and with keyword set KeywordsSet1Form new keyword set KeywordsSet2
S2:It calculates under each theme, keyword set KeywordsSet2In each similarity of word between any two, fall Number is as the weight between 2 points;
S3:Threshold value is set, and filtering similarity is less than the side of threshold value;Threshold value selects the different corresponding Different Results of value such as tables 2:
Table 2
1/similarity Topic similarity
0.05 0.41
0.1 0.44
0.15 0.48
0.2 0.49
0.25 0.52
0.3 0.59
0.35 0.55
0.4 0.57
0.45 0.56
0.5 0.52
0.55 0.50
By table it can be seen that threshold value selects when 0.3 (i.e. similarity > 3.33) coagulating between keyword under same subject Poly- degree higher.
S4:Build the keyword network of each theme;
S5:Subject key words are extracted:After the completion of keyword network struction, calculate in each subject network PageRank value from Top20 high to Low node, using its corresponding word as the keyword set of the crucial phrase Cheng Xin of the theme KeywordsSetfinal
As shown in Figure 3-Figure 5, the solution of the present invention has first crawled certain colleges and universities 2002 to 2017 years by way of experiment Totally 9802 news announced in the school, after word segmentation processing, by steps such as Topics Crawling, term vector calculating, keyword network structions Suddenly, subject key words are extracted, and by result compared with the keyword that traditional theme model LDA is obtained.
Wherein saturate word expression can preferably react the word of theme, and color gets over the degree of correlation that superficial shows the word and theme It is lower.Ranking is more forward under this methodology for the bigger expression of word.As can be seen that using the present invention method can in comprehensive word frequency and The keyword of the theme can be represented by preferably being extracted in the case of semanteme.
Certainly, above description is not limitation of the present invention, and the present invention is also not limited to the example above, this technology neck The variations, modifications, additions or substitutions that the technical staff in domain is made in the essential scope of the present invention should also belong to the present invention's Protection domain.

Claims (5)

1. a kind of subject key words extracting method based on descriptor vector sum network structure, which is characterized in that specifically include:
Original corpus of text is segmented;
Subject Clustering is carried out to corpus of text based on LDA topic models, and obtains in each theme and is with the degree of subject relativity The keyword set KeywordsSet of top1001={ k1..., k100};
Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating the cosine between term vector Value obtains the semantic similarity between each two word;
It calculates respectively and keyword set KeywordsSet1In each keyword semantically similarity top5 word, close Keyword set KeywordsSet1In word and its collectively constitute new keyword set in the word of semantically similarity top5 KeywordsSet2
With keyword set KeywordsSet2In each keyword for node, the inverse of the semantic similarity between word and word For the weight on side, keyword network is built, and keyword set is obtained according to the PageRank value of each node KeywordsSet2Keyword of the word of middle top20 as the theme forms final keyword set KeywordsSetfinal
2. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is that the participle, the urtext that will be obtained is divided into word sequence to divide so as to follow-up Subject Clustering and keyword extraction Remove additional character during input of the result of word as word2vec;During input as LDA, remove function word, master can not be used as Inscribe the place name of keyword and the preposition of the repetition unrelated with theme.
3. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is, carries out Subject Clustering to corpus of text based on LDA topic models, is weighed in Language Modeling using perplexity Effect quality is modeled, i.e., relatively low perplexity represents better Generalization Capability, and perplexity calculating formulas are as follows:
<mrow> <mi>p</mi> <mi>e</mi> <mi>r</mi> <mi>p</mi> <mi>l</mi> <mi>e</mi> <mi>x</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mo>=</mo> <msup> <mi>e</mi> <mfrac> <mrow> <mo>-</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <msubsup> <mi>log&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </msubsup> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>t</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>*</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>j</mi> </msub> <mo>|</mo> <mi>d</mi> <mo>)</mo> </mrow> </mrow> <mi>N</mi> </mfrac> </msup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Wherein, P (wi|tj) it is word wiIn theme tjOn distribution, P (tj| d) it is theme tjDistribution on document d, N are language materials Without dittograph sum in storehouse, K is number of topics, i=1 ..., N, j=1 ..., K.
4. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is, in the term vector generating process, is obtained using the word segmentation result of title and the mixing text of content as input every The term vector of a word represents the process of model.
5. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special Sign is that during the keyword network struction, construction step specifically includes:
S1:Using the cosine relation between term vector, the initial pass with being obtained in Subject Clustering step under same subject is calculated The word of keyword semantic similarity top5, duplicate removal and with keyword set KeywordsSet1Form new keyword set KeywordsSet2
S2:It calculates under each theme, keyword set KeywordsSet2In each similarity of word between any two, inverse makees For the weight between 2 points;
S3:Threshold value is set, and filtering similarity is less than the side of threshold value;
S4:Build the keyword network of each theme;
S5:Subject key words are extracted:After the completion of keyword network struction, calculate in each subject network PageRank value from height to Top20 low node, using its corresponding word as the keyword set KeywordsSet of the themefinal
CN201711315360.0A 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure Active CN108052593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711315360.0A CN108052593B (en) 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711315360.0A CN108052593B (en) 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure

Publications (2)

Publication Number Publication Date
CN108052593A true CN108052593A (en) 2018-05-18
CN108052593B CN108052593B (en) 2020-09-22

Family

ID=62124320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711315360.0A Active CN108052593B (en) 2017-12-12 2017-12-12 Topic keyword extraction method based on topic word vector and network structure

Country Status (1)

Country Link
CN (1) CN108052593B (en)

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN108984519A (en) * 2018-06-14 2018-12-11 华东理工大学 Event corpus method for auto constructing, device and storage medium based on double mode
CN109086355A (en) * 2018-07-18 2018-12-25 北京航天云路有限公司 Hot spot association relationship analysis method and system based on theme of news word
CN109284366A (en) * 2018-10-17 2019-01-29 徐佳慧 A kind of construction method and device of the homogenous network towards investment and financing mechanism
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN109522928A (en) * 2018-10-15 2019-03-26 北京邮电大学 Theme sentiment analysis method, apparatus, electronic equipment and the storage medium of text
CN109636645A (en) * 2018-12-13 2019-04-16 平安医疗健康管理股份有限公司 Medical insurance monitoring and managing method, unit and computer readable storage medium
CN109710759A (en) * 2018-12-17 2019-05-03 北京百度网讯科技有限公司 Text dividing method, device, computer equipment and readable storage medium storing program for executing
CN109885831A (en) * 2019-01-30 2019-06-14 广州杰赛科技股份有限公司 Key Term abstracting method, device, equipment and computer readable storage medium
CN110020034A (en) * 2018-06-29 2019-07-16 程宇镳 A kind of information citation analysis method and system
CN110046228A (en) * 2019-04-18 2019-07-23 合肥工业大学 Short text subject identifying method and system
CN110222347A (en) * 2019-06-20 2019-09-10 首都师范大学 A kind of detection method that digresses from the subject of writing a composition
CN110287321A (en) * 2019-06-26 2019-09-27 南京邮电大学 A kind of electric power file classification method based on improvement feature selecting
CN110427492A (en) * 2019-07-10 2019-11-08 阿里巴巴集团控股有限公司 Generate the method, apparatus and electronic equipment of keywords database
CN110442855A (en) * 2019-04-10 2019-11-12 北京捷通华声科技股份有限公司 A kind of speech analysis method and system
CN110472005A (en) * 2019-06-27 2019-11-19 中山大学 A kind of unsupervised keyword extracting method
CN110717329A (en) * 2019-09-10 2020-01-21 上海开域信息科技有限公司 Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector
CN110750619A (en) * 2019-08-15 2020-02-04 中国平安财产保险股份有限公司 Chat record keyword extraction method and device, computer equipment and storage medium
CN110807326A (en) * 2019-10-24 2020-02-18 江汉大学 Short text keyword extraction method combining GPU-DMM and text features
CN110851570A (en) * 2019-11-14 2020-02-28 中山大学 Unsupervised keyword extraction method based on Embedding technology
CN110851602A (en) * 2019-11-13 2020-02-28 精硕科技(北京)股份有限公司 Method and device for topic clustering
CN110991175A (en) * 2019-12-10 2020-04-10 爱驰汽车有限公司 Text generation method, system, device and storage medium under multiple modes
CN111026866A (en) * 2019-10-24 2020-04-17 北京中科闻歌科技股份有限公司 Domain-oriented text information extraction clustering method, device and storage medium
CN111079422A (en) * 2019-12-13 2020-04-28 北京小米移动软件有限公司 Keyword extraction method, device and storage medium
CN111078838A (en) * 2019-12-13 2020-04-28 北京小米智能科技有限公司 Keyword extraction method, keyword extraction device and electronic equipment
CN111401040A (en) * 2020-03-17 2020-07-10 上海爱数信息技术股份有限公司 Keyword extraction method suitable for word text
CN111428489A (en) * 2020-03-19 2020-07-17 北京百度网讯科技有限公司 Comment generation method and device, electronic equipment and storage medium
CN111950264A (en) * 2020-08-05 2020-11-17 广东工业大学 Text data enhancement method and knowledge element extraction method
CN112100317A (en) * 2020-09-24 2020-12-18 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN110209941B (en) * 2019-06-03 2021-01-15 北京卡路里信息技术有限公司 Method for maintaining push content pool, push method, device, medium and server
CN112270185A (en) * 2020-10-29 2021-01-26 山西大学 Text representation method based on topic model
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN113011133A (en) * 2021-02-23 2021-06-22 吉林大学珠海学院 Single cell correlation technique data analysis method based on natural language processing
CN113051917A (en) * 2021-04-23 2021-06-29 东南大学 Document implicit time inference method based on time window text similarity
CN113139379A (en) * 2020-01-20 2021-07-20 中国电信股份有限公司 Information identification method and system
CN113378512A (en) * 2021-07-05 2021-09-10 中国科学技术信息研究所 Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
CN113505581A (en) * 2021-07-27 2021-10-15 北京工商大学 Education big data text analysis method based on APSO-LSTM network
CN113591476A (en) * 2021-08-10 2021-11-02 闪捷信息科技有限公司 Data label recommendation method based on machine learning
CN113673223A (en) * 2021-08-25 2021-11-19 北京智通云联科技有限公司 Keyword extraction method and system based on semantic similarity
CN115168600A (en) * 2022-06-23 2022-10-11 广州大学 Value chain knowledge discovery method under personalized customization
CN116431814A (en) * 2023-06-06 2023-07-14 北京中关村科金技术有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium
CN108829822B (en) * 2018-06-12 2023-10-27 腾讯科技(深圳)有限公司 Media content recommendation method and device, storage medium and electronic device
CN116975246A (en) * 2023-08-03 2023-10-31 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal
CN110750619B (en) * 2019-08-15 2024-05-28 中国平安财产保险股份有限公司 Chat record keyword extraction method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110060983A1 (en) * 2009-09-08 2011-03-10 Wei Jia Cai Producing a visual summarization of text documents
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110060983A1 (en) * 2009-09-08 2011-03-10 Wei Jia Cai Producing a visual summarization of text documents
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YUJUN WEN 等: "Research on Keyword extraction based on Word2Vec weighted TextRank", 《2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS》 *
曾庆田 等: "融合主题词嵌入和网络结构分析的主题关键词提取方法", 《数据分析与知识发现》 *
韦强申: "领域关键词抽取:结合LDA与Word2Vec", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (74)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829822B (en) * 2018-06-12 2023-10-27 腾讯科技(深圳)有限公司 Media content recommendation method and device, storage medium and electronic device
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN108984519A (en) * 2018-06-14 2018-12-11 华东理工大学 Event corpus method for auto constructing, device and storage medium based on double mode
CN108984519B (en) * 2018-06-14 2022-07-05 华东理工大学 Dual-mode-based automatic event corpus construction method and device and storage medium
CN110020034B (en) * 2018-06-29 2023-12-08 程宇镳 Information quotation analysis method and system
CN110020034A (en) * 2018-06-29 2019-07-16 程宇镳 A kind of information citation analysis method and system
CN109086355A (en) * 2018-07-18 2018-12-25 北京航天云路有限公司 Hot spot association relationship analysis method and system based on theme of news word
CN109376352B (en) * 2018-08-28 2022-11-29 中山大学 Patent text modeling method based on word2vec and semantic similarity
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN109522928A (en) * 2018-10-15 2019-03-26 北京邮电大学 Theme sentiment analysis method, apparatus, electronic equipment and the storage medium of text
CN109284366A (en) * 2018-10-17 2019-01-29 徐佳慧 A kind of construction method and device of the homogenous network towards investment and financing mechanism
CN109492157B (en) * 2018-10-24 2021-08-31 华侨大学 News recommendation method and theme characterization method based on RNN and attention mechanism
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN109636645A (en) * 2018-12-13 2019-04-16 平安医疗健康管理股份有限公司 Medical insurance monitoring and managing method, unit and computer readable storage medium
CN109710759A (en) * 2018-12-17 2019-05-03 北京百度网讯科技有限公司 Text dividing method, device, computer equipment and readable storage medium storing program for executing
CN109885831A (en) * 2019-01-30 2019-06-14 广州杰赛科技股份有限公司 Key Term abstracting method, device, equipment and computer readable storage medium
CN109885831B (en) * 2019-01-30 2023-06-02 广州杰赛科技股份有限公司 Keyword extraction method, device, equipment and computer readable storage medium
CN110442855A (en) * 2019-04-10 2019-11-12 北京捷通华声科技股份有限公司 A kind of speech analysis method and system
CN110442855B (en) * 2019-04-10 2023-11-07 北京捷通华声科技股份有限公司 Voice analysis method and system
CN110046228A (en) * 2019-04-18 2019-07-23 合肥工业大学 Short text subject identifying method and system
CN110046228B (en) * 2019-04-18 2021-06-11 合肥工业大学 Short text topic identification method and system
CN110209941B (en) * 2019-06-03 2021-01-15 北京卡路里信息技术有限公司 Method for maintaining push content pool, push method, device, medium and server
CN110222347A (en) * 2019-06-20 2019-09-10 首都师范大学 A kind of detection method that digresses from the subject of writing a composition
CN110287321A (en) * 2019-06-26 2019-09-27 南京邮电大学 A kind of electric power file classification method based on improvement feature selecting
CN110472005B (en) * 2019-06-27 2023-09-15 中山大学 Unsupervised keyword extraction method
CN110472005A (en) * 2019-06-27 2019-11-19 中山大学 A kind of unsupervised keyword extracting method
CN110427492B (en) * 2019-07-10 2023-08-15 创新先进技术有限公司 Keyword library generation method and device and electronic equipment
CN110427492A (en) * 2019-07-10 2019-11-08 阿里巴巴集团控股有限公司 Generate the method, apparatus and electronic equipment of keywords database
CN110750619A (en) * 2019-08-15 2020-02-04 中国平安财产保险股份有限公司 Chat record keyword extraction method and device, computer equipment and storage medium
CN110750619B (en) * 2019-08-15 2024-05-28 中国平安财产保险股份有限公司 Chat record keyword extraction method and device, computer equipment and storage medium
CN110717329B (en) * 2019-09-10 2023-06-16 上海开域信息科技有限公司 Method for performing approximate search based on word vector to rapidly extract advertisement text theme
CN110717329A (en) * 2019-09-10 2020-01-21 上海开域信息科技有限公司 Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector
CN110807326B (en) * 2019-10-24 2023-04-28 江汉大学 Short text keyword extraction method combining GPU-DMM and text features
CN111026866A (en) * 2019-10-24 2020-04-17 北京中科闻歌科技股份有限公司 Domain-oriented text information extraction clustering method, device and storage medium
CN110807326A (en) * 2019-10-24 2020-02-18 江汉大学 Short text keyword extraction method combining GPU-DMM and text features
CN110851602A (en) * 2019-11-13 2020-02-28 精硕科技(北京)股份有限公司 Method and device for topic clustering
CN110851570B (en) * 2019-11-14 2023-04-18 中山大学 Unsupervised keyword extraction method based on Embedding technology
CN110851570A (en) * 2019-11-14 2020-02-28 中山大学 Unsupervised keyword extraction method based on Embedding technology
CN110991175B (en) * 2019-12-10 2024-04-09 爱驰汽车有限公司 Method, system, equipment and storage medium for generating text in multi-mode
CN110991175A (en) * 2019-12-10 2020-04-10 爱驰汽车有限公司 Text generation method, system, device and storage medium under multiple modes
CN111079422B (en) * 2019-12-13 2023-07-14 北京小米移动软件有限公司 Keyword extraction method, keyword extraction device and storage medium
CN111078838A (en) * 2019-12-13 2020-04-28 北京小米智能科技有限公司 Keyword extraction method, keyword extraction device and electronic equipment
CN111079422A (en) * 2019-12-13 2020-04-28 北京小米移动软件有限公司 Keyword extraction method, device and storage medium
CN111078838B (en) * 2019-12-13 2023-08-18 北京小米智能科技有限公司 Keyword extraction method, keyword extraction device and electronic equipment
EP3835995A1 (en) * 2019-12-13 2021-06-16 Beijing Xiaomi Mobile Software Co., Ltd. Method and device for keyword extraction and storage medium
US11580303B2 (en) 2019-12-13 2023-02-14 Beijing Xiaomi Mobile Software Co., Ltd. Method and device for keyword extraction and storage medium
CN113139379B (en) * 2020-01-20 2023-12-22 中国电信股份有限公司 Information identification method and system
CN113139379A (en) * 2020-01-20 2021-07-20 中国电信股份有限公司 Information identification method and system
CN111401040B (en) * 2020-03-17 2021-06-18 上海爱数信息技术股份有限公司 Keyword extraction method suitable for word text
CN111401040A (en) * 2020-03-17 2020-07-10 上海爱数信息技术股份有限公司 Keyword extraction method suitable for word text
CN111428489A (en) * 2020-03-19 2020-07-17 北京百度网讯科技有限公司 Comment generation method and device, electronic equipment and storage medium
CN111428489B (en) * 2020-03-19 2023-08-29 北京百度网讯科技有限公司 Comment generation method and device, electronic equipment and storage medium
CN111950264A (en) * 2020-08-05 2020-11-17 广东工业大学 Text data enhancement method and knowledge element extraction method
CN111950264B (en) * 2020-08-05 2024-04-26 广东工业大学 Text data enhancement method and knowledge element extraction method
CN112100317A (en) * 2020-09-24 2020-12-18 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN112100317B (en) * 2020-09-24 2022-10-14 南京邮电大学 Feature keyword extraction method based on theme semantic perception
CN112270185A (en) * 2020-10-29 2021-01-26 山西大学 Text representation method based on topic model
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN113011133A (en) * 2021-02-23 2021-06-22 吉林大学珠海学院 Single cell correlation technique data analysis method based on natural language processing
CN113051917A (en) * 2021-04-23 2021-06-29 东南大学 Document implicit time inference method based on time window text similarity
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113407679B (en) * 2021-06-30 2023-10-03 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113378512A (en) * 2021-07-05 2021-09-10 中国科学技术信息研究所 Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture
CN113378512B (en) * 2021-07-05 2023-05-26 中国科学技术信息研究所 Automatic indexing-based stepless dynamic evolution subject cloud image generation method
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
CN113505581A (en) * 2021-07-27 2021-10-15 北京工商大学 Education big data text analysis method based on APSO-LSTM network
CN113591476A (en) * 2021-08-10 2021-11-02 闪捷信息科技有限公司 Data label recommendation method based on machine learning
CN113673223A (en) * 2021-08-25 2021-11-19 北京智通云联科技有限公司 Keyword extraction method and system based on semantic similarity
CN115168600A (en) * 2022-06-23 2022-10-11 广州大学 Value chain knowledge discovery method under personalized customization
US20240046119A1 (en) * 2022-06-23 2024-02-08 Guangzhou University Value chain knowledge discovery method under personalized customization
CN116431814B (en) * 2023-06-06 2023-09-05 北京中关村科金技术有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium
CN116431814A (en) * 2023-06-06 2023-07-14 北京中关村科金技术有限公司 Information extraction method, information extraction device, electronic equipment and readable storage medium
CN116975246A (en) * 2023-08-03 2023-10-31 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal
CN116975246B (en) * 2023-08-03 2024-04-26 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal

Also Published As

Publication number Publication date
CN108052593B (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
Devika et al. Sentiment analysis: a comparative study on different approaches
CN106055538B (en) The automatic abstracting method of the text label that topic model and semantic analysis combine
CN106156204B (en) Text label extraction method and device
Thakkar et al. Graph-based algorithms for text summarization
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN102622338B (en) Computer-assisted computing method of semantic distance between short texts
Jafari et al. Automatic text summarization using fuzzy inference
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN107992542A (en) A kind of similar article based on topic model recommends method
CN110222172B (en) Multi-source network public opinion theme mining method based on improved hierarchical clustering
CN110188349A (en) A kind of automation writing method based on extraction-type multiple file summarization method
CN108804595B (en) Short text representation method based on word2vec
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN109815400A (en) Personage&#39;s interest extracting method based on long text
CN111625622B (en) Domain ontology construction method and device, electronic equipment and storage medium
Qiu et al. Advanced sentiment classification of tibetan microblogs on smart campuses based on multi-feature fusion
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
Subramaniam et al. Test model for rich semantic graph representation for Hindi text using abstractive method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant