CN108052593A - A kind of subject key words extracting method based on descriptor vector sum network structure - Google Patents
A kind of subject key words extracting method based on descriptor vector sum network structure Download PDFInfo
- Publication number
- CN108052593A CN108052593A CN201711315360.0A CN201711315360A CN108052593A CN 108052593 A CN108052593 A CN 108052593A CN 201711315360 A CN201711315360 A CN 201711315360A CN 108052593 A CN108052593 A CN 108052593A
- Authority
- CN
- China
- Prior art keywords
- keyword
- word
- theme
- subject
- mrow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of subject key words extracting methods based on descriptor vector sum network structure, and in particular to keyword technical field is extracted from text.The subject key words extracting method based on descriptor vector sum network structure is based on LDA topic models and carries out Subject Clustering to corpus of text, and obtains 100 keywords in each theme with the degree of subject relativity for top100;Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating the semantic similarity obtained between each two word, the word in semantically similarity top5 with each keyword in keyword is calculated respectively, and the word of keyword and each keyword in semantically similarity top5 collectively constitutes new keyword set;Keyword network is built, and obtains keyword of the word of each set top 20 as the theme.This method can the keyword higher to word frequency in document extract and effectively find relatively low but strong with the thematic relation keyword of word frequency.
Description
Technical field
The present invention relates to keyword technical field is extracted from text, and in particular to one kind is based on descriptor vector sum network
The subject key words extracting method of structure.
Background technology
With representing that in natural language processing field extensive use, the vector of word is carried out using word2vec learning art
It represents, can be good at describing and obtain the semanteme and syntax rule of word, meanwhile, topic model can be good at explaining document
The theme aggregation situation of level.Therefore, the research that the term vector of fusion topic model and topic keyword represents at present is increasingly
Extensively.
LDA topic models:In the various topic models of proposition, LDA is the generation model for being capable of general subject matter distribution.
LDA is the Bayesian model of three stage layereds, wherein each project gathered having of being modeled as that potential theme collection closes
Limit mixing, on the contrary, each theme is also modeled as the unlimited mixing of one group of potential theme probability.In the context of text modeling
In, the display that theme probability provides document represents.The modeling process of LDA can be described as each resource (i.e. P (z | d)) it looks for
Mixed to corresponding theme, each theme is by another probability distribution (i.e. P (t | z)) it describes.This can formally be represented
For:
Wherein, P (ti| it is d) probability on i-th of given document d, ziIt is potential theme.P(ti|zj=j) it is theme j
Middle tiProbability.P(zj=j | d) it is probability of the document on theme j.The quantity of the Z of potential theme must define in advance.LDA makes
Descriptor distribution P (t | z) and text are estimated from unlabelled corpus with Dirichlet prior distributions and definite number of topics
Shelves theme distribution P (z | d).
LDA is the very wide topic model of use scope, and most others topic models are all based on LDA and are extended.But
The keyword that LDA is extracted as a whole is generally too wide in range, it is impossible to preferably reflection article theme, therefore side proposed by the present invention
Method is innovation.
word embedding:Word insertion be according to syntax and semantic information by each Chinese word coding for vector row (word to
Amount), therefore distance of the similar word on its term vector is similar.It is counted from natural text and establishes a language model
And after obtaining term vector, syntactic analysis, sentiment analysis etc. can be carried out as the input of neutral net, can also be used as
Supplemental characteristic expands existing model.But only term vector is the expected theme of None- identified text, it is necessary to by itself and master
Topic model is combined.
Existing unsupervised keyword extraction techniques mainly include TF-IDF, the schemes such as Topic model, TextRank.
Its technical disadvantages is mainly reflected in the following aspects:
TF-IDF is a kind of common weighting technique for information retrieval and data mining, is to search key importance
Measurement, the extraction applied to text key word equally obtains preferable effect.But TF-IDF is based on word frequency and keyword
The cross entropy of probability distribution is the sequencing occurred without considering word, do not account in text each word and context it
Between relation.
Widely used Topic model such as LDA can preferably excavate theme from document, but its pass for extracting
Keyword is crossed to be wide in range, there are many higher but unrelated with the theme word of word frequency, it is impossible to theme is preferably reacted, so as crucial
Word is inappropriate.
TextRank algorithm is a kind of sort algorithm based on figure for text, and text is split as sentence, utilizes word
In the text the cooccurrence relation of context establishes graph model, and PageRank value in graph model extracts keyword.It should
Algorithm succinctly can effectively extract the keyword of single document on the basis of word frequency and Term co-occurrence relation is considered, but nothing
Method is identified and clusters to the theme of multiple documents, thus the keyword of particular topic Documents can not be extracted.
The content of the invention
The purpose of the present invention is against the above deficiency, it is proposed that a kind of by topic model LDA and Word embedding phases
With reference to extracting the keyword of same subject text using the Internet communication of similarity, pass that can be higher to word frequency in document
Keyword extracts, but can effectively find that word frequency is relatively low but the keyword extracting method of keyword that thematic relation is strong.
The present invention specifically adopts the following technical scheme that:
As shown in Figure 1, a kind of subject key words extracting method based on descriptor vector sum network structure, specifically includes:
Original corpus of text is segmented;
Based on LDA topic models to corpus of text carry out Subject Clustering, and obtain in each theme with the degree of subject relativity
For the keyword set KeywordsSet of top1001={ k1..., k100};
Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating between term vector
Cosine value obtains the semantic similarity between each two word;
It calculates respectively and keyword set KeywordsSet1In each keyword semantically similarity top5's
Word, keyword set KeywordsSet1In word and its collectively constitute new keyword set in the word of semantically similarity top5
Close KeywordsSet2;
With keyword set KeywordsSet2In each keyword for node, semantic similarity between word and word
Inverse is the weight on side, builds keyword network, and obtains keyword set according to the PageRank value of each node
KeywordsSet2Keyword of the word of middle top20 as the theme forms final keyword set KeyordsSetfinal。
Preferably, the participle, the urtext that will be obtained are divided into word sequence so as to follow-up Subject Clustering and key
Word extracts, and when input as word2vec of the result of participle removes additional character;During input as LDA, remove function word, nothing
Method is as the place name of subject key words and the preposition of the repetition unrelated with theme.
It is preferably based on LDA topic models and Subject Clustering is carried out to corpus of text, used in Language Modeling
Perplexity is fine or not to weigh modeling effect, i.e., relatively low perplexity represents better Generalization Capability, perplexity
Calculating formula is as follows:
Wherein, P (wi|tj) it is word wiIn theme tjOn distribution, P (tj| d) it is theme tjDistribution on document d, N are
Without dittograph sum in corpus, K is number of topics, i=1 ..., N, j=1 ..., K.
Preferably, in the term vector generating process, using the word segmentation result of the mixing text of title and content as defeated
The term vector for entering to obtain each word represents the process of model.
Preferably, during the keyword network struction, construction step specifically includes:
S1:Using the cosine relation between term vector, calculate under same subject with obtained in Subject Clustering step just
The word of beginning keywords semantics similarity top5, duplicate removal and with keyword set KeywordsSet1Form new keyword set
KeywordsSet2;
S2:It calculates under each theme, keyword set KeywordsSet2In each similarity of word between any two, fall
Number is as the weight between 2 points;
S3:Threshold value is set, and filtering similarity is less than the side of threshold value;
S4:Build the keyword network of each theme;
S5:Subject key words are extracted:After the completion of keyword network struction, calculate in each subject network PageRank value from
Top20 high to Low node, using its corresponding word as the keyword set KeywordsSet of the themefinal。
The present invention has the advantages that:
The present invention is primarily based on LDA topic models and corpus of text is clustered;Secondly, it is using word2vec that text is pre-
Each vocabulary in material is shown as a term vector;Then, each keyword similarity top5 in the document of the theme is obtained
Word collectively constitutes new keyword set.Finally, using keyword as node, the similarity between word is the weight on side, and structure closes
Keyword network obtains keyword of the core node as the theme of network;
Topic model LDA is combined by this method with Word embedding, is extracted using the Internet communication of similarity
The keyword of same subject text, not only can the keyword higher to word frequency in document extract, can simultaneously be effectively
It was found that word frequency is relatively low but keyword that thematic relation is strong;
This method carries out secondary discovery on the basis of word frequency is considered according to term vector relation pair keyword, and word frequency is not high
But the word of semantic similarity is included in the alternative set of keyword, can reasonably expand the range of choice of keyword so that final
Keyword under the same subject of acquisition is semantically contacting more closely;
This method introduces term vector and carries out network struction based on the distance between term vector and can more accurately find out same
Keyword similar in the meaning of a word under one theme, so as to obtain more accurately result.
Description of the drawings
Fig. 1 is the subject key words extracting method flow chart based on descriptor vector sum network structure;
Fig. 2 is puzzlement degree (perplexity) graph;
Fig. 3 is the keyword distribution map of teaching class notice;
Fig. 4 is the keyword distribution map that class notifies of appraising and choosing excellent;
Fig. 5 is the keyword distribution map of library's class notice.
Specific embodiment
The specific embodiment of the present invention is described further in the following with reference to the drawings and specific embodiments:
As shown in Figure 1, a kind of subject key words extracting method based on descriptor vector sum network structure, specifically includes:
Original corpus of text is segmented;
Based on LDA topic models to corpus of text carry out Subject Clustering, and obtain in each theme with the degree of subject relativity
For the keyword set KeywordsSet of top1001={ k1..., k100};
Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating between term vector
Cosine value obtains the semantic similarity between each two word;
It calculates respectively and keyword set KeywordsSet1In each keyword semantically similarity top5's
Word, keyword set KeywordsSet1In word and its collectively constitute new keyword set in the word of semantically similarity top5
Close KeywordsSet2;
With keyword set KeywordsSet2In each keyword for node, semantic similarity between word and word
Inverse is the weight on side, builds keyword network, and obtains keyword set according to the PageRank value of each node
KeywordsSet2Keyword of the word of middle top20 as the theme forms final keyword set KeywordsSetfinal。
Participle, the urtext that will be obtained are divided into word sequence so as to follow-up Subject Clustering and keyword extraction, participle
Input of the result as word2vec when remove additional character;During input as LDA, remove function word, theme can not be used as
The place name of keyword and the preposition of repetition largely unrelated with theme.
As described in Figure 2, Subject Clustering is carried out to corpus of text based on LDA topic models, is used in Language Modeling
Perplexity is fine or not to weigh modeling effect, i.e., relatively low perplexity represents better Generalization Capability, perplexity
Calculating formula is as follows:
Wherein, P (wi|tj) it is word wiIn theme tjOn distribution, P (tj| d) it is theme tjDistribution on document d, N are
Without dittograph sum in corpus, K is number of topics, i=1 ..., N, j=1 ..., K.Change topic quantity, pass through calculating
The perplexity of data set obtains optimal theme number under different themes number.
Quantitative value at trade-off curve inflection point enables to that the perplexity values of data set are smaller and theme quantity is unlikely
In excessive.Then the word distribution under the theme distribution and each theme of every document is obtained, selects LDA values ranking under each theme
Top100 word is as initial keyword set.
In term vector generating process, each word is obtained as input using the word segmentation result of title and the mixing text of content
Term vector represent model process.Select CBOW models that window size is arranged to 5 to predict present pivot word in this programme
Probability, and negative sampling algorithm is selected to pass through logistic regression to distinguish target word and extract noise profile.1 (word2vec of table
Model training parameter setting) give the explanation and default value of key parameter in training process.
Table 1
The high dimension vector that all words in text may finally be obtained represents, and can be owned using the term vector model
Similarity relation between word, i.e., distance semantically.
During keyword network struction, construction step specifically includes:
S1:Using the cosine relation between term vector, calculate under same subject with obtained in Subject Clustering step just
The word of beginning keywords semantics similarity top5, duplicate removal and with keyword set KeywordsSet1Form new keyword set
KeywordsSet2;
S2:It calculates under each theme, keyword set KeywordsSet2In each similarity of word between any two, fall
Number is as the weight between 2 points;
S3:Threshold value is set, and filtering similarity is less than the side of threshold value;Threshold value selects the different corresponding Different Results of value such as tables
2:
Table 2
1/similarity | Topic similarity |
0.05 | 0.41 |
0.1 | 0.44 |
0.15 | 0.48 |
0.2 | 0.49 |
0.25 | 0.52 |
0.3 | 0.59 |
0.35 | 0.55 |
0.4 | 0.57 |
0.45 | 0.56 |
0.5 | 0.52 |
0.55 | 0.50 |
By table it can be seen that threshold value selects when 0.3 (i.e. similarity > 3.33) coagulating between keyword under same subject
Poly- degree higher.
S4:Build the keyword network of each theme;
S5:Subject key words are extracted:After the completion of keyword network struction, calculate in each subject network PageRank value from
Top20 high to Low node, using its corresponding word as the keyword set of the crucial phrase Cheng Xin of the theme
KeywordsSetfinal。
As shown in Figure 3-Figure 5, the solution of the present invention has first crawled certain colleges and universities 2002 to 2017 years by way of experiment
Totally 9802 news announced in the school, after word segmentation processing, by steps such as Topics Crawling, term vector calculating, keyword network structions
Suddenly, subject key words are extracted, and by result compared with the keyword that traditional theme model LDA is obtained.
Wherein saturate word expression can preferably react the word of theme, and color gets over the degree of correlation that superficial shows the word and theme
It is lower.Ranking is more forward under this methodology for the bigger expression of word.As can be seen that using the present invention method can in comprehensive word frequency and
The keyword of the theme can be represented by preferably being extracted in the case of semanteme.
Certainly, above description is not limitation of the present invention, and the present invention is also not limited to the example above, this technology neck
The variations, modifications, additions or substitutions that the technical staff in domain is made in the essential scope of the present invention should also belong to the present invention's
Protection domain.
Claims (5)
1. a kind of subject key words extracting method based on descriptor vector sum network structure, which is characterized in that specifically include:
Original corpus of text is segmented;
Subject Clustering is carried out to corpus of text based on LDA topic models, and obtains in each theme and is with the degree of subject relativity
The keyword set KeywordsSet of top1001={ k1..., k100};
Each vocabulary in corpus of text is shown as a term vector using word2vec, by calculating the cosine between term vector
Value obtains the semantic similarity between each two word;
It calculates respectively and keyword set KeywordsSet1In each keyword semantically similarity top5 word, close
Keyword set KeywordsSet1In word and its collectively constitute new keyword set in the word of semantically similarity top5
KeywordsSet2;
With keyword set KeywordsSet2In each keyword for node, the inverse of the semantic similarity between word and word
For the weight on side, keyword network is built, and keyword set is obtained according to the PageRank value of each node
KeywordsSet2Keyword of the word of middle top20 as the theme forms final keyword set KeywordsSetfinal。
2. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special
Sign is that the participle, the urtext that will be obtained is divided into word sequence to divide so as to follow-up Subject Clustering and keyword extraction
Remove additional character during input of the result of word as word2vec;During input as LDA, remove function word, master can not be used as
Inscribe the place name of keyword and the preposition of the repetition unrelated with theme.
3. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special
Sign is, carries out Subject Clustering to corpus of text based on LDA topic models, is weighed in Language Modeling using perplexity
Effect quality is modeled, i.e., relatively low perplexity represents better Generalization Capability, and perplexity calculating formulas are as follows:
<mrow>
<mi>p</mi>
<mi>e</mi>
<mi>r</mi>
<mi>p</mi>
<mi>l</mi>
<mi>e</mi>
<mi>x</mi>
<mi>i</mi>
<mi>t</mi>
<mi>y</mi>
<mo>=</mo>
<msup>
<mi>e</mi>
<mfrac>
<mrow>
<mo>-</mo>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</msubsup>
<msubsup>
<mi>log&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>K</mi>
</msubsup>
<mi>P</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<msub>
<mi>t</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>*</mo>
<mi>P</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>t</mi>
<mi>j</mi>
</msub>
<mo>|</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
</mrow>
<mi>N</mi>
</mfrac>
</msup>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, P (wi|tj) it is word wiIn theme tjOn distribution, P (tj| d) it is theme tjDistribution on document d, N are language materials
Without dittograph sum in storehouse, K is number of topics, i=1 ..., N, j=1 ..., K.
4. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special
Sign is, in the term vector generating process, is obtained using the word segmentation result of title and the mixing text of content as input every
The term vector of a word represents the process of model.
5. a kind of subject key words extracting method based on descriptor vector sum network structure as described in claim 1, special
Sign is that during the keyword network struction, construction step specifically includes:
S1:Using the cosine relation between term vector, the initial pass with being obtained in Subject Clustering step under same subject is calculated
The word of keyword semantic similarity top5, duplicate removal and with keyword set KeywordsSet1Form new keyword set
KeywordsSet2;
S2:It calculates under each theme, keyword set KeywordsSet2In each similarity of word between any two, inverse makees
For the weight between 2 points;
S3:Threshold value is set, and filtering similarity is less than the side of threshold value;
S4:Build the keyword network of each theme;
S5:Subject key words are extracted:After the completion of keyword network struction, calculate in each subject network PageRank value from height to
Top20 low node, using its corresponding word as the keyword set KeywordsSet of the themefinal。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711315360.0A CN108052593B (en) | 2017-12-12 | 2017-12-12 | Topic keyword extraction method based on topic word vector and network structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711315360.0A CN108052593B (en) | 2017-12-12 | 2017-12-12 | Topic keyword extraction method based on topic word vector and network structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108052593A true CN108052593A (en) | 2018-05-18 |
CN108052593B CN108052593B (en) | 2020-09-22 |
Family
ID=62124320
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711315360.0A Active CN108052593B (en) | 2017-12-12 | 2017-12-12 | Topic keyword extraction method based on topic word vector and network structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108052593B (en) |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920454A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of theme phrase extraction method |
CN108984519A (en) * | 2018-06-14 | 2018-12-11 | 华东理工大学 | Event corpus method for auto constructing, device and storage medium based on double mode |
CN109086355A (en) * | 2018-07-18 | 2018-12-25 | 北京航天云路有限公司 | Hot spot association relationship analysis method and system based on theme of news word |
CN109284366A (en) * | 2018-10-17 | 2019-01-29 | 徐佳慧 | A kind of construction method and device of the homogenous network towards investment and financing mechanism |
CN109376352A (en) * | 2018-08-28 | 2019-02-22 | 中山大学 | A kind of patent text modeling method based on word2vec and semantic similarity |
CN109492157A (en) * | 2018-10-24 | 2019-03-19 | 华侨大学 | Based on RNN, the news recommended method of attention mechanism and theme characterizing method |
CN109522928A (en) * | 2018-10-15 | 2019-03-26 | 北京邮电大学 | Theme sentiment analysis method, apparatus, electronic equipment and the storage medium of text |
CN109636645A (en) * | 2018-12-13 | 2019-04-16 | 平安医疗健康管理股份有限公司 | Medical insurance monitoring and managing method, unit and computer readable storage medium |
CN109710759A (en) * | 2018-12-17 | 2019-05-03 | 北京百度网讯科技有限公司 | Text dividing method, device, computer equipment and readable storage medium storing program for executing |
CN109885831A (en) * | 2019-01-30 | 2019-06-14 | 广州杰赛科技股份有限公司 | Key Term abstracting method, device, equipment and computer readable storage medium |
CN110020034A (en) * | 2018-06-29 | 2019-07-16 | 程宇镳 | A kind of information citation analysis method and system |
CN110046228A (en) * | 2019-04-18 | 2019-07-23 | 合肥工业大学 | Short text subject identifying method and system |
CN110222347A (en) * | 2019-06-20 | 2019-09-10 | 首都师范大学 | A kind of detection method that digresses from the subject of writing a composition |
CN110287321A (en) * | 2019-06-26 | 2019-09-27 | 南京邮电大学 | A kind of electric power file classification method based on improvement feature selecting |
CN110427492A (en) * | 2019-07-10 | 2019-11-08 | 阿里巴巴集团控股有限公司 | Generate the method, apparatus and electronic equipment of keywords database |
CN110442855A (en) * | 2019-04-10 | 2019-11-12 | 北京捷通华声科技股份有限公司 | A kind of speech analysis method and system |
CN110472005A (en) * | 2019-06-27 | 2019-11-19 | 中山大学 | A kind of unsupervised keyword extracting method |
CN110717329A (en) * | 2019-09-10 | 2020-01-21 | 上海开域信息科技有限公司 | Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector |
CN110750619A (en) * | 2019-08-15 | 2020-02-04 | 中国平安财产保险股份有限公司 | Chat record keyword extraction method and device, computer equipment and storage medium |
CN110807326A (en) * | 2019-10-24 | 2020-02-18 | 江汉大学 | Short text keyword extraction method combining GPU-DMM and text features |
CN110851570A (en) * | 2019-11-14 | 2020-02-28 | 中山大学 | Unsupervised keyword extraction method based on Embedding technology |
CN110851602A (en) * | 2019-11-13 | 2020-02-28 | 精硕科技(北京)股份有限公司 | Method and device for topic clustering |
CN110991175A (en) * | 2019-12-10 | 2020-04-10 | 爱驰汽车有限公司 | Text generation method, system, device and storage medium under multiple modes |
CN111026866A (en) * | 2019-10-24 | 2020-04-17 | 北京中科闻歌科技股份有限公司 | Domain-oriented text information extraction clustering method, device and storage medium |
CN111079422A (en) * | 2019-12-13 | 2020-04-28 | 北京小米移动软件有限公司 | Keyword extraction method, device and storage medium |
CN111078838A (en) * | 2019-12-13 | 2020-04-28 | 北京小米智能科技有限公司 | Keyword extraction method, keyword extraction device and electronic equipment |
CN111401040A (en) * | 2020-03-17 | 2020-07-10 | 上海爱数信息技术股份有限公司 | Keyword extraction method suitable for word text |
CN111428489A (en) * | 2020-03-19 | 2020-07-17 | 北京百度网讯科技有限公司 | Comment generation method and device, electronic equipment and storage medium |
CN111950264A (en) * | 2020-08-05 | 2020-11-17 | 广东工业大学 | Text data enhancement method and knowledge element extraction method |
CN112100317A (en) * | 2020-09-24 | 2020-12-18 | 南京邮电大学 | Feature keyword extraction method based on theme semantic perception |
CN110209941B (en) * | 2019-06-03 | 2021-01-15 | 北京卡路里信息技术有限公司 | Method for maintaining push content pool, push method, device, medium and server |
CN112270185A (en) * | 2020-10-29 | 2021-01-26 | 山西大学 | Text representation method based on topic model |
CN112508376A (en) * | 2020-11-30 | 2021-03-16 | 中国科学院深圳先进技术研究院 | Index system construction method |
CN113011133A (en) * | 2021-02-23 | 2021-06-22 | 吉林大学珠海学院 | Single cell correlation technique data analysis method based on natural language processing |
CN113051917A (en) * | 2021-04-23 | 2021-06-29 | 东南大学 | Document implicit time inference method based on time window text similarity |
CN113139379A (en) * | 2020-01-20 | 2021-07-20 | 中国电信股份有限公司 | Information identification method and system |
CN113378512A (en) * | 2021-07-05 | 2021-09-10 | 中国科学技术信息研究所 | Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture |
CN113407679A (en) * | 2021-06-30 | 2021-09-17 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
CN113505581A (en) * | 2021-07-27 | 2021-10-15 | 北京工商大学 | Education big data text analysis method based on APSO-LSTM network |
CN113591476A (en) * | 2021-08-10 | 2021-11-02 | 闪捷信息科技有限公司 | Data label recommendation method based on machine learning |
CN113673223A (en) * | 2021-08-25 | 2021-11-19 | 北京智通云联科技有限公司 | Keyword extraction method and system based on semantic similarity |
CN115168600A (en) * | 2022-06-23 | 2022-10-11 | 广州大学 | Value chain knowledge discovery method under personalized customization |
CN116431814A (en) * | 2023-06-06 | 2023-07-14 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
CN108829822B (en) * | 2018-06-12 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Media content recommendation method and device, storage medium and electronic device |
CN116975246A (en) * | 2023-08-03 | 2023-10-31 | 深圳市博锐高科科技有限公司 | Data acquisition method, device, chip and terminal |
CN110750619B (en) * | 2019-08-15 | 2024-05-28 | 中国平安财产保险股份有限公司 | Chat record keyword extraction method and device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110060983A1 (en) * | 2009-09-08 | 2011-03-10 | Wei Jia Cai | Producing a visual summarization of text documents |
CN104778161A (en) * | 2015-04-30 | 2015-07-15 | 车智互联(北京)科技有限公司 | Keyword extracting method based on Word2Vec and Query log |
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
-
2017
- 2017-12-12 CN CN201711315360.0A patent/CN108052593B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110060983A1 (en) * | 2009-09-08 | 2011-03-10 | Wei Jia Cai | Producing a visual summarization of text documents |
CN104778161A (en) * | 2015-04-30 | 2015-07-15 | 车智互联(北京)科技有限公司 | Keyword extracting method based on Word2Vec and Query log |
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
Non-Patent Citations (3)
Title |
---|
YUJUN WEN 等: "Research on Keyword extraction based on Word2Vec weighted TextRank", 《2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS》 * |
曾庆田 等: "融合主题词嵌入和网络结构分析的主题关键词提取方法", 《数据分析与知识发现》 * |
韦强申: "领域关键词抽取:结合LDA与Word2Vec", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (74)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829822B (en) * | 2018-06-12 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Media content recommendation method and device, storage medium and electronic device |
CN108920454A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of theme phrase extraction method |
CN108984519A (en) * | 2018-06-14 | 2018-12-11 | 华东理工大学 | Event corpus method for auto constructing, device and storage medium based on double mode |
CN108984519B (en) * | 2018-06-14 | 2022-07-05 | 华东理工大学 | Dual-mode-based automatic event corpus construction method and device and storage medium |
CN110020034B (en) * | 2018-06-29 | 2023-12-08 | 程宇镳 | Information quotation analysis method and system |
CN110020034A (en) * | 2018-06-29 | 2019-07-16 | 程宇镳 | A kind of information citation analysis method and system |
CN109086355A (en) * | 2018-07-18 | 2018-12-25 | 北京航天云路有限公司 | Hot spot association relationship analysis method and system based on theme of news word |
CN109376352B (en) * | 2018-08-28 | 2022-11-29 | 中山大学 | Patent text modeling method based on word2vec and semantic similarity |
CN109376352A (en) * | 2018-08-28 | 2019-02-22 | 中山大学 | A kind of patent text modeling method based on word2vec and semantic similarity |
CN109522928A (en) * | 2018-10-15 | 2019-03-26 | 北京邮电大学 | Theme sentiment analysis method, apparatus, electronic equipment and the storage medium of text |
CN109284366A (en) * | 2018-10-17 | 2019-01-29 | 徐佳慧 | A kind of construction method and device of the homogenous network towards investment and financing mechanism |
CN109492157B (en) * | 2018-10-24 | 2021-08-31 | 华侨大学 | News recommendation method and theme characterization method based on RNN and attention mechanism |
CN109492157A (en) * | 2018-10-24 | 2019-03-19 | 华侨大学 | Based on RNN, the news recommended method of attention mechanism and theme characterizing method |
CN109636645A (en) * | 2018-12-13 | 2019-04-16 | 平安医疗健康管理股份有限公司 | Medical insurance monitoring and managing method, unit and computer readable storage medium |
CN109710759A (en) * | 2018-12-17 | 2019-05-03 | 北京百度网讯科技有限公司 | Text dividing method, device, computer equipment and readable storage medium storing program for executing |
CN109885831A (en) * | 2019-01-30 | 2019-06-14 | 广州杰赛科技股份有限公司 | Key Term abstracting method, device, equipment and computer readable storage medium |
CN109885831B (en) * | 2019-01-30 | 2023-06-02 | 广州杰赛科技股份有限公司 | Keyword extraction method, device, equipment and computer readable storage medium |
CN110442855A (en) * | 2019-04-10 | 2019-11-12 | 北京捷通华声科技股份有限公司 | A kind of speech analysis method and system |
CN110442855B (en) * | 2019-04-10 | 2023-11-07 | 北京捷通华声科技股份有限公司 | Voice analysis method and system |
CN110046228A (en) * | 2019-04-18 | 2019-07-23 | 合肥工业大学 | Short text subject identifying method and system |
CN110046228B (en) * | 2019-04-18 | 2021-06-11 | 合肥工业大学 | Short text topic identification method and system |
CN110209941B (en) * | 2019-06-03 | 2021-01-15 | 北京卡路里信息技术有限公司 | Method for maintaining push content pool, push method, device, medium and server |
CN110222347A (en) * | 2019-06-20 | 2019-09-10 | 首都师范大学 | A kind of detection method that digresses from the subject of writing a composition |
CN110287321A (en) * | 2019-06-26 | 2019-09-27 | 南京邮电大学 | A kind of electric power file classification method based on improvement feature selecting |
CN110472005B (en) * | 2019-06-27 | 2023-09-15 | 中山大学 | Unsupervised keyword extraction method |
CN110472005A (en) * | 2019-06-27 | 2019-11-19 | 中山大学 | A kind of unsupervised keyword extracting method |
CN110427492B (en) * | 2019-07-10 | 2023-08-15 | 创新先进技术有限公司 | Keyword library generation method and device and electronic equipment |
CN110427492A (en) * | 2019-07-10 | 2019-11-08 | 阿里巴巴集团控股有限公司 | Generate the method, apparatus and electronic equipment of keywords database |
CN110750619A (en) * | 2019-08-15 | 2020-02-04 | 中国平安财产保险股份有限公司 | Chat record keyword extraction method and device, computer equipment and storage medium |
CN110750619B (en) * | 2019-08-15 | 2024-05-28 | 中国平安财产保险股份有限公司 | Chat record keyword extraction method and device, computer equipment and storage medium |
CN110717329B (en) * | 2019-09-10 | 2023-06-16 | 上海开域信息科技有限公司 | Method for performing approximate search based on word vector to rapidly extract advertisement text theme |
CN110717329A (en) * | 2019-09-10 | 2020-01-21 | 上海开域信息科技有限公司 | Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector |
CN110807326B (en) * | 2019-10-24 | 2023-04-28 | 江汉大学 | Short text keyword extraction method combining GPU-DMM and text features |
CN111026866A (en) * | 2019-10-24 | 2020-04-17 | 北京中科闻歌科技股份有限公司 | Domain-oriented text information extraction clustering method, device and storage medium |
CN110807326A (en) * | 2019-10-24 | 2020-02-18 | 江汉大学 | Short text keyword extraction method combining GPU-DMM and text features |
CN110851602A (en) * | 2019-11-13 | 2020-02-28 | 精硕科技(北京)股份有限公司 | Method and device for topic clustering |
CN110851570B (en) * | 2019-11-14 | 2023-04-18 | 中山大学 | Unsupervised keyword extraction method based on Embedding technology |
CN110851570A (en) * | 2019-11-14 | 2020-02-28 | 中山大学 | Unsupervised keyword extraction method based on Embedding technology |
CN110991175B (en) * | 2019-12-10 | 2024-04-09 | 爱驰汽车有限公司 | Method, system, equipment and storage medium for generating text in multi-mode |
CN110991175A (en) * | 2019-12-10 | 2020-04-10 | 爱驰汽车有限公司 | Text generation method, system, device and storage medium under multiple modes |
CN111079422B (en) * | 2019-12-13 | 2023-07-14 | 北京小米移动软件有限公司 | Keyword extraction method, keyword extraction device and storage medium |
CN111078838A (en) * | 2019-12-13 | 2020-04-28 | 北京小米智能科技有限公司 | Keyword extraction method, keyword extraction device and electronic equipment |
CN111079422A (en) * | 2019-12-13 | 2020-04-28 | 北京小米移动软件有限公司 | Keyword extraction method, device and storage medium |
CN111078838B (en) * | 2019-12-13 | 2023-08-18 | 北京小米智能科技有限公司 | Keyword extraction method, keyword extraction device and electronic equipment |
EP3835995A1 (en) * | 2019-12-13 | 2021-06-16 | Beijing Xiaomi Mobile Software Co., Ltd. | Method and device for keyword extraction and storage medium |
US11580303B2 (en) | 2019-12-13 | 2023-02-14 | Beijing Xiaomi Mobile Software Co., Ltd. | Method and device for keyword extraction and storage medium |
CN113139379B (en) * | 2020-01-20 | 2023-12-22 | 中国电信股份有限公司 | Information identification method and system |
CN113139379A (en) * | 2020-01-20 | 2021-07-20 | 中国电信股份有限公司 | Information identification method and system |
CN111401040B (en) * | 2020-03-17 | 2021-06-18 | 上海爱数信息技术股份有限公司 | Keyword extraction method suitable for word text |
CN111401040A (en) * | 2020-03-17 | 2020-07-10 | 上海爱数信息技术股份有限公司 | Keyword extraction method suitable for word text |
CN111428489A (en) * | 2020-03-19 | 2020-07-17 | 北京百度网讯科技有限公司 | Comment generation method and device, electronic equipment and storage medium |
CN111428489B (en) * | 2020-03-19 | 2023-08-29 | 北京百度网讯科技有限公司 | Comment generation method and device, electronic equipment and storage medium |
CN111950264A (en) * | 2020-08-05 | 2020-11-17 | 广东工业大学 | Text data enhancement method and knowledge element extraction method |
CN111950264B (en) * | 2020-08-05 | 2024-04-26 | 广东工业大学 | Text data enhancement method and knowledge element extraction method |
CN112100317A (en) * | 2020-09-24 | 2020-12-18 | 南京邮电大学 | Feature keyword extraction method based on theme semantic perception |
CN112100317B (en) * | 2020-09-24 | 2022-10-14 | 南京邮电大学 | Feature keyword extraction method based on theme semantic perception |
CN112270185A (en) * | 2020-10-29 | 2021-01-26 | 山西大学 | Text representation method based on topic model |
CN112508376A (en) * | 2020-11-30 | 2021-03-16 | 中国科学院深圳先进技术研究院 | Index system construction method |
CN113011133A (en) * | 2021-02-23 | 2021-06-22 | 吉林大学珠海学院 | Single cell correlation technique data analysis method based on natural language processing |
CN113051917A (en) * | 2021-04-23 | 2021-06-29 | 东南大学 | Document implicit time inference method based on time window text similarity |
CN113407679A (en) * | 2021-06-30 | 2021-09-17 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
CN113407679B (en) * | 2021-06-30 | 2023-10-03 | 竹间智能科技(上海)有限公司 | Text topic mining method and device, electronic equipment and storage medium |
CN113378512A (en) * | 2021-07-05 | 2021-09-10 | 中国科学技术信息研究所 | Automatic indexing-based generation method of stepless dynamic evolution theme cloud picture |
CN113378512B (en) * | 2021-07-05 | 2023-05-26 | 中国科学技术信息研究所 | Automatic indexing-based stepless dynamic evolution subject cloud image generation method |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
CN113505581A (en) * | 2021-07-27 | 2021-10-15 | 北京工商大学 | Education big data text analysis method based on APSO-LSTM network |
CN113591476A (en) * | 2021-08-10 | 2021-11-02 | 闪捷信息科技有限公司 | Data label recommendation method based on machine learning |
CN113673223A (en) * | 2021-08-25 | 2021-11-19 | 北京智通云联科技有限公司 | Keyword extraction method and system based on semantic similarity |
CN115168600A (en) * | 2022-06-23 | 2022-10-11 | 广州大学 | Value chain knowledge discovery method under personalized customization |
US20240046119A1 (en) * | 2022-06-23 | 2024-02-08 | Guangzhou University | Value chain knowledge discovery method under personalized customization |
CN116431814B (en) * | 2023-06-06 | 2023-09-05 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
CN116431814A (en) * | 2023-06-06 | 2023-07-14 | 北京中关村科金技术有限公司 | Information extraction method, information extraction device, electronic equipment and readable storage medium |
CN116975246A (en) * | 2023-08-03 | 2023-10-31 | 深圳市博锐高科科技有限公司 | Data acquisition method, device, chip and terminal |
CN116975246B (en) * | 2023-08-03 | 2024-04-26 | 深圳市博锐高科科技有限公司 | Data acquisition method, device, chip and terminal |
Also Published As
Publication number | Publication date |
---|---|
CN108052593B (en) | 2020-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
CN106844658B (en) | Automatic construction method and system of Chinese text knowledge graph | |
Devika et al. | Sentiment analysis: a comparative study on different approaches | |
CN106055538B (en) | The automatic abstracting method of the text label that topic model and semantic analysis combine | |
CN106156204B (en) | Text label extraction method and device | |
Thakkar et al. | Graph-based algorithms for text summarization | |
CN109376352B (en) | Patent text modeling method based on word2vec and semantic similarity | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN102622338B (en) | Computer-assisted computing method of semantic distance between short texts | |
Jafari et al. | Automatic text summarization using fuzzy inference | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
CN104391942A (en) | Short text characteristic expanding method based on semantic atlas | |
CN106202294B (en) | Related news computing method and device based on keyword and topic model fusion | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN110222172B (en) | Multi-source network public opinion theme mining method based on improved hierarchical clustering | |
CN110188349A (en) | A kind of automation writing method based on extraction-type multiple file summarization method | |
CN108804595B (en) | Short text representation method based on word2vec | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
CN108681574A (en) | A kind of non-true class quiz answers selection method and system based on text snippet | |
CN109815400A (en) | Personage's interest extracting method based on long text | |
CN111625622B (en) | Domain ontology construction method and device, electronic equipment and storage medium | |
Qiu et al. | Advanced sentiment classification of tibetan microblogs on smart campuses based on multi-feature fusion | |
Sadr et al. | Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms | |
Subramaniam et al. | Test model for rich semantic graph representation for Hindi text using abstractive method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |