CN106055538A - Automatic extraction method for text labels in combination with theme model and semantic analyses - Google Patents
Automatic extraction method for text labels in combination with theme model and semantic analyses Download PDFInfo
- Publication number
- CN106055538A CN106055538A CN201610361639.1A CN201610361639A CN106055538A CN 106055538 A CN106055538 A CN 106055538A CN 201610361639 A CN201610361639 A CN 201610361639A CN 106055538 A CN106055538 A CN 106055538A
- Authority
- CN
- China
- Prior art keywords
- word
- document
- theme
- lda
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 32
- 238000000605 extraction Methods 0.000 title claims abstract description 11
- 238000000034 method Methods 0.000 claims abstract description 43
- 239000011159 matrix material Substances 0.000 claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 7
- 238000002372 labelling Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000002203 pretreatment Methods 0.000 abstract 2
- 238000004883 computer application Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 3
- 210000000582 semen Anatomy 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention relates to an automatic extraction method for text labels in combination with theme model and semantic analyses, pertaining to the technical field of computer application. The method comprises pre-treatment, LDA modeling, context analyses and label extraction.The pre-treatment comprises following steps: removing low-frequency words, removing stop words and removing label information, wherein stop words are auxiliary words without any information, words showing sentence grammar structures, all function words and punctuations. The LDA modeling process comprises following steps: obtaining two matrixes after processing the LDA model: one is a file-theme matrix of N*K with each element corresponding to a hidden theme distribution of each file and the other is a K*M theme-word matrix with each element corresponding to a word distribution of each theme. Based on a conventional counting method, the method takes correlations of words in files into consideration and fully utilizes one key feature of context information so that label information of files is obtained.
Description
Technical field
The present invention relates to topic model and the automatic abstracting method of text label that semantic analysis combines, belonging to computer should
Use technical field.
Background technology
In DT (data technology) epoch, internet information presents explosive growth, various text datas
Emerge in an endless stream, as diversified news, magnanimity from media original article.In the face of the most rich and varied information, people are urgent
Need some automation tools to help them to find the crucial letter oneself needed from immense information vast sea accurately and rapidly
Breath, label extraction produces the most under this background.Label is quick obtaining text key message, the important side holding theme
Formula, all has important application in the fields such as information retrieval, natural language processing, intelligent recommendation.Many websites provide a user with
The function of label it is labeled for object (such as picture, video, books and film etc.) interested, it is simple to user shares, manages,
Collection and retrieval object.As Fig. 1 (a) and Fig. 1 (b) show on Semen Sojae Preparatum the label for books and film.
LDA (Latent Dirichlet Allocation) model is that a kind of document subject matter generates model, and it is at present should
With widest a kind of probability topic model, it has assumes than the more fully text generation of other models.LDA model is at PLSA
On the basis of, use the implicit stochastic variable of K dimension obeying Dirichlet distribution to represent the theme mixed proportion of document, come with this
The generation process of simulation document.The document representation and the implicit semantic structure that use LDA acquisition are applied to very the most like a bomb
The association area of many text-processings.LDA model is the production probabilistic model of a multilamellar, comprises document, theme, word three-layered node
Structure.Theme obeys multinomial distribution to word, and Dirichlet distribution then obeyed by document to theme.The LDA hybrid weight θ to theme
Carry out Dirichlet priori, produce parameter θ with a hyper parameter α, i.e. the parameter of parameter.LDA is a kind of non-supervisory engineering
Habit technology, can be used to identify subject information hiding in extensive document sets or corpus.It has employed the method for word bag, this
Each document is considered as a word frequency vector by the method for kind, thus text message is converted the digital information for ease of modeling.
Each theme represents again the probability distribution that a lot of word is constituted, and some theme institute structures of each documents representative
The probability distribution become.
The shortcoming that current label abstracting method mainly has following two and existence:
1. statistical information based on text vocabulary generates label, such as TF-IDF (term frequency-inverse
Document frequency), mutual information (mutual information) etc., then they are sorted, if choosing the highest
Dry individual as key word, the most unsupervised method.The method advantage is simple and fast, it is not required that manually mark
Note.But, this method cannot effectively comprehensively utilize much information and sort candidate keywords.It addition, do not account for word and word
Between dependency, namely document is actually made up of some potential themes, and each theme is by some word structures
Become.
2. method based on machine learning generates label.The method also referred to as having supervision, main thought is by label
Extraction problem is converted to judge that whether each candidate keywords is two classification problems of label.Mark firstly the need of to document sets
Sign mark, then split into training data and test data, be used for generating disaggregated model.This method can be learnt by training
Regulate the information of multiple dimension for judging the influence degree of key word, so effect is more preferable.But, for training set
Mark waste time and energy the most very much, and document subject matter changes acutely the most over time, is trained the mark of set at any time
The most unrealistic.
Summary of the invention
In order to overcome above-mentioned deficiency, the text label that the present invention provides topic model and semantic analysis to combine is taken out automatically
Access method.
The technical scheme that the present invention takes is as follows:
The automatic abstracting method of text label that topic model and semantic analysis combine, comprises the steps:
The first step: pretreatment;
Second step: LDA modeling and contextual analysis;
3rd step: tag extraction.
Wherein, the mode of the pretreatment of the first step is: if there is low-frequency word, stop-word and label information, described pre-place
Reason includes removing low-frequency word, removing stop-word and remove label information;Described low-frequency word only occurred in one to two texts,
Described stop-word is to carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuate hardly
Symbol, described label information is web page text or other marking language text information;Other marking language text information bag
Include html and css;
The LDA modeling process of second step is: file, after LDA models treated, obtains two matrixes: one is N × K
" document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document;Another is that K × M is " main
Topic-word " matrix, what each element of matrix was corresponding is the word distribution of each theme;
Contextual analysis includes following dimension:
(1) word frequency time,
(2) document frequencies,
(3) part of speech,
(4) lexeme is put,
(5)TF-IDF;
The method of contextual analysis comprises the steps,
1. according to the html label information of text, the positional information at each section of text place is obtained;
2. text is carried out word segmentation processing and part of speech labelling, obtains each independent word and part-of-speech information;
3. method known to industry is used to calculate word frequency time, document frequencies and TF-IDF;
After the pretreatment of the first step, each document defines a characteristic vector, defines the side of characteristic vector
Method is: suppose there is N piece document, M word, K theme, LDA modeling process is: file, after LDA models treated, obtains two
Matrix: one is N × K " document-theme " matrix, the implicit theme of what each element of matrix was corresponding is each document divides
Cloth;Another is K × M " theme-word " matrix, and what each element of matrix was corresponding is the word distribution of each theme.
The method of the tag extraction of the 3rd step is as follows:
The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, the weight obtaining text d word w is:
Weigh | and t (d, w)=α | SorceLDA(d, w)+β | Sorceword(d, w),
Wherein, (d w) represents that word w LDA in document d calculates score, represents word w context in document d Score
Score after analysis, α and β represents LDA algorithm and the weight of contextual analysis method,
K represents the number of topics that LDA model is arranged, and (t d) represents t of document d in " document-theme " matrix to Topic
The probit of theme, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,
Scoreword(d, w)=ρ | TfIdf (w, d)+γ | f (w, d)+ξ | g (w, d)+μ | and ρ (w, d)+σ | γ (w);
(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word w power of word frequency time in document d to f
Weight, (w, d) represents word w weight of document frequencies in document d to g, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents word
Part of speech weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put and divided at word context with part of speech respectively
Weight in analysis algorithm, for constant,
F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through above
Calculating, obtain the Weigh of each word w in document d | (d, w), sorts from high to low t according to the least, takes maximum several
Word or phrase are as the label of document.
The method have the benefit that
Comparing current Statistics-Based Method, the present invention not only allows for word and the association of word in document, also the most sharp
With some key features in contextual information, finally give the label information of document.
Accompanying drawing explanation
Fig. 1 (a) schematically illustrates the label one on Semen Sojae Preparatum for books and film;
Fig. 1 (b) schematically illustrates the label two on Semen Sojae Preparatum for books and film;
Fig. 2 schematically illustrates the schematic flow sheet of the present invention;
Fig. 3 schematically illustrates LDA models treated flow chart.
Detailed description of the invention
The present invention will be further described below in conjunction with the accompanying drawings:
As shown in Figure 2: the automatic abstracting method of text label that topic model and semantic analysis combine, including walking as follows
Rapid:
The first step: pretreatment;
Second step: LDA modeling and contextual analysis;
3rd step: tag extraction.
The mode of the pretreatment of the first step is: if there is low-frequency word, stop-word and label information, described pretreatment includes
Remove low-frequency word, remove stop-word and remove label information;Described low-frequency word only occurred in one to two texts, described in stop
Only word is to carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuation mark hardly,
Described label information is web page text or other marking language text information;Other marking language text information includes html
And css;
The LDA modeling process that second step relates to is: after the first step obtains pretreatment, each document defines one
Characteristic vector, the method defining characteristic vector is: suppose there is N piece document, M word, K theme;As it is shown on figure 3, LDA modeling
Process is: file, after LDA models treated, obtains two matrixes: one is N × K " document-theme " matrix, matrix
What each element was corresponding is the implicit theme distribution of each document;Another is K × M " theme-word " matrix, each unit of matrix
What element was corresponding is the word distribution of each theme.
Described contextual analysis includes following dimension:
(1) word frequency time, the occurrence number of word in i.e. one document.
(2) document frequencies, i.e. in all document sets, has how many documents to comprise this word;
(3) part of speech, noun and nominal phrase characterize semanteme and are eager to excel, and weight also can be higher;
(4) lexeme is put, i.e. this word location, and the not off-position at title, summary and the article such as conclusion, text is put, power
It is heavily different.
(5) TF-IDF, TF-IDF are a kind of statistical method, and main thought is the frequency when a word occurs in a document
Rate is the highest, and the number of times simultaneously occurred in other documents is the fewest, then show that this word is for representing the separating capacity of this document more
By force, so its weighted value just should be the biggest.
The method of the contextual analysis that second step relates to comprises the steps,
1. according to the html label information of text, obtain the positional information at each section of text place, as title, text, overstriking,
Font size etc.;
2. text is carried out word segmentation processing and part of speech labelling, obtains each independent word and part-of-speech information;
3. method known to industry is used to calculate word frequency time, document frequencies and TF-IDF;
The tag extraction method of the 3rd step is:
The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, obtains text d, and the weight of word w is:
Weigh | and t (d, w)=α | SorceLDA(d, w)+β | Sorceword(d, w),
Wherein (d w) represents that word w LDA in document d calculates score, represents word w context in document d Score
Score after analysis, α and β represents LDA algorithm and the weight of contextual analysis method,
K represents the number of topics that LDA model is arranged, and (t d) represents t of document d in " document-theme " matrix to Topic
The probit of theme, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,
Scoreword(d, w)=ρ | TfIdf (w, d)+γ | f (w, d)+ξ | g (w, d)+μ | and ρ (w, d)+σ | γ (w);
(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word w power of word frequency time in document d to f
Weight, (w, d) represents word w weight of document frequencies in document d to g, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents word
Part of speech weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put and divided at word context with part of speech respectively
Weight in analysis algorithm, for constant,
F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through above
Calculating, obtain the Weigh of each word w in document d | (d, w), sorts from high to low t according to the least, takes maximum several
Word or phrase are as the label of document.
Comparing current Statistics-Based Method, the present invention not only allows for word and the association of word in document, also the most sharp
With some key features in contextual information, finally give the label information of document.
For the ordinary skill in the art, the present invention is simply exemplarily described by specific embodiment,
Obviously the present invention implements and is not subject to the restrictions described above, as long as the method design that have employed the present invention is entered with technical scheme
The improvement of various unsubstantialities of row, or the most improved design by the present invention and technical scheme directly apply to other occasion
, all within protection scope of the present invention.
Claims (3)
1. the automatic abstracting method of text label that topic model and semantic analysis combine, it is characterised in that: comprise the steps:
The first step: pretreatment, if there is low-frequency word, stop-word and label information, described pretreatment includes removing low-frequency word, going
Fall stop-word and remove label information;Described low-frequency word only occurred in one to two texts, and described stop-word is hardly
Carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuation mark, described label information
It is web page text or other marking language text information;Other marking language text information includes html and css;
Second step: LDA modeling and contextual analysis;LDA modeling process is: file, after LDA models treated, obtains two squares
Battle array: one is N × K " document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document;
Another is K × M " theme-word " matrix, and what each element of matrix was corresponding is the word distribution of each theme;
Contextual analysis includes following dimension:
(1) word frequency time,
(2) document frequencies,
(3) part of speech,
(4) lexeme is put,
(5)TF-IDF;
The method of contextual analysis comprises the steps,
1. according to the html label information of text, the positional information at each section of text place is obtained;
2. text is carried out word segmentation processing and part of speech labelling, obtains each independent word and part-of-speech information;
3. method known to industry is used to calculate word frequency time, document frequencies and TF-IDF;
3rd step: tag extraction.
The automatic abstracting method of text label that topic model the most according to claim 1 and semantic analysis combine, it is special
Levying and be: in described second step, after pretreatment, each document defines a characteristic vector, it is assumed that have N piece document, M
Individual word, K theme, the process of LDA modeling is: file, after LDA models treated, obtains two matrixes: one is N × K
" document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document;Another is that K × M is " main
Topic-word " matrix, what each element of matrix was corresponding is the word distribution of each theme.
The automatic abstracting method of text label that topic model the most according to claim 1 and semantic analysis combine, it is special
Levy and be: in described 3rd step, the method for tag extraction is as follows,
The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, the weight obtaining text d word w is:
Weigh | and t (d, w)=α | SorceLDA(d, w)+β | Sorceword(d, w),
Wherein (d w) represents that word w LDA in document d calculates score, represents word w contextual analysis in document d Score
After score, α and β represents LDA algorithm and the weight of contextual analysis method,
K represents the number of topics that LDA model is arranged, and (t d) represents the t theme of document d in " document-theme " matrix to Topic
Probit, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,
Scoreword(d, w)=ρ | TfIdf (w, d)+γ | f (w, d)+ξ | g (w, d)+μ | and ρ (w, d)+σ | γ (w);
(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word the w weight of word frequency time, g in document d to f
(w, d) represents word w weight of document frequencies in document d, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents the word of word
Property weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put respectively and calculate at word contextual analysis with part of speech
Weight in method, for constant,
F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through meter above
Calculate, obtain the Weigh of each word w in document d | t (d, w), sorts from high to low according to the least, take maximum several words or
Person's phrase is as the label of document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610361639.1A CN106055538B (en) | 2016-05-26 | 2016-05-26 | The automatic abstracting method of the text label that topic model and semantic analysis combine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610361639.1A CN106055538B (en) | 2016-05-26 | 2016-05-26 | The automatic abstracting method of the text label that topic model and semantic analysis combine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106055538A true CN106055538A (en) | 2016-10-26 |
CN106055538B CN106055538B (en) | 2019-03-08 |
Family
ID=57175892
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610361639.1A Active CN106055538B (en) | 2016-05-26 | 2016-05-26 | The automatic abstracting method of the text label that topic model and semantic analysis combine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106055538B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502988A (en) * | 2016-11-02 | 2017-03-15 | 深圳市空谷幽兰人工智能科技有限公司 | The method and apparatus that a kind of objective attribute target attribute is extracted |
CN106649844A (en) * | 2016-12-30 | 2017-05-10 | 浙江工商大学 | Unstructured text data enhanced distributed large-scale data dimension extracting method |
CN107169021A (en) * | 2017-04-07 | 2017-09-15 | 华为机器有限公司 | Method and apparatus for predicting application function label |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
CN108304509A (en) * | 2018-01-19 | 2018-07-20 | 华南理工大学 | A kind of comment spam filter method for indicating mutually to learn based on the multidirectional amount of text |
CN108536679A (en) * | 2018-04-13 | 2018-09-14 | 腾讯科技(成都)有限公司 | Name entity recognition method, device, equipment and computer readable storage medium |
CN108959431A (en) * | 2018-06-11 | 2018-12-07 | 中国科学院上海高等研究院 | Label automatic generation method, system, computer readable storage medium and equipment |
CN109213988A (en) * | 2017-06-29 | 2019-01-15 | 武汉斗鱼网络科技有限公司 | Barrage subject distillation method, medium, equipment and system based on N-gram model |
CN109376270A (en) * | 2018-09-26 | 2019-02-22 | 青岛聚看云科技有限公司 | A kind of data retrieval method and device |
CN109635102A (en) * | 2018-11-19 | 2019-04-16 | 浙江工业大学 | Topic model method for improving based on user's interaction |
CN110032639A (en) * | 2018-12-27 | 2019-07-19 | 中国银联股份有限公司 | By the method, apparatus and storage medium of semantic text data and tag match |
CN110222331A (en) * | 2019-04-26 | 2019-09-10 | 平安科技(深圳)有限公司 | Lie recognition methods and device, storage medium, computer equipment |
CN110347977A (en) * | 2019-06-28 | 2019-10-18 | 太原理工大学 | A kind of news automated tag method based on LDA model |
CN110380954A (en) * | 2017-04-12 | 2019-10-25 | 腾讯科技(深圳)有限公司 | Data sharing method and device, storage medium and electronic device |
CN110727794A (en) * | 2018-06-28 | 2020-01-24 | 上海传漾广告有限公司 | System and method for collecting and analyzing network semantics and summarizing and analyzing content |
CN110781659A (en) * | 2018-07-11 | 2020-02-11 | 株式会社Ntt都科摩 | Text processing method and text processing device based on neural network |
CN111079042A (en) * | 2019-12-03 | 2020-04-28 | 杭州安恒信息技术股份有限公司 | Webpage hidden link detection method and device based on text theme |
CN111160025A (en) * | 2019-12-12 | 2020-05-15 | 日照睿安信息科技有限公司 | Method for actively discovering case keywords based on public security text |
CN111507098A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111695358A (en) * | 2020-06-12 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Method and device for generating word vector, computer storage medium and electronic equipment |
CN112287679A (en) * | 2020-10-16 | 2021-01-29 | 国网江西省电力有限公司电力科学研究院 | Structured extraction method and system for text information in scientific and technological project review |
CN112559853A (en) * | 2019-09-26 | 2021-03-26 | 北京沃东天骏信息技术有限公司 | User label generation method and device |
CN112732743A (en) * | 2021-01-12 | 2021-04-30 | 北京久其软件股份有限公司 | Data analysis method and device based on Chinese natural language |
US11030483B2 (en) | 2018-08-07 | 2021-06-08 | International Business Machines Corporation | Generating and ordering tags for an image using subgraph of concepts |
WO2022183991A1 (en) * | 2021-03-01 | 2022-09-09 | 国家电网有限公司 | Document classification method and apparatus, and electronic device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080319974A1 (en) * | 2007-06-21 | 2008-12-25 | Microsoft Corporation | Mining geographic knowledge using a location aware topic model |
US20120041953A1 (en) * | 2010-08-16 | 2012-02-16 | Microsoft Corporation | Text mining of microblogs using latent topic labels |
CN103164463A (en) * | 2011-12-16 | 2013-06-19 | 国际商业机器公司 | Method and device for recommending labels |
CN103365978A (en) * | 2013-07-01 | 2013-10-23 | 浙江大学 | Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model |
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
CN103425710A (en) * | 2012-05-25 | 2013-12-04 | 北京百度网讯科技有限公司 | Subject-based searching method and device |
CN103678277A (en) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | Theme-vocabulary distribution establishing method and system based on document segmenting |
CN103778207A (en) * | 2014-01-15 | 2014-05-07 | 杭州电子科技大学 | LDA-based news comment topic digging method |
CN103942340A (en) * | 2014-05-09 | 2014-07-23 | 电子科技大学 | Microblog user interest recognizing method based on text mining |
CN105608166A (en) * | 2015-12-18 | 2016-05-25 | Tcl集团股份有限公司 | Label extracting method and device |
-
2016
- 2016-05-26 CN CN201610361639.1A patent/CN106055538B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080319974A1 (en) * | 2007-06-21 | 2008-12-25 | Microsoft Corporation | Mining geographic knowledge using a location aware topic model |
US20120041953A1 (en) * | 2010-08-16 | 2012-02-16 | Microsoft Corporation | Text mining of microblogs using latent topic labels |
CN103164463A (en) * | 2011-12-16 | 2013-06-19 | 国际商业机器公司 | Method and device for recommending labels |
CN103425710A (en) * | 2012-05-25 | 2013-12-04 | 北京百度网讯科技有限公司 | Subject-based searching method and device |
CN103365978A (en) * | 2013-07-01 | 2013-10-23 | 浙江大学 | Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model |
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
CN103678277A (en) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | Theme-vocabulary distribution establishing method and system based on document segmenting |
CN103778207A (en) * | 2014-01-15 | 2014-05-07 | 杭州电子科技大学 | LDA-based news comment topic digging method |
CN103942340A (en) * | 2014-05-09 | 2014-07-23 | 电子科技大学 | Microblog user interest recognizing method based on text mining |
CN105608166A (en) * | 2015-12-18 | 2016-05-25 | Tcl集团股份有限公司 | Label extracting method and device |
Non-Patent Citations (4)
Title |
---|
NIDAA GHALIB ALI 等: "A Hybrid of Statistical and Machine Learning Methods for Arabic Keyphrase Extraction", 《ASIAN JOURNAL OF APPLIED SCIENCES, 2015》 * |
刘娜 等: "基于LDA重要主题的多文档自动摘要算法", 《计算机科学与探索》 * |
刘慕凡: "基于主题与语义的作弊网页检测方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
石晶 等: "基于LDA模型的主题词抽取方法", 《计算机工程》 * |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502988A (en) * | 2016-11-02 | 2017-03-15 | 深圳市空谷幽兰人工智能科技有限公司 | The method and apparatus that a kind of objective attribute target attribute is extracted |
CN106502988B (en) * | 2016-11-02 | 2019-06-07 | 广东惠禾科技发展有限公司 | A kind of method and apparatus that objective attribute target attribute extracts |
CN106649844A (en) * | 2016-12-30 | 2017-05-10 | 浙江工商大学 | Unstructured text data enhanced distributed large-scale data dimension extracting method |
CN106649844B (en) * | 2016-12-30 | 2019-10-18 | 浙江工商大学 | The enhanced distributed large-scale data dimension abstracting method of unstructured text data |
CN107169021A (en) * | 2017-04-07 | 2017-09-15 | 华为机器有限公司 | Method and apparatus for predicting application function label |
CN110380954A (en) * | 2017-04-12 | 2019-10-25 | 腾讯科技(深圳)有限公司 | Data sharing method and device, storage medium and electronic device |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
CN109213988B (en) * | 2017-06-29 | 2022-06-21 | 武汉斗鱼网络科技有限公司 | Barrage theme extraction method, medium, equipment and system based on N-gram model |
CN109213988A (en) * | 2017-06-29 | 2019-01-15 | 武汉斗鱼网络科技有限公司 | Barrage subject distillation method, medium, equipment and system based on N-gram model |
CN108304509A (en) * | 2018-01-19 | 2018-07-20 | 华南理工大学 | A kind of comment spam filter method for indicating mutually to learn based on the multidirectional amount of text |
CN108304509B (en) * | 2018-01-19 | 2021-12-21 | 华南理工大学 | Junk comment filtering method based on text multi-directional expression mutual learning |
CN108536679B (en) * | 2018-04-13 | 2022-05-20 | 腾讯科技(成都)有限公司 | Named entity recognition method, device, equipment and computer readable storage medium |
CN108536679A (en) * | 2018-04-13 | 2018-09-14 | 腾讯科技(成都)有限公司 | Name entity recognition method, device, equipment and computer readable storage medium |
CN108959431A (en) * | 2018-06-11 | 2018-12-07 | 中国科学院上海高等研究院 | Label automatic generation method, system, computer readable storage medium and equipment |
CN110727794A (en) * | 2018-06-28 | 2020-01-24 | 上海传漾广告有限公司 | System and method for collecting and analyzing network semantics and summarizing and analyzing content |
CN110781659A (en) * | 2018-07-11 | 2020-02-11 | 株式会社Ntt都科摩 | Text processing method and text processing device based on neural network |
US11030483B2 (en) | 2018-08-07 | 2021-06-08 | International Business Machines Corporation | Generating and ordering tags for an image using subgraph of concepts |
CN109376270A (en) * | 2018-09-26 | 2019-02-22 | 青岛聚看云科技有限公司 | A kind of data retrieval method and device |
CN109635102A (en) * | 2018-11-19 | 2019-04-16 | 浙江工业大学 | Topic model method for improving based on user's interaction |
CN109635102B (en) * | 2018-11-19 | 2021-05-11 | 浙江工业大学 | Theme model lifting method based on user interaction |
CN110032639B (en) * | 2018-12-27 | 2023-10-31 | 中国银联股份有限公司 | Method, device and storage medium for matching semantic text data with tag |
US11586658B2 (en) | 2018-12-27 | 2023-02-21 | China Unionpay Co., Ltd. | Method and device for matching semantic text data with a tag, and computer-readable storage medium having stored instructions |
CN110032639A (en) * | 2018-12-27 | 2019-07-19 | 中国银联股份有限公司 | By the method, apparatus and storage medium of semantic text data and tag match |
CN110222331A (en) * | 2019-04-26 | 2019-09-10 | 平安科技(深圳)有限公司 | Lie recognition methods and device, storage medium, computer equipment |
CN110347977A (en) * | 2019-06-28 | 2019-10-18 | 太原理工大学 | A kind of news automated tag method based on LDA model |
CN112559853A (en) * | 2019-09-26 | 2021-03-26 | 北京沃东天骏信息技术有限公司 | User label generation method and device |
CN112559853B (en) * | 2019-09-26 | 2024-01-12 | 北京沃东天骏信息技术有限公司 | User tag generation method and device |
CN111079042B (en) * | 2019-12-03 | 2023-08-15 | 杭州安恒信息技术股份有限公司 | Webpage hidden chain detection method and device based on text theme |
CN111079042A (en) * | 2019-12-03 | 2020-04-28 | 杭州安恒信息技术股份有限公司 | Webpage hidden link detection method and device based on text theme |
CN111160025A (en) * | 2019-12-12 | 2020-05-15 | 日照睿安信息科技有限公司 | Method for actively discovering case keywords based on public security text |
CN111507098B (en) * | 2020-04-17 | 2023-03-21 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111507098A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111695358A (en) * | 2020-06-12 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Method and device for generating word vector, computer storage medium and electronic equipment |
CN111695358B (en) * | 2020-06-12 | 2023-08-08 | 腾讯科技(深圳)有限公司 | Method and device for generating word vector, computer storage medium and electronic equipment |
CN112287679A (en) * | 2020-10-16 | 2021-01-29 | 国网江西省电力有限公司电力科学研究院 | Structured extraction method and system for text information in scientific and technological project review |
CN112732743A (en) * | 2021-01-12 | 2021-04-30 | 北京久其软件股份有限公司 | Data analysis method and device based on Chinese natural language |
CN112732743B (en) * | 2021-01-12 | 2023-09-22 | 北京久其软件股份有限公司 | Data analysis method and device based on Chinese natural language |
WO2022183991A1 (en) * | 2021-03-01 | 2022-09-09 | 国家电网有限公司 | Document classification method and apparatus, and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN106055538B (en) | 2019-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106055538A (en) | Automatic extraction method for text labels in combination with theme model and semantic analyses | |
US9779085B2 (en) | Multilingual embeddings for natural language processing | |
Lin et al. | Joint sentiment/topic model for sentiment analysis | |
Read et al. | Weakly supervised techniques for domain-independent sentiment classification | |
CN103049435B (en) | Text fine granularity sentiment analysis method and device | |
El-Halees | Mining opinions in user-generated contents to improve course evaluation | |
Rajan et al. | Automatic classification of Tamil documents using vector space model and artificial neural network | |
Das et al. | Part of speech tagging in odia using support vector machine | |
CN103473380B (en) | A kind of computer version sensibility classification method | |
Hamza et al. | An arabic question classification method based on new taxonomy and continuous distributed representation of words | |
Haque et al. | Literature review of automatic multiple documents text summarization | |
Qiu et al. | Advanced sentiment classification of tibetan microblogs on smart campuses based on multi-feature fusion | |
Wahbeh et al. | Comparative assessment of the performance of three WEKA text classifiers applied to arabic text | |
Sadr et al. | Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms | |
Jebari et al. | A new approach for implicit citation extraction | |
Hassan et al. | Automatic document topic identification using wikipedia hierarchical ontology | |
Spatiotis et al. | Sentiment analysis for the Greek language | |
Khan et al. | Sentiment analysis at sentence level for heterogeneous datasets | |
Shah et al. | An automatic text summarization on Naive Bayes classifier using latent semantic analysis | |
Zhang et al. | Positive, negative, or mixed? Mining blogs for opinions | |
Alam et al. | Bangla news trend observation using lda based topic modeling | |
Ma et al. | Analysis of three methods for web-based opinion mining | |
Kumar et al. | Aspect-Based Sentiment Analysis of Tweets Using Independent Component Analysis (ICA) and Probabilistic Latent Semantic Analysis (pLSA) | |
Ba-Alwi et al. | Arabic text summarization using latent semantic analysis | |
Singh et al. | An Insight into Word Sense Disambiguation Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: Room 501, 502, 503, No. 66 Boxia Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, March 2012 Patentee after: Daguan Data Co.,Ltd. Address before: Room 1208, No. 2305 Zuchongzhi Road, Zhangjiang, Pudong New Area, Shanghai, 200000 Patentee before: DATAGRAND INFORMATION TECHNOLOGY (SHANGHAI) Co.,Ltd. |