CN106055538A - Automatic extraction method for text labels in combination with theme model and semantic analyses - Google Patents

Automatic extraction method for text labels in combination with theme model and semantic analyses Download PDF

Info

Publication number
CN106055538A
CN106055538A CN201610361639.1A CN201610361639A CN106055538A CN 106055538 A CN106055538 A CN 106055538A CN 201610361639 A CN201610361639 A CN 201610361639A CN 106055538 A CN106055538 A CN 106055538A
Authority
CN
China
Prior art keywords
word
document
theme
lda
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610361639.1A
Other languages
Chinese (zh)
Other versions
CN106055538B (en
Inventor
于敬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Co ltd
Original Assignee
Information Technology (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Technology (shanghai) Co Ltd filed Critical Information Technology (shanghai) Co Ltd
Priority to CN201610361639.1A priority Critical patent/CN106055538B/en
Publication of CN106055538A publication Critical patent/CN106055538A/en
Application granted granted Critical
Publication of CN106055538B publication Critical patent/CN106055538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to an automatic extraction method for text labels in combination with theme model and semantic analyses, pertaining to the technical field of computer application. The method comprises pre-treatment, LDA modeling, context analyses and label extraction.The pre-treatment comprises following steps: removing low-frequency words, removing stop words and removing label information, wherein stop words are auxiliary words without any information, words showing sentence grammar structures, all function words and punctuations. The LDA modeling process comprises following steps: obtaining two matrixes after processing the LDA model: one is a file-theme matrix of N*K with each element corresponding to a hidden theme distribution of each file and the other is a K*M theme-word matrix with each element corresponding to a word distribution of each theme. Based on a conventional counting method, the method takes correlations of words in files into consideration and fully utilizes one key feature of context information so that label information of files is obtained.

Description

The automatic abstracting method of text label that topic model and semantic analysis combine
Technical field
The present invention relates to topic model and the automatic abstracting method of text label that semantic analysis combines, belonging to computer should Use technical field.
Background technology
In DT (data technology) epoch, internet information presents explosive growth, various text datas Emerge in an endless stream, as diversified news, magnanimity from media original article.In the face of the most rich and varied information, people are urgent Need some automation tools to help them to find the crucial letter oneself needed from immense information vast sea accurately and rapidly Breath, label extraction produces the most under this background.Label is quick obtaining text key message, the important side holding theme Formula, all has important application in the fields such as information retrieval, natural language processing, intelligent recommendation.Many websites provide a user with The function of label it is labeled for object (such as picture, video, books and film etc.) interested, it is simple to user shares, manages, Collection and retrieval object.As Fig. 1 (a) and Fig. 1 (b) show on Semen Sojae Preparatum the label for books and film.
LDA (Latent Dirichlet Allocation) model is that a kind of document subject matter generates model, and it is at present should With widest a kind of probability topic model, it has assumes than the more fully text generation of other models.LDA model is at PLSA On the basis of, use the implicit stochastic variable of K dimension obeying Dirichlet distribution to represent the theme mixed proportion of document, come with this The generation process of simulation document.The document representation and the implicit semantic structure that use LDA acquisition are applied to very the most like a bomb The association area of many text-processings.LDA model is the production probabilistic model of a multilamellar, comprises document, theme, word three-layered node Structure.Theme obeys multinomial distribution to word, and Dirichlet distribution then obeyed by document to theme.The LDA hybrid weight θ to theme Carry out Dirichlet priori, produce parameter θ with a hyper parameter α, i.e. the parameter of parameter.LDA is a kind of non-supervisory engineering Habit technology, can be used to identify subject information hiding in extensive document sets or corpus.It has employed the method for word bag, this Each document is considered as a word frequency vector by the method for kind, thus text message is converted the digital information for ease of modeling. Each theme represents again the probability distribution that a lot of word is constituted, and some theme institute structures of each documents representative The probability distribution become.
The shortcoming that current label abstracting method mainly has following two and existence:
1. statistical information based on text vocabulary generates label, such as TF-IDF (term frequency-inverse Document frequency), mutual information (mutual information) etc., then they are sorted, if choosing the highest Dry individual as key word, the most unsupervised method.The method advantage is simple and fast, it is not required that manually mark Note.But, this method cannot effectively comprehensively utilize much information and sort candidate keywords.It addition, do not account for word and word Between dependency, namely document is actually made up of some potential themes, and each theme is by some word structures Become.
2. method based on machine learning generates label.The method also referred to as having supervision, main thought is by label Extraction problem is converted to judge that whether each candidate keywords is two classification problems of label.Mark firstly the need of to document sets Sign mark, then split into training data and test data, be used for generating disaggregated model.This method can be learnt by training Regulate the information of multiple dimension for judging the influence degree of key word, so effect is more preferable.But, for training set Mark waste time and energy the most very much, and document subject matter changes acutely the most over time, is trained the mark of set at any time The most unrealistic.
Summary of the invention
In order to overcome above-mentioned deficiency, the text label that the present invention provides topic model and semantic analysis to combine is taken out automatically Access method.
The technical scheme that the present invention takes is as follows:
The automatic abstracting method of text label that topic model and semantic analysis combine, comprises the steps:
The first step: pretreatment;
Second step: LDA modeling and contextual analysis;
3rd step: tag extraction.
Wherein, the mode of the pretreatment of the first step is: if there is low-frequency word, stop-word and label information, described pre-place Reason includes removing low-frequency word, removing stop-word and remove label information;Described low-frequency word only occurred in one to two texts, Described stop-word is to carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuate hardly Symbol, described label information is web page text or other marking language text information;Other marking language text information bag Include html and css;
The LDA modeling process of second step is: file, after LDA models treated, obtains two matrixes: one is N × K " document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document;Another is that K × M is " main Topic-word " matrix, what each element of matrix was corresponding is the word distribution of each theme;
Contextual analysis includes following dimension:
(1) word frequency time,
(2) document frequencies,
(3) part of speech,
(4) lexeme is put,
(5)TF-IDF;
The method of contextual analysis comprises the steps,
1. according to the html label information of text, the positional information at each section of text place is obtained;
2. text is carried out word segmentation processing and part of speech labelling, obtains each independent word and part-of-speech information;
3. method known to industry is used to calculate word frequency time, document frequencies and TF-IDF;
After the pretreatment of the first step, each document defines a characteristic vector, defines the side of characteristic vector Method is: suppose there is N piece document, M word, K theme, LDA modeling process is: file, after LDA models treated, obtains two Matrix: one is N × K " document-theme " matrix, the implicit theme of what each element of matrix was corresponding is each document divides Cloth;Another is K × M " theme-word " matrix, and what each element of matrix was corresponding is the word distribution of each theme.
The method of the tag extraction of the 3rd step is as follows:
The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, the weight obtaining text d word w is:
Weigh | and t (d, w)=α | SorceLDA(d, w)+β | Sorceword(d, w),
Wherein, (d w) represents that word w LDA in document d calculates score, represents word w context in document d Score Score after analysis, α and β represents LDA algorithm and the weight of contextual analysis method,
S o r c e ( d , w ) = Σ t = 1 k T o p i c ( t , d ) | W o r d ( w , t ) ,
K represents the number of topics that LDA model is arranged, and (t d) represents t of document d in " document-theme " matrix to Topic The probit of theme, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,
Scoreword(d, w)=ρ | TfIdf (w, d)+γ | f (w, d)+ξ | g (w, d)+μ | and ρ (w, d)+σ | γ (w);
(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word w power of word frequency time in document d to f Weight, (w, d) represents word w weight of document frequencies in document d to g, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents word Part of speech weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put and divided at word context with part of speech respectively Weight in analysis algorithm, for constant,
F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through above Calculating, obtain the Weigh of each word w in document d | (d, w), sorts from high to low t according to the least, takes maximum several Word or phrase are as the label of document.
The method have the benefit that
Comparing current Statistics-Based Method, the present invention not only allows for word and the association of word in document, also the most sharp With some key features in contextual information, finally give the label information of document.
Accompanying drawing explanation
Fig. 1 (a) schematically illustrates the label one on Semen Sojae Preparatum for books and film;
Fig. 1 (b) schematically illustrates the label two on Semen Sojae Preparatum for books and film;
Fig. 2 schematically illustrates the schematic flow sheet of the present invention;
Fig. 3 schematically illustrates LDA models treated flow chart.
Detailed description of the invention
The present invention will be further described below in conjunction with the accompanying drawings:
As shown in Figure 2: the automatic abstracting method of text label that topic model and semantic analysis combine, including walking as follows Rapid:
The first step: pretreatment;
Second step: LDA modeling and contextual analysis;
3rd step: tag extraction.
The mode of the pretreatment of the first step is: if there is low-frequency word, stop-word and label information, described pretreatment includes Remove low-frequency word, remove stop-word and remove label information;Described low-frequency word only occurred in one to two texts, described in stop Only word is to carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuation mark hardly, Described label information is web page text or other marking language text information;Other marking language text information includes html And css;
The LDA modeling process that second step relates to is: after the first step obtains pretreatment, each document defines one Characteristic vector, the method defining characteristic vector is: suppose there is N piece document, M word, K theme;As it is shown on figure 3, LDA modeling Process is: file, after LDA models treated, obtains two matrixes: one is N × K " document-theme " matrix, matrix What each element was corresponding is the implicit theme distribution of each document;Another is K × M " theme-word " matrix, each unit of matrix What element was corresponding is the word distribution of each theme.
Described contextual analysis includes following dimension:
(1) word frequency time, the occurrence number of word in i.e. one document.
(2) document frequencies, i.e. in all document sets, has how many documents to comprise this word;
(3) part of speech, noun and nominal phrase characterize semanteme and are eager to excel, and weight also can be higher;
(4) lexeme is put, i.e. this word location, and the not off-position at title, summary and the article such as conclusion, text is put, power It is heavily different.
(5) TF-IDF, TF-IDF are a kind of statistical method, and main thought is the frequency when a word occurs in a document Rate is the highest, and the number of times simultaneously occurred in other documents is the fewest, then show that this word is for representing the separating capacity of this document more By force, so its weighted value just should be the biggest.
The method of the contextual analysis that second step relates to comprises the steps,
1. according to the html label information of text, obtain the positional information at each section of text place, as title, text, overstriking, Font size etc.;
2. text is carried out word segmentation processing and part of speech labelling, obtains each independent word and part-of-speech information;
3. method known to industry is used to calculate word frequency time, document frequencies and TF-IDF;
The tag extraction method of the 3rd step is:
The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, obtains text d, and the weight of word w is:
Weigh | and t (d, w)=α | SorceLDA(d, w)+β | Sorceword(d, w),
Wherein (d w) represents that word w LDA in document d calculates score, represents word w context in document d Score Score after analysis, α and β represents LDA algorithm and the weight of contextual analysis method,
S o r c e ( d , w ) = Σ t = 1 k T o p i c ( t , d ) | W o r d ( w , t ) ,
K represents the number of topics that LDA model is arranged, and (t d) represents t of document d in " document-theme " matrix to Topic The probit of theme, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,
Scoreword(d, w)=ρ | TfIdf (w, d)+γ | f (w, d)+ξ | g (w, d)+μ | and ρ (w, d)+σ | γ (w);
(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word w power of word frequency time in document d to f Weight, (w, d) represents word w weight of document frequencies in document d to g, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents word Part of speech weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put and divided at word context with part of speech respectively Weight in analysis algorithm, for constant,
F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through above Calculating, obtain the Weigh of each word w in document d | (d, w), sorts from high to low t according to the least, takes maximum several Word or phrase are as the label of document.
Comparing current Statistics-Based Method, the present invention not only allows for word and the association of word in document, also the most sharp With some key features in contextual information, finally give the label information of document.
For the ordinary skill in the art, the present invention is simply exemplarily described by specific embodiment, Obviously the present invention implements and is not subject to the restrictions described above, as long as the method design that have employed the present invention is entered with technical scheme The improvement of various unsubstantialities of row, or the most improved design by the present invention and technical scheme directly apply to other occasion , all within protection scope of the present invention.

Claims (3)

1. the automatic abstracting method of text label that topic model and semantic analysis combine, it is characterised in that: comprise the steps:
The first step: pretreatment, if there is low-frequency word, stop-word and label information, described pretreatment includes removing low-frequency word, going Fall stop-word and remove label information;Described low-frequency word only occurred in one to two texts, and described stop-word is hardly Carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuation mark, described label information It is web page text or other marking language text information;Other marking language text information includes html and css;
Second step: LDA modeling and contextual analysis;LDA modeling process is: file, after LDA models treated, obtains two squares Battle array: one is N × K " document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document; Another is K × M " theme-word " matrix, and what each element of matrix was corresponding is the word distribution of each theme;
Contextual analysis includes following dimension:
(1) word frequency time,
(2) document frequencies,
(3) part of speech,
(4) lexeme is put,
(5)TF-IDF;
The method of contextual analysis comprises the steps,
1. according to the html label information of text, the positional information at each section of text place is obtained;
2. text is carried out word segmentation processing and part of speech labelling, obtains each independent word and part-of-speech information;
3. method known to industry is used to calculate word frequency time, document frequencies and TF-IDF;
3rd step: tag extraction.
The automatic abstracting method of text label that topic model the most according to claim 1 and semantic analysis combine, it is special Levying and be: in described second step, after pretreatment, each document defines a characteristic vector, it is assumed that have N piece document, M Individual word, K theme, the process of LDA modeling is: file, after LDA models treated, obtains two matrixes: one is N × K " document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document;Another is that K × M is " main Topic-word " matrix, what each element of matrix was corresponding is the word distribution of each theme.
The automatic abstracting method of text label that topic model the most according to claim 1 and semantic analysis combine, it is special Levy and be: in described 3rd step, the method for tag extraction is as follows,
The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, the weight obtaining text d word w is:
Weigh | and t (d, w)=α | SorceLDA(d, w)+β | Sorceword(d, w),
Wherein (d w) represents that word w LDA in document d calculates score, represents word w contextual analysis in document d Score After score, α and β represents LDA algorithm and the weight of contextual analysis method,
S o r c e ( d , w ) = Σ t = 1 k T o p i c ( t , d ) | W o r d ( w , t ) ,
K represents the number of topics that LDA model is arranged, and (t d) represents the t theme of document d in " document-theme " matrix to Topic Probit, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,
Scoreword(d, w)=ρ | TfIdf (w, d)+γ | f (w, d)+ξ | g (w, d)+μ | and ρ (w, d)+σ | γ (w);
(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word the w weight of word frequency time, g in document d to f (w, d) represents word w weight of document frequencies in document d, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents the word of word Property weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put respectively and calculate at word contextual analysis with part of speech Weight in method, for constant,
F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through meter above Calculate, obtain the Weigh of each word w in document d | t (d, w), sorts from high to low according to the least, take maximum several words or Person's phrase is as the label of document.
CN201610361639.1A 2016-05-26 2016-05-26 The automatic abstracting method of the text label that topic model and semantic analysis combine Active CN106055538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610361639.1A CN106055538B (en) 2016-05-26 2016-05-26 The automatic abstracting method of the text label that topic model and semantic analysis combine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610361639.1A CN106055538B (en) 2016-05-26 2016-05-26 The automatic abstracting method of the text label that topic model and semantic analysis combine

Publications (2)

Publication Number Publication Date
CN106055538A true CN106055538A (en) 2016-10-26
CN106055538B CN106055538B (en) 2019-03-08

Family

ID=57175892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610361639.1A Active CN106055538B (en) 2016-05-26 2016-05-26 The automatic abstracting method of the text label that topic model and semantic analysis combine

Country Status (1)

Country Link
CN (1) CN106055538B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502988A (en) * 2016-11-02 2017-03-15 深圳市空谷幽兰人工智能科技有限公司 The method and apparatus that a kind of objective attribute target attribute is extracted
CN106649844A (en) * 2016-12-30 2017-05-10 浙江工商大学 Unstructured text data enhanced distributed large-scale data dimension extracting method
CN107169021A (en) * 2017-04-07 2017-09-15 华为机器有限公司 Method and apparatus for predicting application function label
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN108304509A (en) * 2018-01-19 2018-07-20 华南理工大学 A kind of comment spam filter method for indicating mutually to learn based on the multidirectional amount of text
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN109213988A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Barrage subject distillation method, medium, equipment and system based on N-gram model
CN109376270A (en) * 2018-09-26 2019-02-22 青岛聚看云科技有限公司 A kind of data retrieval method and device
CN109635102A (en) * 2018-11-19 2019-04-16 浙江工业大学 Topic model method for improving based on user's interaction
CN110032639A (en) * 2018-12-27 2019-07-19 中国银联股份有限公司 By the method, apparatus and storage medium of semantic text data and tag match
CN110222331A (en) * 2019-04-26 2019-09-10 平安科技(深圳)有限公司 Lie recognition methods and device, storage medium, computer equipment
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN110380954A (en) * 2017-04-12 2019-10-25 腾讯科技(深圳)有限公司 Data sharing method and device, storage medium and electronic device
CN110727794A (en) * 2018-06-28 2020-01-24 上海传漾广告有限公司 System and method for collecting and analyzing network semantics and summarizing and analyzing content
CN110781659A (en) * 2018-07-11 2020-02-11 株式会社Ntt都科摩 Text processing method and text processing device based on neural network
CN111079042A (en) * 2019-12-03 2020-04-28 杭州安恒信息技术股份有限公司 Webpage hidden link detection method and device based on text theme
CN111160025A (en) * 2019-12-12 2020-05-15 日照睿安信息科技有限公司 Method for actively discovering case keywords based on public security text
CN111507098A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN111695358A (en) * 2020-06-12 2020-09-22 腾讯科技(深圳)有限公司 Method and device for generating word vector, computer storage medium and electronic equipment
CN112287679A (en) * 2020-10-16 2021-01-29 国网江西省电力有限公司电力科学研究院 Structured extraction method and system for text information in scientific and technological project review
CN112559853A (en) * 2019-09-26 2021-03-26 北京沃东天骏信息技术有限公司 User label generation method and device
CN112732743A (en) * 2021-01-12 2021-04-30 北京久其软件股份有限公司 Data analysis method and device based on Chinese natural language
US11030483B2 (en) 2018-08-07 2021-06-08 International Business Machines Corporation Generating and ordering tags for an image using subgraph of concepts
WO2022183991A1 (en) * 2021-03-01 2022-09-09 国家电网有限公司 Document classification method and apparatus, and electronic device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080319974A1 (en) * 2007-06-21 2008-12-25 Microsoft Corporation Mining geographic knowledge using a location aware topic model
US20120041953A1 (en) * 2010-08-16 2012-02-16 Microsoft Corporation Text mining of microblogs using latent topic labels
CN103164463A (en) * 2011-12-16 2013-06-19 国际商业机器公司 Method and device for recommending labels
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN103425710A (en) * 2012-05-25 2013-12-04 北京百度网讯科技有限公司 Subject-based searching method and device
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting
CN103778207A (en) * 2014-01-15 2014-05-07 杭州电子科技大学 LDA-based news comment topic digging method
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN105608166A (en) * 2015-12-18 2016-05-25 Tcl集团股份有限公司 Label extracting method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080319974A1 (en) * 2007-06-21 2008-12-25 Microsoft Corporation Mining geographic knowledge using a location aware topic model
US20120041953A1 (en) * 2010-08-16 2012-02-16 Microsoft Corporation Text mining of microblogs using latent topic labels
CN103164463A (en) * 2011-12-16 2013-06-19 国际商业机器公司 Method and device for recommending labels
CN103425710A (en) * 2012-05-25 2013-12-04 北京百度网讯科技有限公司 Subject-based searching method and device
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting
CN103778207A (en) * 2014-01-15 2014-05-07 杭州电子科技大学 LDA-based news comment topic digging method
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN105608166A (en) * 2015-12-18 2016-05-25 Tcl集团股份有限公司 Label extracting method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
NIDAA GHALIB ALI 等: "A Hybrid of Statistical and Machine Learning Methods for Arabic Keyphrase Extraction", 《ASIAN JOURNAL OF APPLIED SCIENCES, 2015》 *
刘娜 等: "基于LDA重要主题的多文档自动摘要算法", 《计算机科学与探索》 *
刘慕凡: "基于主题与语义的作弊网页检测方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
石晶 等: "基于LDA模型的主题词抽取方法", 《计算机工程》 *

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502988A (en) * 2016-11-02 2017-03-15 深圳市空谷幽兰人工智能科技有限公司 The method and apparatus that a kind of objective attribute target attribute is extracted
CN106502988B (en) * 2016-11-02 2019-06-07 广东惠禾科技发展有限公司 A kind of method and apparatus that objective attribute target attribute extracts
CN106649844A (en) * 2016-12-30 2017-05-10 浙江工商大学 Unstructured text data enhanced distributed large-scale data dimension extracting method
CN106649844B (en) * 2016-12-30 2019-10-18 浙江工商大学 The enhanced distributed large-scale data dimension abstracting method of unstructured text data
CN107169021A (en) * 2017-04-07 2017-09-15 华为机器有限公司 Method and apparatus for predicting application function label
CN110380954A (en) * 2017-04-12 2019-10-25 腾讯科技(深圳)有限公司 Data sharing method and device, storage medium and electronic device
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN109213988B (en) * 2017-06-29 2022-06-21 武汉斗鱼网络科技有限公司 Barrage theme extraction method, medium, equipment and system based on N-gram model
CN109213988A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Barrage subject distillation method, medium, equipment and system based on N-gram model
CN108304509A (en) * 2018-01-19 2018-07-20 华南理工大学 A kind of comment spam filter method for indicating mutually to learn based on the multidirectional amount of text
CN108304509B (en) * 2018-01-19 2021-12-21 华南理工大学 Junk comment filtering method based on text multi-directional expression mutual learning
CN108536679B (en) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 Named entity recognition method, device, equipment and computer readable storage medium
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN110727794A (en) * 2018-06-28 2020-01-24 上海传漾广告有限公司 System and method for collecting and analyzing network semantics and summarizing and analyzing content
CN110781659A (en) * 2018-07-11 2020-02-11 株式会社Ntt都科摩 Text processing method and text processing device based on neural network
US11030483B2 (en) 2018-08-07 2021-06-08 International Business Machines Corporation Generating and ordering tags for an image using subgraph of concepts
CN109376270A (en) * 2018-09-26 2019-02-22 青岛聚看云科技有限公司 A kind of data retrieval method and device
CN109635102A (en) * 2018-11-19 2019-04-16 浙江工业大学 Topic model method for improving based on user's interaction
CN109635102B (en) * 2018-11-19 2021-05-11 浙江工业大学 Theme model lifting method based on user interaction
CN110032639B (en) * 2018-12-27 2023-10-31 中国银联股份有限公司 Method, device and storage medium for matching semantic text data with tag
US11586658B2 (en) 2018-12-27 2023-02-21 China Unionpay Co., Ltd. Method and device for matching semantic text data with a tag, and computer-readable storage medium having stored instructions
CN110032639A (en) * 2018-12-27 2019-07-19 中国银联股份有限公司 By the method, apparatus and storage medium of semantic text data and tag match
CN110222331A (en) * 2019-04-26 2019-09-10 平安科技(深圳)有限公司 Lie recognition methods and device, storage medium, computer equipment
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN112559853A (en) * 2019-09-26 2021-03-26 北京沃东天骏信息技术有限公司 User label generation method and device
CN112559853B (en) * 2019-09-26 2024-01-12 北京沃东天骏信息技术有限公司 User tag generation method and device
CN111079042B (en) * 2019-12-03 2023-08-15 杭州安恒信息技术股份有限公司 Webpage hidden chain detection method and device based on text theme
CN111079042A (en) * 2019-12-03 2020-04-28 杭州安恒信息技术股份有限公司 Webpage hidden link detection method and device based on text theme
CN111160025A (en) * 2019-12-12 2020-05-15 日照睿安信息科技有限公司 Method for actively discovering case keywords based on public security text
CN111507098B (en) * 2020-04-17 2023-03-21 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN111507098A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN111695358A (en) * 2020-06-12 2020-09-22 腾讯科技(深圳)有限公司 Method and device for generating word vector, computer storage medium and electronic equipment
CN111695358B (en) * 2020-06-12 2023-08-08 腾讯科技(深圳)有限公司 Method and device for generating word vector, computer storage medium and electronic equipment
CN112287679A (en) * 2020-10-16 2021-01-29 国网江西省电力有限公司电力科学研究院 Structured extraction method and system for text information in scientific and technological project review
CN112732743A (en) * 2021-01-12 2021-04-30 北京久其软件股份有限公司 Data analysis method and device based on Chinese natural language
CN112732743B (en) * 2021-01-12 2023-09-22 北京久其软件股份有限公司 Data analysis method and device based on Chinese natural language
WO2022183991A1 (en) * 2021-03-01 2022-09-09 国家电网有限公司 Document classification method and apparatus, and electronic device

Also Published As

Publication number Publication date
CN106055538B (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN106055538A (en) Automatic extraction method for text labels in combination with theme model and semantic analyses
US9779085B2 (en) Multilingual embeddings for natural language processing
Lin et al. Joint sentiment/topic model for sentiment analysis
Read et al. Weakly supervised techniques for domain-independent sentiment classification
CN103049435B (en) Text fine granularity sentiment analysis method and device
El-Halees Mining opinions in user-generated contents to improve course evaluation
Rajan et al. Automatic classification of Tamil documents using vector space model and artificial neural network
Das et al. Part of speech tagging in odia using support vector machine
CN103473380B (en) A kind of computer version sensibility classification method
Hamza et al. An arabic question classification method based on new taxonomy and continuous distributed representation of words
Haque et al. Literature review of automatic multiple documents text summarization
Qiu et al. Advanced sentiment classification of tibetan microblogs on smart campuses based on multi-feature fusion
Wahbeh et al. Comparative assessment of the performance of three WEKA text classifiers applied to arabic text
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
Jebari et al. A new approach for implicit citation extraction
Hassan et al. Automatic document topic identification using wikipedia hierarchical ontology
Spatiotis et al. Sentiment analysis for the Greek language
Khan et al. Sentiment analysis at sentence level for heterogeneous datasets
Shah et al. An automatic text summarization on Naive Bayes classifier using latent semantic analysis
Zhang et al. Positive, negative, or mixed? Mining blogs for opinions
Alam et al. Bangla news trend observation using lda based topic modeling
Ma et al. Analysis of three methods for web-based opinion mining
Kumar et al. Aspect-Based Sentiment Analysis of Tweets Using Independent Component Analysis (ICA) and Probabilistic Latent Semantic Analysis (pLSA)
Ba-Alwi et al. Arabic text summarization using latent semantic analysis
Singh et al. An Insight into Word Sense Disambiguation Techniques

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 501, 502, 503, No. 66 Boxia Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, March 2012

Patentee after: Daguan Data Co.,Ltd.

Address before: Room 1208, No. 2305 Zuchongzhi Road, Zhangjiang, Pudong New Area, Shanghai, 200000

Patentee before: DATAGRAND INFORMATION TECHNOLOGY (SHANGHAI) Co.,Ltd.