CN106055538A

CN106055538A - Automatic extraction method for text labels in combination with theme model and semantic analyses

Info

Publication number: CN106055538A
Application number: CN201610361639.1A
Authority: CN
Inventors: 于敬
Original assignee: Information Technology (shanghai) Co Ltd
Current assignee: Daguan Data Co ltd
Priority date: 2016-05-26
Filing date: 2016-05-26
Publication date: 2016-10-26
Anticipated expiration: 2036-05-26
Also published as: CN106055538B

Abstract

The invention relates to an automatic extraction method for text labels in combination with theme model and semantic analyses, pertaining to the technical field of computer application. The method comprises pre-treatment, LDA modeling, context analyses and label extraction.The pre-treatment comprises following steps: removing low-frequency words, removing stop words and removing label information, wherein stop words are auxiliary words without any information, words showing sentence grammar structures, all function words and punctuations. The LDA modeling process comprises following steps: obtaining two matrixes after processing the LDA model: one is a file-theme matrix of N*K with each element corresponding to a hidden theme distribution of each file and the other is a K*M theme-word matrix with each element corresponding to a word distribution of each theme. Based on a conventional counting method, the method takes correlations of words in files into consideration and fully utilizes one key feature of context information so that label information of files is obtained.

Description

The automatic abstracting method of text label that topic model and semantic analysis combine

Technical field

The present invention relates to topic model and the automatic abstracting method of text label that semantic analysis combines, belonging to computer should Use technical field.

Background technology

In DT (data technology) epoch, internet information presents explosive growth, various text datas Emerge in an endless stream, as diversified news, magnanimity from media original article.In the face of the most rich and varied information, people are urgent Need some automation tools to help them to find the crucial letter oneself needed from immense information vast sea accurately and rapidly Breath, label extraction produces the most under this background.Label is quick obtaining text key message, the important side holding theme Formula, all has important application in the fields such as information retrieval, natural language processing, intelligent recommendation.Many websites provide a user with The function of label it is labeled for object (such as picture, video, books and film etc.) interested, it is simple to user shares, manages, Collection and retrieval object.As Fig. 1 (a) and Fig. 1 (b) show on Semen Sojae Preparatum the label for books and film.

LDA (Latent Dirichlet Allocation) model is that a kind of document subject matter generates model, and it is at present should With widest a kind of probability topic model, it has assumes than the more fully text generation of other models.LDA model is at PLSA On the basis of, use the implicit stochastic variable of K dimension obeying Dirichlet distribution to represent the theme mixed proportion of document, come with this The generation process of simulation document.The document representation and the implicit semantic structure that use LDA acquisition are applied to very the most like a bomb The association area of many text-processings.LDA model is the production probabilistic model of a multilamellar, comprises document, theme, word three-layered node Structure.Theme obeys multinomial distribution to word, and Dirichlet distribution then obeyed by document to theme.The LDA hybrid weight θ to theme Carry out Dirichlet priori, produce parameter θ with a hyper parameter α, i.e. the parameter of parameter.LDA is a kind of non-supervisory engineering Habit technology, can be used to identify subject information hiding in extensive document sets or corpus.It has employed the method for word bag, this Each document is considered as a word frequency vector by the method for kind, thus text message is converted the digital information for ease of modeling. Each theme represents again the probability distribution that a lot of word is constituted, and some theme institute structures of each documents representative The probability distribution become.

The shortcoming that current label abstracting method mainly has following two and existence:

1. statistical information based on text vocabulary generates label, such as TF-IDF (term frequency-inverse Document frequency), mutual information (mutual information) etc., then they are sorted, if choosing the highest Dry individual as key word, the most unsupervised method.The method advantage is simple and fast, it is not required that manually mark Note.But, this method cannot effectively comprehensively utilize much information and sort candidate keywords.It addition, do not account for word and word Between dependency, namely document is actually made up of some potential themes, and each theme is by some word structures Become.

2. method based on machine learning generates label.The method also referred to as having supervision, main thought is by label Extraction problem is converted to judge that whether each candidate keywords is two classification problems of label.Mark firstly the need of to document sets Sign mark, then split into training data and test data, be used for generating disaggregated model.This method can be learnt by training Regulate the information of multiple dimension for judging the influence degree of key word, so effect is more preferable.But, for training set Mark waste time and energy the most very much, and document subject matter changes acutely the most over time, is trained the mark of set at any time The most unrealistic.

Summary of the invention

In order to overcome above-mentioned deficiency, the text label that the present invention provides topic model and semantic analysis to combine is taken out automatically Access method.

The technical scheme that the present invention takes is as follows:

The automatic abstracting method of text label that topic model and semantic analysis combine, comprises the steps:

The first step: pretreatment；

Second step: LDA modeling and contextual analysis；

3rd step: tag extraction.

Wherein, the mode of the pretreatment of the first step is: if there is low-frequency word, stop-word and label information, described pre-place Reason includes removing low-frequency word, removing stop-word and remove label information；Described low-frequency word only occurred in one to two texts, Described stop-word is to carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuate hardly Symbol, described label information is web page text or other marking language text information；Other marking language text information bag Include html and css；

The LDA modeling process of second step is: file, after LDA models treated, obtains two matrixes: one is N × K " document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document；Another is that K × M is " main Topic-word " matrix, what each element of matrix was corresponding is the word distribution of each theme；

Contextual analysis includes following dimension:

(1) word frequency time,

(2) document frequencies,

(3) part of speech,

(4) lexeme is put,

(5)TF-IDF；

The method of contextual analysis comprises the steps,

1. according to the html label information of text, the positional information at each section of text place is obtained；

2. text is carried out word segmentation processing and part of speech labelling, obtains each independent word and part-of-speech information；

3. method known to industry is used to calculate word frequency time, document frequencies and TF-IDF；

After the pretreatment of the first step, each document defines a characteristic vector, defines the side of characteristic vector Method is: suppose there is N piece document, M word, K theme, LDA modeling process is: file, after LDA models treated, obtains two Matrix: one is N × K " document-theme " matrix, the implicit theme of what each element of matrix was corresponding is each document divides Cloth；Another is K × M " theme-word " matrix, and what each element of matrix was corresponding is the word distribution of each theme.

The method of the tag extraction of the 3rd step is as follows:

The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, the weight obtaining text d word w is:

Weigh | and t (d, w)=α | Sorce_LDA(d, w)+β | Sorce_word(d, w),

Wherein, (d w) represents that word w LDA in document d calculates score, represents word w context in document d Score Score after analysis, α and β represents LDA algorithm and the weight of contextual analysis method,

S o r c e (d, w) = Σ_{t = 1}^{k} T o p i c (t, d) | W o r d (w, t),

K represents the number of topics that LDA model is arranged, and (t d) represents t of document d in " document-theme " matrix to Topic The probit of theme, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,

(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word w power of word frequency time in document d to f Weight, (w, d) represents word w weight of document frequencies in document d to g, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents word Part of speech weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put and divided at word context with part of speech respectively Weight in analysis algorithm, for constant,

F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through above Calculating, obtain the Weigh of each word w in document d | (d, w), sorts from high to low t according to the least, takes maximum several Word or phrase are as the label of document.

The method have the benefit that

Comparing current Statistics-Based Method, the present invention not only allows for word and the association of word in document, also the most sharp With some key features in contextual information, finally give the label information of document.

Accompanying drawing explanation

Fig. 1 (a) schematically illustrates the label one on Semen Sojae Preparatum for books and film；

Fig. 1 (b) schematically illustrates the label two on Semen Sojae Preparatum for books and film；

Fig. 2 schematically illustrates the schematic flow sheet of the present invention；

Fig. 3 schematically illustrates LDA models treated flow chart.

Detailed description of the invention

The present invention will be further described below in conjunction with the accompanying drawings:

As shown in Figure 2: the automatic abstracting method of text label that topic model and semantic analysis combine, including walking as follows Rapid:

The first step: pretreatment；

Second step: LDA modeling and contextual analysis；

3rd step: tag extraction.

The mode of the pretreatment of the first step is: if there is low-frequency word, stop-word and label information, described pretreatment includes Remove low-frequency word, remove stop-word and remove label information；Described low-frequency word only occurred in one to two texts, described in stop Only word is to carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuation mark hardly, Described label information is web page text or other marking language text information；Other marking language text information includes html And css；

The LDA modeling process that second step relates to is: after the first step obtains pretreatment, each document defines one Characteristic vector, the method defining characteristic vector is: suppose there is N piece document, M word, K theme；As it is shown on figure 3, LDA modeling Process is: file, after LDA models treated, obtains two matrixes: one is N × K " document-theme " matrix, matrix What each element was corresponding is the implicit theme distribution of each document；Another is K × M " theme-word " matrix, each unit of matrix What element was corresponding is the word distribution of each theme.

Described contextual analysis includes following dimension:

(1) word frequency time, the occurrence number of word in i.e. one document.

(2) document frequencies, i.e. in all document sets, has how many documents to comprise this word；

(3) part of speech, noun and nominal phrase characterize semanteme and are eager to excel, and weight also can be higher；

(4) lexeme is put, i.e. this word location, and the not off-position at title, summary and the article such as conclusion, text is put, power It is heavily different.

(5) TF-IDF, TF-IDF are a kind of statistical method, and main thought is the frequency when a word occurs in a document Rate is the highest, and the number of times simultaneously occurred in other documents is the fewest, then show that this word is for representing the separating capacity of this document more By force, so its weighted value just should be the biggest.

The method of the contextual analysis that second step relates to comprises the steps,

1. according to the html label information of text, obtain the positional information at each section of text place, as title, text, overstriking, Font size etc.；

The tag extraction method of the 3rd step is:

The characteristic quantity obtained in conjunction with result and the word contextual analysis of LDA model, obtains text d, and the weight of word w is:

Weigh | and t (d, w)=α | Sorce_LDA(d, w)+β | Sorce_word(d, w),

Wherein (d w) represents that word w LDA in document d calculates score, represents word w context in document d Score Score after analysis, α and β represents LDA algorithm and the weight of contextual analysis method,

S o r c e (d, w) = Σ_{t = 1}^{k} T o p i c (t, d) | W o r d (w, t),

For the ordinary skill in the art, the present invention is simply exemplarily described by specific embodiment, Obviously the present invention implements and is not subject to the restrictions described above, as long as the method design that have employed the present invention is entered with technical scheme The improvement of various unsubstantialities of row, or the most improved design by the present invention and technical scheme directly apply to other occasion , all within protection scope of the present invention.

Claims

1. the automatic abstracting method of text label that topic model and semantic analysis combine, it is characterised in that: comprise the steps:

The first step: pretreatment, if there is low-frequency word, stop-word and label information, described pretreatment includes removing low-frequency word, going Fall stop-word and remove label information；Described low-frequency word only occurred in one to two texts, and described stop-word is hardly Carry the auxiliary word of any information, the word of reflection Sentence Grammar structure and all function words and punctuation mark, described label information It is web page text or other marking language text information；Other marking language text information includes html and css；

Second step: LDA modeling and contextual analysis；LDA modeling process is: file, after LDA models treated, obtains two squares Battle array: one is N × K " document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document； Another is K × M " theme-word " matrix, and what each element of matrix was corresponding is the word distribution of each theme；

Contextual analysis includes following dimension:

(1) word frequency time,

(2) document frequencies,

(3) part of speech,

(4) lexeme is put,

(5)TF-IDF；

The method of contextual analysis comprises the steps,

3rd step: tag extraction.

The automatic abstracting method of text label that topic model the most according to claim 1 and semantic analysis combine, it is special Levying and be: in described second step, after pretreatment, each document defines a characteristic vector, it is assumed that have N piece document, M Individual word, K theme, the process of LDA modeling is: file, after LDA models treated, obtains two matrixes: one is N × K " document-theme " matrix, what each element of matrix was corresponding is the implicit theme distribution of each document；Another is that K × M is " main Topic-word " matrix, what each element of matrix was corresponding is the word distribution of each theme.

The automatic abstracting method of text label that topic model the most according to claim 1 and semantic analysis combine, it is special Levy and be: in described 3rd step, the method for tag extraction is as follows,

Weigh | and t (d, w)=α | Sorce_LDA(d, w)+β | Sorce_word(d, w),

Wherein (d w) represents that word w LDA in document d calculates score, represents word w contextual analysis in document d Score After score, α and β represents LDA algorithm and the weight of contextual analysis method,

S o r c e (d, w) = Σ_{t = 1}^{k} T o p i c (t, d) | W o r d (w, t),

K represents the number of topics that LDA model is arranged, and (t d) represents the t theme of document d in " document-theme " matrix to Topic Probit, Word (w, t) represents the probit of the word w of theme t in " theme-word " matrix,

(w, d) represents the TF-IDF value of word w in document d to TfIdf, and (w d) represents word the w weight of word frequency time, g in document d to f (w, d) represents word w weight of document frequencies in document d, and (w, d) represents the weight of the position of word to ρ, and γ (w) represents the word of word Property weight, ρ, γ, ξ, μ, σ represent that TF-IDF, word frequency time, document frequencies, lexeme are put respectively and calculate at word contextual analysis with part of speech Weight in method, for constant,

F (w, d), g (w, d), ρ (w, d) and γ (w) be all discrete function, be respectively mapped to different intervals, through meter above Calculate, obtain the Weigh of each word w in document d | t (d, w), sorts from high to low according to the least, take maximum several words or Person's phrase is as the label of document.