CN105138538A

CN105138538A - Cross-domain knowledge discovery-oriented topic mining method

Info

Publication number: CN105138538A
Application number: CN201510398749.0A
Authority: CN
Inventors: 靳晓明; 韩春晖
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2015-07-08
Filing date: 2015-07-08
Publication date: 2015-12-09
Anticipated expiration: 2035-07-08
Also published as: CN105138538B

Abstract

The invention discloses a cross-domain knowledge discovery-oriented topic mining method, comprising: constructing a source domain text set and a target domain set; extracting potential class feature information and potential semantic information from the source domain text set; extracting potential feature information and potential semantic information from the target domain set; automatically aggregating a text in the target domain set into a potential style component; modeling the semantic information of the target domain set in a potential topic component; and modeling the potential topic component of the semantic information of the target domain set. The method has the following advantages: features of texts in a source domain are automatically mined for identifying and classifying texts in a target domain; text feature information of the source domain is accurately transferred in a text cluster of the target domain; and text content, different from the source domain, in the target domain is automatically found out.

Description

A kind of Topics Crawling method towards cross-cutting Knowledge Discovery

Technical field

The invention belongs to computer version digging technology field, relate to topic model technology, be specifically related to a kind of Topics Crawling method towards cross-cutting Knowledge Discovery.

Background technology

Along with the development of internet, the appearance of the increasing network platform makes textual resources become explosive growth, and the process that huge data volume and complicated analytic process often make user obtain required knowledge becomes very difficult.Such as; when people want to find local valuable media event or hot issue in social networks; people can only help oneself find the information gone for by the search of keyword; but; often such retrieval mode is that efficiency is very low; people usually can attempt a large amount of search keywords, or, browse a large amount of Search Results and just likely find the information oneself wanted.In order to effectively promote the efficiency of user's obtaining information, there is Text Mining Technology, having helped people's organization and management text message.Text Mining Technology main at present has traditional topic model technology, has the topic model technology of supervision and cross-cutting topic model technology etc.

There are relative merits in these technology, is now summarized as follows separately:

1. traditional Topics Crawling technology, based on the mixture model of probability statistics, carries out modeling to text message, makes model can go out semantic information potential in text by automatic mining, enables user understand content involved in text fast.By topic model, the information related generally in text collection can not only be obtained, and the content information in every section of document can be obtained.Common topic model has probability latent semantic analysis (ProbabilisticLatentSemanticAnalysis, PLSA) model [1] and potential Di Li Cray distribute (LatentDirichletAllocation, LDA) model [2].But such technology only considers the text message in text collection, other useful information, as the classification information etc. of text, cannot be utilized.

2. there is the Topics Crawling technology of supervision technical at traditional Topics Crawling, the classification information of text is fused in the process of Topics Crawling, makes the text with same characteristic features contain identical theme as far as possible, and then improve the ability of Topics Crawling.Priori is fused in unsupervised Topics Crawling by diverse ways by these topic models with priori.What the work be dissolved in topic model in the priori of document aspect had Blei to propose has the potential Di Li Cray of supervision to distribute (supervisedLatentDirichletAllocation, sLDA) model [3], this model utilizes the class mark of text, be dissolved in topic model as relevant variable, and carry out modeling by general linear model.The theme feature that its text class mark information introduced improves text represents, and then serves classification and regression problem better.The potential Di Li Cray that Ramage proposes label distributes (labeledLatentDirichletAllocation, lLDA) model [4], this model directly sets up mapping relations one to one between potential theme and document class mark, efficiently solves the attaching problem existed in the set of many label text.But this kind of Topics Crawling method has higher requirement to text data, sometimes even need more human resources to provide the required information having supervision.

3. cross-cutting Text Mining Technology mainly solves the problem of similarity and otherness in the text potential applications information of Automatic Extraction different field.Cross-cutting mixture model (the Cross-CollectionMixtureModel that groundwork has Zhai to propose, CCMix) [5], this model can be excavated under different field, the subject information jointly occurred, simultaneously, for the subject information that these jointly occur, find out the special part of public part in these information and each field.Paul is on the basis of this work, propose cross-cutting potential Di Li Cray and distribute (cross-collectionLatentDirichletAllocation, ccLDA) model [6], under ccMix is transformed into the framework of LDA from the framework of PLSA, this makes model be provided with the superiority of LDA, namely can infer the new text arrived.In addition, this model also reduces the number of parameters in ccMix, makes, and model parameter can not increase along with the increase of text data, and the feature that model can be intrinsic according to text better carries out text mining.But cross-cutting topic model cannot utilize the information of different field to help user to filter out required information.

List of references

[1]HofmannT.Probabilisticlatentsemanticindexing.Proceedingsofthe22ndannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval.ACM,1999:50-57。

[2]BleiDM,NgAY,JordanMI.Latentdirichletallocation.TheJournalofmachineLearningresearch,2003,3:993-1022。

[3]McauliffeJD,BleiDM.Supervisedtopicmodels.Advancesinneuralinformationprocessingsystems.2008:121-128。

[4]RamageD,HallD,NallapatiR,etal.LabeledLDA:Asupervisedtopicmodelforcreditattributioninmulti-labeledcorpora.Proceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLanguageProcessing:Volume1-Volume1.AssociationforComputationalLinguistics,2009:248-256。

[5]ZhaiCX,VelivelliA,YuB.Across-collectionmixturemodelforcomparativetextmining.ProceedingsofthetenthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining.ACM,2004:743-748。

[6]PaulM.Cross-collectiontopicmodels:Automaticallycomparingandcontrastingtext.Urbana,2009,51:61801。

Summary of the invention

The present invention is intended at least one of solve the problems of the technologies described above.

For this reason, the object of the invention is to propose a kind of Topics Crawling method towards cross-cutting Knowledge Discovery.

To achieve these goals, the embodiment of one aspect of the present invention discloses a kind of Topics Crawling method towards cross-cutting Knowledge Discovery, comprises the following steps: A: have class target text data set for given, builds source domain text collection; Do not have class target text data set for given, establishing target field is gathered; B: the potential category feature information extracting text each classification from described source domain text collection, dives described potential category feature information modeling in assembly in style; Extract potential applications information text from described source domain text collection, be modeled in theme and dive in assembly; C: the potential characteristic information and the potential applications information that extract all texts from described target domain set; D: according to the potential assembly of described style and the potential characteristic information from described target domain set extraction, the described text autopolymerization in described target domain set is dived in assembly in described style; According to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, the Semantic information modeling of described target domain set is dived in assembly at described theme; And E: the potential assembly of the theme of the semantic information of target domain set described in modeling.

According to a kind of Topics Crawling method towards cross-cutting Knowledge Discovery of the embodiment of the present invention, automatic mining source domain text feature, these features may be used for identification and the classification of target domain Chinese version; The text feature information of source domain is moved among the text cluster of target domain effectively, makes the process of cluster more accurate; The content similar to source text in target text is fallen in automatic fitration, can carry out modeling to the content of source domain text and target domain text, and differentiates the similarity of content and the non-same sex, thus finds out content of text different from source domain in target domain.

In addition, a kind of Topics Crawling method towards cross-cutting Knowledge Discovery according to the above embodiment of the present invention, can also have following additional technical characteristic:

Further, also comprise between steps A and step B: AB: pre-service is carried out to the text data in described source domain text collection and described target domain set.

Further, described pre-service comprises stop words process and text goes root process.

Further, in step D, described according to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, Semantic information modeling from described target domain set is dived at described theme and comprises further in assembly: D1: filter the Similar content with source text in described target text set, retain the unique content of described target domain set; And D2: described unique content is modeled in described theme and dives in assembly.

Further, described step B iterates to step D, until the potential assembly of described style and the potential assembly of described theme are all restrained.

Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:

Fig. 1 is the process flow diagram of one embodiment of the invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.

In describing the invention, it will be appreciated that, term " " center ", " longitudinal direction ", " transverse direction ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end ", " interior ", orientation or the position relationship of the instruction such as " outward " are based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore limitation of the present invention can not be interpreted as.In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance.

In describing the invention, it should be noted that, unless otherwise clearly defined and limited, term " installation ", " being connected ", " connection " should be interpreted broadly, and such as, can be fixedly connected with, also can be removably connect, or connect integratedly; Can be mechanical connection, also can be electrical connection; Can be directly be connected, also indirectly can be connected by intermediary, can be the connection of two element internals.For the ordinary skill in the art, concrete condition above-mentioned term concrete meaning in the present invention can be understood.

With reference to description below and accompanying drawing, these and other aspects of embodiments of the invention will be known.Describe at these and in accompanying drawing, specifically disclose some particular implementation in embodiments of the invention, representing some modes of the principle implementing embodiments of the invention, but should be appreciated that the scope of embodiments of the invention is not limited.On the contrary, embodiments of the invention comprise fall into attached claims spirit and intension within the scope of all changes, amendment and equivalent.

Below in conjunction with accompanying drawing, a kind of Topics Crawling method towards cross-cutting Knowledge Discovery according to the embodiment of the present invention is described.

Fig. 1 is the process flow diagram of one embodiment of the invention, please refer to Fig. 1.

(1) the potential category feature of source domain text is extracted

Model, on the basis of traditional theme model, for every section of document introduces a new mixture model, for the potential category feature of modeling source domain, is referred to as style.For the document of all source domains, potential assembly corresponding in its style mixture model and the class label one_to_one corresponding of the document, under the condition having supervision, carry out the study of model.

(2) the potential applications information of source domain text is extracted

Still retaining document in traditional theme model in this model is the basic assumption associated with the mixed phase of a series of theme, for the potential assembly of each modeling category feature, there is the potential assembly of a series of theme and this to be associated, and then the theme modeling that its place classification is corresponding of the content information of feature non-in source domain text is got up.In order to the content of source domain text and the content of target domain are distinguished, the theme of modeling source domain content of text is marked.

(3) the potential feature of target domain text is extracted

Similar to source domain text, model is also for every section of document of target domain introduces a new mixture model, for the potential feature of modeling source domain.Learn in unsupervised situation with the distribution of source domain text unlike the potential assembly corresponding in mixture model of, target domain text.

(4) the potential applications information of target domain text is extracted

The modeling of the potential applications information of target domain text is identical with the modeling pattern of source domain text, and difference is, the theme that target domain content of text distributes does not need to mark.

(5) cluster of target domain text

Because the new style mixture model introduced for source domain and target domain is in a model unified, so class label that can be corresponding according to potential assembly each in the distribution of target domain in style mixture model and style mixture model, cluster is carried out to target domain text.

(6) newfound domain knowledge is screened

For subject information corresponding in each style mixture model, in the 3rd step, the theme that all source domain texts relate to all is marked, like this, all subject informations not carrying out marking are the content only involved by target domain text, so these are the theme of mark is exactly newfound domain knowledge.

In addition, other formation of a kind of Topics Crawling method towards cross-cutting Knowledge Discovery of the embodiment of the present invention and effect are all known for a person skilled in the art, in order to reduce redundancy, do not repeat.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.

Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalency thereof.

Claims

1., towards a Topics Crawling method for cross-cutting Knowledge Discovery, it is characterized in that, comprise the following steps:

A: have class target text data set for given, builds source domain text collection; Do not have class target text data set for given, establishing target field is gathered;

B: the potential category feature information extracting text each classification from described source domain text collection, dives described potential category feature information modeling in assembly in style; Extract potential applications information text from described source domain text collection, be modeled in theme and dive in assembly;

C: the potential characteristic information and the potential applications information that extract text from described target domain set;

D: according to the potential assembly of described style and the potential characteristic information from described target domain set extraction, the described text autopolymerization in described target domain set is dived in assembly in described style; According to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, the Semantic information modeling of described target domain set is dived in assembly at described theme; And

E: the potential assembly of the theme of the semantic information of target domain set described in modeling.

2. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 1, is characterized in that, also comprise between steps A and step B:

AB: pre-service is carried out to the text data in described source domain text collection and described target domain set.

3. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 2, is characterized in that, described pre-service comprises stop words process and text goes root process.

4. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 1, it is characterized in that, in step D, described according to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, the Semantic information modeling from described target domain set is dived at described theme and comprises further in assembly:

D1: filter the Similar content with source text in described target text set, retain the unique content of described target domain set; And

D2: described unique content is modeled in described theme and dives in assembly.

5. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 4, it is characterized in that, described step B iterates to step D, until the potential assembly of described style and the potential assembly of described theme are all restrained.