CN105138538A - Cross-domain knowledge discovery-oriented topic mining method - Google Patents

Cross-domain knowledge discovery-oriented topic mining method Download PDF

Info

Publication number
CN105138538A
CN105138538A CN201510398749.0A CN201510398749A CN105138538A CN 105138538 A CN105138538 A CN 105138538A CN 201510398749 A CN201510398749 A CN 201510398749A CN 105138538 A CN105138538 A CN 105138538A
Authority
CN
China
Prior art keywords
text
potential
target domain
assembly
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510398749.0A
Other languages
Chinese (zh)
Other versions
CN105138538B (en
Inventor
靳晓明
韩春晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201510398749.0A priority Critical patent/CN105138538B/en
Publication of CN105138538A publication Critical patent/CN105138538A/en
Application granted granted Critical
Publication of CN105138538B publication Critical patent/CN105138538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-domain knowledge discovery-oriented topic mining method, comprising: constructing a source domain text set and a target domain set; extracting potential class feature information and potential semantic information from the source domain text set; extracting potential feature information and potential semantic information from the target domain set; automatically aggregating a text in the target domain set into a potential style component; modeling the semantic information of the target domain set in a potential topic component; and modeling the potential topic component of the semantic information of the target domain set. The method has the following advantages: features of texts in a source domain are automatically mined for identifying and classifying texts in a target domain; text feature information of the source domain is accurately transferred in a text cluster of the target domain; and text content, different from the source domain, in the target domain is automatically found out.

Description

A kind of Topics Crawling method towards cross-cutting Knowledge Discovery
Technical field
The invention belongs to computer version digging technology field, relate to topic model technology, be specifically related to a kind of Topics Crawling method towards cross-cutting Knowledge Discovery.
Background technology
Along with the development of internet, the appearance of the increasing network platform makes textual resources become explosive growth, and the process that huge data volume and complicated analytic process often make user obtain required knowledge becomes very difficult.Such as; when people want to find local valuable media event or hot issue in social networks; people can only help oneself find the information gone for by the search of keyword; but; often such retrieval mode is that efficiency is very low; people usually can attempt a large amount of search keywords, or, browse a large amount of Search Results and just likely find the information oneself wanted.In order to effectively promote the efficiency of user's obtaining information, there is Text Mining Technology, having helped people's organization and management text message.Text Mining Technology main at present has traditional topic model technology, has the topic model technology of supervision and cross-cutting topic model technology etc.
There are relative merits in these technology, is now summarized as follows separately:
1. traditional Topics Crawling technology, based on the mixture model of probability statistics, carries out modeling to text message, makes model can go out semantic information potential in text by automatic mining, enables user understand content involved in text fast.By topic model, the information related generally in text collection can not only be obtained, and the content information in every section of document can be obtained.Common topic model has probability latent semantic analysis (ProbabilisticLatentSemanticAnalysis, PLSA) model [1] and potential Di Li Cray distribute (LatentDirichletAllocation, LDA) model [2].But such technology only considers the text message in text collection, other useful information, as the classification information etc. of text, cannot be utilized.
2. there is the Topics Crawling technology of supervision technical at traditional Topics Crawling, the classification information of text is fused in the process of Topics Crawling, makes the text with same characteristic features contain identical theme as far as possible, and then improve the ability of Topics Crawling.Priori is fused in unsupervised Topics Crawling by diverse ways by these topic models with priori.What the work be dissolved in topic model in the priori of document aspect had Blei to propose has the potential Di Li Cray of supervision to distribute (supervisedLatentDirichletAllocation, sLDA) model [3], this model utilizes the class mark of text, be dissolved in topic model as relevant variable, and carry out modeling by general linear model.The theme feature that its text class mark information introduced improves text represents, and then serves classification and regression problem better.The potential Di Li Cray that Ramage proposes label distributes (labeledLatentDirichletAllocation, lLDA) model [4], this model directly sets up mapping relations one to one between potential theme and document class mark, efficiently solves the attaching problem existed in the set of many label text.But this kind of Topics Crawling method has higher requirement to text data, sometimes even need more human resources to provide the required information having supervision.
3. cross-cutting Text Mining Technology mainly solves the problem of similarity and otherness in the text potential applications information of Automatic Extraction different field.Cross-cutting mixture model (the Cross-CollectionMixtureModel that groundwork has Zhai to propose, CCMix) [5], this model can be excavated under different field, the subject information jointly occurred, simultaneously, for the subject information that these jointly occur, find out the special part of public part in these information and each field.Paul is on the basis of this work, propose cross-cutting potential Di Li Cray and distribute (cross-collectionLatentDirichletAllocation, ccLDA) model [6], under ccMix is transformed into the framework of LDA from the framework of PLSA, this makes model be provided with the superiority of LDA, namely can infer the new text arrived.In addition, this model also reduces the number of parameters in ccMix, makes, and model parameter can not increase along with the increase of text data, and the feature that model can be intrinsic according to text better carries out text mining.But cross-cutting topic model cannot utilize the information of different field to help user to filter out required information.
List of references
[1]HofmannT.Probabilisticlatentsemanticindexing.Proceedingsofthe22ndannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval.ACM,1999:50-57。
[2]BleiDM,NgAY,JordanMI.Latentdirichletallocation.TheJournalofmachineLearningresearch,2003,3:993-1022。
[3]McauliffeJD,BleiDM.Supervisedtopicmodels.Advancesinneuralinformationprocessingsystems.2008:121-128。
[4]RamageD,HallD,NallapatiR,etal.LabeledLDA:Asupervisedtopicmodelforcreditattributioninmulti-labeledcorpora.Proceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLanguageProcessing:Volume1-Volume1.AssociationforComputationalLinguistics,2009:248-256。
[5]ZhaiCX,VelivelliA,YuB.Across-collectionmixturemodelforcomparativetextmining.ProceedingsofthetenthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining.ACM,2004:743-748。
[6]PaulM.Cross-collectiontopicmodels:Automaticallycomparingandcontrastingtext.Urbana,2009,51:61801。
Summary of the invention
The present invention is intended at least one of solve the problems of the technologies described above.
For this reason, the object of the invention is to propose a kind of Topics Crawling method towards cross-cutting Knowledge Discovery.
To achieve these goals, the embodiment of one aspect of the present invention discloses a kind of Topics Crawling method towards cross-cutting Knowledge Discovery, comprises the following steps: A: have class target text data set for given, builds source domain text collection; Do not have class target text data set for given, establishing target field is gathered; B: the potential category feature information extracting text each classification from described source domain text collection, dives described potential category feature information modeling in assembly in style; Extract potential applications information text from described source domain text collection, be modeled in theme and dive in assembly; C: the potential characteristic information and the potential applications information that extract all texts from described target domain set; D: according to the potential assembly of described style and the potential characteristic information from described target domain set extraction, the described text autopolymerization in described target domain set is dived in assembly in described style; According to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, the Semantic information modeling of described target domain set is dived in assembly at described theme; And E: the potential assembly of the theme of the semantic information of target domain set described in modeling.
According to a kind of Topics Crawling method towards cross-cutting Knowledge Discovery of the embodiment of the present invention, automatic mining source domain text feature, these features may be used for identification and the classification of target domain Chinese version; The text feature information of source domain is moved among the text cluster of target domain effectively, makes the process of cluster more accurate; The content similar to source text in target text is fallen in automatic fitration, can carry out modeling to the content of source domain text and target domain text, and differentiates the similarity of content and the non-same sex, thus finds out content of text different from source domain in target domain.
In addition, a kind of Topics Crawling method towards cross-cutting Knowledge Discovery according to the above embodiment of the present invention, can also have following additional technical characteristic:
Further, also comprise between steps A and step B: AB: pre-service is carried out to the text data in described source domain text collection and described target domain set.
Further, described pre-service comprises stop words process and text goes root process.
Further, in step D, described according to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, Semantic information modeling from described target domain set is dived at described theme and comprises further in assembly: D1: filter the Similar content with source text in described target text set, retain the unique content of described target domain set; And D2: described unique content is modeled in described theme and dives in assembly.
Further, described step B iterates to step D, until the potential assembly of described style and the potential assembly of described theme are all restrained.
Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:
Fig. 1 is the process flow diagram of one embodiment of the invention.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
In describing the invention, it will be appreciated that, term " " center ", " longitudinal direction ", " transverse direction ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end ", " interior ", orientation or the position relationship of the instruction such as " outward " are based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore limitation of the present invention can not be interpreted as.In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance.
In describing the invention, it should be noted that, unless otherwise clearly defined and limited, term " installation ", " being connected ", " connection " should be interpreted broadly, and such as, can be fixedly connected with, also can be removably connect, or connect integratedly; Can be mechanical connection, also can be electrical connection; Can be directly be connected, also indirectly can be connected by intermediary, can be the connection of two element internals.For the ordinary skill in the art, concrete condition above-mentioned term concrete meaning in the present invention can be understood.
With reference to description below and accompanying drawing, these and other aspects of embodiments of the invention will be known.Describe at these and in accompanying drawing, specifically disclose some particular implementation in embodiments of the invention, representing some modes of the principle implementing embodiments of the invention, but should be appreciated that the scope of embodiments of the invention is not limited.On the contrary, embodiments of the invention comprise fall into attached claims spirit and intension within the scope of all changes, amendment and equivalent.
Below in conjunction with accompanying drawing, a kind of Topics Crawling method towards cross-cutting Knowledge Discovery according to the embodiment of the present invention is described.
Fig. 1 is the process flow diagram of one embodiment of the invention, please refer to Fig. 1.
(1) the potential category feature of source domain text is extracted
Model, on the basis of traditional theme model, for every section of document introduces a new mixture model, for the potential category feature of modeling source domain, is referred to as style.For the document of all source domains, potential assembly corresponding in its style mixture model and the class label one_to_one corresponding of the document, under the condition having supervision, carry out the study of model.
(2) the potential applications information of source domain text is extracted
Still retaining document in traditional theme model in this model is the basic assumption associated with the mixed phase of a series of theme, for the potential assembly of each modeling category feature, there is the potential assembly of a series of theme and this to be associated, and then the theme modeling that its place classification is corresponding of the content information of feature non-in source domain text is got up.In order to the content of source domain text and the content of target domain are distinguished, the theme of modeling source domain content of text is marked.
(3) the potential feature of target domain text is extracted
Similar to source domain text, model is also for every section of document of target domain introduces a new mixture model, for the potential feature of modeling source domain.Learn in unsupervised situation with the distribution of source domain text unlike the potential assembly corresponding in mixture model of, target domain text.
(4) the potential applications information of target domain text is extracted
The modeling of the potential applications information of target domain text is identical with the modeling pattern of source domain text, and difference is, the theme that target domain content of text distributes does not need to mark.
(5) cluster of target domain text
Because the new style mixture model introduced for source domain and target domain is in a model unified, so class label that can be corresponding according to potential assembly each in the distribution of target domain in style mixture model and style mixture model, cluster is carried out to target domain text.
(6) newfound domain knowledge is screened
For subject information corresponding in each style mixture model, in the 3rd step, the theme that all source domain texts relate to all is marked, like this, all subject informations not carrying out marking are the content only involved by target domain text, so these are the theme of mark is exactly newfound domain knowledge.
In addition, other formation of a kind of Topics Crawling method towards cross-cutting Knowledge Discovery of the embodiment of the present invention and effect are all known for a person skilled in the art, in order to reduce redundancy, do not repeat.
In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.
Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalency thereof.

Claims (5)

1., towards a Topics Crawling method for cross-cutting Knowledge Discovery, it is characterized in that, comprise the following steps:
A: have class target text data set for given, builds source domain text collection; Do not have class target text data set for given, establishing target field is gathered;
B: the potential category feature information extracting text each classification from described source domain text collection, dives described potential category feature information modeling in assembly in style; Extract potential applications information text from described source domain text collection, be modeled in theme and dive in assembly;
C: the potential characteristic information and the potential applications information that extract text from described target domain set;
D: according to the potential assembly of described style and the potential characteristic information from described target domain set extraction, the described text autopolymerization in described target domain set is dived in assembly in described style; According to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, the Semantic information modeling of described target domain set is dived in assembly at described theme; And
E: the potential assembly of the theme of the semantic information of target domain set described in modeling.
2. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 1, is characterized in that, also comprise between steps A and step B:
AB: pre-service is carried out to the text data in described source domain text collection and described target domain set.
3. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 2, is characterized in that, described pre-service comprises stop words process and text goes root process.
4. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 1, it is characterized in that, in step D, described according to the potential assembly of described theme and the described potential characteristic information that extracts from described target domain set, the Semantic information modeling from described target domain set is dived at described theme and comprises further in assembly:
D1: filter the Similar content with source text in described target text set, retain the unique content of described target domain set; And
D2: described unique content is modeled in described theme and dives in assembly.
5. the Topics Crawling method towards cross-cutting Knowledge Discovery according to claim 4, it is characterized in that, described step B iterates to step D, until the potential assembly of described style and the potential assembly of described theme are all restrained.
CN201510398749.0A 2015-07-08 2015-07-08 A kind of Topics Crawling method towards cross-cutting Knowledge Discovery Active CN105138538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510398749.0A CN105138538B (en) 2015-07-08 2015-07-08 A kind of Topics Crawling method towards cross-cutting Knowledge Discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510398749.0A CN105138538B (en) 2015-07-08 2015-07-08 A kind of Topics Crawling method towards cross-cutting Knowledge Discovery

Publications (2)

Publication Number Publication Date
CN105138538A true CN105138538A (en) 2015-12-09
CN105138538B CN105138538B (en) 2018-08-03

Family

ID=54723888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510398749.0A Active CN105138538B (en) 2015-07-08 2015-07-08 A kind of Topics Crawling method towards cross-cutting Knowledge Discovery

Country Status (1)

Country Link
CN (1) CN105138538B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760426A (en) * 2016-01-28 2016-07-13 仲恺农业工程学院 Subject community mining method for online social networking service
CN108897815A (en) * 2018-06-20 2018-11-27 淮阴工学院 A kind of multi-tag file classification method based on similarity model and FastText
CN109194605A (en) * 2018-07-02 2019-01-11 中国科学院信息工程研究所 A kind of suspected threat index Proactive authentication method and system based on open source information
CN110738715A (en) * 2018-07-19 2020-01-31 北京大学 automatic migration method of dynamic text special effect based on sample
CN110957042A (en) * 2020-01-17 2020-04-03 广州慧视医疗科技有限公司 Prediction and simulation method of eye diseases under different conditions based on domain knowledge migration
US11423333B2 (en) 2020-03-25 2022-08-23 International Business Machines Corporation Mechanisms for continuous improvement of automated machine learning
CN115168600A (en) * 2022-06-23 2022-10-11 广州大学 Value chain knowledge discovery method under personalized customization

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770580A (en) * 2009-01-04 2010-07-07 中国科学院计算技术研究所 Training method and classification method of cross-field text sentiment classifier
CN103324708A (en) * 2013-06-18 2013-09-25 哈尔滨工程大学 Method of transfer learning from long text to short text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770580A (en) * 2009-01-04 2010-07-07 中国科学院计算技术研究所 Training method and classification method of cross-field text sentiment classifier
CN103324708A (en) * 2013-06-18 2013-09-25 哈尔滨工程大学 Method of transfer learning from long text to short text

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HAIDONG GAO ET AL: "Supervised Cross-collection Topic Modeling", 《MM’12》 *
RUI ZHAO ET AL: "Supervised Adaptive-transfer PLSA for Cross-Domain Text Classification", 《2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP》 *
YANG BAO ET AL: "A Partially Supervised Cross-Collection Topic Model for Cross-Domain Text Classification", 《CIKM’13》 *
李良豪: "跨领域文本分类算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760426A (en) * 2016-01-28 2016-07-13 仲恺农业工程学院 Subject community mining method for online social networking service
CN105760426B (en) * 2016-01-28 2018-12-21 仲恺农业工程学院 A kind of theme community's method for digging towards online social networks
CN108897815A (en) * 2018-06-20 2018-11-27 淮阴工学院 A kind of multi-tag file classification method based on similarity model and FastText
CN108897815B (en) * 2018-06-20 2021-07-16 淮阴工学院 Multi-label text classification method based on similarity model and FastText
CN109194605A (en) * 2018-07-02 2019-01-11 中国科学院信息工程研究所 A kind of suspected threat index Proactive authentication method and system based on open source information
CN109194605B (en) * 2018-07-02 2020-08-25 中国科学院信息工程研究所 Active verification method and system for suspicious threat indexes based on open source information
CN110738715A (en) * 2018-07-19 2020-01-31 北京大学 automatic migration method of dynamic text special effect based on sample
CN110738715B (en) * 2018-07-19 2021-07-09 北京大学 Automatic migration method of dynamic text special effect based on sample
CN110957042A (en) * 2020-01-17 2020-04-03 广州慧视医疗科技有限公司 Prediction and simulation method of eye diseases under different conditions based on domain knowledge migration
US11423333B2 (en) 2020-03-25 2022-08-23 International Business Machines Corporation Mechanisms for continuous improvement of automated machine learning
CN115168600A (en) * 2022-06-23 2022-10-11 广州大学 Value chain knowledge discovery method under personalized customization

Also Published As

Publication number Publication date
CN105138538B (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN105138538A (en) Cross-domain knowledge discovery-oriented topic mining method
JP6286104B2 (en) Display method, apparatus, server, program and recording medium for social network information stream
CN103365924B (en) A kind of method of internet information search, device and terminal
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
CN103699689B (en) Method and device for establishing event repository
CN102419778B (en) Information searching method for discovering and clustering sub-topics of query statement
CN104035975B (en) It is a kind of to realize the method that remote supervisory character relation is extracted using Chinese online resource
CN105468605A (en) Entity information map generation method and device
CN102142089B (en) Semantic binary tree-based image annotation method
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN104504024B (en) Keyword method for digging based on content of microblog and system
CN103617239A (en) Method and device for identifying named entity and method and device for establishing classification model
CN104536956A (en) A Microblog platform based event visualization method and system
CN103020159A (en) Method and device for news presentation facing events
CN103246748B (en) Automatically manage the technology of filec descriptor
CN104462547A (en) Configurable webpage data acquisition method and system
CN105912665B (en) The model conversion and data migration method of a kind of Neo4j to relevant database
CN107832440B (en) Data mining method, device, server and computer readable storage medium
CN110232126A (en) Hot spot method for digging and server and computer readable storage medium
CN104216979A (en) Chinese technology patent automatic classification system and method for patent classification by using system
CN102542061A (en) Intelligent product classification method
CN103995885A (en) Method and device for recognizing entity names
CN104915405A (en) Microblog query expansion method based on multiple layers
CN103049557A (en) Website resource management method and website resource management device
CN105956158A (en) Automatic extraction method of network neologism on the basis of mass microblog texts and use information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant