CN105843798A - Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages - Google Patents
Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages Download PDFInfo
- Publication number
- CN105843798A CN105843798A CN201610205217.5A CN201610205217A CN105843798A CN 105843798 A CN105843798 A CN 105843798A CN 201610205217 A CN201610205217 A CN 201610205217A CN 105843798 A CN105843798 A CN 105843798A
- Authority
- CN
- China
- Prior art keywords
- long
- information
- designated
- divide
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an internet information acquisition and fusion method based on a divide-and-conquer strategy of long and short messages. The method comprises steps as follows: an acquisition rule r of a user is acquired; searching is performed at search interfaces provided by different media according to the acquisition rule r, and the acquired total message set is marked as S; long messages and short messages in message set S are distinguished, and fusion calculation is performed; messages in sets LS and SS are fused again, and u categories are obtained. The messages acquired from the internet are distinguished into the long messages and the short messages, the messages are represented and fusion calculation is performed on the messages according to the divide-and-conquer strategy, finally, key words of the categories are extracted respectively, long and short texts are further fused, and the problems of sparseness in extraction of short message characteristics and poor effect when long and short texts are calculated together are effectively solved.
Description
Technical field
The invention belongs to network information processing field, specifically, be a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy.
Background technology
Along with the fast development of the Internet, all kinds of media have become as the important place that people state one's views, including news, mhkc, forum, microblogging, wechat etc..It is all kinds of internet information handling implement basis that the collection of internet information is merged.
The length of different the Internet media information has bigger difference, the several words having, tens words, hundreds of, thousand of the words having.And the information of difference length has bigger difference in the process of technology.For long message, owing to can extract the weight of word, mostly conventional integration technology is technology VSM model.And for short message, short message is merged by Similarity Measure or Jaccard mode frequently with character string.The two is not organically combined by the collection fusion method of existing internet information, processes just for certain single situation, and the syncretizing effect causing collection information is undesirable.
Summary of the invention
The technical problem to be solved is for the deficiencies in the prior art, a kind of new internet information acquisition fusion method based on length information divide-and-conquer strategy is provided, the data separation collected is long and short two types by the method, use different information representation model and different similarity calculating methods, reach the purpose to the internet information effective integration collected.
The technical problem to be solved is to be realized by following technical scheme.The present invention is a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy, is characterized in, its step includes:
(1) a collection rule r of user is obtained;Its preferred operating procedure is as follows:
(1-1) special topic that user creates is read;
(1-2) a collection rule r of special topic is obtained.
(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S;Its preferred operating procedure is as follows:
(2-1) information aggregate using collection rule r to obtain at m different media is designated as S1, S2 ..., Sm respectively;
(2-2) seeking the union of S1, S2 ..., Sm, result is designated as S.
(3) by the data separation long message in set S and short message, fusion calculation respectively, its preferred operating procedure is as follows:
(3-1) using set S in message length more than 140 characters as long message, obtain long message set LS, other as short message, obtain short message set SS;
(3-2) long message set LS is used VSM model representation, re-use COS distance and calculate the similarity of long message, obtain p classification, be designated as LS1, LS2 ..., LSp respectively;
(3-3) by after every information participle of short message set SS, filtering stop words, every information table is shown as the set of word, re-uses Jaccard mode and calculates the similarity of short message, obtains q classification, is designated as SS1, SS2 ..., SSq respectively.
(4) information in set long message set LS, short message set SS being merged again, obtain u classification, its preferred operating procedure is as follows:
(4-1) use each classification LSi(1≤i≤p of TF method set of computations LS) in the weight of word, selects 20 words as gathering the Feature Words of LSi, feature word set is designated as LSi-FW;
(4-2) each classification SS j(1≤j≤q of TF method set of computations SS is used) weight of word, select 20 words as gathering the Feature Words of SSj, feature word set is designated as SSj-FW;
(4-3) use Jaccard mode to calculate the similarity of LSi-FW, SSj-FW, finally give u classification.
The information that the Internet is collected by the inventive method, divide into long message and short message two types, use the tactful expression information and the fusion calculation of the information of carrying out divided and rule, the last key word extracting classification the most respectively carries out the further fusion of long and short text, effectively solves the problem that short message feature extraction short text sparse, long computational valid time together fruit is undesirable.
Accompanying drawing explanation
Fig. 1 is the flow chart for present invention internet information acquisition fusion method based on length information divide-and-conquer strategy;
Fig. 2 be in the inventive method by the data separation long message in set S and short message, the respectively flow chart of fusion calculation;
Fig. 3 is for information in set LS, SS being merged again in the inventive method, obtaining the flow chart of u classification.
Detailed description of the invention
Below in conjunction with drawings and Examples, the inventive method is described in further detail, so that those skilled in the art are further understood from the present invention, and does not constitute the restriction to right of the present invention.
Embodiment 1, with reference to Fig. 1-3, a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy, concrete steps include: (1) obtains a collection rule r of user;(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S;(3) by the data separation long message in set S and short message, fusion calculation respectively;(4) information in set LS, SS is merged again, obtain u classification.
Step (1) reads the special topic that user creates, the special topic of user can corresponding a plurality of collection rule, take each collection rule r of special topic respectively, follow-up gather fuse information in the same fashion.
Step (2) is to each collection rule r for m different media, and the searching interface using media to provide scans for, and obtains the information aggregate specifying number of pages, and the information aggregate that different media obtain is respectively S1, S2 ..., Sm.Then, the information aggregate that m different media obtain is merged into a big set, is designated as S.
Set S is divided into two classes by step (3): long message set and short message set, the length reference microblogging of information 140 characters of requirement to input content length.Long message set LS, short message set SS is obtained after division.
By long message set LS use VSM model representation, it is assumed that in Chief Information Officer information LS-Infor1, the number of word is ls1, then LS-Infor1 can be expressed as<word1, weight1>,<word2, weight2>...,<
Wordls1, weightls1>}, wherein wordlsi (1=<i=<l1) is word, and wherein weightlsi (1=<i=<l1) is the weight of word.In long message set LS, similarity calculating method during information fusion uses COS distance method to calculate.Threshold value LS-f of similarity is determined by experiment, LS-f=0.7.Obtain p classification, be designated as LS1, LS2 ..., LSp respectively.
After every information participle of short message set SS, filter stop words, it is assumed that in short message SS-Infor1, the number of word is ss1, then SS-Infor1 can be expressed as<word1>,<word2>...,<wordss1
>, wherein wordssi (1=<i=<l1) is word.Similarity calculating method when short message set is information fusion in S uses Jaccard mode to calculate.Threshold value SS-f of similarity is determined by experiment, SS-f=0.8.Obtain q classification, be designated as SS1, SS2 ..., SSq respectively.
Step (4) uses TF(word frequency) each classification LSi(1=of method set of computations LS < i≤p) in the weight of word, selects 20 words (being determined by experiment) as gathering the Feature Words of LSi, feature word set is designated as LSi-FW.Use TF(word frequency) each classification SS j(1=< j≤q of method set of computations SS) weight of word, select 20 words (being determined by experiment) as gathering the Feature Words of SSj, feature word set is designated as SSj-FW.And then using Jaccard mode to calculate the similarity of Feature Words LSi-FW, SSj-FW of classification, final fusion obtains u classification.Threshold value LS-SS-f of similarity is determined by experiment, SS-f=0.7.
Claims (5)
1. an internet information acquisition fusion method based on length information divide-and-conquer strategy, it is characterised in that its step includes:
(1) a collection rule r of user is obtained;
(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S;
(3) by the data separation long message in set S and short message, fusion calculation respectively;
(4) information in long message set LS, short message set SS is merged again, obtain u classification.
Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that in step (1): first read the special topic that user creates;Obtain a collection rule r of special topic again.
Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that in step (2): the information aggregate first using collection rule r to obtain at m different media is designated as S1, S2 ..., Sm respectively;Seeking the union of S1, S2 ..., Sm again, result is designated as S.
Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that specifically comprising the following steps that of step (3)
A, using set S in message length more than 140 characters as long message, obtain long message set LS, other as short message, obtain short message set SS;
B, by long message set LS use VSM model representation, re-use COS distance calculate long message similarity, obtain p classification, be designated as LS1, LS2 ..., LSp respectively;
C, by after every information participle of short message set SS, filter stop words, every information table is shown as the set of word, re-uses Jaccard mode and calculates the similarity of short message, obtains q classification, is designated as SS1, SS2 ..., SSq respectively.
5. according to the internet information acquisition fusion method based on length information divide-and-conquer strategy described in claim 1-4 any one, it is characterised in that specifically comprising the following steps that of step (4)
A, using TF method to calculate the weight of word in each classification LSi of long message set LS, 1≤i≤p, selects 20 words as gathering the Feature Words of LSi, feature word set is designated as LSi-FW;
B, use TF method calculate the weight of each classification SS j word of short message set SS, 1≤j≤q, select 20 words as the Feature Words of set SSj, and feature word set is designated as SSj-FW;
C, use Jaccard mode calculate the similarity of LSi-FW, SSj-FW, finally give u classification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610205217.5A CN105843798A (en) | 2016-04-05 | 2016-04-05 | Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610205217.5A CN105843798A (en) | 2016-04-05 | 2016-04-05 | Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105843798A true CN105843798A (en) | 2016-08-10 |
Family
ID=56596636
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610205217.5A Pending CN105843798A (en) | 2016-04-05 | 2016-04-05 | Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105843798A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222072A (en) * | 2010-04-19 | 2011-10-19 | 腾讯科技(深圳)有限公司 | Method and device for information classification |
US20120239650A1 (en) * | 2011-03-18 | 2012-09-20 | Microsoft Corporation | Unsupervised message clustering |
CN104573070A (en) * | 2015-01-26 | 2015-04-29 | 清华大学 | Text clustering method special for mixed length text sets |
CN104915447A (en) * | 2015-06-30 | 2015-09-16 | 北京奇艺世纪科技有限公司 | Method and device for tracing hot topics and confirming keywords |
-
2016
- 2016-04-05 CN CN201610205217.5A patent/CN105843798A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222072A (en) * | 2010-04-19 | 2011-10-19 | 腾讯科技(深圳)有限公司 | Method and device for information classification |
US20120239650A1 (en) * | 2011-03-18 | 2012-09-20 | Microsoft Corporation | Unsupervised message clustering |
CN104573070A (en) * | 2015-01-26 | 2015-04-29 | 清华大学 | Text clustering method special for mixed length text sets |
CN104915447A (en) * | 2015-06-30 | 2015-09-16 | 北京奇艺世纪科技有限公司 | Method and device for tracing hot topics and confirming keywords |
Non-Patent Citations (2)
Title |
---|
张金鹏: "基于语义的文本相似度算法研究及应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
曾颖黎: "网络舆情文本分类系统研究与开发", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108829858B (en) | Data query method and device and computer readable storage medium | |
CN106528532B (en) | Text error correction method, device and terminal | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN106202211B (en) | Integrated microblog rumor identification method based on microblog types | |
CN103885937B (en) | Method for judging repetition of enterprise Chinese names on basis of core word similarity | |
CN104281653B (en) | A kind of opining mining method for millions scale microblogging text | |
CN101593200A (en) | Chinese Web page classification method based on the keyword frequency analysis | |
CN107544988B (en) | Method and device for acquiring public opinion data | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN104268160A (en) | Evaluation object extraction method based on domain dictionary and semantic roles | |
CN103678275A (en) | Two-level text similarity calculation method based on subjective and objective semantics | |
CN105718585B (en) | Document and label word justice correlating method and its device | |
CN102722709A (en) | Method and device for identifying garbage pictures | |
CN104317784A (en) | Cross-platform user identification method and cross-platform user identification system | |
CN102279890A (en) | Sentiment word extracting and collecting method based on micro blog | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
CN103927309A (en) | Method and device for marking information labels for business objects | |
CN104077417A (en) | Figure tag recommendation method and system in social network | |
CN105512333A (en) | Product comment theme searching method based on emotional tendency | |
CN112149422B (en) | Dynamic enterprise news monitoring method based on natural language | |
CN103886077A (en) | Short text clustering method and system | |
CN105512300B (en) | information filtering method and system | |
CN106202038A (en) | Synonym method for digging based on iteration and device | |
CN105068986A (en) | Method for filtering comment spam based on bidirectional iteration and automatically constructed and updated corpus | |
CN101673263B (en) | Method for searching video content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160810 |
|
RJ01 | Rejection of invention patent application after publication |