CN105843798A - Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages - Google Patents

Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages Download PDF

Info

Publication number
CN105843798A
CN105843798A CN201610205217.5A CN201610205217A CN105843798A CN 105843798 A CN105843798 A CN 105843798A CN 201610205217 A CN201610205217 A CN 201610205217A CN 105843798 A CN105843798 A CN 105843798A
Authority
CN
China
Prior art keywords
long
information
designated
divide
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610205217.5A
Other languages
Chinese (zh)
Inventor
张庆祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Dingzhong Intelligent Technology Co Ltd
Original Assignee
Jiangsu Dingzhong Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Dingzhong Intelligent Technology Co Ltd filed Critical Jiangsu Dingzhong Intelligent Technology Co Ltd
Priority to CN201610205217.5A priority Critical patent/CN105843798A/en
Publication of CN105843798A publication Critical patent/CN105843798A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an internet information acquisition and fusion method based on a divide-and-conquer strategy of long and short messages. The method comprises steps as follows: an acquisition rule r of a user is acquired; searching is performed at search interfaces provided by different media according to the acquisition rule r, and the acquired total message set is marked as S; long messages and short messages in message set S are distinguished, and fusion calculation is performed; messages in sets LS and SS are fused again, and u categories are obtained. The messages acquired from the internet are distinguished into the long messages and the short messages, the messages are represented and fusion calculation is performed on the messages according to the divide-and-conquer strategy, finally, key words of the categories are extracted respectively, long and short texts are further fused, and the problems of sparseness in extraction of short message characteristics and poor effect when long and short texts are calculated together are effectively solved.

Description

A kind of internet information acquisition fusion method based on length information divide-and-conquer strategy
Technical field
The invention belongs to network information processing field, specifically, be a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy.
Background technology
Along with the fast development of the Internet, all kinds of media have become as the important place that people state one's views, including news, mhkc, forum, microblogging, wechat etc..It is all kinds of internet information handling implement basis that the collection of internet information is merged.
The length of different the Internet media information has bigger difference, the several words having, tens words, hundreds of, thousand of the words having.And the information of difference length has bigger difference in the process of technology.For long message, owing to can extract the weight of word, mostly conventional integration technology is technology VSM model.And for short message, short message is merged by Similarity Measure or Jaccard mode frequently with character string.The two is not organically combined by the collection fusion method of existing internet information, processes just for certain single situation, and the syncretizing effect causing collection information is undesirable.
Summary of the invention
The technical problem to be solved is for the deficiencies in the prior art, a kind of new internet information acquisition fusion method based on length information divide-and-conquer strategy is provided, the data separation collected is long and short two types by the method, use different information representation model and different similarity calculating methods, reach the purpose to the internet information effective integration collected.
The technical problem to be solved is to be realized by following technical scheme.The present invention is a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy, is characterized in, its step includes:
(1) a collection rule r of user is obtained;Its preferred operating procedure is as follows:
(1-1) special topic that user creates is read;
(1-2) a collection rule r of special topic is obtained.
(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S;Its preferred operating procedure is as follows:
(2-1) information aggregate using collection rule r to obtain at m different media is designated as S1, S2 ..., Sm respectively;
(2-2) seeking the union of S1, S2 ..., Sm, result is designated as S.
(3) by the data separation long message in set S and short message, fusion calculation respectively, its preferred operating procedure is as follows:
(3-1) using set S in message length more than 140 characters as long message, obtain long message set LS, other as short message, obtain short message set SS;
(3-2) long message set LS is used VSM model representation, re-use COS distance and calculate the similarity of long message, obtain p classification, be designated as LS1, LS2 ..., LSp respectively;
(3-3) by after every information participle of short message set SS, filtering stop words, every information table is shown as the set of word, re-uses Jaccard mode and calculates the similarity of short message, obtains q classification, is designated as SS1, SS2 ..., SSq respectively.
(4) information in set long message set LS, short message set SS being merged again, obtain u classification, its preferred operating procedure is as follows:
(4-1) use each classification LSi(1≤i≤p of TF method set of computations LS) in the weight of word, selects 20 words as gathering the Feature Words of LSi, feature word set is designated as LSi-FW;
(4-2) each classification SS j(1≤j≤q of TF method set of computations SS is used) weight of word, select 20 words as gathering the Feature Words of SSj, feature word set is designated as SSj-FW;
(4-3) use Jaccard mode to calculate the similarity of LSi-FW, SSj-FW, finally give u classification.
The information that the Internet is collected by the inventive method, divide into long message and short message two types, use the tactful expression information and the fusion calculation of the information of carrying out divided and rule, the last key word extracting classification the most respectively carries out the further fusion of long and short text, effectively solves the problem that short message feature extraction short text sparse, long computational valid time together fruit is undesirable.
Accompanying drawing explanation
Fig. 1 is the flow chart for present invention internet information acquisition fusion method based on length information divide-and-conquer strategy;
Fig. 2 be in the inventive method by the data separation long message in set S and short message, the respectively flow chart of fusion calculation;
Fig. 3 is for information in set LS, SS being merged again in the inventive method, obtaining the flow chart of u classification.
Detailed description of the invention
Below in conjunction with drawings and Examples, the inventive method is described in further detail, so that those skilled in the art are further understood from the present invention, and does not constitute the restriction to right of the present invention.
Embodiment 1, with reference to Fig. 1-3, a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy, concrete steps include: (1) obtains a collection rule r of user;(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S;(3) by the data separation long message in set S and short message, fusion calculation respectively;(4) information in set LS, SS is merged again, obtain u classification.
Step (1) reads the special topic that user creates, the special topic of user can corresponding a plurality of collection rule, take each collection rule r of special topic respectively, follow-up gather fuse information in the same fashion.
Step (2) is to each collection rule r for m different media, and the searching interface using media to provide scans for, and obtains the information aggregate specifying number of pages, and the information aggregate that different media obtain is respectively S1, S2 ..., Sm.Then, the information aggregate that m different media obtain is merged into a big set, is designated as S.
Set S is divided into two classes by step (3): long message set and short message set, the length reference microblogging of information 140 characters of requirement to input content length.Long message set LS, short message set SS is obtained after division.
By long message set LS use VSM model representation, it is assumed that in Chief Information Officer information LS-Infor1, the number of word is ls1, then LS-Infor1 can be expressed as<word1, weight1>,<word2, weight2>...,< Wordls1, weightls1>}, wherein wordlsi (1=<i=<l1) is word, and wherein weightlsi (1=<i=<l1) is the weight of word.In long message set LS, similarity calculating method during information fusion uses COS distance method to calculate.Threshold value LS-f of similarity is determined by experiment, LS-f=0.7.Obtain p classification, be designated as LS1, LS2 ..., LSp respectively.
After every information participle of short message set SS, filter stop words, it is assumed that in short message SS-Infor1, the number of word is ss1, then SS-Infor1 can be expressed as<word1>,<word2>...,<wordss1 >, wherein wordssi (1=<i=<l1) is word.Similarity calculating method when short message set is information fusion in S uses Jaccard mode to calculate.Threshold value SS-f of similarity is determined by experiment, SS-f=0.8.Obtain q classification, be designated as SS1, SS2 ..., SSq respectively.
Step (4) uses TF(word frequency) each classification LSi(1=of method set of computations LS < i≤p) in the weight of word, selects 20 words (being determined by experiment) as gathering the Feature Words of LSi, feature word set is designated as LSi-FW.Use TF(word frequency) each classification SS j(1=< j≤q of method set of computations SS) weight of word, select 20 words (being determined by experiment) as gathering the Feature Words of SSj, feature word set is designated as SSj-FW.And then using Jaccard mode to calculate the similarity of Feature Words LSi-FW, SSj-FW of classification, final fusion obtains u classification.Threshold value LS-SS-f of similarity is determined by experiment, SS-f=0.7.

Claims (5)

1. an internet information acquisition fusion method based on length information divide-and-conquer strategy, it is characterised in that its step includes:
(1) a collection rule r of user is obtained;
(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S;
(3) by the data separation long message in set S and short message, fusion calculation respectively;
(4) information in long message set LS, short message set SS is merged again, obtain u classification.
Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that in step (1): first read the special topic that user creates;Obtain a collection rule r of special topic again.
Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that in step (2): the information aggregate first using collection rule r to obtain at m different media is designated as S1, S2 ..., Sm respectively;Seeking the union of S1, S2 ..., Sm again, result is designated as S.
Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that specifically comprising the following steps that of step (3)
A, using set S in message length more than 140 characters as long message, obtain long message set LS, other as short message, obtain short message set SS;
B, by long message set LS use VSM model representation, re-use COS distance calculate long message similarity, obtain p classification, be designated as LS1, LS2 ..., LSp respectively;
C, by after every information participle of short message set SS, filter stop words, every information table is shown as the set of word, re-uses Jaccard mode and calculates the similarity of short message, obtains q classification, is designated as SS1, SS2 ..., SSq respectively.
5. according to the internet information acquisition fusion method based on length information divide-and-conquer strategy described in claim 1-4 any one, it is characterised in that specifically comprising the following steps that of step (4)
A, using TF method to calculate the weight of word in each classification LSi of long message set LS, 1≤i≤p, selects 20 words as gathering the Feature Words of LSi, feature word set is designated as LSi-FW;
B, use TF method calculate the weight of each classification SS j word of short message set SS, 1≤j≤q, select 20 words as the Feature Words of set SSj, and feature word set is designated as SSj-FW;
C, use Jaccard mode calculate the similarity of LSi-FW, SSj-FW, finally give u classification.
CN201610205217.5A 2016-04-05 2016-04-05 Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages Pending CN105843798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610205217.5A CN105843798A (en) 2016-04-05 2016-04-05 Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610205217.5A CN105843798A (en) 2016-04-05 2016-04-05 Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages

Publications (1)

Publication Number Publication Date
CN105843798A true CN105843798A (en) 2016-08-10

Family

ID=56596636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610205217.5A Pending CN105843798A (en) 2016-04-05 2016-04-05 Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages

Country Status (1)

Country Link
CN (1) CN105843798A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222072A (en) * 2010-04-19 2011-10-19 腾讯科技(深圳)有限公司 Method and device for information classification
US20120239650A1 (en) * 2011-03-18 2012-09-20 Microsoft Corporation Unsupervised message clustering
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets
CN104915447A (en) * 2015-06-30 2015-09-16 北京奇艺世纪科技有限公司 Method and device for tracing hot topics and confirming keywords

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222072A (en) * 2010-04-19 2011-10-19 腾讯科技(深圳)有限公司 Method and device for information classification
US20120239650A1 (en) * 2011-03-18 2012-09-20 Microsoft Corporation Unsupervised message clustering
CN104573070A (en) * 2015-01-26 2015-04-29 清华大学 Text clustering method special for mixed length text sets
CN104915447A (en) * 2015-06-30 2015-09-16 北京奇艺世纪科技有限公司 Method and device for tracing hot topics and confirming keywords

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张金鹏: "基于语义的文本相似度算法研究及应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
曾颖黎: "网络舆情文本分类系统研究与开发", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN108829858B (en) Data query method and device and computer readable storage medium
CN106528532B (en) Text error correction method, device and terminal
CN103336766B (en) Short text garbage identification and modeling method and device
CN106202211B (en) Integrated microblog rumor identification method based on microblog types
CN103885937B (en) Method for judging repetition of enterprise Chinese names on basis of core word similarity
CN104281653B (en) A kind of opining mining method for millions scale microblogging text
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN107544988B (en) Method and device for acquiring public opinion data
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN103678275A (en) Two-level text similarity calculation method based on subjective and objective semantics
CN105718585B (en) Document and label word justice correlating method and its device
CN102722709A (en) Method and device for identifying garbage pictures
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN103927309A (en) Method and device for marking information labels for business objects
CN104077417A (en) Figure tag recommendation method and system in social network
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN112149422B (en) Dynamic enterprise news monitoring method based on natural language
CN103886077A (en) Short text clustering method and system
CN105512300B (en) information filtering method and system
CN106202038A (en) Synonym method for digging based on iteration and device
CN105068986A (en) Method for filtering comment spam based on bidirectional iteration and automatically constructed and updated corpus
CN101673263B (en) Method for searching video content

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160810

RJ01 Rejection of invention patent application after publication