CN105843798A

CN105843798A - Internet information acquisition and fusion method based on divide-and-conquer strategy of long and short messages

Info

Publication number: CN105843798A
Application number: CN201610205217.5A
Authority: CN
Inventors: 张庆祥
Original assignee: Jiangsu Dingzhong Intelligent Technology Co Ltd
Current assignee: Jiangsu Dingzhong Intelligent Technology Co Ltd
Priority date: 2016-04-05
Filing date: 2016-04-05
Publication date: 2016-08-10

Abstract

The invention relates to an internet information acquisition and fusion method based on a divide-and-conquer strategy of long and short messages. The method comprises steps as follows: an acquisition rule r of a user is acquired; searching is performed at search interfaces provided by different media according to the acquisition rule r, and the acquired total message set is marked as S; long messages and short messages in message set S are distinguished, and fusion calculation is performed; messages in sets LS and SS are fused again, and u categories are obtained. The messages acquired from the internet are distinguished into the long messages and the short messages, the messages are represented and fusion calculation is performed on the messages according to the divide-and-conquer strategy, finally, key words of the categories are extracted respectively, long and short texts are further fused, and the problems of sparseness in extraction of short message characteristics and poor effect when long and short texts are calculated together are effectively solved.

Description

A kind of internet information acquisition fusion method based on length information divide-and-conquer strategy

Technical field

The invention belongs to network information processing field, specifically, be a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy.

Background technology

Along with the fast development of the Internet, all kinds of media have become as the important place that people state one's views, including news, mhkc, forum, microblogging, wechat etc..It is all kinds of internet information handling implement basis that the collection of internet information is merged.

The length of different the Internet media information has bigger difference, the several words having, tens words, hundreds of, thousand of the words having.And the information of difference length has bigger difference in the process of technology.For long message, owing to can extract the weight of word, mostly conventional integration technology is technology VSM model.And for short message, short message is merged by Similarity Measure or Jaccard mode frequently with character string.The two is not organically combined by the collection fusion method of existing internet information, processes just for certain single situation, and the syncretizing effect causing collection information is undesirable.

Summary of the invention

The technical problem to be solved is for the deficiencies in the prior art, a kind of new internet information acquisition fusion method based on length information divide-and-conquer strategy is provided, the data separation collected is long and short two types by the method, use different information representation model and different similarity calculating methods, reach the purpose to the internet information effective integration collected.

The technical problem to be solved is to be realized by following technical scheme.The present invention is a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy, is characterized in, its step includes:

(1) a collection rule r of user is obtained；Its preferred operating procedure is as follows:

(1-1) special topic that user creates is read；

(1-2) a collection rule r of special topic is obtained.

(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S；Its preferred operating procedure is as follows:

(2-1) information aggregate using collection rule r to obtain at m different media is designated as S1, S2 ..., Sm respectively；

(2-2) seeking the union of S1, S2 ..., Sm, result is designated as S.

(3) by the data separation long message in set S and short message, fusion calculation respectively, its preferred operating procedure is as follows:

(3-1) using set S in message length more than 140 characters as long message, obtain long message set LS, other as short message, obtain short message set SS；

(3-2) long message set LS is used VSM model representation, re-use COS distance and calculate the similarity of long message, obtain p classification, be designated as LS1, LS2 ..., LSp respectively；

(3-3) by after every information participle of short message set SS, filtering stop words, every information table is shown as the set of word, re-uses Jaccard mode and calculates the similarity of short message, obtains q classification, is designated as SS1, SS2 ..., SSq respectively.

(4) information in set long message set LS, short message set SS being merged again, obtain u classification, its preferred operating procedure is as follows:

(4-1) use each classification LSi(1≤i≤p of TF method set of computations LS) in the weight of word, selects 20 words as gathering the Feature Words of LSi, feature word set is designated as LSi-FW；

(4-2) each classification SS j(1≤j≤q of TF method set of computations SS is used) weight of word, select 20 words as gathering the Feature Words of SSj, feature word set is designated as SSj-FW；

(4-3) use Jaccard mode to calculate the similarity of LSi-FW, SSj-FW, finally give u classification.

The information that the Internet is collected by the inventive method, divide into long message and short message two types, use the tactful expression information and the fusion calculation of the information of carrying out divided and rule, the last key word extracting classification the most respectively carries out the further fusion of long and short text, effectively solves the problem that short message feature extraction short text sparse, long computational valid time together fruit is undesirable.

Accompanying drawing explanation

Fig. 1 is the flow chart for present invention internet information acquisition fusion method based on length information divide-and-conquer strategy；

Fig. 2 be in the inventive method by the data separation long message in set S and short message, the respectively flow chart of fusion calculation；

Fig. 3 is for information in set LS, SS being merged again in the inventive method, obtaining the flow chart of u classification.

Detailed description of the invention

Below in conjunction with drawings and Examples, the inventive method is described in further detail, so that those skilled in the art are further understood from the present invention, and does not constitute the restriction to right of the present invention.

Embodiment 1, with reference to Fig. 1-3, a kind of internet information acquisition fusion method based on length information divide-and-conquer strategy, concrete steps include: (1) obtains a collection rule r of user；(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S；(3) by the data separation long message in set S and short message, fusion calculation respectively；(4) information in set LS, SS is merged again, obtain u classification.

Step (1) reads the special topic that user creates, the special topic of user can corresponding a plurality of collection rule, take each collection rule r of special topic respectively, follow-up gather fuse information in the same fashion.

Step (2) is to each collection rule r for m different media, and the searching interface using media to provide scans for, and obtains the information aggregate specifying number of pages, and the information aggregate that different media obtain is respectively S1, S2 ..., Sm.Then, the information aggregate that m different media obtain is merged into a big set, is designated as S.

Set S is divided into two classes by step (3): long message set and short message set, the length reference microblogging of information 140 characters of requirement to input content length.Long message set LS, short message set SS is obtained after division.

By long message set LS use VSM model representation, it is assumed that in Chief Information Officer information LS-Infor1, the number of word is ls1, then LS-Infor1 can be expressed as<word1, weight1>,<word2, weight2>...,< Wordls1, weightls1>}, wherein wordlsi (1=<i=<l1) is word, and wherein weightlsi (1=<i=<l1) is the weight of word.In long message set LS, similarity calculating method during information fusion uses COS distance method to calculate.Threshold value LS-f of similarity is determined by experiment, LS-f=0.7.Obtain p classification, be designated as LS1, LS2 ..., LSp respectively.

After every information participle of short message set SS, filter stop words, it is assumed that in short message SS-Infor1, the number of word is ss1, then SS-Infor1 can be expressed as<word1>,<word2>...,<wordss1 >, wherein wordssi (1=<i=<l1) is word.Similarity calculating method when short message set is information fusion in S uses Jaccard mode to calculate.Threshold value SS-f of similarity is determined by experiment, SS-f=0.8.Obtain q classification, be designated as SS1, SS2 ..., SSq respectively.

Step (4) uses TF(word frequency) each classification LSi(1=of method set of computations LS < i≤p) in the weight of word, selects 20 words (being determined by experiment) as gathering the Feature Words of LSi, feature word set is designated as LSi-FW.Use TF(word frequency) each classification SS j(1=< j≤q of method set of computations SS) weight of word, select 20 words (being determined by experiment) as gathering the Feature Words of SSj, feature word set is designated as SSj-FW.And then using Jaccard mode to calculate the similarity of Feature Words LSi-FW, SSj-FW of classification, final fusion obtains u classification.Threshold value LS-SS-f of similarity is determined by experiment, SS-f=0.7.

Claims

1. an internet information acquisition fusion method based on length information divide-and-conquer strategy, it is characterised in that its step includes:

(1) a collection rule r of user is obtained；

(2) searching interface using collection rule r to provide at different media scans for, and total information aggregate of acquisition is designated as S；

(3) by the data separation long message in set S and short message, fusion calculation respectively；

(4) information in long message set LS, short message set SS is merged again, obtain u classification.

Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that in step (1): first read the special topic that user creates；Obtain a collection rule r of special topic again.

Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that in step (2): the information aggregate first using collection rule r to obtain at m different media is designated as S1, S2 ..., Sm respectively；Seeking the union of S1, S2 ..., Sm again, result is designated as S.

Internet information acquisition fusion method based on length information divide-and-conquer strategy the most according to claim 1, it is characterised in that specifically comprising the following steps that of step (3)

A, using set S in message length more than 140 characters as long message, obtain long message set LS, other as short message, obtain short message set SS；

B, by long message set LS use VSM model representation, re-use COS distance calculate long message similarity, obtain p classification, be designated as LS1, LS2 ..., LSp respectively；

C, by after every information participle of short message set SS, filter stop words, every information table is shown as the set of word, re-uses Jaccard mode and calculates the similarity of short message, obtains q classification, is designated as SS1, SS2 ..., SSq respectively.

5. according to the internet information acquisition fusion method based on length information divide-and-conquer strategy described in claim 1-4 any one, it is characterised in that specifically comprising the following steps that of step (4)

A, using TF method to calculate the weight of word in each classification LSi of long message set LS, 1≤i≤p, selects 20 words as gathering the Feature Words of LSi, feature word set is designated as LSi-FW；

B, use TF method calculate the weight of each classification SS j word of short message set SS, 1≤j≤q, select 20 words as the Feature Words of set SSj, and feature word set is designated as SSj-FW；

C, use Jaccard mode calculate the similarity of LSi-FW, SSj-FW, finally give u classification.