CN105117466A

CN105117466A - Internet information screening system and method

Info

Publication number: CN105117466A
Application number: CN201510536772.1A
Authority: CN
Inventors: 杨裕芬
Original assignee: Hubei Haobai Information Service Branch of China Telecom Corp Ltd
Current assignee: Hubei Haobai Information Service Branch of China Telecom Corp Ltd
Priority date: 2015-08-27
Filing date: 2015-08-27
Publication date: 2015-12-02

Abstract

The invention relates to an Internet information screening method, and belongs to the technical field of a computer network. Internet documents are collected to form a document library and the documents in the document library are preprocessed, wherein the preprocessing on the documents in the document library comprises classification, segmentation and denoising and frequency statistics on the documents in the document library; the pre-processes documents are subjected to weight calculation to obtain class-center vectors of various documents; to-be-classified documents are subjected to frequency statistics to finally obtain a similarity degree result between the to-be-classified documents and the documents in the document library; and screening is carried out in the to-be-classified documents according to a set threshold so as to obtain a target document and finally, a promotion item content is loaded into the screened target document and document data information is transmitted to the Internet. According to the invention, the problems of carrying out classification and screening on massive hotspot event information documents for specific categories are solved, the processing speed is improved and an execution speed and efficiency of a system can be greatly improved under the condition of ensuring the accuracy.

Description

A kind of internet information screening system and method

Technical field

The present invention relates to technical field of the computer network, particularly a kind of internet information screening system and method.

Background technology

Along with the development in epoch, the information on network presents explosive growth, and the news media of portal website have been no longer the main sources of internet content, various social platform, microblogging, and micro-letter, circle of friends forum etc. all can produce the information of magnanimity every day.Valuable hot spot networks event information in time, is effectively extracted in a large amount of information, by hot spot networks course of event editor, be processed into the own information possessing marketing speciality, gone out by the Internet channel release quickly of oneself, increase client's viscosity and marketing success ratio.How to issue valuable events marketing information in time, the requirement ageing to event information is also more and more higher, after by most valuable event information superposition marketing attribute, be presented to the inevitable requirement that user also becomes electric business timely, how screen focus incident document fast and also become electric business website problem demanding prompt solution.The screening question essence of event information document can be summed up as the two classification problem of document, but Comparatively speaking have again obvious otherness with traditional document classification.One is the indefinite property of boundary between document class, and the standard of namely classifying is by the so-called focus incident Determination of Value artificially determined; Two is the development along with information industry, the especially explosive growth of Internet, and the focus incident data of Water demand present magnanimity feature.

Summary of the invention

Technical matters to be solved by this invention is for the deficiencies in the prior art, provides a kind of internet information screening system and method.

The technical scheme that the present invention solves the problems of the technologies described above is as follows:

A kind of internet information screening system, comprises communication unit, pretreatment unit, weight calculation unit, policy unit, threshold value screening unit and performance element;

Described communication unit, it becomes document library for the sets of documentation of collecting internet, and the document information in document library is passed to pretreatment unit; Also for collecting the document to be sorted of internet, and document information to be sorted is passed to sorter unit;

Described pretreatment unit, it is for sorting out the document in document library, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;

Described weight calculation unit, it is for carrying out weight calculation to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, is mapped in feature word space by the document weight result obtained, and draws the class center vector of all kinds of document;

Described sorter unit, it carries out Frequency statistics for treating classifying documents, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, the class center vector of the document in the document library utilizing weight calculation unit to calculate, classifies according to feedback mechanism, is exported by number of documents as value, document generic exports as key, obtains the similarity result of the document in document to be sorted and document library;

Described policy unit, it is for arranging the threshold value of the similarity of the document in screening document to be sorted and document library;

Described threshold value screening unit, its threshold value for strategically unit setting, screening in document to be sorted, obtains destination document.

The invention has the beneficial effects as follows: Website server is after the event document information collecting the micro-letter of news category, the issue of website microblogging, break through performance limitations when screening runs on extensive focus incident message data set, solve the category filter problem of the magnanimity focus incident information document carried out for particular category, improve processing speed, greatly improve execution speed and the efficiency of system when ensureing accuracy.Meet the requirement that electric business marketing is ageing to focus incident, can from large-scale focus incident information data, filter out valuable information timely and be issued, promote electric business website marketing promptness, multifarious expansion, improve website trading volume and income.

Further, system of the present invention also comprises performance element, and it for commodity sales promotion content being carried in the destination document after the screening of threshold value screening unit, and passes to communication unit the document data information after loading.

Further, what described weight calculation unit adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.

Further, the feedback mechanism of the employing of described sorter, to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c _i=ac _i+ bw _i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.

A kind of internet information screening technique, comprises the following steps:

(1) sets of documentation of collecting internet becomes document library;

(2) document in document library is sorted out, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;

(3) weight calculation is carried out to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, the document weight result obtained is mapped in feature word space, obtains the class center vector of all kinds of document;

(4) document to be sorted of internet is collected, treat classifying documents and carry out Frequency statistics, draw the statistics of the entry frequency for each document, calculate the weights of each document on feature word space V, document vector VD to be sorted in feature word space V is resolved to number of documents ID, proper vector wD=(the w1 of document, w2, wm), then the class center vector of the document in the document library in step (3) is utilized, classify according to feedback mechanism, number of documents is exported as value, document generic exports as key, obtain the similarity result of the document in document to be sorted and document library.

(6) threshold value of the similarity of the document in screening document to be sorted and document library is set;

(7) screen in document to be sorted according to the threshold value arranged, obtain destination document.

Further, also comprise in the destination document after commodity sales promotion content being carried in the screening of threshold value screening unit, and the document data information after loading is sent to the step of internet.

Further, what the weight calculation in described step (3) adopted is improve TF-IDF algorithm, and formula is:

TF-IDF＝TFxlog(m＝(m+k)xN)

Wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, and document frequency in the non-field that k represents this characteristic item, N represents whole number of files.

Further, the feedback mechanism in described step (4), to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c _i=ac _i+ bw _i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.

Further, the Rocchio algorithm that what Frequency statistics in described step (4) adopted is based on MapReduce.

Accompanying drawing explanation

Fig. 1 is present system schematic diagram;

Fig. 2 is method flow diagram of the present invention.

Embodiment

Be described principle of the present invention and feature below in conjunction with accompanying drawing, example, only for explaining the present invention, is not intended to limit scope of the present invention.

As shown in Figure 1, a kind of internet information screening system, comprises communication unit, pretreatment unit, weight calculation unit, policy unit, threshold value screening unit and performance element;

Described weight calculation unit, it is for carrying out weight calculation to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, is mapped in feature word space by the document weight result obtained, and finally draws the class center vector of all kinds of document;

Described sorter unit, it carries out Frequency statistics for treating classifying documents, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, the class center vector of the document in the document library utilizing weight calculation unit to calculate, classifies according to feedback mechanism, is exported by number of documents as value, document generic exports as key, finally obtains the similarity result of the document in document to be sorted and document library;

System of the present invention also comprises performance element, and it for commodity sales promotion content being carried in the destination document after the screening of threshold value screening unit, and passes to communication unit the document data information after loading.

What described weight calculation unit adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.

The feedback mechanism of the employing of described sorter, to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c _i=ac _i+ bw _i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.

Utilize above-mentioned internet information screening system to carry out a method for internet information screening, comprise the following steps:

(1) sets of documentation of collecting internet becomes document library;

(2) document in document library is sorted out, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key;

False code is as follows:

Input:D

Output:<key,value>

1、(ID,C,A)＝parse(D)

2、T＝segment(C)

3、forterminTdo

4、key＝make_pair(ID,A)

5、value＝term

6、output(key,value)

And Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race, adopts the Rocchio algorithm based on MapReduce, is specially Reduce process:

Before the result of Map is input to Reduce process, the value of identical key can be incorporated in together, form <key, value_list>, each key is added up to the frequency of different value in its value_list (value is now entry term), last key former state exports as the key of Reduce, and value different in its corresponding value_list exports with the value of its frequency as Reduce.

False code is as follows:

Input:<key,value_list>

Output:<key,value>

1、forterminvalue_listdo

2、freq＝count(term,value_list)

3、list.add(make_pair(term,freq))

4、value＝list

5、output(key,value)

Finally obtain following data mode:

[<JD _i，A _i>，(<term _i1，freq _i1>，<term _i2，freq _i2>，L，<term _im，freq _im>)]，i＝1，2，Ln

Have a distinct increment relative to traditional Rocchio algorithm based on the Rocchio Relevance Feedback Algorithms of MapReduce in the accuracy rate of category filter, and be better than KNN algorithm, be just inferior to slightly SVM algorithm.But in the processing speed of category filter, Rocchio Relevance Feedback Algorithms based on MapReduce takes full advantage of the advantage of MapReduce technology in large data processing, the speed performed and efficiency far higher than other algorithms, and can constantly increase along with the increase of cluster scale within the specific limits.

(3) weight calculation is carried out to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting merges the Feature Words of all documents as the Feature Words of this document, composition characteristic word space, the document weight result obtained is mapped in feature word space, obtains proper vector and the classification of document; Described weight calculation adopts the feature selection approach based on improving TF-IDF algorithm: TF-IDF algorithm is character representation method conventional in information retrieval field.Its basic thought is, if the word frequency TF that certain entry occurs in a certain document (TermFrequency) is higher, but the document comprising this entry is less, namely the document frequency DF (DocumentFrequency) of entry is lower, then think that this entry has good class discrimination ability, should higher weights be given.That is:

W = T F \times \frac{1}{D F} = T F \times I D F, - - - (1)

Wherein IDF (InverseDocumentFrequency) represents to fall document frequency.The basic definition of TF and IDF is as follows:

{TF}_{i j} = \frac{{freq}_{i j}}{{Maxfreq}_{j}}, - - - (2)

{IDF}_{i} = \log \frac{N}{n_{i}}, - - - (3)

In formula, freqij represents the number of times that the word that the number of times that i-th word occurs in jth section document, Maxfreqj represent that in jth section document, frequency is the highest occurs; N is total number of files, and ni is the number of files comprising i-th word.

But this traditional TF-IDF algorithm Shortcomings when being applied to document classification.In fact, if certain entry frequently occurs in a category documents, then illustrate this entry can represent very well this class feature, higher weight should be given.Therefore Zhang Yufang etc. propose a kind of improvement TF-IDF algorithm for document classification, improve IDF formula, namely for a certain class document C, IDF is:

{IDF}_{i} = \log \frac{N \cdot m_{c i}}{n_{i}}, - - - (4)

Wherein, mci is the number of files comprising i-th word in certain class C.If except C class, the number of files comprising i-th entry is ki, then formula becomes:

{IDF}_{i} = N \cdot l o g \frac{m_{c i}}{m_{c i} + k_{i}}, - - - (5)

Value and the mci positive correlation of IDFi can be found out, with ki negative correlation from formula.Therefore, above formula can embody this improvement thought.

Described weight calculation unit is for calculating the weight of each word in each document, heavy front K the maximum word of weighting merges the Feature Words of all documents as the Feature Words of this document, composition characteristic word space, then the document weight result obtained before is mapped in feature word space, obtains proper vector and the classification of document.Calculate the weight wij of each word in each document, then heavy front K the maximum word of weighting merges the Feature Words of all documents as the Feature Words of this document, and composition characteristic word space, is the dimensional space of VSM, is designated as V,

V＝[cTerm ₁，cTerm ₂，L，cTerm _m]，(6)

Then the document weight result obtained before is mapped in feature word space, just can obtains the proper vector of document, be:

[<ID _i，A _i>，(w _i1，w _i2，Lw _im)]，i＝1，2，Ln，(7)

Wherein, the value of Ai is R or NR.Above result is brought in formula (7), and make α=0, β=1, γ=1, the class center vector cR of associated class and the class center vector cNR of irrelevant class can be drawn, that is:

c _R＝(c _R1，c _R2，L，c _Rm)

c _NR＝(c _NR1，c _NR2，L，c _NRm)

From formula (7), c _r=-c _nR

(4) collect the document to be sorted of internet, treat classifying documents and carry out Frequency statistics, utilize the document word frequency statistics algorithm based on MapReduce, draw the word frequency statistics result for each document.Then calculate the weights of each document on feature word space V, obtain:

[<ID _i，Non>，(w _i1，w _i2，L，w _im)]，i＝1，2，Ln

Wherein Non shows that above document is all to be sorted.Assorting process mainly concentrates on the Map stage, with the distance of destination document and class center vector for foundation, and constantly updates class center vector, to reach the effect of feedback by formula (8) in assorting process.

c _i＝a·c _i+b·w _i，(8)

Wherein ci is the class center vector of the i-th class, and wi is the document vector being just divided into the i-th class, and a, b are feedback factor, and a+b=1.

Map process: the document vector VD to be sorted of above-mentioned process is resolved to number of documents ID, proper vector wD=(the w1 of document, w2, wm), then classify with feedback mechanism based on class center vector cR, cNR classification according to above-mentioned, finally exported as value by number of documents, document generic exports as key.False code is as follows:

Input:VD,cR,cNR

Output:<key,value>

1、(ID,wD)＝parse(VD)

2、value＝ID

3、sR＝cos_similarity(wD,cR)

4、sNR＝cos_similarity(wD,cNR)

5、ifsR>sNRthen

6、key＝“R”

7、cR＝a*cR+b*wD

8、else

9、key＝“NR”

10、cNR＝a*cNR+b*wD

11、output(key,value)

Reduce process: same, between Map process and Reduce process, the <key of identical key, value> is combined, become <key, value_list>, it is processed a little former state and export.False code is as follows:

Input:<key,value_list>

Output:<key,value>

1、fordocinvalue_listdo

2、list.add(doc)

3、values＝list

4、output(key,value)

Through above algorithm process, last classification results just can be obtained as follows:

RDoc _R1，Doc _R2，LDoc _Rp

NRDoc _NR1，Doc _NR2，LDoc _NRq

(5) threshold value of the similarity of the document in screening document to be sorted and document library is set;

(6) screen in document to be sorted according to the threshold value arranged, obtain destination document.

(7) commodity sales promotion content is carried in the destination document after the screening of threshold value screening unit, and the document data information after loading is sent to internet.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. an internet information screening system, is characterized in that, comprises communication unit, pretreatment unit, weight calculation unit, policy unit, threshold value screening unit and performance element;

Described pretreatment unit, it, for carrying out pre-service to the document in document library, comprises and sorting out the document in document library, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;

Described sorter unit, it carries out Frequency statistics for treating classifying documents, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, the class center vector of the document in the document library utilizing weight calculation unit to calculate, treat classifying documents according to feedback mechanism to classify, obtain the similarity result of the document in document to be sorted and document library;

2. internet information screening system according to claim 1, it is characterized in that, also comprise performance element, it for commodity sales promotion content being carried in the destination document after the screening of threshold value screening unit, and passes to internet the document data information after loading by communication unit.

3. internet information screening system according to claim 1, it is characterized in that, what described weight calculation unit adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.

4. internet information screening system according to claim 1, it is characterized in that, the feedback mechanism of the employing of described sorter, to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c _i=ac _i+ bw _iwherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.

5. internet information screening system according to claim 1, is characterized in that, the Rocchio algorithm that what the Frequency statistics of described sorter unit adopted is based on MapReduce.

6. an internet information screening technique, is characterized in that, comprises the following steps:

(1) sets of documentation of collecting internet becomes document library;

(2) pre-service is carried out to the document in document library, comprise and document is sorted out, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;

(3) weight calculation is carried out to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, the document weight result obtained is mapped in feature word space, draws the class center vector of all kinds of document;

(4) document to be sorted of internet is collected, treat classifying documents and carry out Frequency statistics, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, utilize the class center vector of the document in the document library in step (3), treat classifying documents according to feedback mechanism to classify, obtain the similarity result of the document in document to be sorted and document library;

7. internet information screening technique according to claim 6, is characterized in that, also comprises in the destination document after commodity sales promotion content being carried in the screening of threshold value screening unit, and the document data information after loading is sent to the step of internet.

8. the method for internet information screening according to claim 6, it is characterized in that, what the weight calculation in described step (3) adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.

9. internet information screening technique according to claim 6, it is characterized in that, feedback mechanism in described step (4), to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c _i=ac _i+ bw _i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.

10. internet information screening technique according to any one of claim 6-9, is characterized in that, the Rocchio algorithm that what Frequency statistics in described step (4) adopted is based on MapReduce.