CN105117466A - Internet information screening system and method - Google Patents

Internet information screening system and method Download PDF

Info

Publication number
CN105117466A
CN105117466A CN201510536772.1A CN201510536772A CN105117466A CN 105117466 A CN105117466 A CN 105117466A CN 201510536772 A CN201510536772 A CN 201510536772A CN 105117466 A CN105117466 A CN 105117466A
Authority
CN
China
Prior art keywords
document
frequency
sorted
documents
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510536772.1A
Other languages
Chinese (zh)
Inventor
杨裕芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Haobai Information Service Branch of China Telecom Corp Ltd
Original Assignee
Hubei Haobai Information Service Branch of China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Haobai Information Service Branch of China Telecom Corp Ltd filed Critical Hubei Haobai Information Service Branch of China Telecom Corp Ltd
Priority to CN201510536772.1A priority Critical patent/CN105117466A/en
Publication of CN105117466A publication Critical patent/CN105117466A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention relates to an Internet information screening method, and belongs to the technical field of a computer network. Internet documents are collected to form a document library and the documents in the document library are preprocessed, wherein the preprocessing on the documents in the document library comprises classification, segmentation and denoising and frequency statistics on the documents in the document library; the pre-processes documents are subjected to weight calculation to obtain class-center vectors of various documents; to-be-classified documents are subjected to frequency statistics to finally obtain a similarity degree result between the to-be-classified documents and the documents in the document library; and screening is carried out in the to-be-classified documents according to a set threshold so as to obtain a target document and finally, a promotion item content is loaded into the screened target document and document data information is transmitted to the Internet. According to the invention, the problems of carrying out classification and screening on massive hotspot event information documents for specific categories are solved, the processing speed is improved and an execution speed and efficiency of a system can be greatly improved under the condition of ensuring the accuracy.

Description

A kind of internet information screening system and method
Technical field
The present invention relates to technical field of the computer network, particularly a kind of internet information screening system and method.
Background technology
Along with the development in epoch, the information on network presents explosive growth, and the news media of portal website have been no longer the main sources of internet content, various social platform, microblogging, and micro-letter, circle of friends forum etc. all can produce the information of magnanimity every day.Valuable hot spot networks event information in time, is effectively extracted in a large amount of information, by hot spot networks course of event editor, be processed into the own information possessing marketing speciality, gone out by the Internet channel release quickly of oneself, increase client's viscosity and marketing success ratio.How to issue valuable events marketing information in time, the requirement ageing to event information is also more and more higher, after by most valuable event information superposition marketing attribute, be presented to the inevitable requirement that user also becomes electric business timely, how screen focus incident document fast and also become electric business website problem demanding prompt solution.The screening question essence of event information document can be summed up as the two classification problem of document, but Comparatively speaking have again obvious otherness with traditional document classification.One is the indefinite property of boundary between document class, and the standard of namely classifying is by the so-called focus incident Determination of Value artificially determined; Two is the development along with information industry, the especially explosive growth of Internet, and the focus incident data of Water demand present magnanimity feature.
Summary of the invention
Technical matters to be solved by this invention is for the deficiencies in the prior art, provides a kind of internet information screening system and method.
The technical scheme that the present invention solves the problems of the technologies described above is as follows:
A kind of internet information screening system, comprises communication unit, pretreatment unit, weight calculation unit, policy unit, threshold value screening unit and performance element;
Described communication unit, it becomes document library for the sets of documentation of collecting internet, and the document information in document library is passed to pretreatment unit; Also for collecting the document to be sorted of internet, and document information to be sorted is passed to sorter unit;
Described pretreatment unit, it is for sorting out the document in document library, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;
Described weight calculation unit, it is for carrying out weight calculation to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, is mapped in feature word space by the document weight result obtained, and draws the class center vector of all kinds of document;
Described sorter unit, it carries out Frequency statistics for treating classifying documents, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, the class center vector of the document in the document library utilizing weight calculation unit to calculate, classifies according to feedback mechanism, is exported by number of documents as value, document generic exports as key, obtains the similarity result of the document in document to be sorted and document library;
Described policy unit, it is for arranging the threshold value of the similarity of the document in screening document to be sorted and document library;
Described threshold value screening unit, its threshold value for strategically unit setting, screening in document to be sorted, obtains destination document.
The invention has the beneficial effects as follows: Website server is after the event document information collecting the micro-letter of news category, the issue of website microblogging, break through performance limitations when screening runs on extensive focus incident message data set, solve the category filter problem of the magnanimity focus incident information document carried out for particular category, improve processing speed, greatly improve execution speed and the efficiency of system when ensureing accuracy.Meet the requirement that electric business marketing is ageing to focus incident, can from large-scale focus incident information data, filter out valuable information timely and be issued, promote electric business website marketing promptness, multifarious expansion, improve website trading volume and income.
Further, system of the present invention also comprises performance element, and it for commodity sales promotion content being carried in the destination document after the screening of threshold value screening unit, and passes to communication unit the document data information after loading.
Further, what described weight calculation unit adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.
Further, the feedback mechanism of the employing of described sorter, to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c i=ac i+ bw i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.
A kind of internet information screening technique, comprises the following steps:
(1) sets of documentation of collecting internet becomes document library;
(2) document in document library is sorted out, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;
(3) weight calculation is carried out to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, the document weight result obtained is mapped in feature word space, obtains the class center vector of all kinds of document;
(4) document to be sorted of internet is collected, treat classifying documents and carry out Frequency statistics, draw the statistics of the entry frequency for each document, calculate the weights of each document on feature word space V, document vector VD to be sorted in feature word space V is resolved to number of documents ID, proper vector wD=(the w1 of document, w2, wm), then the class center vector of the document in the document library in step (3) is utilized, classify according to feedback mechanism, number of documents is exported as value, document generic exports as key, obtain the similarity result of the document in document to be sorted and document library.
(6) threshold value of the similarity of the document in screening document to be sorted and document library is set;
(7) screen in document to be sorted according to the threshold value arranged, obtain destination document.
Further, also comprise in the destination document after commodity sales promotion content being carried in the screening of threshold value screening unit, and the document data information after loading is sent to the step of internet.
Further, what the weight calculation in described step (3) adopted is improve TF-IDF algorithm, and formula is:
TF-IDF=TFxlog(m=(m+k)xN)
Wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, and document frequency in the non-field that k represents this characteristic item, N represents whole number of files.
Further, the feedback mechanism in described step (4), to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c i=ac i+ bw i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.
Further, the Rocchio algorithm that what Frequency statistics in described step (4) adopted is based on MapReduce.
Accompanying drawing explanation
Fig. 1 is present system schematic diagram;
Fig. 2 is method flow diagram of the present invention.
Embodiment
Be described principle of the present invention and feature below in conjunction with accompanying drawing, example, only for explaining the present invention, is not intended to limit scope of the present invention.
As shown in Figure 1, a kind of internet information screening system, comprises communication unit, pretreatment unit, weight calculation unit, policy unit, threshold value screening unit and performance element;
Described communication unit, it becomes document library for the sets of documentation of collecting internet, and the document information in document library is passed to pretreatment unit; Also for collecting the document to be sorted of internet, and document information to be sorted is passed to sorter unit;
Described pretreatment unit, it is for sorting out the document in document library, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;
Described weight calculation unit, it is for carrying out weight calculation to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, is mapped in feature word space by the document weight result obtained, and finally draws the class center vector of all kinds of document;
Described sorter unit, it carries out Frequency statistics for treating classifying documents, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, the class center vector of the document in the document library utilizing weight calculation unit to calculate, classifies according to feedback mechanism, is exported by number of documents as value, document generic exports as key, finally obtains the similarity result of the document in document to be sorted and document library;
Described policy unit, it is for arranging the threshold value of the similarity of the document in screening document to be sorted and document library;
Described threshold value screening unit, its threshold value for strategically unit setting, screening in document to be sorted, obtains destination document.
System of the present invention also comprises performance element, and it for commodity sales promotion content being carried in the destination document after the screening of threshold value screening unit, and passes to communication unit the document data information after loading.
What described weight calculation unit adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.
The feedback mechanism of the employing of described sorter, to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c i=ac i+ bw i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.
Utilize above-mentioned internet information screening system to carry out a method for internet information screening, comprise the following steps:
(1) sets of documentation of collecting internet becomes document library;
(2) document in document library is sorted out, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key;
False code is as follows:
Input:D
Output:<key,value>
1、(ID,C,A)=parse(D)
2、T=segment(C)
3、forterminTdo
4、key=make_pair(ID,A)
5、value=term
6、output(key,value)
And Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race, adopts the Rocchio algorithm based on MapReduce, is specially Reduce process:
Before the result of Map is input to Reduce process, the value of identical key can be incorporated in together, form <key, value_list>, each key is added up to the frequency of different value in its value_list (value is now entry term), last key former state exports as the key of Reduce, and value different in its corresponding value_list exports with the value of its frequency as Reduce.
False code is as follows:
Input:<key,value_list>
Output:<key,value>
1、forterminvalue_listdo
2、freq=count(term,value_list)
3、list.add(make_pair(term,freq))
4、value=list
5、output(key,value)
Finally obtain following data mode:
[<JD i,A i>,(<term i1,freq i1>,<term i2,freq i2>,L,<term im,freq im>)],i=1,2,Ln
Have a distinct increment relative to traditional Rocchio algorithm based on the Rocchio Relevance Feedback Algorithms of MapReduce in the accuracy rate of category filter, and be better than KNN algorithm, be just inferior to slightly SVM algorithm.But in the processing speed of category filter, Rocchio Relevance Feedback Algorithms based on MapReduce takes full advantage of the advantage of MapReduce technology in large data processing, the speed performed and efficiency far higher than other algorithms, and can constantly increase along with the increase of cluster scale within the specific limits.
(3) weight calculation is carried out to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting merges the Feature Words of all documents as the Feature Words of this document, composition characteristic word space, the document weight result obtained is mapped in feature word space, obtains proper vector and the classification of document; Described weight calculation adopts the feature selection approach based on improving TF-IDF algorithm: TF-IDF algorithm is character representation method conventional in information retrieval field.Its basic thought is, if the word frequency TF that certain entry occurs in a certain document (TermFrequency) is higher, but the document comprising this entry is less, namely the document frequency DF (DocumentFrequency) of entry is lower, then think that this entry has good class discrimination ability, should higher weights be given.That is:
W = T F &times; 1 D F = T F &times; I D F , - - - ( 1 )
Wherein IDF (InverseDocumentFrequency) represents to fall document frequency.The basic definition of TF and IDF is as follows:
TF i j = freq i j Maxfreq j , - - - ( 2 )
IDF i = log N n i , - - - ( 3 )
In formula, freqij represents the number of times that the word that the number of times that i-th word occurs in jth section document, Maxfreqj represent that in jth section document, frequency is the highest occurs; N is total number of files, and ni is the number of files comprising i-th word.
But this traditional TF-IDF algorithm Shortcomings when being applied to document classification.In fact, if certain entry frequently occurs in a category documents, then illustrate this entry can represent very well this class feature, higher weight should be given.Therefore Zhang Yufang etc. propose a kind of improvement TF-IDF algorithm for document classification, improve IDF formula, namely for a certain class document C, IDF is:
IDF i = log N &CenterDot; m c i n i , - - - ( 4 )
Wherein, mci is the number of files comprising i-th word in certain class C.If except C class, the number of files comprising i-th entry is ki, then formula becomes:
IDF i = N &CenterDot; l o g m c i m c i + k i , - - - ( 5 )
Value and the mci positive correlation of IDFi can be found out, with ki negative correlation from formula.Therefore, above formula can embody this improvement thought.
Described weight calculation unit is for calculating the weight of each word in each document, heavy front K the maximum word of weighting merges the Feature Words of all documents as the Feature Words of this document, composition characteristic word space, then the document weight result obtained before is mapped in feature word space, obtains proper vector and the classification of document.Calculate the weight wij of each word in each document, then heavy front K the maximum word of weighting merges the Feature Words of all documents as the Feature Words of this document, and composition characteristic word space, is the dimensional space of VSM, is designated as V,
V=[cTerm 1,cTerm 2,L,cTerm m],(6)
Then the document weight result obtained before is mapped in feature word space, just can obtains the proper vector of document, be:
[<ID i,A i>,(w i1,w i2,Lw im)],i=1,2,Ln,(7)
Wherein, the value of Ai is R or NR.Above result is brought in formula (7), and make α=0, β=1, γ=1, the class center vector cR of associated class and the class center vector cNR of irrelevant class can be drawn, that is:
c R=(c R1,c R2,L,c Rm)
c NR=(c NR1,c NR2,L,c NRm)
From formula (7), c r=-c nR
(4) collect the document to be sorted of internet, treat classifying documents and carry out Frequency statistics, utilize the document word frequency statistics algorithm based on MapReduce, draw the word frequency statistics result for each document.Then calculate the weights of each document on feature word space V, obtain:
[<ID i,Non>,(w i1,w i2,L,w im)],i=1,2,Ln
Wherein Non shows that above document is all to be sorted.Assorting process mainly concentrates on the Map stage, with the distance of destination document and class center vector for foundation, and constantly updates class center vector, to reach the effect of feedback by formula (8) in assorting process.
c i=a·c i+b·w i,(8)
Wherein ci is the class center vector of the i-th class, and wi is the document vector being just divided into the i-th class, and a, b are feedback factor, and a+b=1.
Map process: the document vector VD to be sorted of above-mentioned process is resolved to number of documents ID, proper vector wD=(the w1 of document, w2, wm), then classify with feedback mechanism based on class center vector cR, cNR classification according to above-mentioned, finally exported as value by number of documents, document generic exports as key.False code is as follows:
Input:VD,cR,cNR
Output:<key,value>
1、(ID,wD)=parse(VD)
2、value=ID
3、sR=cos_similarity(wD,cR)
4、sNR=cos_similarity(wD,cNR)
5、ifsR>sNRthen
6、key=“R”
7、cR=a*cR+b*wD
8、else
9、key=“NR”
10、cNR=a*cNR+b*wD
11、output(key,value)
Reduce process: same, between Map process and Reduce process, the <key of identical key, value> is combined, become <key, value_list>, it is processed a little former state and export.False code is as follows:
Input:<key,value_list>
Output:<key,value>
1、fordocinvalue_listdo
2、list.add(doc)
3、values=list
4、output(key,value)
Through above algorithm process, last classification results just can be obtained as follows:
RDoc R1,Doc R2,LDoc Rp
NRDoc NR1,Doc NR2,LDoc NRq
(5) threshold value of the similarity of the document in screening document to be sorted and document library is set;
(6) screen in document to be sorted according to the threshold value arranged, obtain destination document.
(7) commodity sales promotion content is carried in the destination document after the screening of threshold value screening unit, and the document data information after loading is sent to internet.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. an internet information screening system, is characterized in that, comprises communication unit, pretreatment unit, weight calculation unit, policy unit, threshold value screening unit and performance element;
Described communication unit, it becomes document library for the sets of documentation of collecting internet, and the document information in document library is passed to pretreatment unit; Also for collecting the document to be sorted of internet, and document information to be sorted is passed to sorter unit;
Described pretreatment unit, it, for carrying out pre-service to the document in document library, comprises and sorting out the document in document library, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;
Described weight calculation unit, it is for carrying out weight calculation to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, is mapped in feature word space by the document weight result obtained, and draws the class center vector of all kinds of document;
Described sorter unit, it carries out Frequency statistics for treating classifying documents, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, the class center vector of the document in the document library utilizing weight calculation unit to calculate, treat classifying documents according to feedback mechanism to classify, obtain the similarity result of the document in document to be sorted and document library;
Described policy unit, it is for arranging the threshold value of the similarity of the document in screening document to be sorted and document library;
Described threshold value screening unit, its threshold value for strategically unit setting, screening in document to be sorted, obtains destination document.
2. internet information screening system according to claim 1, it is characterized in that, also comprise performance element, it for commodity sales promotion content being carried in the destination document after the screening of threshold value screening unit, and passes to internet the document data information after loading by communication unit.
3. internet information screening system according to claim 1, it is characterized in that, what described weight calculation unit adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.
4. internet information screening system according to claim 1, it is characterized in that, the feedback mechanism of the employing of described sorter, to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c i=ac i+ bw iwherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.
5. internet information screening system according to claim 1, is characterized in that, the Rocchio algorithm that what the Frequency statistics of described sorter unit adopted is based on MapReduce.
6. an internet information screening technique, is characterized in that, comprises the following steps:
(1) sets of documentation of collecting internet becomes document library;
(2) pre-service is carried out to the document in document library, comprise and document is sorted out, participle denoising and Frequency statistics; Describedly document is carried out classification and refer to document is sorted out respectively by document code, document content and document properties, describedly document is carried out participle denoising and refer to and carry out participle denoising to sorting out document, the entry obtained exports as value, and the document properties belonging to entry is exported as key; Describedly Frequency statistics is carried out to document refer to that the value will with identical document attribute merges, and statistics belongs to the frequency of different value in the value queue of document properties of the same race;
(3) weight calculation is carried out to pretreated document, be specially the weight calculating each word in each document, heavy front K the maximum word of weighting is as the Feature Words of this document, and merge the Feature Words of all documents, composition characteristic word space, the document weight result obtained is mapped in feature word space, draws the class center vector of all kinds of document;
(4) document to be sorted of internet is collected, treat classifying documents and carry out Frequency statistics, draw the statistics of the entry frequency for each document, calculate the weights of each document in feature word space, utilize the class center vector of the document in the document library in step (3), treat classifying documents according to feedback mechanism to classify, obtain the similarity result of the document in document to be sorted and document library;
(5) threshold value of the similarity of the document in screening document to be sorted and document library is set;
(6) screen in document to be sorted according to the threshold value arranged, obtain destination document.
7. internet information screening technique according to claim 6, is characterized in that, also comprises in the destination document after commodity sales promotion content being carried in the screening of threshold value screening unit, and the document data information after loading is sent to the step of internet.
8. the method for internet information screening according to claim 6, it is characterized in that, what the weight calculation in described step (3) adopted is improve TFIDF=TFxlog (m=(m+k) xN) algorithm, wherein TF represents the word frequency of a certain characteristic item, document frequency in the field that m represents this characteristic item, document frequency in the non-field that k represents this characteristic item, N represents whole number of files.
9. internet information screening technique according to claim 6, it is characterized in that, feedback mechanism in described step (4), to refer to the distance of the class center vector of the document in document to be sorted and document library as foundation, upgrades class center vector in assorting process; Described renewal class center vector is based on formula c i=ac i+ bw i, wherein ci is the class center vector of the i-th class, and wi is the document vector of the i-th class, and a, b are feedback factor, and a+b=1.
10. internet information screening technique according to any one of claim 6-9, is characterized in that, the Rocchio algorithm that what Frequency statistics in described step (4) adopted is based on MapReduce.
CN201510536772.1A 2015-08-27 2015-08-27 Internet information screening system and method Pending CN105117466A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510536772.1A CN105117466A (en) 2015-08-27 2015-08-27 Internet information screening system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510536772.1A CN105117466A (en) 2015-08-27 2015-08-27 Internet information screening system and method

Publications (1)

Publication Number Publication Date
CN105117466A true CN105117466A (en) 2015-12-02

Family

ID=54665456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510536772.1A Pending CN105117466A (en) 2015-08-27 2015-08-27 Internet information screening system and method

Country Status (1)

Country Link
CN (1) CN105117466A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102998A (en) * 2016-02-22 2017-08-29 阿里巴巴集团控股有限公司 A kind of String distance computational methods and device
CN107330592A (en) * 2017-06-20 2017-11-07 北京因果树网络科技有限公司 A kind of screening technique, device and the computing device of target Enterprise Object
CN111221959A (en) * 2019-09-27 2020-06-02 武汉创想外码科技有限公司 WNLP text traceability model
CN111815413A (en) * 2020-07-09 2020-10-23 湖南数客星球信息技术有限公司 Big data commodity prediction system and method based on hot event
CN113127611A (en) * 2019-12-31 2021-07-16 北京中关村科金技术有限公司 Method and device for processing question corpus and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110106807A1 (en) * 2009-10-30 2011-05-05 Janya, Inc Systems and methods for information integration through context-based entity disambiguation
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users
CN104536830A (en) * 2015-01-09 2015-04-22 哈尔滨工程大学 KNN text classification method based on MapReduce

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110106807A1 (en) * 2009-10-30 2011-05-05 Janya, Inc Systems and methods for information integration through context-based entity disambiguation
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users
CN104536830A (en) * 2015-01-09 2015-04-22 哈尔滨工程大学 KNN text classification method based on MapReduce

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HU ZONGZHEN 等: "A fuzzy approach to clustering of text documents based on MapReduce", 《CONFERENCE ON COMPUTATIONAL AND INFORMATION SCIENCES(ICCIS),2013 FIFTH INTERNATIONAL CONFERENCE ON》 *
仲梓源 等: "大数据背景下新闻筛选的分布式算法研究", 《人民网》 *
仲梓源: "基于遗传与反馈的分布式文本分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
曾志生 等: "《精准营销 如何精确地找到客户并实现有效销售》", 31 January 2007 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102998A (en) * 2016-02-22 2017-08-29 阿里巴巴集团控股有限公司 A kind of String distance computational methods and device
US11256756B2 (en) 2016-02-22 2022-02-22 Advanced New Technologies Co., Ltd. Character string distance calculation method and device
CN107330592A (en) * 2017-06-20 2017-11-07 北京因果树网络科技有限公司 A kind of screening technique, device and the computing device of target Enterprise Object
CN111221959A (en) * 2019-09-27 2020-06-02 武汉创想外码科技有限公司 WNLP text traceability model
CN113127611A (en) * 2019-12-31 2021-07-16 北京中关村科金技术有限公司 Method and device for processing question corpus and storage medium
CN111815413A (en) * 2020-07-09 2020-10-23 湖南数客星球信息技术有限公司 Big data commodity prediction system and method based on hot event

Similar Documents

Publication Publication Date Title
CN101408883B (en) Method for collecting network public feelings viewpoint
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN105488092B (en) A kind of time-sensitive and adaptive sub-topic online test method and system
CN103500175B (en) A kind of method based on sentiment analysis on-line checking microblog hot event
US9967321B2 (en) Meme discovery system
CN101295381B (en) Junk mail detecting method
CN103745000A (en) Hot topic detection method of Chinese micro-blogs
CN104391835A (en) Method and device for selecting feature words in texts
CN103390051A (en) Topic detection and tracking method based on microblog data
CN105117466A (en) Internet information screening system and method
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN106557558A (en) A kind of data analysing method and device
CN103886108A (en) Feature selection and weight calculation method of imbalance text set
CN107577724A (en) A kind of big data processing method
CN107679069A (en) Method is found based on a kind of special group of news data and related commentary information
Demirbaga HTwitt: a hadoop-based platform for analysis and visualization of streaming Twitter data
CN106502990A (en) A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing
Kotlerman et al. Clustering small-sized collections of short texts
CN101594313A (en) A kind of spam judgement, classification, filter method and system based on potential semantic indexing
Daouadi et al. Organization vs. Individual: Twitter User Classification.
Campbell et al. Content+ context networks for user classification in twitter
Minab et al. A new sentiment classification method based on hybrid classification in Twitter
Li et al. LSSL-SSD: Social spammer detection with laplacian score and semi-supervised learning
CN102799666B (en) Method for automatically categorizing texts of network news based on frequent term set

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151202