CN105718587A

CN105718587A - Network content resource evaluation method and evaluation system

Info

Publication number: CN105718587A
Application number: CN201610052315.XA
Authority: CN
Inventors: 王薇; 龙思薇; 刘珊; 马涛
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-01-26
Filing date: 2016-01-26
Publication date: 2016-06-29

Abstract

The invention discloses a network content resource evaluation method and evaluation system.The method comprises the steps that by means of a network crawler module, network information data is subjected to targeted crawling, and the obtained network information data is stored in a database; duplication elimination filtration and analysis are conducted on the network information data subjected to crawling, and classification is conducted on the network information data; text analysis or index calculation is conducted according to the type of the network information data, and a result is stored in the database so as to directly conduct calling on the analysis result from the database for display.According to the network content resource evaluation method and evaluation system, by means of the network crawler module, grabbing is conducted on the network information data targetedly, web communication data, community website data, video website data, public opinion data and mobile Internet data are obtained respectively, duplication elimination filtration, analysis transformation and classification are conducted on the network information data, the trend and direction of social opinions are obtained by conducting text analysis or index calculation according to the type of the network information data, the data obtaining range is wide, the pertinence is high, and timely information obtaining and feedback are achieved.

Description

A kind of network content resources appraisal procedure and assessment system

Technical field

The present invention relates to internet information processing technology field, especially one network content resources appraisal procedure and assessment system.

Background technology

Development along with Internet technology, the Internet has become the medium being widely used, extend to the every field of society, and it is increasingly becoming the important medium of Information Communication, change production and life style, human communication and the mode of thinking of people, the every field of social life and the survival and development of mankind itself are created deep effect.Network information content resource refers to the various information resources in the Internet, mobile network, by the analysis and evaluation to network information content resource, it will be seen that the trend of spreading network information, thus it is dynamic to grasp network public-opinion in time, but, there is the problems such as few, the information delay of quantity of information in the assessment of existing network information content, it is impossible to reaction network public sentiment is dynamic objectively.

Summary of the invention

The invention provides a kind of network content resources appraisal procedure and assessment system, be used for solving the problems such as few, the information delay of quantity of information in prior art.

In order to solve the problems referred to above, the present invention provides a kind of network content resources appraisal procedure, comprises the steps:

Webcrawler module is utilized to crawl network information data targetedly and the network information data of acquisition is stored in data base；

The network information data crawled is carried out duplicate removal, parsing and network information data is classified；

Type according to network information data carries out text analyzing or Index for Calculation and result is stored in data base so that directly analyzing result from data base call and being shown.

Network information content stock assessment method provided by the invention also includes techniques below step:

Further, utilize the general-purpose web reptile module based on scrapy structure that web network data is crawled, utilize wechat data acquisition module by proxy server, mobile client end data to be captured；The network information data of acquisition is stored in Mongodb data base.

Further, described by proxy server, mobile client end data carried out crawl and include: mobile client networking is arranged agency, data are sent to client from proxy server, client uses analogue-key to realize automatically operating mobile client, proxy server carries out packet capture in data transfer procedure, and then data are filtered and parsing, thus obtain data.

Further, described Index for Calculation includes spread index calculating, rating Index for Calculation and public sentiment Index for Calculation.

Further, described text analyzing includes the positive negative sense judgement of text, text word frequency statistics, conjunctive word statistics, text cluster and text classification.

Second aspect, the present invention provides a kind of network content resources assessment system, including:

Webcrawler module, described webcrawler module is for crawling network information data targetedly；

Data base, for storing the network information data that described webcrawler module crawls；

Data processing module, for carrying out duplicate removal and filtration, data parsing conversion and data classification to network information data；

Index for Calculation module and text analysis model, for carrying out Index for Calculation or text analyzing according to the type of network information data and result being stored in data base so that directly analyzing result from data base call and being shown.

Network information content stock assessment system provided by the invention also includes feature calculated below:

Further, described webcrawler module includes web crawler module and wechat data capture module, described web crawler module is the general eb reptile module based on scrapy structure, and mobile client end data is captured by described wechat data capture module by proxy server.

Further, described wechat data acquisition module includes analogue-key module, proxy service module, packet interception module, client uses analogue-key module to realize automatically operating mobile client, proxy service module carries out packet capture by packet interception module in data transfer procedure, and then data are filtered and parsing, thus obtain data.

Further, described Index for Calculation module includes spread index computing module, rating Index for Calculation module and public sentiment Index for Calculation module.

Further, described text analysis model includes the positive negative sense judge module of text, text word frequency statistics module, conjunctive word statistical module, text cluster module and text classification module.

There is advantages that by utilizing webcrawler module targetedly network information data to be captured, obtain web propagation data, community website data, video website data, public opinion data and mobile Internet data respectively, various network information datas are carried out duplicate removal filtration, data parsing conversion and data classification, type according to network information data carries out text analyzing or Index for Calculation to obtain trend and the trend of public opinion, data acquisition scope is wide, with strong points, and acquisition of information and feedback are in time.

Accompanying drawing explanation

Fig. 1 is the structural representation of embodiment of the present invention network content resources assessment system；

Fig. 2 is the workflow diagram of embodiment of the present invention network content resources assessment system；

Fig. 3 is the working state schematic representation of wechat data acquisition module in the embodiment of the present invention；

Fig. 4 is the workflow diagram of embodiment of the present invention Chinese version analysis module.

Detailed description of the invention

Below with reference to accompanying drawing and describe the present invention in detail in conjunction with the embodiments.It should be noted that when not conflicting, the embodiment in the present invention and the feature in embodiment can be mutually combined.

The present invention provides a kind of network content resources appraisal procedure, comprises the steps:

S100: utilize webcrawler module crawl network information data targetedly and the network information data of acquisition is stored in data base；

S200: the network information data crawled is carried out duplicate removal, parsing and network information data is classified；

S300: carry out text analyzing or Index for Calculation according to the type of network information data and result be stored in data base so that directly analyzing result from data base call and being shown.

In the above-mentioned methods, the network information data that webcrawler module crawls includes: 1.web propagation data, these data include the main news textual resources of each flash-news portal website, such as all kinds of news such as Netease's amusement, Sina News, news Deng Ge great portal website of Tengxun, industry media, professional media；2. community website data: these data include the comment of each big community website, analytical data, such as Baidu's mhkc, Semen Sojae Preparatum community etc.；3. video website data: these data include the program essential information of each big video website, index information and comment text, as excellent cruel, like strange skill, Fructus Mangifera Indicae tv etc.；4. public opinion data: these data are mainly based on microblogging comment data；5. mobile Internet data: these part data refer mainly to pc end and cannot obtain and the information data resource in mobile client with high-impact, such as the article textual resources of wechat common platform.The network information content stock assessment method of the present invention, by utilizing webcrawler module targetedly network information data to be captured, respectively obtain web propagation data, community website data, video website data, public opinion data and. mobile Internet data, various network information datas are carried out duplicate removal filtration, data parsing conversion and data classification, type according to network information data carries out text analyzing or Index for Calculation to obtain trend and the trend of public opinion, data acquisition scope is wide, with strong points, and acquisition of information and feedback are in time.

Utilize crawler technology that above resource is stored in Mongodb data base, and then to crawling data and carry out the Preliminary screening of data, owing to crawling that scope is wide, the amount of crawling greatly, inevitably have many repetition, redundant data, so need data are filtered, duplicate removal etc. processes, it is to avoid junk data affects evaluation result.By preliminary process, data being classified, text-type data participate in text analyzing and process, and text data mainly includes the comment text of each data source, newsletter archive, analysis text etc.；Exponential type data participant index calculate, index text mainly has individual website evaluation number to content, as Semen Sojae Preparatum index, Baidu's index etc. and video website comment amount, put the amount of praising, the amount of reading of news, wechat article transfer amount etc..

S110: utilize the general-purpose web reptile module based on scrapy structure that web network data is crawled, utilize wechat data acquisition module by proxy server, mobile client end data to be captured；The network information data of acquisition is stored in Mongodb data base.

S111: described by proxy server, mobile client end data carried out crawl and include: mobile client networking is arranged agency, data are sent to client from proxy server, client uses analogue-key to realize automatically operating mobile client, proxy server carries out packet capture in data transfer procedure, and then data are filtered and parsing, thus obtain data.

S310: described Index for Calculation includes spread index calculating, rating Index for Calculation and public sentiment Index for Calculation.Index for Calculation master include: 1.web spread index calculates, and is obtained by the news report amount of statistical correlation content resource；2. wechat index, is obtained by the some amount of praising of wechat public number and amount of reading；3. public sentiment Index for Calculation: by the model amount of relevant mhkc and member's number, the index of Semen Sojae Preparatum, video website comment number obtain；4 rating indexes: obtained with the some amount of praising by the playback volume of each video website.

S320: described text analyzing includes the positive negative sense judgement of text, text word frequency statistics, conjunctive word statistics, text cluster and text classification.Its Chinese version is positive and negative is mainly used in analyzing the positive negative sense attribute of text to judgement, and text attribute is mainly used for identifying the attitude of user comment, if the attitude of user is that we then think that this comment is forward, otherwise are then negative sense actively certainly.Text word frequency statistics is mainly used in analyzing the word frequency of text, and word more for the frequency of occurrences in the text is listed and added up.Conjunctive word statistics is mainly used in analyzing the conjunctive word of text, and the word being associated more for the frequency of occurrences in text is listed and added up.Text cluster is mainly used in analyzing the cluster of text, and text is polymerized to 15 classes, and every class provides a number of description such phrase.Each text can be assigned in the classification of correspondence by text classification.

Webcrawler module, webcrawler module is for crawling network information data targetedly；Data base, for storing the network information data that described webcrawler module crawls；Data processing module, for carrying out duplicate removal and filtration, data parsing conversion and data classification to network information data；Index for Calculation module and text analysis model, for carrying out Index for Calculation or text analyzing according to the type of network information data and result being stored in data base so that directly analyzing result from data base call and being shown.

Network content resources provided by the invention assessment system also includes feature calculated below:

Webcrawler module includes web crawler module and wechat data capture module, and web crawler module is the general-purpose web reptile module based on scrapy structure, and mobile client end data is captured by wechat data capture module by proxy server.Wechat data acquisition module includes analogue-key module, proxy service module, packet interception module, client uses analogue-key module to realize automatically operating mobile client, proxy service module carries out packet capture by packet interception module in data transfer procedure, and then data are filtered and parsing, thus obtain data.Index for Calculation module includes spread index computing module, rating Index for Calculation module and public sentiment Index for Calculation module.Text analysis model includes cluster and the sort module of conjunctive word analysis module, word frequency analysis module, text tendency analysis module and text.

Specifically, web crawler module in webcrawler module is the web crawlers based on scrapy, specific algorithm has related to BFS and the Depth Priority Algorithm of figure, in the processing procedure to ajax, employ the ghost module based on webkit, perform to resolve js for simulation browser；It is mainly used in crawling web network data, utilizes certain rule, choose suitable url from entrance url and start directed crawl, ensure page parsing accurate positioning to improve the efficiency captured in crawl process as far as possible.Finally the data crawled are deposited into Mongodb data base.Wechat data capture module realizes mobile client data grabber based on proxy server, mobile client pays close attention to multiple wechat public's accounts, realize client automatically to operate, request is sent by proxy server, proxy server program is constantly in listening state, and proxy server catches packet and then resolution data from pilot process, according to rule, specific data are processed, and be stored in monggodb data base.Proxy server is utilized to realize data acquisition, the acquisition of data depends primarily on the number of requests of client, one mobile client pays close attention to 100 accounts, owing to auto-programming operational efficiency is relatively low, so single mobile phone operational efficiency is relatively low, but server oracle listener can process multi-client data simultaneously, so multiple stage mobile client is run can be greatly improved data acquisition efficiency simultaneously, data acquisition amount every day of every mobile client about about 100.Obtaining packet main flow by proxy server is: install simulation APP software in mobile equipment end, automatically run, automatically the article of wechat public number is asked, the request that sends by proxy server to site requests, return data to carry out also by proxy server, in the transmittance process of data, obtain, by the url asked, the packet mailing to request site (wechat server), and packet is resolved and is stored in data base.

Data processing module also includes data deduplication filtering module, data parsing modular converter and data categorization module.Data deduplication filtering module realizes the duplicate removal to data by specific rule, for some text datas, there is html character field in text, the text analyzing program after impact, and this module achieves the filtration of character, removes and repeats data, filtering useless character；Data take from the mongodb of server, are finally stored in the set that local data base is different, functional, and efficiency is high, and the data volume obtained due to reptile is big, utilizes timing node that data are checked, improves efficiency as far as possible, reduces repeated matching；This module utilizes timing node as basis for estimation, is repeated detection, it is to avoid the repetition Data Detection in duplicate removal process, improves deduplicated efficiency.Data parsing modular converter and data categorization module are mainly used in data being resolved and classifying, because the data grabber of different web sites has similar type, have dissimilar, so to classify, from the data that video website captures, there is more numeric type data, but all store in the form of text after capturing, and being mingled with text in data, so needing numeric type data to carry out extracting and changing, and the difference according to the source of crawl carries out the classification of data；Data are classified, and the mainly classification of text-type data and numeric type data, and logarithm value type data is filtered and changes, and such as some numeric type data include the texts such as hundred million, ten thousand, it is necessary to specifically change；It is carry out intercepting storage with the form of whole bag that the packet of mobile client intercepts, so needing to parse the data of needs from the html page, such as data text, the point amount of praising, amount of reading, and date issued etc., data parsing modular converter calls BeautifulSoup bag and realizes the parsing to the html page.

Index for Calculation module includes spread index computing module, rating Index for Calculation module and public sentiment Index for Calculation module.Spread index computing module is classified to crawling data according to data source, including broadcasting and TV media, door, paper media and professional media, the report amount of the website of separate sources is added up, obtains historical data, for the website of separate sources, different weights is given to statistic, calculate the index results of final media, due to Index for Calculation will under same level, so when the later stage calculates, need to do normalized, be namely weighted by realizing index ranking equity；For wechat data, by counting sharing of statistics wechat, point praises number, issue number acquisition result, due to 3 kinds of data not at an order of magnitude, it is necessary to do the normalized of weighting.Spread index refers mainly to the propagation amount of content resource, and this mainly has the report amount with this resource and amount of reading relevant, so this calculating relates generally to the article of news website and wechat common platform article.The data of video website are mainly added up by rating Index for Calculation module, and to reach the result of data statistics, the playback volume of data result and each website has direct relation；For media playout algorithm, the playback volume situation of each website to be met, calculate comparatively suitable result with this, adopt Bayesian Assessment algorithm at this, as follows:

The collection of index=program all plays/and (+M all play by the collection of program) the * program collection in same type website all plays the collection of all programs in+M/ (programme contribution of same website all plays+M) * storehouse and all plays,

Wherein M is self-defined numerical value, is realized the overall control of the exponential series to this website by this value, utilizes this algorithm, when a program hot broadcast, then can embody the equal playback volume of program, when program temperature is not high, then tend to the average playback volume of all programs gradually.The algorithm of public sentiment Index for Calculation module is mainly based on weighting normalization, owing to there are different power of influence, different index ranks in different websites, it is weighted by making each data reach unified rank, it is further carried out Index for Calculation, public sentiment index therein refers mainly to the comment index of video, Semen Sojae Preparatum index, and mhkc index, Baidu's index, microblogging index etc..

Text analysis model includes the positive negative sense judge module of text, text word frequency statistics module, conjunctive word statistical module, text cluster module and text classification module.The input item of the positive negative sense judge module of text is the text needing to analyze, and output item is the numerical value that text analyzing is judged, this value is 0-1.In this program, if result is more than 0.6, text is just judged to forward, if less than 0.4, is then judged to negative sense, other be judged to neutrality；Main employing is the python system text analyzing bag snownlp carried, first step is obtain the data needing to analyze from data base, it is input in the system of analysis by needing the data analyzed, according to the result treatment analyzed, draw the positive negative sense of text, and be deposited in last Database Systems.The input item of text word frequency statistics module is the text needing to analyze, and output item is key word and word frequency thereof, and the key word of main output is 10, and ten key words that this module can choose the frequency of occurrences the highest are listed；Main employing is the python system text analyzing bag jieba carried, first step is obtain the data needing to analyze from data base, it is input in the system of analysis by needing the data analyzed, this program is in order to ensure the accuracy of result, need to provide stop words, this stop words is some words (also including some without wishing to the word occurring in key word) not having substantive significance often occurred, after have chosen a number of key word, we just carry out word frequency statistics, and by ten the highest for last frequency of occurrences key words, and be stored in data base.The output item of conjunctive word statistical module is conjunctive word and word frequency thereof, and the conjunctive word of main output is 8, and 8 key words that this module can choose the frequency of occurrences the highest are listed；After have chosen a number of key word, we just carry out the association of key word, and key word is connected by this module between two, and 8 that the frequency that occurred by all of conjunctive word is the highest are deposited in data base.The output item of text cluster module is result 15 class of cluster, every class has relevant description, it is input in the system of analysis by needing the data analyzed, by text feature corresponding for the Text Feature Extraction of input, then text feature is collected into together, and text feature is clustered, and the result of last cluster is analyzed, obtain the associated description of the result of 15 last classes, mainly adopt Kmeans algorithm and seek point group CENTER ALGORITHM.Text classification module obtains the text message used by training pattern from data base, three class texts are all carried out extract feature, then use concrete training method to carry out the training of model, finally the model trained is carried out the classification of text, obtaining last result, storage is in data base.

Last it is noted that above example is only in order to illustrate technical scheme, it is not intended to limit；Although the present invention being described in detail with reference to previous embodiment, it will be understood by those within the art that: the technical scheme described in foregoing embodiments still can be modified by it, or wherein portion of techniques feature is carried out equivalent replacement；And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims

1. a network content resources appraisal procedure, it is characterised in that comprise the steps:

The network information data crawled is carried out duplicate removal filtration, parsing and network information data is classified；

2. network content resources appraisal procedure according to claim 1, it is characterized in that, utilize the general-purpose web reptile module based on scrapy structure that web network data is crawled, utilize wechat data acquisition module by proxy server, mobile client end data to be captured；The network information data of acquisition is stored in Mongodb data base.

3. network content resources appraisal procedure according to claim 2, it is characterized in that, described by proxy server, mobile client end data carried out crawl and include: mobile client networking is arranged agency, data are sent to client from proxy server, client uses analogue-key to realize automatically operating mobile client, proxy server carries out packet capture in data transfer procedure, and then data are filtered and parsing, thus obtaining data.

4. network content resources appraisal procedure according to claim 1, it is characterised in that described Index for Calculation includes spread index calculating, rating Index for Calculation and public sentiment Index for Calculation.

5. network content resources appraisal procedure according to claim 1, it is characterised in that described text analyzing includes the positive negative sense judgement of text, text word frequency statistics, conjunctive word statistics, text cluster and text classification.

6. a network content resources assessment system, it is characterised in that including:

7. network content resources according to claim 6 assessment system, it is characterized in that, described webcrawler module includes web crawler module and wechat data capture module, described web crawler module is the general eb reptile module based on scrapy structure, and mobile client end data is captured by described wechat data capture module by proxy server.

8. network content resources according to claim 7 assessment system, it is characterized in that, described wechat data acquisition module includes analogue-key module, proxy service module, packet interception module, client uses analogue-key module to realize automatically operating mobile client, proxy service module carries out packet capture by packet interception module in data transfer procedure, and then data are filtered and parsing, thus obtain data.

9. network content resources according to claim 6 assessment system, it is characterised in that described Index for Calculation module includes spread index computing module, rating Index for Calculation module and public sentiment Index for Calculation module.

10. network content resources according to claim 6 assessment system, it is characterised in that described text analysis model includes the positive negative sense judge module of text, text word frequency statistics module, conjunctive word statistical module, text cluster module and text classification module.