CN113378023A - Visual system for mining and comparing public opinion and news information of people - Google Patents

Visual system for mining and comparing public opinion and news information of people Download PDF

Info

Publication number
CN113378023A
CN113378023A CN202110565938.8A CN202110565938A CN113378023A CN 113378023 A CN113378023 A CN 113378023A CN 202110565938 A CN202110565938 A CN 202110565938A CN 113378023 A CN113378023 A CN 113378023A
Authority
CN
China
Prior art keywords
event
website
news
data
civil
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110565938.8A
Other languages
Chinese (zh)
Other versions
CN113378023B (en
Inventor
王德志
邓帅杰
罗琛
王德宇
王凯琳
陈超
李泽荃
李永飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Institute of Science and Technology
Original Assignee
North China Institute of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Institute of Science and Technology filed Critical North China Institute of Science and Technology
Priority to CN202110565938.8A priority Critical patent/CN113378023B/en
Publication of CN113378023A publication Critical patent/CN113378023A/en
Application granted granted Critical
Publication of CN113378023B publication Critical patent/CN113378023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a visual system is compared in mining of public opinion of civilian life and news information. The system comprises: the data acquisition module is configured with a website crawler pool constructed based on a fuzzy comprehensive evaluation method, and the website crawler pool is configured to crawl news event data of a news website home page and crawl livelihood event data of a livelihood website home page; the data analysis module is configured to determine a hot event according to news event data and civil event data based on a word frequency statistical method or a preset text similarity model; the website crawler pool is called, and the public opinion information of the hot event is crawled in a civil website according to the hot keywords of the hot event to obtain the public opinion information of the hot event; acquiring an emotional tendency value of the Internet about the hot event according to public sentiment information of the hot event based on a preset emotional analysis model; and the visualization module is configured to draw an emotion change graph of the Internet about the hotspot events according to the emotional tendency value.

Description

Visual system for mining and comparing public opinion and news information of people
Technical Field
The application relates to the technical field of internet, in particular to a visual system for mining and comparing public sentiment and news information of people.
Background
With the rapid development of the information quantity of the internet, in the big data era of information explosion, the common people are difficult to comprehensively master numerous contents of a platform so as to screen out target information; meanwhile, the development of internet public sentiment often influences policy promotion, enterprise image or personal life, however, government, enterprise and public crowd evaluation presents the characteristics of individuation and diversification, and the traditional single information acquisition mode is challenged before, so that the public sentiment information is difficult to be accurately acquired from the internet.
Therefore, there is a need to provide an improved solution to the above-mentioned deficiencies of the prior art.
Disclosure of Invention
An object of the application is to provide a visual system is compared in mining of people's public opinion and news information to solve or alleviate the problem that exists among the above-mentioned prior art.
In order to achieve the above purpose, the present application provides the following technical solutions:
the application provides a visual system is compared in mining of public opinion of civilian life and news information, include: the system comprises a data acquisition module, a data analysis module and a visualization module; the data acquisition module is configured with a website crawler pool constructed based on a fuzzy comprehensive evaluation method, and the website crawler pool is configured to crawl news event data of a news website home page and crawl livelihood event data of a livelihood website home page; the data analysis module is configured to determine a hot event according to news event data and civil event data based on a word frequency statistical method or a preset text similarity model; the website crawler pool is called, and the public opinion information of the hot event is crawled in a civil website according to the hot keywords of the hot event to obtain the public opinion information of the hot event; acquiring an emotional tendency value of the Internet about the hot event according to public sentiment information of the hot event based on a preset emotional analysis model; and the visualization module is configured to draw an emotion change graph of the Internet about the hotspot events according to the emotional tendency value.
Preferably, a plurality of news website crawlers constructed based on a fuzzy comprehensive evaluation method and a plurality of civil website crawlers constructed based on the fuzzy comprehensive evaluation method are configured in the website crawler pool; the crawler of the news website can crawl the reading amount and the click amount of each news event of the home page of the news website, and crawl the home page of the news website according to the reading amount and the click amount of each news event to obtain news event data; the crawler of the civil website can crawl the reading amount and the click amount of each civil event of the first page of the civil website, and crawl the first page of the civil website according to the reading amount and the click amount of each civil event to obtain the data of the civil events.
Preferably, the data analysis module is further configured to count the high-frequency keywords in the news event data and the civil event data respectively based on a word frequency statistical method, and determine that the news event corresponding to the same high-frequency keywords in the news event data and the civil event data is a hot event.
Preferably, the data analysis module is further configured to calculate a similarity between each news event in the news event data and the civil event data based on a preset text similarity model, and determine that the news event with the highest similarity to the civil event data is a hot event.
Preferably, the data analysis module is further configured to calculate the similarity between each news event in the news event data and each civil event in the civil event data based on a preset text similarity model, so as to obtain the similarity between each news event in the news event data and the civil event data.
Preferably, the visualization module is further configured to draw a similarity variation graph of the news website and the civil website according to the similarity of each news event in the news event data and the civil event data so as to determine whether public opinion information of the internet is concentrated.
Preferably, the data analysis module is further configured to obtain the emotional intensity of the positive emotion and the negative emotion implied in the public sentiment information of the internet about the hot event according to the public sentiment information of the hot event based on a preset sentiment analysis model, so as to obtain the sentiment tendency value.
Preferably, the data analysis module is further configured to obtain an event type ratio of the internet about the news event data and the civil event data according to the news event data and the civil event data based on a preset text classification model, wherein the event type chart represents a type ratio of public opinion events in a news website and a civil website; correspondingly, the visualization module is further configured to draw an event type graph of the news website and the civil website according to the event type ratio.
Preferably, the visual system of mining and comparing of the civil public opinion and news information further comprises: and the model training module is configured to correspondingly construct an emotion analysis model and a text classification model according to public sentiment information sample data and text classification sample data which are acquired in advance based on a deep learning method.
Preferably, the model training module is further configured to perform word segmentation processing on the public opinion information sample cloth data by adopting a jieba base based on a deep learning method, and perform text steering quantity processing on the word-segmented public opinion information sample cloth data by a TF-IDF method to construct the emotion analysis model.
Has the advantages that:
according to the technical scheme, the data acquisition module is used for crawling news event data and civil event data from the website crawler pool constructed based on the fuzzy comprehensive evaluation method, so that the crawling data is screened, and the efficiency and accuracy of data crawling are effectively improved; determining a hot event through a data analysis module based on a word frequency statistical method or a text similarity model, calling a website crawler pool to crawl public opinion information of hot time, and acquiring public opinion information about the hot event on the Internet, so that governments, enterprises or individuals can acquire comprehensive public opinion comments about the hot event from the Internet, and guidance is provided for effectively improving policy promotion and enterprise images or personal life; the visual module is used for acquiring the emotional tendency value of the Internet about the hot event based on the emotional analysis model and drawing the emotion change diagram of the hot event, so that the judgment of the public opinion tendency of the hot event on the Internet in advance is facilitated, and the policy making or public opinion guidance is effectively guided.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. Wherein:
fig. 1 is a schematic structural diagram of a system for mining and comparing public opinion and news information according to some embodiments of the present application;
FIG. 2 is a cloud of some hotspot keywords provided in accordance with some embodiments of the present application;
FIG. 3 is a graph of similarity changes between news websites and civilian websites according to some embodiments of the present application;
FIG. 4 is a graph of internet emotional changes with respect to a hot event provided according to some embodiments of the present application;
fig. 5 is a graph of emotional changes of a news website and a civilian website with respect to the same event, provided according to some embodiments of the present application.
Description of reference numerals:
101-a data acquisition module; 102-a data analysis module; 103-a visualization module; 104-model training module.
Detailed Description
The present application will be described in detail below with reference to the embodiments with reference to the attached drawings. The various examples are provided by way of explanation of the application and are not limiting of the application. In fact, it will be apparent to those skilled in the art that modifications and variations can be made in the present application without departing from the scope or spirit of the application. For instance, features illustrated or described as part of one embodiment, can be used with another embodiment to yield a still further embodiment. It is therefore intended that the present application cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
In the description of the present application, the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description of the present application but do not require that the present application must be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present application. The terms "connected," "connected," and "disposed" as used herein are intended to be broadly construed, and may include, for example, fixed and removable connections; can be directly connected or indirectly connected through intermediate components; the connection may be a wired electrical connection, a wireless electrical connection, or a wireless communication signal connection, and a person skilled in the art can understand the specific meaning of the above terms according to specific situations.
Fig. 1 is a schematic structural diagram of a system for mining and comparing public opinion and news information according to some embodiments of the present application; as shown in fig. 1, the system for mining, comparing and visualizing the civil public sentiment and news information comprises: a data acquisition module 101, a data analysis module 102 and a visualization module 103; the data acquisition module 101 is configured with a website crawler pool constructed based on a fuzzy comprehensive evaluation method, and the website crawler pool is configured to crawl news event data of a news website home page and crawl livelihood event data of a livelihood website home page; the data analysis module 102 is configured to determine a hot event according to news event data and civil event data based on a word frequency statistical method or a preset text similarity model; the website crawler pool is called, and the public opinion information of the hot event is crawled in a civil website according to the hot keywords of the hot event to obtain the public opinion information of the hot event; acquiring an emotional tendency value of the Internet about the hot event according to public sentiment information of the hot time based on a preset emotional analysis model; and the visualization module 103 is configured to draw an emotion change graph of the internet about the hotspot events according to the emotional tendency values.
In the embodiment of the present application, the news website and the civil website may be both professional type websites or professional tiles in comprehensive type websites, for example, the news website may be an observer net, each government website, each commission website, a chinese policy net, and the like, and the civil website may be a national water net, a power grid, a news net, a weather net, an earthquake net, a treasure washing net, a watch, a B station, a microblog, and the like.
In the embodiment of the application, the website crawler pool is an aggregate of website crawlers, the website crawler pool is provided with a plurality of website crawlers, each website is correspondingly provided with a plurality of crawlers, and the functions of the crawlers are different so as to crawl data of different types (such as military affairs, science and technology, automobiles, emotions and the like) respectively; because the operation and maintenance of each website are changed along with time, the accuracy of the similarity between the website crawler pools can be effectively ensured by regularly calculating the similarity between the website crawler pools, and therefore, the crawled data is accurate and credible.
Specifically, a plurality of news website crawlers constructed based on a fuzzy comprehensive evaluation method and a plurality of civil website crawlers constructed based on the fuzzy comprehensive evaluation method are configured in the website crawler pool; the crawler of the news website can crawl the reading amount and the click amount of each news event of the home page of the news website, and crawl the home page of the news website according to the reading amount and the click amount of each news event to obtain news event data; the crawler of the civil website can crawl the reading amount and the click amount of each civil event of the first page of the civil website, and crawl the first page of the civil website according to the reading amount and the click amount of each civil event to obtain the data of the civil events.
In the embodiment of the application, crawling is performed on the reading quantity and the click quantity of each news event of a news website home page by a crawler of a new website, and the crawling is sequentially performed on each news event according to the reading quantity and the click quantity of each news event in a high-low sequence to obtain news event data; the crawler of the civil website crawls the reading amount and the click amount of each civil event of the first page of the civil website, and crawls the civil events in sequence according to the reading amount and the click amount of each civil event to obtain the data of the civil events.
In some optional embodiments, during the crawling process of the website crawler pool constructed based on the fuzzy comprehensive evaluation method,
firstly, similarity calculation is carried out on a plurality of constructed website crawler pools within a preset period based on a word vector cosine algorithm, and the similarity among the plurality of website crawler pools is obtained. Specifically, crawling is performed on website crawlers corresponding to a plurality of constructed website crawler pools within a preset period, so that website text data corresponding to each website crawler pool are obtained; and based on a word vector cosine algorithm, carrying out similarity calculation between every two website text data corresponding to the website crawler pools to obtain the similarity between the website crawler pools.
In the embodiment of the application, when similarity calculation is performed among a plurality of constructed website crawler pools within a preset period, website homepage data are obtained by regularly crawling website homepages of websites corresponding to the website crawler pools, and then all the obtained website homepage data are spliced to obtain corresponding website text data.
In the embodiment of the application, the similarity among the website crawler pools is calculated through a word vector cosine algorithm, and the text correlation analysis among the websites to be crawled corresponding to the website crawler pools is realized. Specifically, the similarity calculation between the crawler pools of the websites is realized through the relevance of sentence components in the website text data between the websites to be crawled.
In the embodiment of the application, a model for calculating the similarity between the web crawler pools based on a word vector cosine algorithm is shown as formula (1), wherein the formula (1) is as follows:
Figure BDA0003080985460000061
wherein cos theta represents the similarity between the crawler pools of the websites, and A, B represents word vectors in website text data of two websites to be crawled respectively; a. theiRepresents the ith component of the word vector A, i being a positive integer; b isiRepresents the ith component of word vector B; n represents the dimension of the word vector, and the value of n is a positive integer. For example, if the word vector a is (3, 5, 7, 8), then a is a 4-dimensional vector, i.e., n is 4, a1=3,A2=5,A3=7,A4I is not less than 8 and not more than 1 and not more than 4, and i is a positive integer.
Firstly, respectively performing word segmentation on website text data of two websites to be crawled by using jieba, and then performing vectorization processing on the website text data by using a tfidfvactorizer class in a sklern to obtain TF-IDF (term frequency-inverse document frequency, abbreviated as TF-IDF); and finally, calculating the relevance of the two websites to be crawled by using a cosine _ similarity class.
And secondly, selecting a reference website according to the access request, and screening the websites to be crawled according to the similarity between the website crawler pool corresponding to the reference website and other website crawler pools. The method specifically comprises the following steps: and selecting a reference website according to the access request, and screening the websites to be crawled in sequence according to the similarity height between the website crawler pool corresponding to the reference website and other website crawler pools.
In the embodiment of the application, the similarity between two website crawler pools in a preset period is the similarity between two corresponding websites in the period, and the similarity between the website crawler pools is used as a similarity table of the websites to be crawled in the preset period and is stored in a database. And selecting a corresponding reference website according to key information (such as search keywords and the like) in the access request of the target user, wherein the target user searches for videos, and the corresponding reference website is selected to be Aichi art, Youke videos or beep curries and the like. And then, screening the websites to be crawled according to the similarity table of each website to be crawled, thereby improving the crawling efficiency and reducing the resource consumption.
In the embodiment of the application, the similarity of the websites to be crawled is ranked from high to low according to the similarity table of the websites to be crawled, the websites to be crawled with high similarity are first crawled by the website crawler pool corresponding to the reference website, and then the websites to be crawled with low similarity are crawled, so that the crawling efficiency is effectively improved, and the resource consumption is reduced.
In an application scene, when a website to be crawled is screened, in response to the fact that the similarity between a website crawler pool corresponding to the website to be crawled and a website crawler pool corresponding to a reference website is lower than a preset similarity threshold value, the website to be crawled is abandoned.
In the embodiment of the application, if the similarity between the website to be crawled and the reference website is lower than a preset similarity threshold, it indicates that topics between the website to be crawled and the reference website are inconsistent in a preset period, search information of a target user does not exist in the website to be crawled basically, and corresponding data cannot be obtained when the website to be crawled is crawled, so that crawling of the website to be crawled can be directly abandoned.
And thirdly, calculating the crawling recommendation value of the website crawler pool corresponding to the screened website to be crawled based on a fuzzy comprehensive evaluation method, so that the website to be crawled is crawled by the website crawler pool corresponding to the reference website according to the crawling recommendation value of the website crawler pool corresponding to the screened website to be crawled.
In the embodiment of the application, when the crawling recommendation value of the screened website to be crawled is calculated based on a fuzzy comprehensive evaluation method, the crawling recommendation value of a website crawler pool corresponding to the screened website to be crawled is calculated based on the fuzzy comprehensive evaluation method according to the crawling weight of the crawling influence factor of the screened website to be crawled, wherein the crawling influence factor represents an influence parameter when the website to be crawled is crawled; the crawling weight represents the level of influence of the crawling influence factor on the crawling recommendation value.
Specifically, the crawling influence factors include: website popularity, historical request failure rate, user rating, website anti-crawling strength, website tolerance, and website crawling risk; the website crawling system comprises a website crawling system, a website crawling risk representation and a website crawling system, wherein the website popularity represents the value information amount covered by the website to be crawled, the historical request failure rate represents the failure probability of crawling the website to be crawled, the user score represents the satisfaction degree of a crawling result of the website to be crawled, the website crawling strength represents the crawling difficulty degree of the website to be crawled, the website tolerance capability represents the access amount which can be borne by the website to be crawled, and the website crawling risk represents whether the website to be crawled is allowed or not.
In the embodiment of the application, when the screened crawling recommendation value of the website to be crawled is calculated based on a fuzzy comprehensive evaluation method, the website popularity, the historical request failure rate, the user score, the website anti-crawling strength, the website tolerance capability and the website crawling risk are respectively scored.
In the embodiment of the application, in the process of calculating the screened crawling recommendation value of the website to be crawled, the website popularity, the user score and the website tolerance capability are positive crawling influence factors. The higher the website popularity score is, the higher the general attention of the society to the website in a preset period is, the more active the information flow of the website is, the more contents with research value are, and therefore the website popularity score has crawling value. The user score is determined according to the satisfaction degree of the crawling result of the user for performing historical crawling on the website, and the higher the satisfaction degree is, the higher the user score is, and the more valuable information can be obtained by crawling on the website. The higher the site tolerance, the better the architecture of the website is, the larger the access amount that can be borne, and the less likely it is to cause trouble to other users when crawling the website.
In the process of calculating the screened crawling recommendation value of the website to be crawled, the higher the historical request failure rate is, and the higher the website crawling strength and the website crawling risk are negative crawling influence factors. The higher the historical request failure rate is, the worse the operation and maintenance conditions of the website in a preset period is, when the website is crawled, the higher the possibility of crawling failure is, the more the crawling failure times are, the more the resource waste is caused, and when the crawling recommendation value of the screened website to be crawled is calculated, the score of the historical request failure rate of the website is reduced along with the increase of the historical request failure rate. The higher the protection strength of the site is, the more south the site is crawled, namely the score of the anti-crawling strength of the site is reduced along with the enhancement of the protection strength of the site. The higher the website crawling risk is, the less suitable the website is for crawling, and the higher the risk born by crawling the website is.
In the embodiment of the application, the crawling weight reflects the influence of the corresponding crawling recommendation value of different crawling influence factors in the process of evaluating the website and the corresponding website crawler pool. For example, when the student awards the scholarship money, two factors of 'achievement' and 'enthusiasm for participating in the extracurricular activities' need to be considered for scoring the student, and if the 'achievement' is more important than the 'enthusiasm for participating in the extracurricular activities', the weight of the 'achievement' is set to be 0.8, and the weight of the 'enthusiasm for participating in the extracurricular activities' is set to be 0.2; finally, whether the student receives a prize is scored equal to the product of 0.8 times the "score" plus the sum of 0.2 times the "aggressiveness to attend an out-of-class activity".
In the embodiment of the application, the website popularity, the historical request failure rate, the user score, the website anti-crawling strength, the website tolerance capacity and the website crawling risk are respectively used as u1、u2、u3、u4、u5、u6To show that the corresponding crawling weights are respectivelyBy a1、a2、a3、a4、a5、a6
Then the factor set for crawling influence factors is:
U={u1,u2,u3,u4,u5,u6}
wherein, the score u of the website popularity1=x1,x1∈(0,100](ii) a Score u of historical request failure rate2=100-x2,x2∈(0,100](ii) a User scored rating u3=x3,x3∈(0,100](ii) a Scoring u of website crawling prevention strength4=100-x4,x4∈(0,100](ii) a Scoring of site tolerance u5=x5,x5∈(0,100](ii) a Scoring of site crawling risk u6=100-x6,x6∈(0,100]. Wherein x is1Reflecting the ranking condition of the website popularity, the higher the ranking, x1The greater the value of (A); x is the number of2Reflects the actual situation of the historical request failure rate, the higher the historical request failure rate is, x2The greater the value of (A); x is the number of3Reflects the actual value of the user scoring feedback, the higher the user scoring feedback is, x3The greater the value of (A); x is the number of4Reflects the height of the anti-creep strength of the site, the higher the anti-creep strength of the site is, x4The larger the value of (A) is; x is the number of5Reflecting the high and low of the station tolerance capacity, the higher the station tolerance capacity, x5The larger the value of (A) is; x is the number of6Reflects the height of the crawling risk of the site, and x is higher when the crawling risk of the site is higher6The larger the value of (a).
Then, determining a weight set of the crawling weight of each crawling influence factor based on an Analytic Hierarchy Process (AHP) as follows:
A={a1、a2、a3、a4、a5、a6}
the discrimination matrix of the construction factor set U is as follows:
Figure BDA0003080985460000091
wherein, the judgment matrix reflects the importance degree between every two factors in the factor set.
The set of weights is then:
A={0.1638,0.1464,0.3557,0.0752,0.1744,0.0845}
then, an alternative set is established:
v ═ very recommended, general, not recommended, very not recommended }
And evaluating the website to be crawled from the crawling influence factors to obtain the crawling recommendation value of the website to be crawled. Specifically, single-factor evaluation is carried out on each crawling influence factor to obtain a single-factor evaluation result of each crawling influence factor, and then a crawling recommendation value of the website to be crawled is calculated based on a fuzzy comprehensive evaluation method according to the single-factor evaluation result of each crawling influence factor.
In this case, the creep strength u is used as a site protection factor4The description is given for the sake of example: site anti-crawling strength u for website to be crawled4If m (m is a positive integer) users score, then m u4A value of s, wherein1Values belonging to the interval (80, 100)],s2Values belonging to the interval (60, 80)],s3Values belonging to the interval (40, 60)],s4Values belonging to the interval (20, 40)],s5Values belonging to the interval (0, 20)]Wherein s is1、s2、s3、s4、s5Is equal to m, and s1、s2、s3、s4、s5Are all positive integers.
The crawling influence factor of the website to be crawled-the website anti-crawling strength u4The results of the single-factor evaluation of (1) were:
Y4q={y41,y42,y43,y44,y45}
wherein the content of the first and second substances,
Figure BDA0003080985460000101
and respectively carrying out single-factor evaluation on the six crawling influence factors of the website to be crawled to obtain a single-factor evaluation matrix Y of the website to be crawled. The one-factor evaluation matrix Y is as follows:
Y=[Y1j、Y2j、Y3j、Y4j、Y5j]T
wherein j represents a membership interval of the score of the crawling influence factor,
namely:
Figure BDA0003080985460000102
and (3) constructing an intermediate variable matrix B, namely A Y, and solving the intermediate variable matrix B based on an exponential model in a fuzzy comprehensive evaluation method, wherein the exponential model is shown as a formula (2). Equation (2) is as follows:
Figure BDA0003080985460000103
wherein, awRepresents the weight of the w-th crawling influence factor,
Figure BDA0003080985460000104
showing that the w-th crawling influence factor is in the corresponding weight awThe following single-factor evaluation results; bjThe intermediate variable values in the jth membership interval are indicated.
And (4) normalizing the intermediate variable matrix B based on a normalization model, wherein the normalization model is shown as a formula (3). Equation (3) is as follows:
Figure BDA0003080985460000105
constructing a membership set Q corresponding to the alternative set V, and ordering
Figure BDA0003080985460000106
Wherein the content of the first and second substances,
Figure BDA0003080985460000107
corresponding to the "very recommended" in the alternative set V,
Figure BDA0003080985460000108
corresponding to the "recommendations" in the alternative set V,
Figure BDA0003080985460000109
corresponding to "general" in the alternative set V;
Figure BDA00030809854600001010
corresponding to "not recommended" in alternative set V;
Figure BDA00030809854600001011
corresponding to "very not recommended" in the alternative set V.
In the membership degree set Q, the membership degree of the website to be crawled to the element of 'very recommended' in the alternative set V is taken out
Figure BDA0003080985460000111
And the membership degree of the website to be crawled to the element of recommendation in the alternative set V
Figure BDA0003080985460000112
The crawling recommendation value T of the website to be crawled is as follows:
T=Q1+Q2
the larger the recommendation value T is, the more recommended the crawling of the website to be crawled is.
In the embodiment of the application, the crawling recommendation value represents whether the corresponding website to be crawled is suitable for being crawled or not, the higher the crawling recommendation value is, the more suitable the website to be crawled is, and the more the result obtained after crawling meets the access request of the target object. Specifically, based on a fuzzy comprehensive evaluation method, the crawling recommendation value of the website crawler pool corresponding to the screened website to be crawled is calculated, so that the website to be crawled is sequentially crawled by the website crawler pool corresponding to the reference website according to the crawling recommendation value of the website crawler pool corresponding to the screened website to be crawled. Therefore, the crawling efficiency of the website crawler pool is further improved, the resource consumption is further reduced, and the obtained crawling result is more accurate and higher in reliability.
In the embodiment of the application, after the access request of the target user is obtained, when the crawler runs, data in a website crawler pool of the crawler is automatically read for crawling, a plurality of crawlers crawl under the control of a queue, and data obtained by crawling is cleaned under the control of the queue.
In the embodiment of the application, in the crawling process, crawlers do not crawl preferentially according to limited breadth or depth, crawl websites to be crawled according to the height of the recommended value, and crawl websites with higher recommended values preferentially; in addition, the self-filtering mode of each website can be firstly applied, then crawling is carried out by applying the scheme of the application, namely, initial searching is carried out by calling the search box of each website, and then crawling is carried out in the initial search result, so that the crawling accuracy and the crawling efficiency are improved.
In some optional embodiments, the data analysis module 102 is further configured to count the high-frequency keywords in the news event data and the civil event data respectively based on a word frequency statistical method, and determine that the news event corresponding to the same high-frequency keyword in the news event data and the civil event data is a hot event.
In the embodiment of the application, high-frequency keywords in news event data and civil event data are respectively counted based on a word frequency counting method, and if the high-frequency keywords in the news event data and the civil event data are consistent, the news event and the civil event in the internet have a common focus. The data analysis module 102 calls a website crawler pool, crawls public opinion information of the hot event in a civil website according to the hot keywords of the hot event, and obtains the public opinion information of the hot event. Therefore, the operation efficiency of the crawler can be effectively improved, and the crawling efficiency and accuracy for obtaining the public opinion information of the hot events are improved.
If the high-frequency keywords in the news event data are inconsistent with the high-frequency keywords in the civil event data, it is indicated that the attention points of the news event and the civil event on the internet are different, and at the moment, the hot event needs to be determined based on a preset text similarity model. Specifically, the data analysis module 102 is further configured to calculate a similarity between each news event in the news event data and the civil event data based on a preset text similarity model, and determine that the news event with the highest similarity to the civil event data is a hot event.
In a specific example, the data analysis module 102 is further configured to calculate, based on a preset text similarity model, a similarity between each news event in the news event data and each civil event in the civil event data, respectively, so as to obtain a similarity between each news event in the news event data and the civil event data.
In the embodiment of the application, after news event data of each news website and civil event data of each civil website are obtained, similarity calculation is performed on each news event in the news event data and each civil event in the civil event data one by one to obtain the similarity between each news event and all civil events in the civil event data, and a similarity table of the news event data and the civil event data is constructed according to the similarity. According to the similarity, the news event with the highest similarity of the data of the civil events can be determined from the similarity table, which indicates that the news event on the internet is a hot event of the civil discussion.
In an application scenario, the visualization module 103 is further configured to draw a similarity variation graph (as shown in fig. 3) between the news website and the civil website according to the similarity between each news event in the news event data and the civil event data to determine whether the public opinion information of the internet is concentrated. Therefore, the similarity change of the news event and the civil event is visually displayed through the visualization module 103, and whether the civil discussion on the internet is concentrated or not is more vividly and accurately understood. As can be seen from fig. 3, the similarity between news websites and civil websites varies based on time series.
In some optional embodiments, the data analysis module 102 is further configured to obtain, based on the preset emotion analysis model, the emotional intensity of the positive emotion and the negative emotion included in the public opinion information of the hotspot event of the internet according to the public opinion information of the hotspot event, so as to obtain an emotional tendency value. Therefore, the emotional changes of the public to the hot events and the tendency, trend and strength of the civil discussion on the internet can be seen more intuitively and accurately, the judgment of the public opinion trend of the hot events on the internet in advance is facilitated, and the policy making or public opinion guidance is effectively guided.
In some optional embodiments, the data analysis module 102 is further configured to obtain an event type ratio of the internet with respect to the news event data and the civil event data according to the news event data and the civil event data based on a preset text classification model, wherein the event type graph represents a type ratio of public opinion events in news websites and civil websites; correspondingly, the visualization module 103 is further configured to draw an event type graph of the news website and the civil website according to the event type ratio. Therefore, whether the attention points of the news website and the civil affair website are consistent or not can be effectively known through classifying and comparing the news event data and the civil event data, the understanding of the important points of the civil discussion on the Internet is effectively improved, and the guidance policy making or the public opinion guiding are facilitated.
In this application embodiment, the visual system of civil public opinion and news information mining comparison still includes: the model training module 104 is configured to correspondingly construct an emotion analysis model and a text classification model according to public sentiment information sample data and text classification sample data which are acquired in advance based on a deep learning method.
The public sentiment information sample data can be public sentiment evaluation data on a civil website, such as microblog comment data, and is divided into a negative evaluation category and a positive evaluation category according to public sentiment evaluation content, so that the accuracy of predicting positive sentiments and negative sentiments by training a sentiment analysis model is facilitated, the prediction probability of the positive sentiments and the negative sentiments is improved, and the prediction accuracy of the sentiment strength is improved.
Specifically, the model training module 104 is further configured to perform word segmentation processing on the public opinion information sample cloth data by using a jieba library based on a deep learning method, and perform text steering quantity processing on the word-segmented public opinion information sample cloth data by using a TF-IDF method to construct an emotion analysis model. The method comprises the steps of performing word segmentation on public opinion information sample data by adopting a jieba library, then taking out stop words according to a stop word list to improve the accuracy and the training efficiency of an emotion analysis model, then performing text steering quantity processing on the public opinion information sample data after word segmentation processing by using TF-IDF, and then performing model training by calling naive Bayes with prior polynomial distribution in sklern and combining text content after vectorization processing
In the embodiment of the application, the text classification sample data can adopt an open source data set processed by some languages on the internet to cover different types of text contents, so that the training speed of the text classification model is increased, and the prediction accuracy of the text classification model is improved. The training process of the text classification model is similar to that of the emotion analysis model, and is not repeated here.
In the embodiment of the application, the data acquisition module 101 is used for crawling news event data and civil event data from the website crawler pool constructed based on the fuzzy comprehensive evaluation method, so that the crawling data is screened, and the efficiency and accuracy of data crawling are effectively improved; the data analysis module 102 determines a hot event based on a word frequency statistical method or a text similarity model, calls a website crawler pool to crawl public sentiment information of a hot time, and obtains the public sentiment information about the hot event on the Internet, so that governments, enterprises or individuals can obtain comprehensive public sentiment comments about the hot event from the Internet, and guidance is provided for effectively improving policy promotion and enterprise images or personal life; the visualization module 103 is used for acquiring the emotional tendency value of the internet about the hot event based on the emotion analysis model and drawing the emotion change diagram (as shown in fig. 4 and fig. 5) of the hot event, so that the public opinion tendency of the hot event on the internet in advance is judged, and policy making or public opinion guidance is effectively guided. As can be seen from fig. 4 and 5, the internet emotional tendency values for the hot events change based on the time series.
In the embodiment of the application, the news data can be detected through the emotion analysis model, when regional news appears (titles or contents have regional names of various cities, such as Beijing city, Wuhan city and the like), comment data of various regions on the news are obtained through crawler of a civil website, emotion value prediction is carried out by using the emotion analysis models corresponding to the various regions, and the regional news and the judgment of the emotion values of the regional news can be obtained; inviting new energy people to fall home, reacting in various places and the like in Shanghai. Except for the data of each region, combining the data of each region into a large data set, removing the labels of the regions, and retraining the labels to obtain the universal emotion analysis model. And similarly, a universal text classification model and a similarity model can be obtained.
In the embodiment of the application, the crawler acquires the geographic tag by detecting whether a positioning tag exists in the crawled content (that is, whether the user shares the position of the crawler), so as to acquire regional data (news events, civil events, public opinion information and the like of each region). In addition, parameters of geographic crawlers (crawlers for various regions) can be adjusted, and only data with positioning is acquired (when no positioning data exists, geographic information filled by a user is used as a geographic value), so that the data crawling amount is greatly expanded in a mode of reducing data accuracy.
In the embodiment of the application, after data are crawled, data cleaning needs to be carried out on the crawled data to delete repeated information and correct existing errors, so that the data are kept consistent, analysis and visual processing are conveniently carried out on the crawled data, and the data processing efficiency is improved.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. The utility model provides a visual system is compared in mining of people's public opinion and news information which characterized in that includes: the system comprises a data acquisition module, a data analysis module and a visualization module;
the data acquisition module is configured with a website crawler pool constructed based on a fuzzy comprehensive evaluation method, and the website crawler pool is configured to crawl news event data of a news website home page and crawl livelihood event data of a livelihood website home page;
the data analysis module is configured to determine a hot event according to the news event data and the civil event data based on a word frequency statistical method or a preset text similarity model; calling the website crawler pool, and crawling the public sentiment information of the hot event in the civil website according to the hot keyword of the hot event to obtain the public sentiment information of the hot event; acquiring an emotional tendency value of the Internet about the hot event according to public sentiment information of the hot event based on a preset emotional analysis model;
and the visualization module is configured to draw an emotion change graph of the Internet about the hotspot event according to the emotional tendency value.
2. The system for mining, comparing and visualizing civil public opinion and news information according to claim 1, wherein a plurality of news web crawlers constructed based on a fuzzy comprehensive evaluation method and a plurality of civil web crawlers constructed based on a fuzzy comprehensive evaluation method are configured in the web crawler pool;
the news website crawler can crawl the reading amount and the click amount of each news event of the news website home page, and crawl the news website home page according to the reading amount and the click amount of each news event to obtain the news event data;
the crawler of the civil website can crawl the reading amount and the click amount of each civil event of the first page of the civil website, and crawl the first page of the civil website according to the reading amount and the click amount of each civil event to obtain the data of the civil events.
3. The system of claim 1, wherein the data analysis module is further configured to count the high-frequency keywords in the news event data and the civil event data respectively based on a word frequency statistical method, and determine that the news event corresponding to the same high-frequency keyword in the news event data and the civil event data is a hot event.
4. The system of claim 1, wherein the data analysis module is further configured to calculate a similarity between each news event in the news event data and the civil event data based on a preset text similarity model, and determine the news event with the highest similarity to the civil event data as the hotspot event.
5. The system of claim 4, wherein the data analysis module is further configured to calculate a similarity between each news event in the news event data and each civil event in the civil event data based on a preset text similarity model, so as to obtain the similarity between each news event in the news event data and the civil event data.
6. The system of claim 4, wherein the visualization module is further configured to draw a similarity variation graph between the news website and the civil website according to a similarity between each news event in the news event data and the civil event data to determine whether internet public opinion information is concentrated.
7. The system of claim 1, wherein the data analysis module is further configured to obtain emotional intensity of positive and negative emotions implied in the public sentiment information of the hot event from the internet based on a preset sentiment analysis model, so as to obtain the sentiment tendency value.
8. The system of any one of claims 1 to 7, wherein the data analysis module is further configured to obtain an event type ratio of the internet with respect to the news event data and the civil event data according to the news event data and the civil event data based on a preset text classification model, wherein the event type chart represents a type ratio of public opinion events in the news website and the civil website;
in a corresponding manner, the first and second optical fibers are,
and the visualization module is further configured to draw an event type graph of the news website and the civil website according to the event type ratio.
9. The system of claim 8, wherein the system further comprises: and the model training module is configured to correspondingly construct the emotion analysis model and the text classification model according to public sentiment information sample data and text classification sample data which are acquired in advance based on a deep learning method.
10. The system of claim 9, wherein the model training module is further configured to construct the emotion analysis model by performing word segmentation on the public opinion information layout data using a jieba library based on a deep learning method and performing text steering on the word segmented public opinion information layout data using a TF-IDF method.
CN202110565938.8A 2021-05-24 2021-05-24 Civil public opinion and news information mining comparison visualization system Active CN113378023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110565938.8A CN113378023B (en) 2021-05-24 2021-05-24 Civil public opinion and news information mining comparison visualization system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110565938.8A CN113378023B (en) 2021-05-24 2021-05-24 Civil public opinion and news information mining comparison visualization system

Publications (2)

Publication Number Publication Date
CN113378023A true CN113378023A (en) 2021-09-10
CN113378023B CN113378023B (en) 2023-05-23

Family

ID=77571782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110565938.8A Active CN113378023B (en) 2021-05-24 2021-05-24 Civil public opinion and news information mining comparison visualization system

Country Status (1)

Country Link
CN (1) CN113378023B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701729A (en) * 2023-08-01 2023-09-05 贵州融云信息技术有限公司 Network public opinion detection system and detection method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130290232A1 (en) * 2012-04-30 2013-10-31 Mikalai Tsytsarau Identifying news events that cause a shift in sentiment
CN109145216A (en) * 2018-08-29 2019-01-04 中国平安保险(集团)股份有限公司 Network public-opinion monitoring method, device and storage medium
CN110188933A (en) * 2019-05-21 2019-08-30 湖北经济学院 A kind of School Network public sentiment monitoring and pre-warning method and system
CN110232109A (en) * 2019-05-17 2019-09-13 深圳市兴海物联科技有限公司 A kind of Internet public opinion analysis method and system
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110516067A (en) * 2019-08-23 2019-11-29 北京工商大学 Public sentiment monitoring method, system and storage medium based on topic detection
CN111324801A (en) * 2020-02-17 2020-06-23 昆明理工大学 Hot event discovery method in judicial field based on hot words
CN111461553A (en) * 2020-04-02 2020-07-28 上饶市中科院云计算中心大数据研究院 System and method for monitoring and analyzing public sentiment in scenic spot

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130290232A1 (en) * 2012-04-30 2013-10-31 Mikalai Tsytsarau Identifying news events that cause a shift in sentiment
CN109145216A (en) * 2018-08-29 2019-01-04 中国平安保险(集团)股份有限公司 Network public-opinion monitoring method, device and storage medium
CN110232109A (en) * 2019-05-17 2019-09-13 深圳市兴海物联科技有限公司 A kind of Internet public opinion analysis method and system
CN110188933A (en) * 2019-05-21 2019-08-30 湖北经济学院 A kind of School Network public sentiment monitoring and pre-warning method and system
CN110413863A (en) * 2019-08-01 2019-11-05 信雅达系统工程股份有限公司 A kind of public sentiment news duplicate removal and method for pushing based on deep learning
CN110516067A (en) * 2019-08-23 2019-11-29 北京工商大学 Public sentiment monitoring method, system and storage medium based on topic detection
CN111324801A (en) * 2020-02-17 2020-06-23 昆明理工大学 Hot event discovery method in judicial field based on hot words
CN111461553A (en) * 2020-04-02 2020-07-28 上饶市中科院云计算中心大数据研究院 System and method for monitoring and analyzing public sentiment in scenic spot

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭荣荣: ""线上学习"舆情分析与在线教学提升策略", 《中国传媒大学学报(自然科学版)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701729A (en) * 2023-08-01 2023-09-05 贵州融云信息技术有限公司 Network public opinion detection system and detection method
CN116701729B (en) * 2023-08-01 2023-10-31 贵州融云信息技术有限公司 Network public opinion detection system and detection method

Also Published As

Publication number Publication date
CN113378023B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN111737495B (en) Middle-high-end talent intelligent recommendation system and method based on domain self-classification
Jiang et al. Public-opinion sentiment analysis for large hydro projects
US8880559B2 (en) Location activity search engine computer system
CN111967761B (en) Knowledge graph-based monitoring and early warning method and device and electronic equipment
US20110202525A1 (en) Knowledge discovery system with user interactive analysis view for analyzing and generating relationships
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN109522562B (en) Webpage knowledge extraction method based on text image fusion recognition
CN103646092A (en) SE (search engine) ordering method based on user participation
CN103886020B (en) A kind of real estate information method for fast searching
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN101751439A (en) Image retrieval method based on hierarchical clustering
CN111538931A (en) Big data-based public opinion monitoring method and device, computer equipment and medium
Xiong et al. Affective impression: Sentiment-awareness POI suggestion via embedding in heterogeneous LBSNs
Wilkho et al. FF-IR: An information retrieval system for flash flood events developed by integrating public-domain data and machine learning
CN113378023A (en) Visual system for mining and comparing public opinion and news information of people
Liu et al. Semantics and structure based recommendation of similar legal cases
CN114997624A (en) Intelligent whole-person safety production responsibility management system
CN113254746B (en) Internet public opinion display system based on raspberry group
Wang et al. A dynamic recommender system with fused time and location factors
Tang et al. Emotion analysis platform on chinese microblog
CN112257517A (en) Scenic spot recommendation system based on scenic spot clustering and group emotion recognition
Li et al. Matrix factorization for video recommendation based on instantaneous user interest
Liu et al. Research on financial fraud identification of listed companies based on text data mining
Xie et al. Research and Design of Big Data Relevance Analysis System for Land Development Industry Chain
US20220269746A1 (en) System and Methods for Standardizing Scoring of Individual Social Media Content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant