CN111353085A

CN111353085A - Cloud mining network public opinion analysis method based on feature model

Info

Publication number: CN111353085A
Application number: CN201811584557.9A
Authority: CN
Inventors: 李志强
Original assignee: Guangzhou Youtai Security Technology Co ltd
Current assignee: Guangzhou Youtai Security Technology Co ltd
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2020-06-30

Abstract

The invention discloses a cloud mining network public opinion analysis method based on a feature model, which comprises five components: the system comprises a cloud computing resource pool, system monitoring and load measurement, cloud computing resource scheduling service, multi-platform public opinion release service and a user interaction interface. And establishing a corresponding user interface aiming at different public opinion publishing modes, wherein the interface provides user registration and login, public opinion monitoring configuration and management and public opinion pushing functions, and is used for authorized access of the user, checking of latest opinion information and personalized configuration of public opinion monitoring. The system has the characteristics of high operating efficiency and low cost, and is suitable for the technical field of electronic information.

Description

Cloud mining network public opinion analysis method based on feature model

Technical Field

The invention belongs to the technical field of electronic information, relates to computers and network products, and relates to an application system for industry and enterprise informatization, in particular to a cloud mining analysis network public opinion system based on a feature model.

Background

With the increasingly important role of the network in the social life of China, governments and related enterprises and public institutions also pay more and more attention to the monitoring and early warning of network public sentiment, and public sentiment analysis and monitoring become research fields with great strategic and practical significance. Because the information amount on the network is huge, the collection and processing of massive information on the network are difficult to deal with only by a manual method, and therefore, an automatic network public opinion analysis system needs to be established by means of information technology and related subject professional knowledge.

Because the Internet is in global interconnection and intercommunication, the data quantity which can be obtained from the Internet is difficult to calculate, and the work of obtaining useful information from the Internet is not finished by manual processing at all, the network public opinion monitoring is bound to be tightly combined with a data mining technology, so that the automation and the intellectualization of the public opinion monitoring are realized. Based on the application of the data mining technology in public opinion monitoring, how to find key public opinion information in the global maximum data set of the Internet, especially modeling the key public opinion information according to the characteristics of different public opinion monitoring projects to provide accurate service has become a hotspot of data mining technology research. Web data mining is a specialized technique for data mining in the Internet environment, which refers to the use of data mining techniques to find potential, useful patterns or information in Internet data. The Web mining research covers a plurality of research fields, including database technology, information acquisition technology, statistics, machine learning in artificial intelligence, neural networks and the like, and the organic integration and comprehensive application of various technologies can promote the development of the Web data mining technology to a more mature direction.

With the development of distributed processing, parallel processing and grid computing, the organic integration and commercial application of these technologies become hot spots in the industry, and the concept of cloud computing is also brought forward. So-called cloud computing can be seen as a fusion of grid computing and virtualization technologies: i.e., the ability to utilize grid distributed computing processing. IT resources are constructed into a resource pool, and mature server virtualization and storage virtualization technologies are added, so that a user can monitor and allocate the resources in real time. In a remote data center, thousands of computers and servers are connected into a computer cloud, and a user accesses the data center through a computer, a notebook, a mobile phone and the like to operate according to the requirement of the user. "cloud computing" is distinguished from the traditional computer-centric mode of computing, which distributes computing and data over a large number of distributed computers. People can search the internet through mobile phones and computers.

Currently, many IT companies are developing cloud computing products. From 2003, Google published papers in the topmost meeting and journal in the computer system research field for several consecutive years, and discloses an internal distributed data processing method, and cloud computing core technology used by Google is shown to the outside. From its papers published in recent years, Google uses a cloud computing infrastructure model that includes four systems that are independent of each other and tightly coupled together. The method comprises a File System Google File System established by Google on a cluster, a Map/Reduce programming mode provided by aiming at the characteristics of a Google application program, a distributed locking mechanism Chubby and a large-scale distributed database BigTable simplified by a model developed by Google. The Yahoo company participates in development of Hadoop of a cloud computing platform, and meanwhile, a Hadoop system is tested and deployed, Hadoop software is also used in the Yahoo company, the largest Hadoop cluster system in the world is established, and the cluster system comprises 1 ten thousand Linux nodes. Many of the application programs of Yahoo corporation are now built on top of the cloud computing platform. The maximum Hadoop platform is used for calculating a page connection diagram for network search and processing mass data. The hardware company Dell provides a DCS (Dell Cloud Computing solution) solution, helps a user to construct a Cloud Computing platform, and the solution can reduce the operation and maintenance cost of the data center, improve the Computing speed, simplify the management of the data center and has good expandability.

At present, a mature cloud mining technology combining a Web data mining technology and a cloud computing architecture does not exist, and the existing related public opinion monitoring system has the following problems:

(1) public opinion monitoring demand modeling and intelligent matching technology is not provided, and accuracy of internet information mining is low.

(2) The system is not high in usability and personalization degree, and the use cost of a user is high.

(3) The system operates less efficiently due to limitations in the system architecture.

(4) The intelligent correlation processing of public opinion monitoring information, public opinion trend analysis, public opinion automatic early warning and public opinion hotspot finding and tracking capability are weak.

Disclosure of Invention

In order to solve the technical problems, the invention provides a cloud mining network public opinion monitoring system based on a feature model. The technical purpose realized by the system is mainly embodied in the following aspects:

(1) the method realizes modeling of public opinion monitoring requirements, provides a characteristic model for describing the public opinion monitoring requirements, introduces a system, and ensures the accuracy of Internet information mining through a matching filtering algorithm of the characteristic model and public opinion information and a self-learning updating algorithm of the characteristic model.

(2) The software as a service (SaaS) mode is adopted to provide services for users, the software application mode for providing software services for users based on the Internet is the latest trend of software development, users can order public opinion monitoring services provided by the system according to needs, and the IT cost of the users is reduced.

(3) A distributed cloud mining architecture is adopted, and a large number of online data mining servers and database servers are distributed in different geographic positions and used as computing resources and storage resources of the system. The system can dynamically allocate effective use of server resources in the cloud computing architecture by utilizing the cloud computing resource scheduling service according to different requirements of users, so that the operating efficiency of the data mining application program is improved, and the actual requirements of the users are met.

(4) The system can automatically classify mass public sentiments collected every day without categories based on the automatic clustering technology of the similarity algorithm, classify documents with similar contents into one category, and automatically generate subject terms for the category.

(5) Public opinion trend analysis based on an intelligent training sequence mode is realized, public opinion change trend distribution is described through continuous time monitoring data of public opinion attention hotspots, and a detection feature model is automatically trained and updated through the change of the public opinion hotspots, so that the feature model can be consistent with the public opinion monitoring hotspots, and valuable information can be better screened out from mass information.

(6) The system can automatically find the network public sentiment hotspots, analyze and track important hotspot news information, can timely master public sentiment outburst points and events for the network public sentiments caused by emergencies, and can carry out automatic tracking statistics on the network public sentiment information according to the number of news articles and the propagation chain of the articles in various large websites and communities.

(7) The network public sentiment is automatically pre-warned as required, and a pre-warning function is provided for the monitored information category. The early warning level can be divided into a high level, a middle level, a low level, a safety level and the like according to the user requirement. The user can check various types of early warning information, such as the number and percentage of early warning articles of each type of information in the early warning total distribution graph.

The technical scheme is as follows:

a cloud mining network public opinion monitoring system based on a feature model is mainly composed of the following five functional parts:

(1) cloud computing resource pool

The part comprises computing and storage resources distributed in different geographic positions and consists of a large number of online data mining servers and database servers. Under a cloud computing framework, computing and storage resources required by users are dynamically and transparently provided by utilizing a virtualization technology through a scheduling strategy according to different requirements of the users, and the resources are dynamically recycled and supplied to other users when the current users and application programs are not used, so that cheap computing and storage resources are delivered to the users just like power supply of a power plant, and large-scale parallel computing and mass data operation of common users are possible.

(2) System monitoring and load measurement

This section provides monitoring and measurement of computing and storage resources in a cloud computing framework. The main monitoring and measuring indexes are as follows: the data mining system comprises a data mining server resource load state, a database server resource load state, a request amount of a data mining related application program for computing and storing resources and a request amount of a user for computing and storing resources.

(3) Cloud computing resource scheduling service

The part is used for dynamically allocating effective use of server resources in the cloud computing framework, so that the operation efficiency of the data mining application program is improved, and the actual requirements of users are met. When the resource request amount is small, the operation of the data mining application program and the response to a user are executed on a small amount of server resources, and when the resource request amount is increased, the computing capacity of the current data mining server resources is always the first bottleneck of a system, at the moment, the cloud computing platform finds that the load of the current computing resources is too high through a system monitoring and load measuring part, automatically and dynamically requests new computing server resources from a cloud computing resource pool to be added into the current operating environment, and the computing capacity of the current operating environment is linearly increased in a cluster mode to meet the resource request of the data mining application program. When the resource request of the data mining application program is further increased, not only the computing capacity of the operating environment but also the storage capacity become a bottleneck, and particularly when the concurrent and coordinated execution cost caused by the increase of the data mining server resource is too high, the database server resource is dynamically expanded to meet the massive resource request. When the data mining application resource request decreases, the opposite is true, and the data mining and database server resources will be gradually reclaimed back to the resource pool.

(4) Multi-platform public opinion publishing service

The part has the function of pushing the network public opinion monitoring information obtained by data mining processing to the user in a plurality of different issuing modes. The main pushing modes are as follows: WEB page browsing, WAP page browsing, RSS subscription, Email push, MMS/SMS subscription, mobile client software, etc.

The network public opinion monitoring information is published through the multi-platform public opinion publishing service, so that the public opinion push realizes seamless connection and seamless coverage, users can acquire public opinion information in various modes at any time and any place, and the requirements of the users on public opinion monitoring can be met to the maximum extent.

(5) User interaction interface

The part provides interface interfaces of different public opinion publishing modes for users. And establishing a corresponding user interface aiming at different public opinion publishing modes, wherein the interface provides functions of user registration and login, public opinion monitoring configuration and management and public opinion pushing, and is used for authorized access of the user, checking up latest public opinion information and personalized configuration of public opinion monitoring.

The schematic block diagram of the data mining server and the database server is shown in fig. 2, and the adopted technology mainly has the following four aspects:

(1) internet information collection module

The technology realizes the collection and storage of internet information, which is similar to a 'web crawler' used in a search engine, but has obvious difference with the 'web crawler'. The web crawler captures web pages from one or a plurality of initial web page addresses, continuously extracts all link addresses from the current page and further captures the web pages until a certain stopping condition is met, and the web crawler is characterized by capturing the web pages to the maximum extent. The technology carries out limited page grabbing of preset grabbing instructions, only grabs pages containing user public opinion monitoring requirements, and collects data aiming at 'fine' rather than 'wide', so that every time a data grabbing instruction is set, the technology is equivalent to carrying out 'vertical search' in a specific field.

(2) Intelligent extraction module for webpage content

The method comprises the steps of carrying out structural processing on a webpage captured by an internet information collection module, converting the content of the unstructured webpage into data which can be identified and processed by a computer and have a semantic structure, and extracting a data part with public opinion monitoring value. According to the prior art, a computer cannot directly identify and understand the information and meaning embodied by the webpage data, and further processing of the information is impossible. The technology can overcome the difficulty of identifying the information structure by the computer, help the computer to identify the information structure by the attribute mark, and can utilize the advantages of accuracy and rapidness of the computer to process massive information once the intelligent work is completed.

(3) Public opinion monitoring feature modeling module

The technology is used for collecting the demand characteristics of users for different public opinion monitoring items, and establishing a characteristic model of the monitoring items according to the characteristics, and the characteristic model is used as a basis for carrying out public opinion monitoring service for the users. In the system, the formatted public opinion monitoring requirement is called a public opinion monitoring characteristic model. The system performs data mining processing based on the characteristic model, so that information meeting public opinion monitoring requirements of users is extracted from mass data.

(4) Data mining and knowledge discovery module

According to the technology, according to the characteristic model of the monitoring item, useful information meeting the monitoring requirement of a user is intelligently screened out from structured data obtained by a webpage content intelligent extraction technology. Because the data mining processing related in the technology is carried out based on the characteristic model, and the characteristic model is abstract representation of the actual monitoring requirement of the user, the public opinion information recommended by the system for the user is necessarily valuable information required by the user, and the intelligent discovery of the public opinion information is realized.

Compared with the prior art, the beneficial effects of the invention are embodied in the following six aspects:

in the aspect of distributed cloud computing system architecture design, a large number of online data mining servers and database servers are distributed in different geographic positions by utilizing an advanced cloud computing architecture thought to serve as computing resources and storage resources of the system. The system can dynamically allocate effective use of server resources in the cloud computing architecture by utilizing the cloud computing resource scheduling service according to different requirements of users, so that the operating efficiency of the data mining application program is improved, and the actual requirements of the users are met. The system monitoring and load measuring module arranged in the cloud computing architecture can monitor and measure data indexes such as the resource load state of the data mining server, the resource load state of the database server, the request quantity of computing and storage resources of data mining related application programs, the request quantity of computing and storage resources of users and the like, and the real-time data are the basis for allocating system resources by the cloud computing resource scheduling service.

In the aspects of internet information collection technology and web page content intelligent extraction technology, the internet information collection technology can utilize web page grabbing technology to grab full-network information or specific information source pages according to user public opinion monitoring requirements, and the grabbed pages are stored for subsequent processing and use, so that the collection and storage functions of internet information are realized. The intelligent extraction technology of the webpage content realizes that the captured webpage is subjected to structural processing, the unstructured webpage content is converted into data with a semantic structure which can be identified and processed by a computer, and a data part with public opinion monitoring value is extracted, so that once the intelligent work is completed, the advantages of accuracy and quickness of the computer can be utilized to perform data mining processing work of mass information.

In the aspect of public opinion monitoring feature modeling technology, the system abstracts and quantifies the public opinion monitoring requirements of users to form a monitoring feature model which can be identified by a computer. The characteristic model consists of a monitoring information source sequence and a monitoring characteristic label sequence, and is used as a basis for information collection and data mining, so that a user can enjoy precise public opinion monitoring service. The feature model can be updated in two ways, namely, active and passive: the active mode is that a user sets and maintains a monitoring information source and a monitoring feature tag independently so as to establish and update a feature model, and the feature model is characterized in that the feature model can be established and updated rapidly and is suitable for users with clear monitoring requirements; the passive mode is that the user does not need to actively set and maintain, the system determines and updates the monitoring feature model through a certain feature training mechanism, and the system is characterized in that the potential monitoring requirements of the user can be found, and the system is suitable for the user with uncertain monitoring requirements. The two updating modes can be used comprehensively, firstly, a user sets an initial feature model in an active mode, and then corrects and updates the feature model in a passive mode, so that the feature model approaches to the actual monitoring requirement of the user more and can continuously track the change of the monitoring requirement of the user, and the feature model is always consistent with the current monitoring requirement of the user.

In the aspect of a data mining analysis and display technology based on a characteristic model, the data mining analysis technology utilizes an autonomous information filtering and screening mechanism to intelligently screen useful information meeting monitoring requirements of a user from structural data obtained by a webpage content intelligent extraction technology according to the characteristic model of a monitoring item. Because the data mining processing related in the technology is carried out based on the characteristic model, and the characteristic model is abstract representation of the actual monitoring requirement of the user, the public opinion information recommended by the system for the user is necessarily valuable information required by the user, and the intelligent discovery of the public opinion information is realized. The valuable information obtained by mining can be provided to the user through various analysis and presentation modes: the method comprises the steps of discovering an attention hotspot of the network public sentiment through a clustering technology, providing public sentiment hotspot ranking through the occurrence frequency of the public sentiment hotspot on websites with different importance degrees, describing public sentiment change trend distribution through continuous time monitoring data of the public sentiment attention hotspot, providing public sentiment early warning through the public sentiment change trend, analyzing the relevance degree among the public sentiment attention hotspots and the like.

In the aspect of a user service providing mode based on SaaS, the system utilizes an advanced software service technology, so that a user does not need to set up a public opinion monitoring hardware system, and can obtain the required public opinion monitoring information without the limitation of time and region by using the public opinion monitoring service on a cloud mining network public opinion monitoring platform provided by the project according to the requirement.

In the aspect of a multi-platform public opinion publishing mode, the system utilizes various information transmission means to enable a user to obtain public opinion monitoring information by utilizing the most convenient information acquisition platform as far as possible. The main issuing modes are as follows: WEB page browsing, WAP page browsing, RSS subscription, Email push, MMS/SMS subscription, mobile client software, etc.

Drawings

FIG. 1 is a system architecture diagram;

fig. 2 is a schematic block diagram of a public opinion monitoring server;

FIG. 3 is a flow chart of a method for establishing and updating a public opinion monitoring demand characteristic model;

fig. 4 is a flow chart of public opinion information duplication removal technology;

FIG. 5 is a cloud mining system platform architecture diagram;

FIG. 6 is a flow chart of a service model implementation combining SaaS and metacomputing;

fig. 7 is a basic flowchart of an automatic discovery method for network public opinion hotspot information.

Detailed Description

The technical solutions of the present invention will be described in further detail with reference to the accompanying drawings and the detailed description.

1) Public opinion monitoring demand feature model and public opinion information matching and filtering technology

The public opinion monitoring demand characteristic model is a data record set of demand characteristics and attention degree thereof extracted from the public opinion monitoring demands of users, and is a data simulation of the public opinion demand characteristics. The characteristic model can be identified and processed by a computer, and accurate services adaptive to the public sentiment monitoring requirements of different users can be provided for the different users.

A. Defining a public opinion monitoring demand characteristic model:

setting a characteristic sequence I (S, T) { [ (S, r), (S, r),., (S, r) ], [ (T, w), (T, w),., (T, w) ] }, wherein (S, r) represents an information source unit, S is a monitoring information source, and r is a network rank corresponding to the information source; (t, w) represents a monitoring feature unit, t is a feature label, and w is the corresponding importance degree. Normalizing r and w to yield I (S, T) { [ (S, x),. · S, x) ], [ (T, y), (T, y), · T, y) ] }, wherein

B. Establishing and updating a public opinion monitoring demand characteristic model:

the public opinion monitoring demand characteristic model is established actively by a user. The user first registers an information monitoring service account, and under the state of the account, a plurality of monitoring items are established according to needs. For each monitoring item, selecting a proper monitoring information source or information source type according to public opinion monitoring requirements, and automatically incorporating the selected information source unit into the characteristic model of the monitoring item by the system; the user sets a series of related feature labels for the monitoring item, and the system initializes the weight for the feature labels as an initial monitoring feature unit of the monitoring item.

The public opinion monitoring demand characteristic model is updated automatically by the system:

step 1: when a user browses public opinion information, the public opinion information is returned to a server side through a user interaction interface;

step 2: the server side carries out public opinion hotspot discovery processing on the text content of the public opinion information (see innovation point 5) to obtain a series of hotspot keywords and the occurrence frequency thereof, and the data are stored in a database as user browsing history;

and 3, step 3: setting a certain characteristic model updating interval time, counting the user attention frequency of a characteristic label in the characteristic model and the frequency of a hot spot keyword in user browsing history when the interval time is up, and deleting the characteristic label from the characteristic model when the attention frequency of the characteristic label of the characteristic model is lower than a threshold value; when the frequency of the hot keyword in the user browsing history is higher than a threshold value, taking the hot keyword as a feature tag to be incorporated into the feature tag; the importance value corresponding to the feature label c is determined by the following formula:

the frequency of attention of the user where m is c.

The flow chart of the method for establishing and updating the public opinion monitoring demand characteristic model is shown in figure 3:

C. matching filtering algorithm based on feature model public opinion information:

(a) public opinion information duplication removal technology

Some collected public opinion information, although from different sources, have similar or even identical actual contents, belong to repeated results, and should be filtered and combined.

Extracting public opinion hotspot keywords and corresponding frequencies of all public opinion information text contents by using a public opinion hotspot discovery technology to obtain a feature sequence K { (K, f), (K, f), · and (K, f) } of the public opinion information text contents, wherein K is the feature sequence of the public opinion information text contents i, K is the hotspot keywords, and f is the corresponding frequencies; similarity calculation based on a vector space model is carried out on the characteristic sequences to obtain similarity values among the characteristic sequences; setting a similarity combination threshold, and performing similarity combination on the public opinion information when the similarity value exceeds the threshold.

The public opinion information duplication removal technical process is shown in figure 4:

(b) public opinion matching algorithm based on feature model

The algorithm is based on a public opinion monitoring demand characteristic model, calculates the relevance weight of each piece of public opinion information based on the characteristic model, and obtains the public opinion information relevance sequence based on the monitoring demand characteristic, thereby realizing the personalized processing process of presenting the public opinion information which best meets the monitoring demand of a user to the user.

Defining a feature word library which is a corresponding relation table of various feature words and feature categories, and expressing a feature word set belonging to a feature class c by K (c).

Step 1: the following processing is carried out on each piece of information r in the public opinion information set one by one:

performing Chinese word segmentation on the text content of r to obtain a plurality of keywords, and finding out a feature word set which belongs to each feature class in the feature word bank and is represented as K (r) ═ { key, key.

Calculating a weight vector W (r, c) ═ W, W,. W, for k (r) for each feature class c associated with the monitored feature model, one by one, wherein

<math><mrow><msub><mi>w</mi><mi>i</mi></msub><mo>=</mo><mfencedopen='{' close=''><mtable><mtr><mtd><msub><mi>y</mi><mi>j</mi></msub><mo>,</mo></mtd><mtd><msub><mi>t</mi><mi>i</mi></msub><mo>&Element;</mo><mi>K</mi><mrow><mo>(</mo><msub><mi>c</mi><mi>j</mi></msub><mo>)</mo></mrow></mtd></mtr><mtr><mtd><mn>0</mn><mo>,</mo></mtd><mtd><msub><mi>t</mi><mi>i</mi></msub><mo>&NotElement;</mo><mi>K</mi><mrow><mo>(</mo><msub><mi>c</mi><mi>j</mi></msub><mo>)</mo></mrow></mtd></mtr></mtable></mfenced><mrow><mo>(</mo><msub><mi>t</mi><mi>i</mi></msub><mo>&Element;</mo><mi>K</mi><mrow><mo>(</mo><msub><mi>r</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>3</mn><mo>)</mo></mrow></mrow></math>

Similarly, the importance vector (u, u.., u) of the keyword in k (r) is obtained, and the normalization results in x (r) ═ x, (x, x.., x). If W in W (r, c) is not zero, carrying out vector similarity calculation on X (r) and W (r, c) to obtain

Indicating the degree of correlation between the public opinion information r and the feature class c, and if w is all zero, sim (r, c) is 0.

Step 2: after all r are processed as in step 1, a correlation sequence Sim (r, C) (Sim (r, C)), which is a feature class related to the monitoring feature model, of the public sentiment information is obtained.

And 3, step 3: comprehensively processing the correlation sequence Sim (r, C) obtained in the step 2

And obtaining the similarity of the r and the characteristic model. This may result in a set of feature model correlations sim (r) { sim (r) }.

And 4, step 4: since public opinion information originates from different websites on the internet, the influence of different websites on the internet is different, the influence is determined by the ranking of the website on the internet, and the influence is larger when the ranking is closer to the front. For public opinion information r, its importance score based on the information source website can be calculated as follows:

<math><mrow><mi>score</mi><mrow><mo>(</mo><msub><mi>r</mi><mi>i</mi></msub><mo>,</mo><mi>S</mi><mo>)</mo></mrow><mo>=</mo><mn>1</mn><mo>-</mo><munderover><mi>Π</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>k</mi></munderover><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mfrac><mn>1</mn><mrow><mi>k</mi><mo>·</mo><msub><mi>n</mi><mi>i</mi></msub></mrow></mfrac><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>6</mn><mo>)</mo></mrow></mrow></math>

wherein k is the website source number of the public sentiment information, and n is the network ranking of the source website. This equation indicates that r is derived from more websites, the higher the ranking of the source websites on the network.

And 5, step 5: since sim (r,) and score (r, S) are both normalized values, the corrected r-correlation sim (r) ═ K · sim (r) + K · score (r, E) can be obtained by combining them in a certain ratio, where K + K ═ 1.

And 6, step 6: and sequencing the elements in sim (r) from large to small to obtain a public opinion information sequencing mode consistent with the monitoring characteristics. And setting a certain correlation threshold value, and feeding back the public opinion information higher than the threshold value to the user.

In the duplication removal test, 130 pieces of public opinion information are captured for each monitoring item, and after multiple times of information capture, the average number of the public opinion information after duplication removal processing is 98.8, the duplication removal coverage rate reaches 88.9%, and the duplication removal accuracy rate reaches 96.67%.

In the public opinion matching test, 30 pieces of related public opinions are required to be screened out from 130 pieces of information and fed back to a user, and through multiple information grabbing, the proportion of the related public opinion information meeting monitoring requirements in all the public opinion information pushed to the user reaches 73.3%, and the monitoring requirement coincidence degree which is directly pushed to the user without matching screening is only 36.6%.

2) Distributed cloud mining system architecture

The system is built on a cloud computing architecture, the service of a user interface is transparently provided for various terminal users, the support of an open interface is also provided for application programs of the system, and users can access the user interface of the system through various terminals to use the system, and can also indirectly use monitoring services provided by the system through other application programs calling the open interface provided by the system. In any case, the user does not need to know the implementation of the system and worry about the insufficiency of the computing and storage capacity of the system, only needs to care about what service mode is selected to obtain the required information service, and deploys the information service to the system in a task mode to execute the information service, and finally obtains a data mining result. In addition, the internal modules of the data mining platform provide services through the user interface and the open interface, the three layers of modules are the algorithm module and the task module from low to high according to the abstraction level, wherein the services with the open interface are all external visible services, functions related to system management and the like needing high-level authority can be only called through the system user interface to ensure the system safety, and the user interface can directly call the open interface to realize the services which can be called externally.

As shown in fig. 5, each layer from bottom to top is transparently served by its upper layer, the bottom layer is an application program interface provided by the cloud computing platform, the top layer is a user interface and an open interface, and by invoking the open interface, users can share data sets and invoke data mining algorithms, and can conveniently integrate them into their own applications, thereby implementing the openness of the platform. The services provided by the intermediate layers are described in detail from bottom to top as follows:

algorithm layer-this layer uses the unified data source provided by the lower layer to implement various algorithm calls and its management interface.

Data cleansing algorithm invocation service: and calling an interface for a preprocessing method of a data set with noise data before executing a data mining algorithm, and storing the cleaned data into a storage space provided by a cloud computing platform through a data layer for a next data mining service.

The data mining algorithm calls the service: and a unified calling interface for data mining by using data cleaned before or other data not required to be cleaned.

The visualization algorithm calls the service: and the data mining result is presented in a form, a graph or the like.

Algorithm registration and deregistration service: and the algorithm management module manages various algorithm modules in a plug-in mode.

And an application layer, which abstracts the operations of the lower two layers, describes data, algorithms and the relation and the sequence of the data and the algorithms related to the whole data mining process as tasks and provides a calling and maintaining interface taking the application as a unit.

The application calls the service: a call interface for the registered application is provided.

Application registration and deregistration service: and the application management module manages various task definition files in a plug-in mode.

User layer-this layer provides user authentication and authorization functions.

User registration, authentication and authorization services; and providing a user identity authentication and authorization interface, wherein authorization information is used as a pass for calling each service of the lower layer so as to ensure the safety of the platform. The user management interface is also provided by this service.

All the services of each layer use XML as a communication language and are called internally in a Web service form based on the expression state transition so as to better support the scalability of each layer and finally open outwards in an open interface form, so that the openness and the usability of the system are greatly enhanced, and the method is not available in the prior data mining platform architecture.

3) Providing services for users by combining software as a service (SaaS) mode with cloud computing architecture

Software as a service (SaaS) is well known to more and more small and medium enterprises, and enjoying software services by means of leasing is the best way to apply advanced technologies to many small and medium enterprises. The method not only reduces the software service ownership cost of enterprises, shortens the information construction period, but also greatly reduces the operation and maintenance cost of small and medium-sized enterprises. The emergence of SaaS has completely subverted the operating mode of traditional software.

The SaaS mode is realized by a three-layer structure:

a presentation layer: SaaS is a business model, which means that users can remotely use software by renting, and the problems of investment and maintenance are solved. From the perspective of users, SaaS is a service mode for software rental;

interface layer: SaaS is a uniform interface mode, and can facilitate users and applications to remotely call software modules through standard interfaces to realize service interaction;

an application implementation layer: SaaS is a software capability, and software design must emphasize configuration capabilities and resource sharing so that one software system can conveniently serve multiple users.

The network public opinion monitoring system with the SaaS mode introduced under the cloud computing architecture has unique application advantages: as a mode of providing information services for users through server resources on the internet, SaaS has certain quality requirements on computing storage resources and bandwidth resources of a network, and as users who receive SaaS services gradually increase, the requirements on the network resources also increase. For the SaaS mode network public opinion monitoring system of the traditional server architecture at present, the cost is increased sharply and the profit is reduced correspondingly by upgrading hardware equipment to adapt to the increasing user requirements, and the defect is overcome effectively by combining the SaaS software service mode with the cloud computing architecture to be applied to the network public opinion monitoring service in an innovative way.

The cloud computing architecture provides a simple and efficient mechanism for managing network resources, the mechanism can distribute computing tasks, rebalance workload, dynamically distribute resources and the like, can help a SaaS service mode to provide unimaginable huge resources to massive users, can be free from the limitation of network resources such as servers and bandwidth in the design process of a public opinion monitoring system, and concentrates on optimizing software design so as to provide the most effective service for the users, thereby achieving the best user experience.

In a specific SaaS mode implementation, a user provides an access request to a system through an interactive interface, a resource scheduling service calculates and calculates the resource consumption of the user request according to the geographical position and the access mode of the request initiated by the user, and automatically schedules a cloud computing server resource which is closest to the geographical position of the user, is connected with the user access mode fastest and has the same resource consumption, so as to provide corresponding information monitoring service for the user. The implementation flow of the service mode is shown in fig. 6:

4) multi-dimensional related public opinion showing mode

The system is based on the automatic clustering technology of the similarity algorithm, automatically classifies mass and non-category public opinion information collected every day, classifies the information with similar attention contents into one category, automatically generates hot subject terms for the category, analyzes the relevance of the hot subject terms, presents the internal relation among the hot subject terms for users, and discovers the internal relation of the public opinion information.

The fuzzy clustering algorithm (FCM) is used for clustering in the project:

in the FCM algorithm, the cluster loss function defined by the membership function can be written as:

wherein, b > 1 is a constant which can control the fuzzy degree of the clustering result. Require that the sum of the membership of a sample to each cluster be 1, i.e.

In equation (7), the minimum value of equation (6) below, the requirement is satisfied if the partial derivatives of J with respect to m and μ (x) are 0:

solving the equations (8) and (9) by an iterative method is the FCM algorithm. When the algorithm is converged, various clustering centers and membership values of various samples to various classes are obtained, so that fuzzy clustering division is completed. The algorithm has good flexibility and high processing efficiency, and is suitable for the text clustering requirements of the project.

When clustering is carried out, firstly, a Chinese word segmentation tool is used for segmenting words of page text contents, so that the text contents are converted into corresponding characteristic words and weight sequences thereof for a clustering algorithm to use. The project can select Chinese word segmentation components which are widely applied at present, such as CSW, lucene, ICTCCLAS, massive word segmentation and the like, the Chinese word segmentation components are mature and efficient, and the technical guarantee of Chinese word segmentation can be provided for the project.

The public opinion hotspot subject terms obtained through clustering process are subjected to subject term source-based relevance analysis, so that the relevance degree among the lattice subject terms can be obtained, and the specific analysis process is as follows:

let the page sequence to which the subject word K belongs be P { (P, w), (P, w),. · and (P, w) }, where P is the belonging page and w is the weight on the belonging page. And (3) carrying out correlation calculation based on a Vector Space Model (VSM) on each topic word to obtain a correlation value among the topic words.

5) Automatic discovery of network public opinion hotspots

The system automatically discovers the public opinion hotspots by the following steps:

A. the method comprises the following steps of carrying out word segmentation on webpage information text contents obtained by a webpage content intelligent extraction technology by using a Chinese word segmentation tool, and using 4 one-dimensional character string arrays as obtained word segmentation results: segment1[ n ] -Segment 4[ n ] are stored, and the elements in the array correspond to keyword lists of 1-4 keyword units respectively. Each word decomposed by the word segmentation tool is taken as a keyword unit, and the N keyword units are combinations of N continuous forward keyword units in the text.

B. And accumulating the word segmentation results to a total word frequency table. In order to count the word frequency, the number of times the same word appears needs to be calculated. A mapping is created, each different word corresponding to a value representing the number of times it occurs. A TreeMap data structure is established in the program, and the data structure is a mapping of key-key value, wherein the key is a word, and the key value is a word frequency. By using the data structure, searching whether a certain word exists or not can be realized, and the word frequency of the word is obtained.

The procedure was as follows: reading each element in the array of Segment1[ n ] -Segment 4[ n ] in sequence, searching whether the element exists in TreeMap 1-TreeMap 4, if so, adding 1 to the word frequency, and if not, adding the word. And sets the word frequency to 1. And repeating the operations 1 and 2 on all the news texts in sequence to obtain a total word frequency table.

After the number of files reaches a certain order of magnitude, the number of entries may be too large in the TreeMap 1-TreeMap 4 mapping. It was found that in the word frequency table, there are quite a lot of words that appear only 1 time, and words within 4 times exceed 2/3 of the total word number. The word frequency of the keywords of the hot event is generally higher, and the final result cannot be greatly influenced by removing the low-frequency words. Therefore, a threshold lower limit frequency min (set to 4) of one word frequency is set in the program, and when the lengths of TreeMap 1-TreeMap 4 reach a preset value TreeMap max, all entries with the frequency number of words smaller than the frequency min are automatically cleared.

C. And ordering the total word frequency table. TreeMap 1-TreeMap 4 structures obtained after word frequency statistics and filtering are 'key-key value' mapping structures, and can be sorted only according to the pinyin sequence of words but not according to the word frequency. Therefore, TreeMap1 through TreeMap4 are converted into orderable array list1 through ArrayList4 data structures. ArrayList 1-ArrayList 4 can sort objects in a structure according to a user-defined method. The structures of ArrayList 1-ArrayList 4 define a Compare method (when the word frequency of one word is greater than that of the other word, 0 is returned, otherwise 1 is returned), and the structures of ArrayList 1-ArrayList 4 can obtain the word frequency lists sorted according to the word frequency by their sorting function. We rank the word frequencies in descending order, with the top ranked words having higher word frequencies.

D. Stop words are removed from the overall word frequency table. Stop words refer to words which appear frequently and have no practical meaning, and also include punctuation marks left after word segmentation. The stop word library is stored in a txt text file, each row of the text file representing a stop word. And searching each word in the total word frequency table in the deactivation word bank once, if the word exists, deleting the word, and if the word does not exist, keeping the word. A filtered list of keywords is thus available.

E. And finally, entering the work of merging the keywords of multiple units, wherein the specific method comprises the following steps:

scanning each word1 in the TreeMap1 in turn, performing substring matching with each word in the TreeMap2, if the corresponding word2 is successfully matched, performing word frequency comparison, if the value of the ratio between the value of word2 and the value of word1 is approximately the same, i.e., the ratio is greater than a given threshold (set herein to 0.9), adding word2 to the result list, otherwise performing rank comparison, if rank (word2) is greater than rank (word1) (where the rank is taken from the corresponding ArrayList), adding word2 to the result list, otherwise, stating that no word in the TreeMap2 can replace word1, adding word1 to the result list, and finally, if word1 is not the substring of any word in the TreeMap2, stating that word1 is completely background noise, removing word1 from the result list. Combining TreeMapResult, TreeMap3 and TreeMap4 by the same method to finally obtain a multi-unit keyword list formed by combining keywords of various units.

Fig. 7 is a basic flow diagram of the method:

table 1 shows the public opinion hotspot processing results obtained after the test of the method, and news of the Tengxinnews network on the day of 5, 12 and 2008 are used as test samples. It can be seen that the number of news reports between 5 months and 12 days 00:00 and 06:00 is small, and the word frequency is small, but the attention degree of the hand-foot-and-mouth disease is still high. 06:00-12:00, the number of news reports is increased, the word frequency is increased, and the hand-foot-and-mouth disease still ranks at the top, which shows that the disease is still the focus of attention in the day. 12:00-18:00, it can be seen that the relevant keywords of the earthquake all appear, and the relevant information such as the earthquake occurrence place, the earthquake magnitude, the occurrence time and the like can also be seen through the keywords. 18:00-24:00, the keywords appeared are basically related to the earthquake, and the reports in the period almost take the earthquake as the content, and the severity of the event can be seen.

Table 1 public opinion hotspot automatic discovery test result

6) Public opinion trend analysis based on intelligent training sequence mode

The system stores the public opinion monitoring and analyzing data of different users on the cloud server for a long time according to time by means of mass storage resources provided by a cloud computing architecture to form a public opinion information historical database of the users, and the continuous time change trend distribution of public opinion focus can be obtained by utilizing the long-term monitoring data. The system displays the distribution situation of the public opinion heat points to the user in a chart form, so that the user can visually know the development situation of the network public opinion, and clear and reliable data guarantee is provided for decision making of relevant departments.

According to the variation trend distribution of the public opinion hotspots, the system automatically trains and updates the monitoring feature model, so that the feature model can be consistent with the public opinion monitoring hotspots, and valuable information can be better screened from mass information.

7) Automatic early warning for network public opinion according to need

The network public opinion early warning means that necessary and effective actions are taken for resolving and dealing with crisis from the occurrence of signs of crisis events to the beginning of crisis to cause sensible loss. The early warning process of the network public sentiment mainly comprises the following steps:

A. and formulating a crisis early warning scheme. Aiming at various crisis events, a more detailed judgment standard and an early warning scheme are made to prepare, and once a crisis occurs, the crisis can be regulated and remedied according to the symptoms.

B. The situation development is closely concerned. The first time acquirement right to the state is kept, and the monitoring strength is strengthened.

C. And timely transmitting and communicating information. And the system keeps close communication with related departments related to public opinion crisis, and establishes and applies an information communication mechanism.

The significance of network public opinion early warning is to discover the head of a crisis as early as possible, judge the trend and scale of a possible real crisis as early as possible, and inform all relevant departments to prepare for the crisis together. The early warning capability of the crisis mainly reflected in whether the potential crisis can be sharply found from massive network statements every day or not and the time difference between the finding and the possible outbreak of the crisis can be accurately judged. The larger the time difference is, the more time is available for relevant departments to prepare, and valuable time is gained for the effective response of the crisis at the next stage.

The system can provide early warning function for the attention hot spots automatically found by the network public sentiment according to different levels. The user can set different early warning levels as required, and the system can also automatically carry out public sentiment early warning according to the public sentiment trend change rule and the early warning rule. When early warning information appears, the system can push the information to relevant users in various release modes at the first time, the early warning information is displayed in a chart mode, the number and the percentage of articles of each type of early warning information can be seen in an early warning distribution map, and data such as the propagation trend, the propagation site statistics, the positive and negative information statistics and the like of each type of early warning information in a certain time period can be checked.

The above embodiments are only preferred embodiments of the system of the present invention, and the scope of the present invention is not limited thereto, and any simple changes or equivalent substitutions of technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention are within the scope of the present invention.

Claims

1. A cloud mining analysis network public opinion method based on a feature model is characterized by comprising five components: cloud computing resource pool: the system comprises computing and storing resources distributed at different geographic positions, and consists of a data mining server and a database server; system monitoring and load measurement: providing monitoring and measurement of computing and storage resources in a cloud computing framework;

cloud computing resource scheduling service: the method is used for dynamically allocating the use of server resources in the cloud computing framework;

multi-platform public opinion publishing service: the network public opinion monitoring information obtained by data mining processing is pushed to the user in more than one release mode;

a user interaction interface: providing interface interfaces of different public opinion publishing modes for users;

and establishing a corresponding user interface aiming at different public opinion publishing modes, wherein the interface provides functions of user registration and login, public opinion monitoring configuration and management and public opinion pushing, and is used for authorized access of the user, checking up latest public opinion information and personalized configuration of public opinion monitoring.

2. The cloud mining internet public opinion monitoring system based on feature model as claimed in claim 1, wherein the data mining server and the database server comprise:

the internet information collection module: the collection and storage of the internet information are realized;

the intelligent extraction module of the webpage content: performing structural processing on the webpage captured by the internet information collection module, converting the content of the unstructured webpage into data with a semantic structure which can be identified and processed by a computer, and extracting a data part with public opinion monitoring value;

public opinion monitoring characteristic modeling module: the public opinion monitoring system is used for collecting the demand characteristics of users on different public opinion monitoring items, and establishing a characteristic model of the monitoring items according to the characteristics, wherein the characteristic model is used as a basis for carrying out public opinion monitoring service on the users;

the data mining and knowledge discovery module: according to the characteristic model of the monitoring item, useful information meeting the monitoring requirement of the user is intelligently screened out from the structured data obtained by the webpage content intelligent extraction technology.

3. The cloud mining internet public opinion monitoring system based on feature model as claimed in claim 1, wherein the system monitoring and load measurement main monitoring measurement indexes are: the data mining system comprises a data mining server resource load state, a database server resource load state, a request amount of a data mining related application program for computing and storing resources and a request amount of a user for computing and storing resources.

4. The cloud mining network public opinion monitoring system based on feature model as claimed in claim 1, wherein the main push modes of the multi-platform public opinion publishing service are: WEB page browsing, WAP page browsing, RSS subscription, email.

5. The cloud mining internet public opinion monitoring system based on feature model as claimed in claim 1, wherein the user interaction interface establishes a user interface corresponding to different public opinion publishing modes, and the interface provides user registration and login, public opinion monitoring configuration and management, and public opinion push functions for user authorized access, viewing latest public opinion information, and personalized configuration for public opinion monitoring.