CN114637903A - Public opinion data acquisition system for directional target data expansion - Google Patents
Public opinion data acquisition system for directional target data expansion Download PDFInfo
- Publication number
- CN114637903A CN114637903A CN202210258764.5A CN202210258764A CN114637903A CN 114637903 A CN114637903 A CN 114637903A CN 202210258764 A CN202210258764 A CN 202210258764A CN 114637903 A CN114637903 A CN 114637903A
- Authority
- CN
- China
- Prior art keywords
- data
- network
- acquisition
- information
- account
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a public opinion data acquisition system aiming at directional target data expansion, which carries out data expansion by taking specific target data as seeds, designs a scheduling strategy to realize a distributed and sustainable data acquisition task, and manually collects a target account and a label for directional information acquisition; expanding the seed list based on the social network; expanding the seed list based on the similar characteristics; building a distributed iterative data acquisition frame; preprocessing processes such as data association, duplicate removal and structuring; and collecting a program scheduling and performance optimization strategy. The method integrates feature matching and network analysis technologies on the basis of directional data acquisition, and realizes automatic, sustainable and iterative distributed information acquisition by expanding and mining potential relations of the data of the seeds, thereby meeting the practical requirements of social media analysis, relation map construction and the like.
Description
Technical Field
The application relates to the field of data acquisition engineering, in particular to a public opinion data acquisition system for directional target data expansion.
Background
With the popularization and development of the internet, a large number of group discussions and information feedback behaviors are completed by means of network application, and more international netizens communicate information through networks in various ways. When the group interaction and discussion are on the line of stepping, the risks and the control difficulty are increased gradually. Therefore, a set of public opinion big data acquisition platform needs to be built, the dynamic tracking analysis of the social and the public opinions is enhanced, and the related public opinion trends are mastered in time.
The data acquisition system is an important bottom foundation for realizing the public opinion management and analysis. By designing a set of real-time sustainable and efficient distributed data acquisition system, the system is beneficial to collecting various information statements in the network in real time and serving subsequent tasks such as data analysis result presentation. The difficulty of public opinion data collection is mainly focused on two points of collection system design and collection strategy design. For the design of an acquisition system, how to efficiently, continuously and real-timely acquire data and facilitate task scheduling and visual management is the core direction of research and development of the acquisition system; for the design of the acquisition strategy, how to determine the acquisition target in a wide data source and expand the acquisition target list according to the existing data is a core problem to be considered by the acquisition strategy by deeply mining the associated data and the high-value data.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a data expansion and information acquisition system aiming at a directional target, which solves the problems of system stability, scheduling and management simplicity by designing a distributed acquisition framework and solves the problems of related data and high-value data mining and the like in a directional data acquisition task by researching and developing a data expansion technology and a data processing scheme based on contents, relations and characteristics.
In order to achieve the above technical object, the present application provides a public opinion data collection system for directional target data expansion, including:
the data acquisition target orientation module is used for acquiring text data, map data, audio data and video data corresponding to the seed account number by selecting the target object as the seed account number to generate a seed information entity library;
the data expansion technology module is used for constructing a data social relation network according to the field characteristics, the content characteristics and the relation characteristics of the data of the seed information entity library, excavating key information nodes and potential community structures of the data social relation network, and marking the key information nodes and the potential community structures as high-value data to form an expanded data list;
the distributed data acquisition module is used for designing a priority strategy and custom configuration based on an extended data list and building a distributed data acquisition model with high extension degree, wherein the distributed data acquisition model is used for carrying out distributed acquisition on data to meet the flexibility requirement of data acquisition;
the data processing module is used for carrying out unified streaming preprocessing on various acquired open-source website data and providing data support for monitoring and analyzing services;
and the task scheduling and monitoring module is used for optimizing the data acquisition task workflow by constructing a big data distributed task scheduling framework, and acquiring the running logs in real time and uninterruptedly to realize monitoring and management of the acquisition process.
Preferably, the data acquisition target orientation module comprises:
the specific area account number directional acquisition unit is used for selecting a specific area as a target object and serving as a seed account number, wherein the specific area is used for representing a characteristic area to which data to be acquired belongs;
and the specific content directional acquisition unit is used for acquiring data of the overseas and overseas portal website data, the overseas social network data and the search engine data by verification code cracking and request parameter cracking based on an HTML (hypertext markup language) parsing technology, a JSON (Java Server object notation) acquisition technology and an interface acquisition technology according to the seed account number, and acquiring text data, map data, audio data and video data.
Preferably, the data expansion technology module comprises:
the system comprises a field feature-based first data expansion unit, a field feature-based second data expansion unit and a field feature-based first data expansion unit, wherein the field feature-based first data expansion unit is used for acquiring a first related account according to the associated feature of the seed account, the first related account is used for expanding the seed account, and the associated feature is used for representing friend information, city-sharing user information and region position information of the seed account;
the second data expansion unit based on the similar content is used for retrieving a second related account according to the text, the label and the key word of the release content of the seed account, wherein the second related account is used for expanding the seed account;
and the third data expansion unit based on the social relationship network is used for mining the relationship among entities, characters, organizations and events according to the community relationship structure and the information transmission network structure to which the seed account belongs, and identifying potential key characters and communities.
Preferably, the second data expansion unit includes:
the feature construction subunit is used for eliminating stop words from the content text data, acquiring text content features based on a TF-IDF keyword extraction algorithm, importing the text content features into a Word2Vec Word vector model, and generating text feature vectors, wherein the text feature vectors are used for retrieving a second related account;
and the characteristic matching subunit is used for selecting different content characteristic matching strategies according to the retrieval data volume, wherein the content characteristic matching strategies comprise keyword characteristic matching and characteristic vector content matching.
Preferably, the third data expansion unit includes:
the first network construction subunit is used for constructing a first related account network according to the association information of the seed account;
the second network construction subunit is used for constructing a second related account network by carrying out comparison analysis on the similarity of keywords, topics, regions, pictures and videos of the community interpersonal structure;
the third network construction subunit is used for constructing a third related account network by acquiring important position websites of the information transmission network structure and related user information of the important position websites, wherein the related user information is used for expressing user information with network interaction behavior with the seed accounts;
and the social relationship network constructing subunit is used for constructing a social relationship network according to the first related account network, the second related account network and the third related account network.
Preferably, the third data expansion unit further includes:
the key node mining subunit is used for identifying key characters by acquiring the activity degree, the central position, the importance degree, the influence degree and the total relationship number of the network nodes according to the social relationship network;
the potential community mining subunit is used for identifying the communities with the close relation by acquiring the network modularity index of the related account network according to the social relation network, wherein the modularity index is maximized by calling Clauset and Louvain algorithms in the process of identifying the communities with the close relation;
and the data iteration expansion subunit is used for generating an expanded data list according to the network nodes corresponding to the key characters and the community nodes of the community by setting a key node threshold and a community scale threshold.
Preferably, the distributed data acquisition module comprises:
the multi-channel data acquisition unit is used for realizing multi-channel acquisition of data by setting multi-path and multi-mode;
the data distribution unit is used for building a data distribution service based on Flume, receiving all data in the acquisition process and allowing the downstream to be offline within a certain time to realize efficient and stable data transmission, wherein the bottom layer of the data distribution service is a distributed cluster structure;
and the data source website change emergency strategy unit is used for automatically identifying corresponding websites and plates when the page of the target website changes, fully collecting news data of a home page, and extracting the structured news data field based on a page key information automatic analysis engine of the meta search.
Preferably, the data processing module comprises:
the data extraction unit is used for extracting webpage information, full-text data structuring, multimedia information and biological characteristic information to generate basic data;
the data cleaning unit is used for improving the value density of the basic data through junk information filtering, data deduplication and format cleaning;
the data association unit is used for associating the acquired data with the basic data according to the characteristics of people, places, objects, things, organizations, relationships and behaviors;
the data comparison unit is used for realizing clue discovery and touch alarm on the basic data subjected to information association through data comparison, wherein the data comparison comprises structured comparison, keyword comparison and binary comparison;
the data identification unit is used for identifying the language, the region, the position and the service attribute of the basic data subjected to data comparison by relying on the local basic library and the service knowledge library, and providing support for upper-layer application;
and the data merging and distributing unit is used for merging the association relationship and distributing the data of the basic data subjected to the attribute identification.
Preferably, the data identification unit includes:
the general identification subunit is used for carrying out data identification according to specific meanings contained in the data, wherein the specific meanings are determined by self definition of the data or by preprocessing correlation and comparison results;
and the service identification subunit is used for forming a label with a clear service meaning according to different knowledge bases and carrying out service identification on the data, wherein the service identification is used for supporting the formation and the model analysis of a service resource base.
Preferably, the data merging and distributing unit includes:
the incidence relation merging subunit is used for merging the incidence relation with the existing data and identifying the time span and the times of the incidence relation;
and the data distribution subunit is used for generating or updating a basic knowledge base of the public opinion data acquisition system according to the basic data subjected to attribute identification, wherein the basic knowledge base comprises a business resource base, a basic resource base and a business entity base.
The invention discloses the following technical effects:
the data expansion technology based on the three paths can improve the breadth and depth of data capture on the premise of directional capture, and ensures that high-value associated data can be obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is an overall step flow of the present invention;
FIG. 2 is a directional data acquisition system design;
FIG. 3 is a task scheduling policy;
FIG. 4 is a data expansion technique flow based on similar content;
FIG. 5 is a data expansion technical process based on a social relationship network.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1 to 5, the present invention provides a public opinion data collection system for directional target data expansion, comprising:
the data acquisition target orientation module is used for acquiring text data, map data, audio data and video data corresponding to the seed account number by selecting the target object as the seed account number to generate a seed information entity library;
the data expansion technology module is used for constructing a data social relationship network according to field characteristics, content characteristics and relationship characteristics of data of the seed information entity library, and forming an expanded data list by mining key information nodes and potential community structures of the data social relationship network and marking the key information nodes and the potential community structures as high-value data;
the distributed data acquisition module is used for designing a priority strategy and custom configuration based on an extended data list and building a distributed data acquisition model with high extension degree, wherein the distributed data acquisition model is used for carrying out distributed acquisition on data to meet the flexibility requirement of data acquisition;
the data processing module is used for collecting various open-source website data, performing uniform streaming preprocessing and providing data support for monitoring and analyzing services;
and the task scheduling and monitoring module is used for optimizing the data acquisition task workflow by constructing a big data distributed task scheduling framework, and acquiring the running logs in real time and uninterruptedly to realize monitoring and management of the acquisition process.
Further preferably, the data acquisition target orientation module comprises:
the specific area account number directional acquisition unit is used for selecting a specific area as a target object and serving as a seed account number, wherein the specific area is used for representing a characteristic area to which data to be acquired belongs;
and the specific content directional acquisition unit is used for acquiring data of the overseas and overseas portal website data, the overseas social network data and the search engine data by verification code cracking and request parameter cracking based on an HTML (hypertext markup language) parsing technology, a JSON (Java Server object notation) acquisition technology and an interface acquisition technology according to the seed account number, and acquiring text data, map data, audio data and video data.
Further preferably, the data expansion technology module comprises:
the system comprises a field feature-based first data expansion unit, a field feature-based second data expansion unit and a field feature-based first data expansion unit, wherein the field feature-based first data expansion unit is used for acquiring a first related account according to the associated feature of the seed account, the first related account is used for expanding the seed account, and the associated feature is used for representing friend information, city-sharing user information and region position information of the seed account;
the second data expansion unit based on the similar content is used for retrieving a second related account according to the text, the label and the key word of the release content of the seed account, wherein the second related account is used for expanding the seed account;
and the third data expansion unit based on the social relationship network is used for mining the relationship among entities, characters, organizations and events according to the community relationship structure and the information transmission network structure to which the seed account belongs, and identifying potential key characters and communities.
Further preferably, the second data expansion unit includes:
the feature construction subunit is used for eliminating stop words from the content text data, acquiring text content features based on a TF-IDF keyword extraction algorithm, importing the text content features into a Word2Vec Word vector model, and generating text feature vectors, wherein the text feature vectors are used for retrieving a second related account;
and the characteristic matching subunit is used for selecting different content characteristic matching strategies according to the retrieval data volume, wherein the content characteristic matching strategies comprise keyword characteristic matching and characteristic vector content matching.
Further preferably, the third data expansion unit includes:
the first network construction subunit is used for constructing a first related account network according to the associated information of the seed account;
the second network construction subunit is used for constructing a second related account network by carrying out comparison analysis on the similarity of keywords, topics, regions, pictures and videos of the community interpersonal structure;
the third network construction subunit is used for constructing a third related account network by acquiring important position websites of the information transmission network structure and related user information of the important position websites, wherein the related user information is used for expressing user information with network interaction behavior with the seed accounts;
and the social relationship network constructing subunit is used for constructing a social relationship network according to the first related account network, the second related account network and the third related account network.
Further preferably, the third data expansion unit further includes:
the key node mining subunit is used for identifying key characters by acquiring the activity degree, the central position, the importance degree, the influence degree and the total relation number of the network nodes according to the social relation network;
the potential community mining subunit is used for identifying the communities with the close relationship by acquiring the network modularity index of the related account network according to the social relationship network, wherein in the process of identifying the communities with the close relationship, the modularity index is maximized by calling Clauset and Louvain algorithms;
and the data iteration expansion subunit is used for generating an expanded data list according to the network nodes corresponding to the key characters and the community nodes of the community by setting a key node threshold and a community scale threshold.
Further preferably, the distributed data acquisition module includes:
the multi-channel data acquisition unit is used for realizing multi-channel acquisition of data by setting multi-path and multi-mode;
the data distribution unit is used for building a data distribution service based on Flume, receiving all data in the acquisition process and allowing the downstream to be offline within a certain time to realize efficient and stable data transmission, wherein the bottom layer of the data distribution service is a distributed cluster structure;
and the data source website change emergency strategy unit is used for automatically identifying corresponding websites and plates when the page of the target website changes, fully collecting news data of a home page, and extracting the structured news data field based on a page key information automatic analysis engine of the meta search.
Further preferably, the data processing module includes:
the data extraction unit is used for extracting webpage information, full-text data structuralization, multimedia information and biological characteristic information to generate basic data;
the data cleaning unit is used for improving the value density of the basic data through junk information filtering, data deduplication and format cleaning;
the data association unit is used for associating the acquired data with the basic data according to the characteristics of people, places, objects, things, organizations, relationships and behaviors;
the data comparison unit is used for realizing clue discovery and touch alarm on the basic data subjected to information association through data comparison, wherein the data comparison comprises structured comparison, keyword comparison and binary comparison;
the data identification unit is used for identifying the language, the region, the position and the service attribute of the basic data subjected to data comparison by relying on the local basic library and the service knowledge library, and providing support for upper-layer application;
and the data merging and distributing unit is used for merging the association relationship and distributing the data of the basic data subjected to the attribute identification.
Further preferably, the data identification unit includes:
the general identification subunit is used for carrying out data identification according to specific meanings contained in the data, wherein the specific meanings are determined by self definition of the data or by preprocessing correlation and comparison results;
and the service identification subunit is used for forming a label with a clear service meaning according to different knowledge bases and carrying out service identification on the data, wherein the service identification is used for supporting the formation and the model analysis of a service resource base.
Further preferably, the data merging and distributing unit includes:
the incidence relation merging subunit is used for merging the incidence relation with the existing data and identifying the time span and the times of the incidence relation;
and the data distribution subunit is used for generating or updating a basic knowledge base of the public opinion data acquisition system according to the basic data subjected to attribute identification, wherein the basic knowledge base comprises a business resource base, a basic resource base and a business entity base.
Example 1: public opinion data acquisition system to directional target data expansion includes:
1.1 data acquisition target orientation: manually collecting a specific area, such as an agenda, a candidate, an organization group, a public character and the like as a seed account, and collecting account information for the first time to construct a seed information entity library;
1.2 data expansion technology: on the basis of the seed information entity library, searching related accounts according to the population characteristics, the content characteristics and the relationship characteristics of the data, and constructing an expanded data list;
1.3 distributed acquisition framework and strategy: obtaining a full collection target list after a data expansion process, building a high-expansion distributed data collection system, designing a priority strategy and custom configuration, and meeting the flexibility requirement of data collection;
1.4, data acquisition and processing: the collected various open-source website data are subjected to unified streaming preprocessing, so that the data formats of different sources are relatively unified, the associated identifiers are clear, the subsequent data storage processing amount is reduced to a certain extent, more complex processing is facilitated, and necessary support is provided for monitoring and analyzing services;
1.5 task scheduling and monitoring: in order to better manage resources and optimize the life cycle of the system, an intelligent and visual monitoring platform for resource scheduling is set up. Optimizing a data acquisition task workflow by means of a big data distributed task scheduling framework, and acquiring running logs in real time and uninterruptedly to realize monitoring management of an acquisition flow;
1.1 Directional data acquisition technology, comprising the following steps:
2.1 directional collection of account numbers in a specific area: manually gathering a particular region, such as: account information of a agenda, election candidates, organization groups, public figures and the like is used as a seed account list;
2.2 content-specific targeted acquisition: based on three acquisition technologies, including HTML parsing, JSON acquisition and interface acquisition; by means of two cracking techniques, including verification code cracking and request parameter cracking; aiming at specific three types of data acquisition, including an overseas and overseas portal website class, an overseas social network class and a search engine class; four data types are supported, including text, map, audio and video;
1.2 data expansion technology, comprising the following steps:
3.1 data expansion based on field characteristics: on the basis of a seed information entity library, searching related accounts according to data characteristic fields such as friend information, same-city users, region positions and the like in account data, and constructing an extended data list;
3.2 data expansion based on similar contents: on the basis of a seed information entity library, searching related accounts according to similarity indexes of content texts, labels and keywords issued by an account number, and constructing an extended data list; )
3.3 data expansion based on social relationship network: on the basis of a seed information entity library, mining the relation among entities, characters, organizations and events based on a community relationship structure and an information transmission network structure, identifying potential key characters and communities, and constructing an extended data list;
3.2 data expansion technique based on similar content, the concrete steps include:
4.1 characteristic configuration: after removing stop words from all content text data, acquiring important features of the text content based on a TF-IDF keyword extraction algorithm, importing the features into a Word2Vec Word vector model, and calculating text feature vectors;
4.2 feature matching: selecting different content feature matching strategies according to the quantity of retrieval data, wherein the content feature matching strategies comprise keyword feature matching and feature vector content matching;
3.3 social relationship network-based data expansion technology, comprising the following concrete steps:
5.1 network construction technology based on relational data: constructing a related account network according to related information such as attention, fans, same city, friends, interaction, groups (facebook groups and telegraph groups) and the like of key people;
5.2 content-based network construction techniques: the method comprises the steps of carrying out similarity comparison analysis on keywords, topics, regions, pictures and videos, and constructing a related account network;
5.3 network construction technology based on organization data: through monitoring key position websites such as continuous login, high login, PTT, shallot and the like in real time, acquiring organization activity related information, paying attention to which account numbers in specific account numbers, time, blocks (columns) and topics are commented or interacted, and organizing a social relationship network based on the structure;
5.4 mining key nodes based on the social relationship network: calculating coefficient indexes such as activity degree, central position, importance degree, influence degree, total correlation coefficient and the like of the nodes according to the network relation graph constructed by the process, and discovering key characters;
5.5 mining potential communities based on social relationship network: calling algorithms such as Clauset and Louvain based on the modularity index according to the network relationship diagram constructed by the process to maximize the modularity index, so as to discover a community structure with close relationship;
5.6, iteratively expanding data according to the key nodes and the potential communities: setting a key node threshold value and a community scale threshold value, and adding community nodes and key nodes in the threshold value obtained by the process calculation into a data expansion list;
1.4 data acquisition processing flow, which comprises the following specific steps:
6.1 data extraction: the data extraction mainly comprises webpage information extraction, full-text data structured extraction, multimedia information extraction and biological characteristic information extraction, and is beneficial to long-term storage and use of data;
6.2 data cleaning: the data cleaning comprises junk information filtering, data duplicate removal and format cleaning, and the value density of the data is improved;
6.3 data association: the data association is used for carrying out information association on the acquired data according to characteristics of people, places, objects, things, organizations (relations), behaviors and the like, and the closed source data and the third party data need to be combined;
6.4 data alignment: the data comparison comprises structured comparison, keyword comparison and binary comparison, and meets the service requirements of clue discovery, touch alarm and the like;
6.5 data identification: the data identification is based on the basic database and the business knowledge base of the platform, and attributes such as language, region, position, business and the like are identified for the data, so that support is provided for upper-layer application. The data identification is divided into a universal identification and a service identification, wherein the universal identification is an explicit specific meaning contained in the data and is generally determined by self definition of the data or by preprocessing correlation, comparison results and the like; the service identification is a label with definite service meaning formed according to different knowledge bases, service identification is carried out on data, and formation and model analysis of a service resource base are supported;
6.6 data merge and distribution: after five steps of preprocessing, association relation merging and data distribution are needed. Merging the incidence relation refers to merging the incidence relation with the existing data and identifying the time span and the times of the incidence relation; the data distribution means that the preprocessed data are distributed to a service resource library, a basic resource library and a service entity library as required, and the basic knowledge library is updated and maintained;
1.3 distributed data capture framework and technical strategy, the method comprises the following steps:
7.1 multipath and multimode data acquisition channels: the acquisition system configuration center bears hundreds of configurations of a plurality of versions for the system, and the configuration serves as the guarantee for the operation of a plurality of services and a plurality of environments; the proxy IP management service provides filtering, cleaning and protecting functions for the proxy IP, ensures that the crawler has efficient and stable IP available and is used as a basis for data acquisition; the dynamic page capture service (chrome cluster) enables developers to pay more attention to acquisition logic instead of breaking up complicated JS as an important means for efficient and stable acquisition; the identifying service of the identifying code takes an average identifying success rate of 95 percent as a sharp instrument for resisting the identifying code in the acquisition process;
7.2 stable data distribution service: and maintaining a stable data flow as a basis for updating data of a rear data platform and a data warehouse. The data distribution service is built based on Flume, receives all data in the acquisition process, comprises logs, has a distributed cluster structure at the bottom layer, is automatically load balanced, processes more than one billion data in a design day, has high fault tolerance, allows offline in a certain time of a downstream without influencing services, and efficiently and stably realizes data transmission.
7.3 data source website change emergency strategy: when the target website page changes, the system can automatically identify the corresponding website and plate, then start a plan to fully collect news data of a home page, and realize structured news data field extraction based on a page key information automatic analysis engine of meta search so as to ensure the supplement of updated data;
example 2: as shown in fig. 1, the main technical links of the public opinion data acquisition system for directional target data expansion of the present invention include: the system comprises four modules of data expansion, data acquisition, data processing and task scheduling, wherein the sequence of each module and the relation between the modules are covered in the framework shown in the figure.
The following is a detailed description of each part:
1. directional target data acquisition: manually collecting a specific area, such as an agenda, a candidate, an organization group, a public figure and the like as a seed account, and developing a collection task comprising four types of data, namely text, a map, audio and video, aiming at three types of data, namely specific domestic and foreign portal website types, foreign social network types and search engines, based on collection technologies such as HTML (hypertext markup language) analysis, JSON (Java Server object notation) collection and interface collection.
2. Expanding data: on the basis of a seed information entity library, searching related accounts according to field characteristics, content characteristics and relationship characteristics of data, and constructing an extended data list;
in the data expanding step, the following substeps are specifically included:
2.1, data expansion based on field features: performing extended acquisition on a seed list, on one hand, acquiring all friend information of an initial account (the friends are not necessarily in a certain specific region direction and need secondary processing), on the other hand, acquiring same-city friend information of the seed list (the same-city friends can be determined to be in a certain region), and performing extended acquisition according to the acquired information; acquiring the position information to identify the attribution of the specific area of the account according to a 'residence place' module in the friend information of the acquisition result; performing multiple data expansion iterations on the specific region obtained by screening according to the steps, and incorporating related data into a data expansion list; for the account number which cannot accurately identify a certain specific region, the relation analysis can be carried out through a knowledge graph, and a region identification result can be given through a region clustering algorithm and provided in a probability form;
2.2, data expansion based on similar contents: on the basis of a seed information entity library, content characteristics are constructed aiming at an account number release content text, a label and a keyword; firstly, eliminating stop words from all content text data, and then acquiring important characteristics of text content based on a TF-IDF keyword extraction algorithm; acquiring all text data and part of new text data of an entity library, and importing the text data and part of new text data into a Word2Vec Word vector model to obtain a pre-training model; constructing a similarity calculation function, and calculating a text feature vector by using a cosine included angle; importing the content features obtained in the first step into a Word2Vec model to obtain feature vectors; importing the feature vectors of the seed data and the new data into a similarity calculation function for feature matching, and setting a similarity threshold value to screen high-similarity data and incorporate the high-similarity data into an extended acquisition list;
2.3, data expansion based on the social relationship network: on the basis of a seed information entity library, a related account network is constructed according to related information of focus accounts such as focus, fans, city sharing, friends, interaction, groups (facebook groups and telegraph groups) and the like, or focus position websites such as continuous login, high login, PTT, shallot and the like are monitored in real time, related information of organization activities is obtained, which accounts in specific accounts, time, blocks (columns) and topics are subjected to which comment or interaction, and a social relationship network is organized based on the structure; calculating coefficient indexes such as activity degree, central position, importance degree, influence degree, total correlation coefficient and the like of the nodes according to the network relation graph constructed by the process, and discovering key characters; based on the modularity index, algorithms such as Clauset and Louvain are called to maximize the modularity index, so that a community structure with close relationship is found; setting a key node threshold value and a community scale threshold value, screening the key nodes and the potential community nodes obtained by calculation, and bringing the key nodes and the potential community nodes into a data expansion list;
3. distributed acquisition framework and strategy: obtaining a total collection target list after a data expansion process, building a distributed data collection system with high expansion degree by means of a script _ Redis framework, and designing a priority strategy and custom configuration to meet the flexibility requirement of data collection;
3.1, the distributed acquisition framework mainly comprises a cluster management node (SpideService), a management node agent service (MasterAgent) and an acquisition link (SpideProxy). The cluster management node is a control center of the whole cluster, provides data capture interfaces with various granularities for various scheduling services, and selects reasonable collection nodes for distribution and returns processing results according to collection tasks. The request monitoring module is responsible for receiving a task request of upstream acquisition scheduling and delivering the task to the task management module; the task management module stores tasks and monitors an overtime state; the task distribution module is responsible for scheduling according to the priority, the sequence, the required IP downloading and the like of the tasks and selecting a proper acquisition link to execute the tasks; the task recovery module is responsible for receiving the execution result of the task and returning the execution result to the scheduling service through the task management module; the link management module is responsible for creating links, monitoring and calculating the connectivity of the links and the like, and records the execution capacity of each node. The management node proxy service is responsible for managing a plurality of spiderServices and utilizes the Zookeeper as a main management node for election of the distributed coordination service. The management node is used as a temporary node to be registered in the Zookeeper, once the service is down, the node information is automatically removed, and the scheduling service and the collection node can be automatically switched to the available cluster management node. The system realizes high availability of the cluster management node through an OneOnline-multistandBy mode, and improves the stability of the acquisition system.
3.2, the distributed acquisition strategy realizes a multi-path and multi-mode data acquisition channel, a stable data distribution service and a data source website change emergency strategy; the acquisition system configuration center bears hundreds of configurations of a plurality of versions for the system, and the configuration serves as the guarantee for the operation of a plurality of services and a plurality of environments; the proxy IP management service provides filtering, cleaning and protecting functions for the proxy IP, ensures that the crawler has efficient and stable IP available and is used as a basis for data acquisition; the dynamic page capture service (chrome cluster) enables developers to pay more attention to acquisition logic instead of breaking up complicated JS as an important means for efficient and stable acquisition; the identifying service of the identifying code takes an average identifying success rate of 95 percent as a sharp instrument for resisting the identifying code in the acquisition process; and maintaining a stable data flow as a basis for updating data of a rear data platform and a data warehouse. The data distribution service is built based on Flume, receives all data in the acquisition process, comprises logs, has a distributed cluster structure at the bottom layer, is automatically load balanced, processes more than one billion data in a design day, has high fault tolerance, allows offline in a certain time of a downstream without influencing services, and efficiently and stably realizes data transmission. Data source website change emergency strategy: when the target website page changes, the system can automatically identify the corresponding website and plate, then start a plan to fully collect news data of a home page, and realize structured news data field extraction based on a page key information automatic analysis engine of meta search so as to ensure the supplement of updated data;
4. and (3) collected data processing: the collected various open-source website data are subjected to unified streaming preprocessing, so that the data formats of different sources are relatively unified, the associated identifiers are clear, the subsequent data storage processing amount is reduced to a certain extent, and necessary support is provided for monitoring and analyzing services; the processing flow comprises the following steps: data extraction, namely webpage information extraction, full-text data structured extraction, multimedia information extraction and biological characteristic information extraction, is beneficial to long-term storage and use of data; data cleaning, namely junk information filtering, data deduplication and format cleaning, improves the value density of data; data association, namely performing information association on acquired data according to characteristics of people, places, objects, things, organizations (relations), behaviors and the like, wherein closed source data and third-party data need to be combined; data comparison, namely structured comparison, keyword comparison and binary comparison, meets the business requirements of clue discovery, touch network alarm and the like; data identification, namely, carrying out attribute identification on data such as language, region, position, service and the like by depending on a basic database and a service knowledge base of the platform, and providing support for upper-layer application; merging and distributing data, namely merging the incidence relation with the existing data, identifying the time span and the times of the incidence relation, distributing the preprocessed data to a service resource library, a basic resource library and a service entity library as required, and updating and maintaining the basic knowledge library;
5. task scheduling and monitoring: in order to better manage resources and optimize the life cycle of the system, an intelligent and visual monitoring platform for resource scheduling is set up. And optimizing the data acquisition task workflow by means of an Airflow big data distributed task scheduling framework, and acquiring the running logs in real time and uninterruptedly by using Spark streaming to realize monitoring management of the acquisition flow.
The invention discloses a public opinion data acquisition system aiming at oriented target data expansion. The method comprises the following steps: manually collecting a target account and a label acquired by directional information; expanding the seed list based on the social relationship network; expanding the seed list based on the similar characteristics; building a distributed iterative data acquisition frame; preprocessing processes such as data association, duplicate removal and structuring; and collecting a program scheduling and performance optimization strategy. The method integrates feature matching and network analysis technologies on the basis of directional data acquisition, and realizes automatic, sustainable and iterative distributed information acquisition by expanding and mining potential relations of the data of the seeds, thereby meeting the practical requirements of social media analysis, relation map construction and the like.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the following descriptions are only illustrative and not restrictive, and that the scope of the present invention is not limited to the above embodiments: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A public opinion data acquisition system aiming at directional target data expansion is characterized by comprising the following steps:
the data acquisition target orientation module is used for acquiring text data, map data, audio data and video data corresponding to a seed account by selecting a target object as the seed account to generate a seed information entity library;
the data expansion technology module is used for constructing a data social relationship network according to the field characteristics, the content characteristics and the relationship characteristics of the data of the seed information entity library, and forming an expanded data list by mining key information nodes and potential community structures of the data social relationship network and marking the key information nodes and the potential community structures as high-value data;
the distributed data acquisition module is used for designing a priority strategy and custom configuration based on the extended data list and building a distributed data acquisition model with high extension degree, wherein the distributed data acquisition model is used for carrying out distributed acquisition on data to meet the flexibility requirement of data acquisition;
the data processing module is used for collecting various open-source website data, performing uniform streaming preprocessing and providing data support for monitoring and analyzing services;
and the task scheduling and monitoring module is used for optimizing the data acquisition task workflow by constructing a big data distributed task scheduling framework, and acquiring the running logs in real time and uninterruptedly to realize monitoring and management of the acquisition process.
2. The public opinion data collection system aiming at oriented target data expansion according to claim 1, characterized in that:
the data acquisition target orientation module comprises:
the specific area account number directional acquisition unit is used for selecting a specific area as the target object and the seed account number, wherein the specific area is used for representing a characteristic area to which the data to be acquired belongs;
and the specific content directional acquisition unit is used for acquiring data of the in-and-out portal website data, the out-of-the-country social network data and the search engine data by verification code cracking and request parameter cracking based on an HTML (hypertext markup language) parsing technology, a JSON (Java Server object notation) acquisition technology and an interface acquisition technology according to the seed account number, and acquiring the text data, the map data, the audio data and the video data.
3. The public opinion data collection system aiming at oriented target data expansion according to claim 2 is characterized in that:
the data expansion technical module comprises:
the first data expansion unit based on the field characteristics is used for acquiring a first related account according to the related characteristics of the seed account, wherein the first related account is used for expanding the seed account, and the related characteristics are used for expressing friend information, same-city user information and region position information of the seed account;
the second data expansion unit based on similar content is used for retrieving a second related account according to the text, the label and the key word of the release content of the seed account, wherein the second related account is used for expanding the seed account;
and the third data expansion unit based on the social relationship network is used for mining the relationship among entities, characters, organizations and events according to the community relationship structure and the information transmission network structure to which the seed account belongs, and identifying potential key characters and communities.
4. The public opinion data collection system aiming at targeted data expansion according to claim 3, characterized in that:
the second data expansion unit comprises:
the feature construction subunit is used for eliminating stop words from the content text data, acquiring text content features based on a TF-IDF keyword extraction algorithm, importing the text content features into a Word2Vec Word vector model, and generating text feature vectors, wherein the text feature vectors are used for retrieving the second related account;
and the characteristic matching subunit is used for selecting different content characteristic matching strategies according to the retrieval data volume, wherein the content characteristic matching strategies comprise keyword characteristic matching and characteristic vector content matching.
5. The public opinion data collection system aiming at oriented target data expansion according to claim 4, characterized in that:
the third data expansion unit includes:
the first network construction subunit is used for constructing a first related account network according to the associated information of the seed account;
the second network construction subunit is used for constructing a second related account network by carrying out comparison analysis on the similarity of the keywords, the topics, the regions, the pictures and the videos of the community interpersonal structure;
a third network construction subunit based on organization data, configured to construct a third relevant account network by acquiring a key position website of the information dissemination network structure and associated user information of the key position website, where the associated user information is used to indicate user information with a network interaction behavior with the seed account;
and the social relationship network constructing subunit is used for constructing the social relationship network according to the first related account network, the second related account network and the third related account network.
6. The public opinion data collection system aiming at oriented target data expansion according to claim 5, characterized in that:
the third data expansion unit further includes:
the key node mining subunit is used for identifying the key characters by acquiring the activity degree, the central position, the importance degree, the influence degree and the total relation number of the network nodes according to the social relation network;
the potential community mining subunit is used for identifying the communities with the close relation by acquiring the network modularity index of the related account network according to the social relation network, wherein the modularity index is maximized by calling a Clauset algorithm and a Louvain algorithm in the process of identifying the communities with the close relation;
and the data iteration expansion subunit is used for generating the expanded data list according to the network nodes corresponding to the key characters and the community nodes of the community by setting a key node threshold and a community scale threshold.
7. The public opinion data collection system aiming at oriented target data expansion according to claim 6, characterized in that:
the distributed data acquisition module comprises:
the multi-channel data acquisition unit is used for realizing multi-channel acquisition of data by setting multi-path and multi-mode;
the data distribution unit is used for building a data distribution service based on Flume, receiving all data in the acquisition process and allowing the downstream to be offline within a certain time to realize efficient and stable data transmission, wherein the bottom layer of the data distribution service is a distributed cluster structure;
and the data source website change emergency strategy unit is used for automatically identifying corresponding websites and plates when the page of the target website changes, fully collecting news data of a home page, and extracting the structured news data field based on a page key information automatic analysis engine of the meta search.
8. The public opinion data collection system aiming at oriented target data expansion according to claim 7 is characterized in that:
the data processing module comprises:
the data extraction unit is used for extracting webpage information, full-text data structuralization, multimedia information and biological characteristic information to generate basic data;
the data cleaning unit is used for improving the value density of the basic data through junk information filtering, data deduplication and format cleaning;
the data association unit is used for associating the acquired data with the basic data according to the characteristics of people, places, objects, things, organizations, relationships and behaviors;
the data comparison unit is used for realizing clue discovery and touch alarm on the basic data after information association through data comparison, wherein the data comparison comprises structural comparison, keyword comparison and binary comparison;
the data identification unit is used for identifying the language, the region, the position and the service attribute of the basic data after data comparison by relying on a local basic library and a service knowledge library so as to provide support for upper-layer application;
and the data merging and distributing unit is used for merging the association relationship and distributing the data of the basic data subjected to the attribute identification.
9. The public opinion data collection system aiming at targeted data expansion according to claim 8 is characterized in that:
the data identification unit includes:
the general identification subunit is used for carrying out data identification according to specific meanings contained in the data, wherein the specific meanings are determined by self definition of the data or by preprocessing correlation and comparison results;
and the service identification subunit is used for forming a label with a definite service meaning according to different knowledge bases and carrying out service identification on the data, wherein the service identification is used for supporting the formation and the model analysis of a service resource base.
10. The public opinion data collection system aiming at targeted data expansion according to claim 9 is characterized in that:
the data merging and distributing unit comprises:
the incidence relation merging subunit is used for merging the incidence relation with the existing data and identifying the time span and the times of the incidence relation;
and the data distribution subunit is used for generating or updating a basic knowledge base of the public opinion data acquisition system according to the basic data subjected to attribute identification, wherein the basic knowledge base comprises a service resource base, a basic resource base and a service entity base.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210258764.5A CN114637903A (en) | 2022-03-16 | 2022-03-16 | Public opinion data acquisition system for directional target data expansion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210258764.5A CN114637903A (en) | 2022-03-16 | 2022-03-16 | Public opinion data acquisition system for directional target data expansion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114637903A true CN114637903A (en) | 2022-06-17 |
Family
ID=81949912
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210258764.5A Pending CN114637903A (en) | 2022-03-16 | 2022-03-16 | Public opinion data acquisition system for directional target data expansion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114637903A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115098242A (en) * | 2022-08-24 | 2022-09-23 | 广州市城市排水有限公司 | Real-time acquisition and processing method and system for deep tunnel surveying and mapping data |
CN115375923A (en) * | 2022-10-27 | 2022-11-22 | 江西省亿发姆科技发展有限公司 | Crop diagnosis method and device, intelligent diagnosis instrument and readable storage medium |
-
2022
- 2022-03-16 CN CN202210258764.5A patent/CN114637903A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115098242A (en) * | 2022-08-24 | 2022-09-23 | 广州市城市排水有限公司 | Real-time acquisition and processing method and system for deep tunnel surveying and mapping data |
CN115098242B (en) * | 2022-08-24 | 2022-11-08 | 广州市城市排水有限公司 | Real-time acquisition and processing method and system for deep tunnel surveying and mapping data |
CN115375923A (en) * | 2022-10-27 | 2022-11-22 | 江西省亿发姆科技发展有限公司 | Crop diagnosis method and device, intelligent diagnosis instrument and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7201730B2 (en) | Intention recommendation method, device, equipment and storage medium | |
Li et al. | A survey on personalized news recommendation technology | |
Bordin et al. | Dspbench: A suite of benchmark applications for distributed data stream processing systems | |
Rehman et al. | Building a data warehouse for twitter stream exploration | |
Amato et al. | Centrality in heterogeneous social networks for lurkers detection: An approach based on hypergraphs | |
Liu et al. | An improved Apriori–based algorithm for friends recommendation in microblog | |
Cenni et al. | Twitter vigilance: A multi-user platform for cross-domain Twitter data analytics, NLP and sentiment analysis | |
CN114637903A (en) | Public opinion data acquisition system for directional target data expansion | |
Lee et al. | An automatic topic ranking approach for event detection on microblogging messages | |
Okewu et al. | Design of a learning analytics system for academic advising in Nigerian universities | |
Li et al. | Granularity decision of microservice splitting in view of maintainability and its innovation effect in government data sharing | |
Anderson et al. | Architectural Implications of Social Media Analytics in Support of Crisis Informatics Research. | |
Peng et al. | Research trends in social media/big data with the emphasis on data collection and data management: A bibliometric analysis | |
Onorati et al. | Semantic visualization of Twitter usage in emergency and crisis situations | |
Dai et al. | Information spread of emergency events: path searching on social networks | |
Kim et al. | TwitterTrends: a spatio-temporal trend detection and related keywords recommendation scheme | |
CN111353085A (en) | Cloud mining network public opinion analysis method based on feature model | |
Zhao et al. | Collecting, managing and analyzing social networking data effectively | |
Bingöl et al. | Topic-based influence computation in social networks under resource constraints | |
Chen et al. | Design of Online Education Information Management System Based on Data Mining Algorithm | |
Saha et al. | Big data and internet of things: a survey | |
Al-Barhamtoshy et al. | A data analytic framework for unstructured text | |
Liu et al. | A preliminary approach of constructing a knowledge graph-based enterprise informationized audit platform | |
Zhao et al. | A system to manage and mine microblogging data | |
Kaufhold et al. | Cross-Media Usage of Social Big Data for Emergency Services and Volunteer Communities: Approaches, Development and Challenges of Multi-Platform Social Media Services |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |