CN112380852A

CN112380852A - Public opinion data processing system

Info

Publication number: CN112380852A
Application number: CN202011264118.7A
Authority: CN
Inventors: 齐中祥
Original assignee: Womin High New Science & Technology Beijing Co ltd
Current assignee: Womin High New Science & Technology Beijing Co ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-02-19

Abstract

The invention provides a public opinion data processing system, comprising: the system comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for carrying out normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the files of the real-time message queues and the off-line data of various data sources; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through the RPC framework. The invention shortens the research and development time and energy cost through standardization; through the construction of a data warehouse, the use threshold is reduced, the data standard is unified, and a more reasonable framework for supporting the upper layer constructs a scene capable of exerting the maximum performance in use, so that the quick response on the application layer is supported.

Description

Public opinion data processing system

Technical Field

The embodiment of the invention relates to the technical field of radio, in particular to a public opinion data processing system.

Background

With the development of services, the amount of data is increasing day by day. Under this premise, it becomes extremely difficult to accurately find the location of the desired data and quickly provide the correct, consistent, and legible data.

Disclosure of Invention

In order to solve the above technical problems, an embodiment of the present invention provides a public opinion data processing system. The system carries out processing methods such as formatting cleaning, feature marking, analysis and calculation and the like on data acquired every day, and the method mainly takes normalized data naming as a basis, establishes a proper data warehouse model, and selects a calculation engine suitable for each scene, thereby achieving the purposes of reducing the coupling degree of modules, and improving the multiplexing rate, the data processing speed and the response capability of business analysis. The specific technical scheme is as follows:

the public opinion data processing system provided by the embodiment of the invention comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework.

Further, Storm is used for real-time data processing in the ODS layer, and Flink is used as an alternative association and aggregation operation.

Further, the data detail layer selects an elastic search as a storage in a dimension data scene, and when a single machine is in 1000+ QPS, the query delay is below 10 ms; for the storage of the detailed data, HBase is selected as storage.

Further, the data summarization layer selects kudo for storage, uses stream input with real-time availability, and uses an access mode with time sequence wide variation.

Further, when the calculation engine selection module works, under different service scenes, different calculation engines are used for coping with different modes; the method comprises the following steps:

for index aggregation of the general topic, calculating by using impala;

for aggregations requiring full-text retrieval, computing is performed using an elastic search;

under the scene of large data volume and complexity, hive and spark are used as offline calculation engines.

Further, the application layer is used for providing detailed data query, multi-dimensional full-text retrieval and OLAP analysis; and storing the data snapshot needing to be persisted by providing a cache mechanism for the hot spot data into the mongo.

Further, the data normalization module comprises:

the universal standard module is used for carrying out universal standard processing on the data of the system;

the expression name standard module is used for carrying out standard processing on various expression names of the system;

field naming specification: the method is used for carrying out standard processing on field names of system data.

Further, the universal specification module includes:

naming data by underlining and dividing a root word, wherein each part uses a lower case English word;

the table name and field name begin with a letter;

the length of the table name and the field name does not exceed 64 characters;

using the keywords in the defined root dictionary;

nonstandard abbreviations are not used in the self-defined root words; (ii) a

The expression name specification includes type, theme, subtopic, meaning, update frequency and suffix.

Further, the field naming specification includes: the basic index word root naming standard, the service modifier standard for describing service scene words, the date modifier standard and the aggregation modifier standard.

Further, the method also comprises the following steps:

common index class naming specifications: the business modifier + the root of the basic index;

date type index naming specification: a service modifier, a basic index root and a date modifier;

polymerization index: service modifier + basic index root + aggregation type + date modifier.

The embodiment of the invention provides a public opinion data processing system, which comprises: the system comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework. The invention shortens the research and development time and energy cost through standardization; through the construction of a data warehouse, the use threshold is reduced, the data standard is unified, the selection of components with proper structure for supporting the upper layer is more reasonable, and a scene capable of exerting the maximum performance is constructed in use, so that the quick response on the application layer is supported.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.

Fig. 1 is a schematic block diagram of a public opinion data processing system according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of a data warehouse of a public opinion data processing system according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a computing engine of a public opinion data processing system according to an embodiment of the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a schematic block diagram of a public opinion data processing system according to an embodiment of the present invention includes: the system comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework.

Referring to fig. 2, a schematic block diagram of a data warehouse of a public opinion data processing system according to an embodiment of the present invention is shown, where the warehouse includes the following four layers:

ODS: real-time message queues of various data sources, and file snapshots of offline data.

Data detail layer: and (4) the integrated fact data is used for retrieving dimension data in real time.

And a data summarization layer: summary the detailed data and the width of the commonality indicator.

An application layer: the application layer constructed for different service requirements provides services to the outside through RPC framework

Fig. 3 is a schematic block diagram of a computing engine of a public opinion data processing system according to an embodiment of the present invention. ETL: in the early stage of stream processing platform construction, we use Storm to perform real-time data processing, and Storm has good performance in flexibility and performance. However, since the API is too basic, additional development is required for some common data operations in data development, such as association, aggregation, and the like. So we should use Flink instead for this. Flink is closer to Storm in data delay and throughput is much higher than Storm. Meanwhile, the table abstraction and SQL support of Flink provide a development process which is more reliable, efficient and easy to maintain. Data detail: for the dimensional data scene, an elastic search is selected as storage, and when a single machine is in 1000+ QPS, the query delay is below 10 ms. For storing the detail data, HBase is selected as storage to cope with the scene of writing more and reading less. Mild summary broad table: for general (basic) subject indices, kudo is chosen for storage, with near real-time availability of stream input, and with time-sequential widely varying access patterns. A calculation engine: under different business scenarios, we use different compute engines to cope with different modes. For index aggregation of a general topic, the impala is used for calculation, and quick response can be achieved. For aggregation requiring full-text retrieval, the method uses the elastic search to calculate, and can achieve the purpose of quick response under the limited condition. The two schemes are mainly used in a real-time scene, and under a scene with a large data volume and a relatively complex data volume, the situation that the performance is affected and unstable exists in the aggregation of the impala or the es, so that for similar needs, the hive and spark are used as offline calculation engines. An application layer: the application layer is relatively complex to meet the requirements of different services. The method mainly provides functions of detail data query, multi-dimensional full-text retrieval, OLAP analysis and the like. A cache mechanism is provided for hot spot data, pressure is reduced, efficiency is improved, and partial data snapshots needing to be persisted are stored in mongo.

In an alternative embodiment of the invention, data normalization is a guarantee of the construction of the bins. In order to avoid the situations of repeated index construction and poor data quality, standard construction is carried out according to a unified standard, and the situations of repeated development, confusion and error proneness can be avoided.

General specification:

a) naming is performed by dividing the root word in an underline mode, and each part is a lowercase English word.

b) The table name and field name must be open-ended by letters.

c) The table name and field name cannot exceed 64 characters in length.

d) The keywords in the root dictionary that have been defined are preferentially used.

e) The custom root word prohibits the use of non-standard abbreviations.

Specification of table names

The expression name is type + theme + subtopic + meaning + update frequency + suffix; for example:

for example: dwa _ scheme _ move _ terminate _ route

Specification of field naming

a) The root of the basic indicator, for example:

	english full scale	Data type	Accuracy of measurement	Root of Chinese character	Examples of the invention
						Number of	count	int	0	cnt	100
Ratio of occupation of	ratio	float	4	ratio	0.8623
						…

b) The service modifier is used for describing vocabularies of service scenes;

c) date modifiers, such as:

	english full scale	Root of Chinese character
			Hour(s)	hourly	h
Day(s)	dayly	d
			Moon cake	monthly	m
…

d) Polymeric modifiers, such as:

	english full scale	Root of Chinese character
			Average	average	avg
Median number	median	mid
			First few names	top n	tpn
…

e) Common index class naming specifications: the business modifier + the root of the basic index;

f) date type index naming specification: a service modifier, a basic index root and a date modifier;

g) polymerization index: service modifier + basic index root + aggregation type + date modifier.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A public opinion data processing system is characterized by comprising a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework.

2. The system of claim 1, wherein Storm is used in the ODS layer for real-time data processing, with Flink as an alternative association, aggregation operation.

3. The public opinion data processing system according to claim 1, wherein the data detail layer selects an elastic search as a storage for a dimensional data scene, and when a single machine is at 1000+ QPS, the query delay is below 10 ms; for the storage of the detailed data, HBase is selected as storage.

4. The consensus data processing system of claim 1, wherein the data summarization layer selects kudo for storage, uses stream input with real-time availability, and uses time-sequenced widely varying access patterns.

5. The public opinion data processing system according to claim 1, wherein when the computing engine selection module is in operation, under different business scenarios, we use different computing engines to deal with different modes; the method comprises the following steps:

for index aggregation of the general topic, calculating by using impala;

6. The public opinion data processing system of claim 1, wherein the application layer is configured to provide detailed data query, multi-dimensional full text retrieval, and OLAP analysis; and storing the data snapshot needing to be persisted by providing a cache mechanism for the hot spot data into the mongo.

7. The public opinion data processing system of claim 1, wherein the data normalization module comprises:

8. The public opinion data processing system according to claim 7, wherein the general specification module comprises:

the table name and field name begin with a letter;

the length of the table name and the field name does not exceed 64 characters;

using the keywords in the defined root dictionary;

nonstandard abbreviations are not used in the self-defined root words;

9. The public opinion data processing system of claim 7, wherein the field naming specification comprises: the basic index word root naming standard, the service modifier standard for describing service scene words, the date modifier standard and the aggregation modifier standard.

10. The public opinion data processing system according to claim 9, further comprising: