CN112380852A - Public opinion data processing system - Google Patents

Public opinion data processing system Download PDF

Info

Publication number
CN112380852A
CN112380852A CN202011264118.7A CN202011264118A CN112380852A CN 112380852 A CN112380852 A CN 112380852A CN 202011264118 A CN202011264118 A CN 202011264118A CN 112380852 A CN112380852 A CN 112380852A
Authority
CN
China
Prior art keywords
data
layer
module
standard
processing system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011264118.7A
Other languages
Chinese (zh)
Inventor
齐中祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Womin High New Science & Technology Beijing Co ltd
Original Assignee
Womin High New Science & Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Womin High New Science & Technology Beijing Co ltd filed Critical Womin High New Science & Technology Beijing Co ltd
Priority to CN202011264118.7A priority Critical patent/CN112380852A/en
Publication of CN112380852A publication Critical patent/CN112380852A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a public opinion data processing system, comprising: the system comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for carrying out normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the files of the real-time message queues and the off-line data of various data sources; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through the RPC framework. The invention shortens the research and development time and energy cost through standardization; through the construction of a data warehouse, the use threshold is reduced, the data standard is unified, and a more reasonable framework for supporting the upper layer constructs a scene capable of exerting the maximum performance in use, so that the quick response on the application layer is supported.

Description

Public opinion data processing system
Technical Field
The embodiment of the invention relates to the technical field of radio, in particular to a public opinion data processing system.
Background
With the development of services, the amount of data is increasing day by day. Under this premise, it becomes extremely difficult to accurately find the location of the desired data and quickly provide the correct, consistent, and legible data.
Disclosure of Invention
In order to solve the above technical problems, an embodiment of the present invention provides a public opinion data processing system. The system carries out processing methods such as formatting cleaning, feature marking, analysis and calculation and the like on data acquired every day, and the method mainly takes normalized data naming as a basis, establishes a proper data warehouse model, and selects a calculation engine suitable for each scene, thereby achieving the purposes of reducing the coupling degree of modules, and improving the multiplexing rate, the data processing speed and the response capability of business analysis. The specific technical scheme is as follows:
the public opinion data processing system provided by the embodiment of the invention comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework.
Further, Storm is used for real-time data processing in the ODS layer, and Flink is used as an alternative association and aggregation operation.
Further, the data detail layer selects an elastic search as a storage in a dimension data scene, and when a single machine is in 1000+ QPS, the query delay is below 10 ms; for the storage of the detailed data, HBase is selected as storage.
Further, the data summarization layer selects kudo for storage, uses stream input with real-time availability, and uses an access mode with time sequence wide variation.
Further, when the calculation engine selection module works, under different service scenes, different calculation engines are used for coping with different modes; the method comprises the following steps:
for index aggregation of the general topic, calculating by using impala;
for aggregations requiring full-text retrieval, computing is performed using an elastic search;
under the scene of large data volume and complexity, hive and spark are used as offline calculation engines.
Further, the application layer is used for providing detailed data query, multi-dimensional full-text retrieval and OLAP analysis; and storing the data snapshot needing to be persisted by providing a cache mechanism for the hot spot data into the mongo.
Further, the data normalization module comprises:
the universal standard module is used for carrying out universal standard processing on the data of the system;
the expression name standard module is used for carrying out standard processing on various expression names of the system;
field naming specification: the method is used for carrying out standard processing on field names of system data.
Further, the universal specification module includes:
naming data by underlining and dividing a root word, wherein each part uses a lower case English word;
the table name and field name begin with a letter;
the length of the table name and the field name does not exceed 64 characters;
using the keywords in the defined root dictionary;
nonstandard abbreviations are not used in the self-defined root words; (ii) a
The expression name specification includes type, theme, subtopic, meaning, update frequency and suffix.
Further, the field naming specification includes: the basic index word root naming standard, the service modifier standard for describing service scene words, the date modifier standard and the aggregation modifier standard.
Further, the method also comprises the following steps:
common index class naming specifications: the business modifier + the root of the basic index;
date type index naming specification: a service modifier, a basic index root and a date modifier;
polymerization index: service modifier + basic index root + aggregation type + date modifier.
The embodiment of the invention provides a public opinion data processing system, which comprises: the system comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework. The invention shortens the research and development time and energy cost through standardization; through the construction of a data warehouse, the use threshold is reduced, the data standard is unified, the selection of components with proper structure for supporting the upper layer is more reasonable, and a scene capable of exerting the maximum performance is constructed in use, so that the quick response on the application layer is supported.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.
Fig. 1 is a schematic block diagram of a public opinion data processing system according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of a data warehouse of a public opinion data processing system according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of a computing engine of a public opinion data processing system according to an embodiment of the present invention.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a schematic block diagram of a public opinion data processing system according to an embodiment of the present invention includes: the system comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework.
Referring to fig. 2, a schematic block diagram of a data warehouse of a public opinion data processing system according to an embodiment of the present invention is shown, where the warehouse includes the following four layers:
ODS: real-time message queues of various data sources, and file snapshots of offline data.
Data detail layer: and (4) the integrated fact data is used for retrieving dimension data in real time.
And a data summarization layer: summary the detailed data and the width of the commonality indicator.
An application layer: the application layer constructed for different service requirements provides services to the outside through RPC framework
Fig. 3 is a schematic block diagram of a computing engine of a public opinion data processing system according to an embodiment of the present invention. ETL: in the early stage of stream processing platform construction, we use Storm to perform real-time data processing, and Storm has good performance in flexibility and performance. However, since the API is too basic, additional development is required for some common data operations in data development, such as association, aggregation, and the like. So we should use Flink instead for this. Flink is closer to Storm in data delay and throughput is much higher than Storm. Meanwhile, the table abstraction and SQL support of Flink provide a development process which is more reliable, efficient and easy to maintain. Data detail: for the dimensional data scene, an elastic search is selected as storage, and when a single machine is in 1000+ QPS, the query delay is below 10 ms. For storing the detail data, HBase is selected as storage to cope with the scene of writing more and reading less. Mild summary broad table: for general (basic) subject indices, kudo is chosen for storage, with near real-time availability of stream input, and with time-sequential widely varying access patterns. A calculation engine: under different business scenarios, we use different compute engines to cope with different modes. For index aggregation of a general topic, the impala is used for calculation, and quick response can be achieved. For aggregation requiring full-text retrieval, the method uses the elastic search to calculate, and can achieve the purpose of quick response under the limited condition. The two schemes are mainly used in a real-time scene, and under a scene with a large data volume and a relatively complex data volume, the situation that the performance is affected and unstable exists in the aggregation of the impala or the es, so that for similar needs, the hive and spark are used as offline calculation engines. An application layer: the application layer is relatively complex to meet the requirements of different services. The method mainly provides functions of detail data query, multi-dimensional full-text retrieval, OLAP analysis and the like. A cache mechanism is provided for hot spot data, pressure is reduced, efficiency is improved, and partial data snapshots needing to be persisted are stored in mongo.
In an alternative embodiment of the invention, data normalization is a guarantee of the construction of the bins. In order to avoid the situations of repeated index construction and poor data quality, standard construction is carried out according to a unified standard, and the situations of repeated development, confusion and error proneness can be avoided.
General specification:
a) naming is performed by dividing the root word in an underline mode, and each part is a lowercase English word.
b) The table name and field name must be open-ended by letters.
c) The table name and field name cannot exceed 64 characters in length.
d) The keywords in the root dictionary that have been defined are preferentially used.
e) The custom root word prohibits the use of non-standard abbreviations.
Specification of table names
The expression name is type + theme + subtopic + meaning + update frequency + suffix; for example:
Figure BDA0002775573900000061
for example: dwa _ scheme _ move _ terminate _ route
Specification of field naming
a) The root of the basic indicator, for example:
english full scale Data type Accuracy of measurement Root of Chinese character Examples of the invention
Number of count int 0 cnt 100
Ratio of occupation of ratio float 4 ratio 0.8623
b) The service modifier is used for describing vocabularies of service scenes;
c) date modifiers, such as:
english full scale Root of Chinese character
Hour(s) hourly h
Day(s) dayly d
Moon cake monthly m
d) Polymeric modifiers, such as:
english full scale Root of Chinese character
Average average avg
Median number median mid
First few names top n tpn
e) Common index class naming specifications: the business modifier + the root of the basic index;
f) date type index naming specification: a service modifier, a basic index root and a date modifier;
g) polymerization index: service modifier + basic index root + aggregation type + date modifier.
The embodiment of the invention provides a public opinion data processing system, which comprises: the system comprises a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework. The invention shortens the research and development time and energy cost through standardization; through the construction of a data warehouse, the use threshold is reduced, the data standard is unified, the selection of components with proper structure for supporting the upper layer is more reasonable, and a scene capable of exerting the maximum performance is constructed in use, so that the quick response on the application layer is supported.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (10)

1. A public opinion data processing system is characterized by comprising a data standardization module, a data warehouse construction module and a storage and calculation engine selection module; the data normalization module is used for performing normalization processing on the data of the system according to a unified standard; the data warehouse building module comprises an ODS layer, a data detail layer, a data summary layer and an application layer; the ODS layer is used for carrying out snapshot and backup on the real-time message queues of various data sources and files of off-line data; the data detail layer is used for carrying out real-time dimension retrieval on the backed-up data; the data summarization layer is used for summarizing the width of the common indexes related to the data of the detail data layer; the application layer is used for providing services to the outside through an RPC framework.
2. The system of claim 1, wherein Storm is used in the ODS layer for real-time data processing, with Flink as an alternative association, aggregation operation.
3. The public opinion data processing system according to claim 1, wherein the data detail layer selects an elastic search as a storage for a dimensional data scene, and when a single machine is at 1000+ QPS, the query delay is below 10 ms; for the storage of the detailed data, HBase is selected as storage.
4. The consensus data processing system of claim 1, wherein the data summarization layer selects kudo for storage, uses stream input with real-time availability, and uses time-sequenced widely varying access patterns.
5. The public opinion data processing system according to claim 1, wherein when the computing engine selection module is in operation, under different business scenarios, we use different computing engines to deal with different modes; the method comprises the following steps:
for index aggregation of the general topic, calculating by using impala;
for aggregations requiring full-text retrieval, computing is performed using an elastic search;
under the scene of large data volume and complexity, hive and spark are used as offline calculation engines.
6. The public opinion data processing system of claim 1, wherein the application layer is configured to provide detailed data query, multi-dimensional full text retrieval, and OLAP analysis; and storing the data snapshot needing to be persisted by providing a cache mechanism for the hot spot data into the mongo.
7. The public opinion data processing system of claim 1, wherein the data normalization module comprises:
the universal standard module is used for carrying out universal standard processing on the data of the system;
the expression name standard module is used for carrying out standard processing on various expression names of the system;
field naming specification: the method is used for carrying out standard processing on field names of system data.
8. The public opinion data processing system according to claim 7, wherein the general specification module comprises:
naming data by underlining and dividing a root word, wherein each part uses a lower case English word;
the table name and field name begin with a letter;
the length of the table name and the field name does not exceed 64 characters;
using the keywords in the defined root dictionary;
nonstandard abbreviations are not used in the self-defined root words;
the expression name specification includes type, theme, subtopic, meaning, update frequency and suffix.
9. The public opinion data processing system of claim 7, wherein the field naming specification comprises: the basic index word root naming standard, the service modifier standard for describing service scene words, the date modifier standard and the aggregation modifier standard.
10. The public opinion data processing system according to claim 9, further comprising:
common index class naming specifications: the business modifier + the root of the basic index;
date type index naming specification: a service modifier, a basic index root and a date modifier;
polymerization index: service modifier + basic index root + aggregation type + date modifier.
CN202011264118.7A 2020-11-12 2020-11-12 Public opinion data processing system Pending CN112380852A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011264118.7A CN112380852A (en) 2020-11-12 2020-11-12 Public opinion data processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011264118.7A CN112380852A (en) 2020-11-12 2020-11-12 Public opinion data processing system

Publications (1)

Publication Number Publication Date
CN112380852A true CN112380852A (en) 2021-02-19

Family

ID=74583491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011264118.7A Pending CN112380852A (en) 2020-11-12 2020-11-12 Public opinion data processing system

Country Status (1)

Country Link
CN (1) CN112380852A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656370A (en) * 2021-08-16 2021-11-16 南方电网数字电网研究院有限公司 Data processing method and device for power measurement system and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283194A1 (en) * 2005-11-12 2007-12-06 Phillip Villella Log collection, structuring and processing
CN106779580A (en) * 2016-11-17 2017-05-31 中知厚德知识产权投资管理(天津)有限公司 Multi-level intellectual property data system
CN110134717A (en) * 2019-05-07 2019-08-16 浙江省科技信息研究院 Research funding system data query system
CN111694810A (en) * 2019-03-12 2020-09-22 阿里巴巴集团控股有限公司 Data warehouse creation method and device, electronic equipment and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283194A1 (en) * 2005-11-12 2007-12-06 Phillip Villella Log collection, structuring and processing
CN106779580A (en) * 2016-11-17 2017-05-31 中知厚德知识产权投资管理(天津)有限公司 Multi-level intellectual property data system
CN111694810A (en) * 2019-03-12 2020-09-22 阿里巴巴集团控股有限公司 Data warehouse creation method and device, electronic equipment and readable storage medium
CN110134717A (en) * 2019-05-07 2019-08-16 浙江省科技信息研究院 Research funding system data query system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
伟伦 等: "美团基于Flink的实时数仓建设实践", pages 1 - 5, Retrieved from the Internet <URL:https://tech.meituan.com/2018/10/18/meishi-data-flink.html> *
马立和 等著: "教育大数据视域下的智慧校园建设与应用研究", 31 July 2020, 机械工业出版社, pages: 294 - 297 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656370A (en) * 2021-08-16 2021-11-16 南方电网数字电网研究院有限公司 Data processing method and device for power measurement system and computer equipment
CN113656370B (en) * 2021-08-16 2024-04-30 南方电网数字电网集团有限公司 Data processing method and device for electric power measurement system and computer equipment

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN106383877B (en) Social media online short text clustering and topic detection method
US10691753B2 (en) Memory reduced string similarity analysis
CN108829658B (en) Method and device for discovering new words
US20090182723A1 (en) Ranking search results using author extraction
CN101458708B (en) Searching result clustering method and device
CN105159998A (en) Keyword calculation method based on document clustering
CN110569289B (en) Column data processing method, equipment and medium based on big data
CN102110123A (en) Method for establishing inverted index
EP3926484A1 (en) Improved fuzzy search using field-level deletion neighborhoods
CN110134717A (en) Research funding system data query system
CN110597986A (en) Text clustering system and method based on fine tuning characteristics
CN101650729A (en) Dynamic construction method for Web service component library and service search method thereof
CN112380852A (en) Public opinion data processing system
CN115757461B (en) Result clustering method for bank database application system
CN102122296A (en) Search result clustering method and device
CN102385597A (en) Fault-tolerant searching method for point of interest (POI)
JP4219122B2 (en) Feature word extraction system
CN102253983A (en) Method and system for identifying Chinese high-risk words
CN111737461B (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112861495A (en) Method for generating impala SQL statement based on Excel template file
CN112699676B (en) Address similarity relation generation method and device
CN111767730A (en) Event type identification method and device
CN115408491B (en) Text retrieval method and system for historical data
CN114492419B (en) Text labeling method, system and device based on newly added key words in labeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination