CN115309749A

CN115309749A - Big data experiment system for scientific and technological service

Info

Publication number: CN115309749A
Application number: CN202211030785.8A
Authority: CN
Inventors: 费敏锐; 李晨辉; 周文举; 徐昱琳; 王海宽; 易开祥; 吕泽昊; 沈赟怡
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-11-08

Abstract

The invention relates to a big data experiment system for scientific and technological services, which comprises a data import module, a data cleaning module, a data warehouse construction module and a data association analysis module. The data import module imports a behavior log of a user to a scientific and technological service platform and a business database into the distributed file storage system; the data cleaning module is used for extracting, converting and loading data of the distributed file storage system; the data warehouse construction module establishes a star model based on the relation of the fact table and the dimension table and carries out layered processing on the data warehouse; the data association analysis module adopts an efficient association processing technology for the data warehouse, and meets the multi-dimensional and high-association data analysis operation in online analysis and processing (OLAP). The invention meets the requirement of sub-second level analysis query in the large-scale data OLAP production environment of the scientific and technological service platform through the data management full life cycle of data acquisition, cleaning, storage, analysis and the like in the scientific and technological service platform.

Description

Big data experiment system for scientific and technological service

Technical Field

The invention relates to the technical field of scientific and technological services, in particular to the technical field of big data of scientific and technological service platforms, and specifically relates to a big data experiment system for scientific and technological services.

Background

In recent years, the scientific and technological service industry of China opens a brand-new stable situation, and scientific and technological service platforms in many areas are brought forward. However, since the early development stage is limited by influence and the number of user enterprises, the platform is mostly a front-end architecture based on the platform. With the improvement of long-term operation influence of the scientific and technological service platform and the introduction of scientific and technological resources of a large number of user enterprises, more and more data including user behavior data and business system data can be generated, and the traditional relational database is difficult to deal with the correlation operation and storage of large-scale data, so that the development of the scientific and technological service platform needs to be realized by means of a big data technology. Meanwhile, the difficulty of data management is aggravated by the increase of data and multi-source isomerism, more importantly, the data should be orderly circulated in the whole analysis, the whole flow of the data can be clearly and definitely mastered and used, and the layered construction of the scientific and technological service data warehouse plays an important role in the process. The current science and technology service big data platform should support complex analysis operation, emphasizes decision support, provides visual query results, and simultaneously excavates data value and provides decision basis for the operation of the science and technology service platform.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a big data experiment system for scientific and technological services, which has the advantages of abundant data, relative stability and wider application range.

In order to achieve the above object, the big data experiment system for scientific and technological services of the present invention is as follows:

based on the technical problems, the invention discloses a big data experiment system for scientific and technological services, which comprises a data import module, a data cleaning module, a data warehouse construction module and a data association analysis module,

the data import module is connected with the scientific and technological service platform and used for collecting the behavior logs of the user to the scientific and technological service platform into the distributed storage system and importing the behavior logs into the distributed storage system from the business database of the scientific and technological service platform;

the data cleaning module is connected with the data import module and is used for extracting, converting and loading import data in the distributed file storage system;

the data warehouse building module is connected with the data cleaning module and used for building a star model based on the relationship between the fact table and the dimension table of the data and carrying out layered processing on the data warehouse;

and the data association analysis module is connected with the data warehouse construction module and is used for performing association processing on the established data warehouse and performing multi-dimensional and high-association data analysis operation in online analysis processing.

Preferably, the data importing module includes:

the embedded point triggering program part is connected with the data cleaning module and is used for embedding key events in a scientific and technological service platform and simultaneously implanting a front-end embedded point triggering unit based on js sdk and a rear-end event triggering unit based on java sdk to track the behaviors of the user;

the data acquisition part is connected with the data cleaning module and is used for acquiring behavior log data to a distributed file storage system in real time through a flash tool and loading the behavior log data to a data warehouse in batches;

and the data import part is connected with the data cleaning module and is used for directly importing the relational service database data comprising a scientific and technological resource data table, a user information table and an order information table into a data warehouse through a Sqoop tool.

Preferably, the data warehouse construction module comprises an original data layer, a detailed data layer, a service data layer and an application data layer,

the original data layer stores original data including user behavior log data and system service data;

the detail data layer stores data cleaned by data, and the data comprises data obtained by performing judgment and repeated filtering on original data and data obtained by performing dimensionality reduction and degradation on a resource classification table;

the service data layer stores the slightly aggregated data, and obtains a preliminary result according to a table related to business association;

and the application data layer stores related data according to specific required services.

Preferably, the data association analysis module includes:

the correlation calculation engine unit is used for correlating the table data correlation operation among all layers of the scientific and technological service data warehouse;

the analysis business unit is used for performing correlation optimization based on an OLAP analysis type data warehouse tool engine, and performing related statistical work and data mining tasks according to actual business;

and the service display unit is used for providing lightweight data query and visualization for the data result service and providing a query billboard and a graphical interface.

By adopting the big data experiment system for scientific and technological service, the invention has the following beneficial effects:

1. the invention is built around a scientific and technological service platform and aims to develop the scientific and technological service platform. The experimental system is aimed at four industries of modern service industry, marine economy, cultural tourism and international medical care, converges diversified scientific and technological service resources, provides nine services such as research and development, intellectual property, scientific and technological finance, technical transfer, entrepreneurship incubation, scientific and technological consultation, scientific and technological popularization, inspection and detection authentication and comprehensive scientific and technological service for government departments, parks, enterprises and universities, and assists a scientific and technological service platform and the high-quality development of the scientific and technological service industry.

2. The method has rich data sources, is not limited to the business database, collects more valuable user behavior data into the system, and meets the requirements of a scientific and technological service platform on multi-source heterogeneous data. Meanwhile, the data can be orderly circulated in the subsequent processes of cleaning, storing, analyzing and the like, and the sub-second level analysis query can be achieved in the large-scale data OLAP production environment of a science and technology service platform.

3. The technical service data warehouse is constructed on the basis of the technical service platform and is used for supporting the management decision process of the technical service platform. The data warehouse is subjected to layered integrated management, is relatively stable, reflects historical changes and is oriented to science and technology service topics, complex analysis operation is supported, visual query results are provided, data values are mined, and potential requirements of users are explored.

Drawings

FIG. 1 is a schematic diagram illustrating an operation process of a big data experiment system for scientific and technical services according to the present invention.

FIG. 2 is a diagram of a data acquisition architecture of a big data experiment system for scientific and technical services according to the present invention.

FIG. 3 is a dimensional modeling diagram of a big data experiment system for scientific and technical services according to the present invention.

FIG. 4 is a diagram of a scientific and technological service data warehouse hierarchy of the big data experiment system for scientific and technological services of the present invention.

Detailed Description

In order to more clearly describe the technical contents of the present invention, the following further description is given in conjunction with specific embodiments.

The big data experiment system for scientific and technological service comprises a data import module, a data cleaning module, a data warehouse construction module and a data association analysis module,

As a preferred embodiment of the present invention, the data importing module includes:

and the data import part is connected with the data cleaning module and is used for directly importing the relational service database data comprising a scientific and technological resource data table, a user information table and an order information table into a data warehouse through an Sqoop tool.

In a preferred embodiment of the present invention, the data warehouse construction module comprises an original data layer, a detail data layer, a service data layer and an application data layer,

As a preferred embodiment of the present invention, the data association analysis module includes:

The invention provides a service recommendation experiment system and method for scientific and technical services, which comprises a data import module, a data processing module, a data cleaning module, a data warehouse construction module and a data association analysis module. To further clarify the technical solution of the present invention, the present invention is further explained with reference to the accompanying drawings. As shown in fig. 1, the present invention is specifically realized according to the following steps:

firstly, importing scientific and technological service resource data.

The resource data includes user behavior data and a relational service database, which will be specifically developed below.

1. User behavior data collection

Step 1: setting a front-end buried point trigger program based on js sdk, and acquiring user basic information, area information, browser information, external link data, order information and the like, wherein the events mainly comprise Launch events, pageview events, chargeRequest events and Event events.

Step 2: setting a java sdk-based back-end event trigger program, and sending payment and refund success information to an Nginx server for an order generated by a scientific and technological service platform, wherein the events comprise a chargeSuccess event and a chargeRefund event.

And step 3: setting Ngnix local log storage, and storing behavior data of a user browsing scientific and technological service platform in an access.

And 4, step 4: and setting the Flume to collect the user behavior logs, as shown in fig. 2, the behavior logs generated by the user can be collected by the Flume in real time into the HDFS of the distributed file storage system.

2. Relational business database data import

The relational service database data comprises a scientific and technological resource data table, a user information table and an order information table which are directly imported into Hive through a Sqoop tool, and the metadata formats of the data tables are shown in tables 1, 2 and 3.

TABLE 1 scientific and technological resources data sheet

Name of field	Type of field	Field description
			servi	boolean	Service type eg 0 (supply) 1 (demand)
sid	int	ID of science and technology resources
			name	string	Name of providing/demanding scientific resources
descri	string	Description of scientific and technological resources
			indu	string	Industry eg of scientific resources: cultural tourism
type	string	Type of scientific resource eg: research and development
			issue	string	Scientific and technological resource release time
price	double	Price of scientific and technological resources
			ins	string	Resource/demander organization name

TABLE 2 user information Table

Name of field	Type of field	Field description
			uid	int	User ID
uname	string	User name
			uins	string	The unit of the user
city	string	Area of user
			uiph	string	User mobile phone number
passward	string	User password note: irreversible encryption processing
			first	boolean	Whether to log in eg for the first time 0 (NO) 1 (YES)
time	long	User creation time stamp: time stamp format

Table 3 order information table

Name of field	Type of field	Field description
			oid	string	Order id
order	string	Name of order
			cua	string	Amount of payment
pm	string	Payment mode
			uid	int	User ID
uname	string	User name
			sid	int	ID of scientific and technological resources
name	string	Name of providing/demanding scientific resources
			otime	long	Order creation time

And secondly, cleaning the data.

Step 1: the collected HDFS user behavior log data need to be split through separators, data which do not meet requirements and parameter values need to be filtered if the parameter values do not belong to 6 event types.

Step 2: and performing conversion and extraction operations on related data, such as ip conversion into regions, timestamp conversion time expression, browser related information extraction processing and the like.

And step 3: the behavior log in the HDFS is directly imported into Hive, and the format of Hive metadata is shown in a table 4.

Table 4 user behavior log table

And 4, step 4: and establishing related characteristic engineering by referring to subsequent data analysis and mining, wherein the related characteristic engineering comprises establishing an implicit grading model of a user on a scientific and technological service product, performing One-Hot coding on a data tag and the like.

And thirdly, building a scientific and technological service data warehouse.

The star model is built based on the relationships between the fact tables and the dimension tables of the data, as shown in FIG. 3. The fact table is a platform order table, wherein the platform order table comprises fields such as order ID, time, region, resource/demand party, order quantity, order amount and payment mode, the relevant information of the time, the region, the resource and the resource/demand party can be specifically expanded, and dimension tables can be respectively established for the time, the region, the resource and the resource/demand party, so that a star model with all the dimension tables directly connected with the fact table as the center is established. Although the model can enable data to have certain redundancy, the model is low in complexity, convenient to understand, low in maintenance cost and high in correlation analysis performance.

Fig. 4 is a layered architecture diagram of the science and technology service data warehouse in the present invention, which includes an original data layer (ODS), a detailed data layer (DWD), a service data layer (DWS) and an application data layer (ADS). By implementing a data warehouse hierarchical management architecture, a platform can be clear in data structure, original data can be isolated, reusability of data can be increased, and complex problems can be simplified.

The ODS layer stores original data, including user behavior log data and system service data, fields are completely the same as those of the HDFS and the service database, and the service database data needs to be imported into the Hive by using a Sqoop tool. The table name of the ODS layer requires addition of an ODS _ field before the original table name.

And the DWD layer stores the data cleaned by the data, and comprises the steps of carrying out judgment and repeated filtration on the ODS layer data and carrying out dimension reduction and degradation on the resource classification table, wherein the specific cleaning process is shown as the second step in the invention. The table name of the DWD layer needs to add a DWD _ field before the original table name.

The DWS layer stores data of light aggregation, for example, total amount and times of order placing are obtained from an order form dwd _ order _ com, payment amount and payment times are obtained from an order base form dwd _ event _ com, click conditions are obtained from a user click table dwd _ user _ log, and finally a detail table is obtained according to user _ id aggregation. The table name of the DWS layer needs to add a DWS _ field before the original table name.

And the ADS layer stores related data according to the specific required services. For the service of the transaction total, the transaction total can be obtained by grouping and aggregating according to the statistical date and solving the sum function only for the user behavior broad table DWS _ user _ action in the DWS layer, and the final result is exported to the service database so as to facilitate the subsequent data visualization. The table name of the ADS layer needs to add an ADS _ field before the original table name.

The data warehouse is subjected to layered integrated management, is relatively stable, reflects historical changes and is oriented to scientific and technical service subjects, supports complex analysis operation, provides visual query results, simultaneously mines data values and explores potential requirements of users.

And fourthly, establishing a data association analysis service.

The business analysis of the invention can be mostly completed by the layered construction function of the data warehouse, but for the complex multidimensional correlation operation part of each table data between each layer of the scientific and technological service data warehouse, the invention is based on a Kylin frame engine, and the rough flow is as follows: (1) pre-computing by using the metrics for the multi-dimensional analysis; (2) converting operations such as high-dimensional complex multi-table connection, aggregation calculation and the like into pre-calculation results; (3) the pre-computed results are stored in a distributed storage system for quick access during querying. By using a data processing mode of exchanging space for time, the data association analysis service module has good rapid query and high concurrency capability, so that multidimensional and high association data analysis operations such as drilling, scrolling, slicing and rotating in OLAP are met, and the requirement of sub-second-level analysis query of mass data is met.

And analyzing a business part, performing correlation optimization based on an OLAP analysis type data warehouse tool engine, and performing related statistical work and data mining tasks according to actual business. The platform related statistical work comprises bargain, active user retention service, day/week/month resource heat ranking list and the like, which relate to multi-table related query service, and the platform data mining task comprises user portrait generation, user click recommendation, accurate pushing, resource matching, resource similarity analysis and the like, which relate to machine learning related service.

The business display part is based on a BI tool Superset framework, provides a lightweight data query and visualization scheme for a data result business, and has the following operation flow: (1) logging in a SuperSet, (2) clicking a data source, (3) selecting a database, (4) adding a MySQL data source, (5) adding a database table, (6) editing a table format, and (7) drawing a chart. The visual query billboard and the graphical interface are provided, and a decision basis is provided for the operation of a scientific and technological service platform.

For a specific implementation of this embodiment, reference may be made to the relevant description in the above embodiments, which is not described herein again.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present invention, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution device. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, and the corresponding program may be stored in a computer readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

1. the invention is built around a scientific and technological service platform and aims to develop the scientific and technological service platform. The experimental system is aimed at four industries of modern service industry, marine economy, cultural tourism and international medical care, converges diversified scientific and technological service resources, provides nine services such as research and development, intellectual property, scientific and technological finance, technical transfer, entrepreneurship incubation, scientific and technological consultation, scientific and technological popularization, inspection, detection and authentication, comprehensive scientific and technological service and the like for government departments, parks, enterprises and universities, and is beneficial to the high-quality development of scientific and technological service platforms and the scientific and technological service industry.

2. The method has rich data sources, is not limited to the business database, collects more valuable user behavior data into the system, and meets the requirements of a scientific and technological service platform on multi-source heterogeneous data. Meanwhile, the data can be orderly transferred in the subsequent processes of cleaning, storing, analyzing and the like, and the sub-second level analysis query can be achieved in the large-scale data OLAP production environment of the science and technology service platform.

In this specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A big data experiment system for scientific and technological services is characterized by comprising a data import module, a data cleaning module, a data warehouse construction module and a data association analysis module,

the data import module is connected with the scientific and technological service platform and used for collecting the behavior logs of the user to the scientific and technological service platform to the distributed storage system and importing the behavior logs into the distributed storage system from the business database of the scientific and technological service platform;

2. The big data experiment system for scientific and technological services as claimed in claim 1, wherein said data import module includes:

3. The big data experiment system for scientific and technological services as claimed in claim 1, wherein the data warehouse building module includes a raw data layer, a detailed data layer, a service data layer and an application data layer,

4. A big data experiment system for scientific and technological services according to claim 1, wherein the data association analysis module comprises:

the correlation calculation engine unit is used for correlating the table data correlation operation among the layers of the scientific and technological service data warehouse;