CN117370325B

CN117370325B - Data center system based on big data acquisition and analysis

Info

Publication number: CN117370325B
Application number: CN202311356968.3A
Authority: CN
Inventors: 蒋剑辉; 戴子君
Original assignee: Hangzhou Shuliang Technology Co ltd
Current assignee: Hangzhou Shuliang Technology Co ltd
Priority date: 2023-10-19
Filing date: 2023-10-19
Publication date: 2024-05-28
Anticipated expiration: 2043-10-19
Also published as: CN117370325A

Abstract

The invention discloses a data center system based on big data acquisition and analysis, which is designed in the field of data communication processing; comprising the following steps: the data acquisition module is connected with the island databases and acquires data information of the island databases; the data cleaning module is used for cleaning the data acquired by the data acquisition module; the data integration module is used for integrating the cleaned multi-source data; and the server is used for storing the integrated data. According to the invention, the abnormal data is analyzed, then the item types to which the abnormal data originally belong are matched in a stepwise narrowing mode, the abnormal data can be automatically matched, and the abnormal data can be manually filled, so that the data integrity during integration can be ensured to be extracted by a random method for the integrated data, then the data accuracy is checked in a bidirectional verification mode, the accuracy of the integrated data can be ensured, in addition, the omission can be automatically filled when the omission occurs in the check, and the accuracy is further increased.

Description

Data center system based on big data acquisition and analysis

Technical Field

The invention relates to the field of data communication processing, in particular to a data center system based on big data acquisition and analysis.

Background

The data center system is used for solving the problem of 'data island', and is used for collecting, calculating, storing and processing massive data through a data technology, unifying standards and calibers, unifying the data to form standard data, storing the standard data to form a large data resource layer, and providing efficient service for clients.

For example, through searching, the patent with the Chinese patent publication number of CN115168474B discloses a method for constructing an Internet of things center system based on a big data model, which comprises the following steps: the method comprises the steps of (S1) selecting a message middleware Kafka as an intermediate bridge of a data acquisition and internet of things platform and used for receiving equipment data accessed by an internet of things sensing system; (S2) selecting a Flink distributed data processing engine to clean and filter data of different devices received by Kafka, and performing rule matching; (S3) selecting a distributed computing engine Spark lot to extract IoTDB data of different devices; (S4) selecting Atlas tools to construct a metadata management system, and managing business metadata, technical metadata and operation metadata; and (S5) performing secondary development AJ-Report to perform visual display of the data Report.

The above patent suffers from the following disadvantages: because the island data is recorded, the position difference can be caused by the misoperation of staff, namely the data recording position is not positioned in the row or column where the correct position is positioned, so that the island data is not stored in the type, and the island data can be directly cleaned out as dirty data when the data is integrated and cleaned, so that the integrated data is lost.

Therefore, the invention provides a data center system based on big data acquisition and analysis

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a data center system based on big data acquisition and analysis.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a data center system based on big data acquisition analysis, comprising:

the data acquisition module is connected with the island databases and acquires data information of the island databases;

the data cleaning module is used for cleaning the data acquired by the data acquisition module;

The data integration module is used for integrating the cleaned multi-source data;

the server is used for storing the integrated data;

the working logic of the data center system is as follows:

s1: the data acquisition module acquires data information from a plurality of island databases;

s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;

S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;

s4: and uploading the integrated fusion data packet to a server for storage.

Preferably: in the step S2, the data cleaning includes the following steps:

S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set；

S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.

Preferably: in the step S3, the data integration method includes:

s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;

S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.

Preferably: in the step S3, the method for secondarily integrating the abnormal data includes the following steps:

s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;

S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;

s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.

Preferably: in S33b, the following cases exist:

a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;

A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.

Preferably: the data center system also comprises a data verification module.

Preferably: the working logic of the data verification module comprises the following steps:

b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;

b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;

b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;

B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;

B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;

B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.

Preferably: the data center system also comprises a data retrieval module which is used for retrieving and displaying the data by a user.

Preferably: the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data retrieval module specifically comprises the following steps:

C1: the user inputs a part of the data type name containing g characters;

C2: and retrieving the item type names containing the character segments in the integrated data packet, and returning to the client of the user according to the descending form of the matching degree.

In the step C2, the matching degree calculating method comprises the following steps:

C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;

C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;

c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.

Preferably: in the step C21, further includes D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:

D1: acquiring the item type name of a data packet opened by a user, and equally dividing the item type name into three sections;

D2: judging the position of the character input by the user in the three sections, and marking the prefix, the middle position and the suffix of the user according to the position in the three sections;

d3: matching user names of users, optimizing the retrieval sequence according to the marks of the users when the users retrieve next time, namely if the users are marked by 'prefixes', preferentially matching the first segment positions of the input characters of the users in the three segments of the project type name segmentation.

The beneficial effects of the invention are as follows:

1. According to the method, the abnormal data are analyzed, then the item types which the abnormal data originally belong to are matched in a mode of gradually narrowing the range, the abnormal data can be automatically matched, and filling can also be manually participated, so that the data integrity in integration can be ensured.

2. The invention adopts a random method to extract the integrated data, and then performs data accuracy test in a bidirectional verification mode, thereby ensuring the accuracy of the integrated data, and in addition, when the test is omitted, the omission can be automatically filled, and the accuracy is further increased.

3. The invention adopts a fuzzy index algorithm based on the data type name for the retrieval of the user, and matches different calculation modes according to the relation between the characters input by the user and the characters of the project type name, thereby ensuring the accuracy of the result matching under each condition.

4. According to the operation or user behavior habit after the user matching result, character sectional type user marks are adopted according to the habit, and then the next index of the user is matched according to the marks, so that the index accuracy can be improved, and the response speed can be increased.

Drawings

FIG. 1 is a system architecture diagram of a data center system based on big data acquisition and analysis according to the present invention;

fig. 2 is a logic diagram of the operation of the data center system based on big data acquisition and analysis according to the present invention.

Detailed Description

The technical scheme of the patent is further described in detail below with reference to the specific embodiments.

In the description of this patent, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "disposed" are to be construed broadly, and may be fixedly connected, disposed, detachably connected, disposed, or integrally connected, disposed, for example. The specific meaning of the terms in this patent will be understood by those of ordinary skill in the art as the case may be.

Example 1:

a data center system based on big data acquisition analysis, comprising:

the server is used for storing the integrated data;

the working logic of the data center system is as follows:

s4: and uploading the integrated fusion data packet to a server for storage.

In the step S2, the data cleaning includes the following steps:

In the step S3, the data integration method includes:

In the step S3, the method for secondarily integrating the abnormal data includes the following steps:

In S33b, the following cases exist:

In this embodiment, by analyzing the abnormal data and then matching the abnormal data with the item type to which the abnormal data originally belongs in a form of gradually narrowing, the abnormal data can be automatically matched and can also be manually filled, so that the data integrity during integration can be ensured.

Example 2:

a data center system based on big data acquisition analysis, comprising:

the server is used for storing the integrated data;

the working logic of the data center system is as follows:

s4: and uploading the integrated fusion data packet to a server for storage.

In the step S2, the data cleaning includes the following steps:

In the step S3, the data integration method includes:

In S33b, the following cases exist:

The data center system also comprises a data verification module, and the working logic of the data verification module comprises the following steps:

In this embodiment, the integrated data is extracted by a random method, and then the data accuracy is checked in a bidirectional verification mode, so that the accuracy of the integrated data can be ensured, and in addition, when the check is omitted, the omission can be automatically filled, so that the accuracy is further increased.

Example 3:

a data center system based on big data acquisition analysis, comprising:

the server is used for storing the integrated data;

the working logic of the data center system is as follows:

s4: and uploading the integrated fusion data packet to a server for storage.

In the step S2, the data cleaning includes the following steps:

In the step S3, the data integration method includes:

In S33b, the following cases exist:

The data center system also comprises a data retrieval module which is used for retrieving and displaying data by a user, wherein the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data center system specifically comprises the following steps:

C1: the user inputs a part of the data type name containing g characters;

In this embodiment, a fuzzy index algorithm based on data type names is adopted for the retrieval of users, and different calculation modes are matched according to the relationship between the characters input by the users and the characters of the project type names, so that the accuracy of result matching under each condition is ensured.

Example 4:

a data center system based on big data acquisition analysis, comprising:

the server is used for storing the integrated data;

the working logic of the data center system is as follows:

s4: and uploading the integrated fusion data packet to a server for storage.

In the step S2, the data cleaning includes the following steps:

In the step S3, the data integration method includes:

In S33b, the following cases exist:

C1: the user inputs a part of the data type name containing g characters;

In the step C21, further includes D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:

In this embodiment, according to the operation or the user behavior habit after the user matches the result, the character segmented user mark is adopted according to the habit, and then the next index of the user is matched according to the mark, so that the index accuracy can be increased and the response speed can be increased.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A data center system based on big data acquisition analysis, comprising:

the server is used for storing the integrated data;

the working logic of the data center system is as follows:

S2: the data cleaning module cleans the collected data, takes the data saved by cleaning as standard data and the rest data as abnormal data;

s4: uploading the integrated fusion data packet to a server for storage;

in the step S2, the data cleaning includes the following steps:

S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets Wherein/>Data sets of the nth type representing the mth data packet are acquired simultaneously with the type name/>, of each data set；

S22: stripping all types of data sets in the data packet to form standard data, and obtaining the rest abnormal data;

in the step S3, the data integration method includes:

s31a: according to the type name of the acquired data set And the union is taken to obtain the collection QA,/>, of all types of namesWherein/>Representing the name of the t type after the union is taken;

S32a: combine m data packets with Name matching data is populated to/>Until traversing in the lower level data of (a)The data integration can be completed after that;

S33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree;

In S33b, the following cases exist:

2. The data center system based on big data collection analysis of claim 1, further comprising a data verification module.

3. The data center system based on big data collection analysis of claim 2, wherein the working logic of the data verification module comprises the steps of:

4. The data center system based on big data collection analysis of claim 1, further comprising a data retrieval module for user retrieval display of data.

5. The data center system based on big data collection and analysis of claim 4, wherein the data retrieval module adopts a fuzzy indexing algorithm based on data type names, and the working logic comprises the following steps:

C1: the user inputs a part of the data type name containing g characters;

c2: retrieving item type names containing character segments in the integrated data packet, and returning to a client of a user according to a descending form of the matching degree;

6. The data center system based on big data collection analysis of claim 5, wherein in the step C21, further comprising D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps: