CN117370325B - Data center system based on big data acquisition and analysis - Google Patents

Data center system based on big data acquisition and analysis Download PDF

Info

Publication number
CN117370325B
CN117370325B CN202311356968.3A CN202311356968A CN117370325B CN 117370325 B CN117370325 B CN 117370325B CN 202311356968 A CN202311356968 A CN 202311356968A CN 117370325 B CN117370325 B CN 117370325B
Authority
CN
China
Prior art keywords
data
characters
integrated
abnormal
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311356968.3A
Other languages
Chinese (zh)
Other versions
CN117370325A (en
Inventor
蒋剑辉
戴子君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Shuliang Technology Co ltd
Original Assignee
Hangzhou Shuliang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Shuliang Technology Co ltd filed Critical Hangzhou Shuliang Technology Co ltd
Priority to CN202311356968.3A priority Critical patent/CN117370325B/en
Publication of CN117370325A publication Critical patent/CN117370325A/en
Application granted granted Critical
Publication of CN117370325B publication Critical patent/CN117370325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Automation & Control Theory (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a data center system based on big data acquisition and analysis, which is designed in the field of data communication processing; comprising the following steps: the data acquisition module is connected with the island databases and acquires data information of the island databases; the data cleaning module is used for cleaning the data acquired by the data acquisition module; the data integration module is used for integrating the cleaned multi-source data; and the server is used for storing the integrated data. According to the invention, the abnormal data is analyzed, then the item types to which the abnormal data originally belong are matched in a stepwise narrowing mode, the abnormal data can be automatically matched, and the abnormal data can be manually filled, so that the data integrity during integration can be ensured to be extracted by a random method for the integrated data, then the data accuracy is checked in a bidirectional verification mode, the accuracy of the integrated data can be ensured, in addition, the omission can be automatically filled when the omission occurs in the check, and the accuracy is further increased.

Description

Data center system based on big data acquisition and analysis
Technical Field
The invention relates to the field of data communication processing, in particular to a data center system based on big data acquisition and analysis.
Background
The data center system is used for solving the problem of 'data island', and is used for collecting, calculating, storing and processing massive data through a data technology, unifying standards and calibers, unifying the data to form standard data, storing the standard data to form a large data resource layer, and providing efficient service for clients.
For example, through searching, the patent with the Chinese patent publication number of CN115168474B discloses a method for constructing an Internet of things center system based on a big data model, which comprises the following steps: the method comprises the steps of (S1) selecting a message middleware Kafka as an intermediate bridge of a data acquisition and internet of things platform and used for receiving equipment data accessed by an internet of things sensing system; (S2) selecting a Flink distributed data processing engine to clean and filter data of different devices received by Kafka, and performing rule matching; (S3) selecting a distributed computing engine Spark lot to extract IoTDB data of different devices; (S4) selecting Atlas tools to construct a metadata management system, and managing business metadata, technical metadata and operation metadata; and (S5) performing secondary development AJ-Report to perform visual display of the data Report.
The above patent suffers from the following disadvantages: because the island data is recorded, the position difference can be caused by the misoperation of staff, namely the data recording position is not positioned in the row or column where the correct position is positioned, so that the island data is not stored in the type, and the island data can be directly cleaned out as dirty data when the data is integrated and cleaned, so that the integrated data is lost.
Therefore, the invention provides a data center system based on big data acquisition and analysis
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a data center system based on big data acquisition and analysis.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
Preferably: in the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
Preferably: in the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
Preferably: in the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
Preferably: in S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
Preferably: the data center system also comprises a data verification module.
Preferably: the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
Preferably: the data center system also comprises a data retrieval module which is used for retrieving and displaying the data by a user.
Preferably: the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data retrieval module specifically comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
C2: and retrieving the item type names containing the character segments in the integrated data packet, and returning to the client of the user according to the descending form of the matching degree.
In the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
Preferably: in the step C21, further includes D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:
D1: acquiring the item type name of a data packet opened by a user, and equally dividing the item type name into three sections;
D2: judging the position of the character input by the user in the three sections, and marking the prefix, the middle position and the suffix of the user according to the position in the three sections;
d3: matching user names of users, optimizing the retrieval sequence according to the marks of the users when the users retrieve next time, namely if the users are marked by 'prefixes', preferentially matching the first segment positions of the input characters of the users in the three segments of the project type name segmentation.
The beneficial effects of the invention are as follows:
1. According to the method, the abnormal data are analyzed, then the item types which the abnormal data originally belong to are matched in a mode of gradually narrowing the range, the abnormal data can be automatically matched, and filling can also be manually participated, so that the data integrity in integration can be ensured.
2. The invention adopts a random method to extract the integrated data, and then performs data accuracy test in a bidirectional verification mode, thereby ensuring the accuracy of the integrated data, and in addition, when the test is omitted, the omission can be automatically filled, and the accuracy is further increased.
3. The invention adopts a fuzzy index algorithm based on the data type name for the retrieval of the user, and matches different calculation modes according to the relation between the characters input by the user and the characters of the project type name, thereby ensuring the accuracy of the result matching under each condition.
4. According to the operation or user behavior habit after the user matching result, character sectional type user marks are adopted according to the habit, and then the next index of the user is matched according to the marks, so that the index accuracy can be improved, and the response speed can be increased.
Drawings
FIG. 1 is a system architecture diagram of a data center system based on big data acquisition and analysis according to the present invention;
fig. 2 is a logic diagram of the operation of the data center system based on big data acquisition and analysis according to the present invention.
Detailed Description
The technical scheme of the patent is further described in detail below with reference to the specific embodiments.
In the description of this patent, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "disposed" are to be construed broadly, and may be fixedly connected, disposed, detachably connected, disposed, or integrally connected, disposed, for example. The specific meaning of the terms in this patent will be understood by those of ordinary skill in the art as the case may be.
Example 1:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
In this embodiment, by analyzing the abnormal data and then matching the abnormal data with the item type to which the abnormal data originally belongs in a form of gradually narrowing, the abnormal data can be automatically matched and can also be manually filled, so that the data integrity during integration can be ensured.
Example 2:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
The data center system also comprises a data verification module, and the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
In this embodiment, the integrated data is extracted by a random method, and then the data accuracy is checked in a bidirectional verification mode, so that the accuracy of the integrated data can be ensured, and in addition, when the check is omitted, the omission can be automatically filled, so that the accuracy is further increased.
Example 3:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
The data center system also comprises a data verification module, and the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
The data center system also comprises a data retrieval module which is used for retrieving and displaying data by a user, wherein the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data center system specifically comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
C2: and retrieving the item type names containing the character segments in the integrated data packet, and returning to the client of the user according to the descending form of the matching degree.
In the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
In this embodiment, a fuzzy index algorithm based on data type names is adopted for the retrieval of users, and different calculation modes are matched according to the relationship between the characters input by the users and the characters of the project type names, so that the accuracy of result matching under each condition is ensured.
Example 4:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
The data center system also comprises a data verification module, and the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
The data center system also comprises a data retrieval module which is used for retrieving and displaying data by a user, wherein the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data center system specifically comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
C2: and retrieving the item type names containing the character segments in the integrated data packet, and returning to the client of the user according to the descending form of the matching degree.
In the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
In the step C21, further includes D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:
D1: acquiring the item type name of a data packet opened by a user, and equally dividing the item type name into three sections;
D2: judging the position of the character input by the user in the three sections, and marking the prefix, the middle position and the suffix of the user according to the position in the three sections;
d3: matching user names of users, optimizing the retrieval sequence according to the marks of the users when the users retrieve next time, namely if the users are marked by 'prefixes', preferentially matching the first segment positions of the input characters of the users in the three segments of the project type name segmentation.
In this embodiment, according to the operation or the user behavior habit after the user matches the result, the character segmented user mark is adopted according to the habit, and then the next index of the user is matched according to the mark, so that the index accuracy can be increased and the response speed can be increased.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (6)

1. A data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
S2: the data cleaning module cleans the collected data, takes the data saved by cleaning as standard data and the rest data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: uploading the integrated fusion data packet to a server for storage;
in the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets Wherein/>Data sets of the nth type representing the mth data packet are acquired simultaneously with the type name/>, of each data set
S22: stripping all types of data sets in the data packet to form standard data, and obtaining the rest abnormal data;
in the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain the collection QA,/>, of all types of namesWherein/>Representing the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data is populated to/>Until traversing in the lower level data of (a)The data integration can be completed after that;
in the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
S33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree;
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
2. The data center system based on big data collection analysis of claim 1, further comprising a data verification module.
3. The data center system based on big data collection analysis of claim 2, wherein the working logic of the data verification module comprises the steps of:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
4. The data center system based on big data collection analysis of claim 1, further comprising a data retrieval module for user retrieval display of data.
5. The data center system based on big data collection and analysis of claim 4, wherein the data retrieval module adopts a fuzzy indexing algorithm based on data type names, and the working logic comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
c2: retrieving item type names containing character segments in the integrated data packet, and returning to a client of a user according to a descending form of the matching degree;
in the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
6. The data center system based on big data collection analysis of claim 5, wherein in the step C21, further comprising D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:
D1: acquiring the item type name of a data packet opened by a user, and equally dividing the item type name into three sections;
D2: judging the position of the character input by the user in the three sections, and marking the prefix, the middle position and the suffix of the user according to the position in the three sections;
d3: matching user names of users, optimizing the retrieval sequence according to the marks of the users when the users retrieve next time, namely if the users are marked by 'prefixes', preferentially matching the first segment positions of the input characters of the users in the three segments of the project type name segmentation.
CN202311356968.3A 2023-10-19 2023-10-19 Data center system based on big data acquisition and analysis Active CN117370325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311356968.3A CN117370325B (en) 2023-10-19 2023-10-19 Data center system based on big data acquisition and analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311356968.3A CN117370325B (en) 2023-10-19 2023-10-19 Data center system based on big data acquisition and analysis

Publications (2)

Publication Number Publication Date
CN117370325A CN117370325A (en) 2024-01-09
CN117370325B true CN117370325B (en) 2024-05-28

Family

ID=89401845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311356968.3A Active CN117370325B (en) 2023-10-19 2023-10-19 Data center system based on big data acquisition and analysis

Country Status (1)

Country Link
CN (1) CN117370325B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019153813A1 (en) * 2018-02-07 2019-08-15 华南理工大学 Full-text fuzzy retrieval method for similar chinese characters in ciphertext domain
CN112395325A (en) * 2020-11-27 2021-02-23 广州光点信息科技有限公司 Data management method, system, terminal equipment and storage medium
CN113392646A (en) * 2021-07-07 2021-09-14 上海软中信息技术有限公司 Data relay system, construction method and device
CN114579539A (en) * 2022-02-28 2022-06-03 武汉祁联生态科技有限公司 Master-slave data sharing mode based on ecological environment big data framework
CN114764535A (en) * 2022-05-06 2022-07-19 广东电网有限责任公司 Power data processing method, device and equipment for simulation and storage medium
WO2022179123A1 (en) * 2021-02-24 2022-09-01 深圳壹账通智能科技有限公司 Data update and presentation method and apparatus, and electronic device and storage medium
CN116483810A (en) * 2022-07-29 2023-07-25 四创电子股份有限公司 Data management method based on public security big data processing technical guidelines

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9819444B2 (en) * 2014-05-09 2017-11-14 Avago Technologies General Ip (Singapore) Pte. Ltd. Robust line coding scheme for communication under severe external noises
US20170068891A1 (en) * 2015-09-04 2017-03-09 Infotech Soft, Inc. System for rapid ingestion, semantic modeling and semantic querying over computer clusters
JP6943128B2 (en) * 2017-10-06 2021-09-29 株式会社島津製作所 Analytical database registration device, analytical data collection system, analytical system and analytical database registration method
US20230207123A1 (en) * 2021-12-23 2023-06-29 GE Precision Healthcare LLC Machine learning approach for detecting data discrepancies during clinical data integration

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019153813A1 (en) * 2018-02-07 2019-08-15 华南理工大学 Full-text fuzzy retrieval method for similar chinese characters in ciphertext domain
CN112395325A (en) * 2020-11-27 2021-02-23 广州光点信息科技有限公司 Data management method, system, terminal equipment and storage medium
WO2022179123A1 (en) * 2021-02-24 2022-09-01 深圳壹账通智能科技有限公司 Data update and presentation method and apparatus, and electronic device and storage medium
CN113392646A (en) * 2021-07-07 2021-09-14 上海软中信息技术有限公司 Data relay system, construction method and device
CN114579539A (en) * 2022-02-28 2022-06-03 武汉祁联生态科技有限公司 Master-slave data sharing mode based on ecological environment big data framework
CN114764535A (en) * 2022-05-06 2022-07-19 广东电网有限责任公司 Power data processing method, device and equipment for simulation and storage medium
CN116483810A (en) * 2022-07-29 2023-07-25 四创电子股份有限公司 Data management method based on public security big data processing technical guidelines

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HSMA:面向物联网异构数据的模式分层匹配算法;郭帅;郭忠文;仇志金;;计算机研究与发展;20181115(11);全文 *

Also Published As

Publication number Publication date
CN117370325A (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN110489633B (en) Intelligent brain service system based on library data
CN104252445B (en) Approximate repetitive file detection method and device
Kopparapu Automatic extraction of usable information from unstructured resumes to aid search
RU2680746C2 (en) Method and device for developing web page quality model
WO2012034733A2 (en) Method and arrangement for handling data sets, data processing program and computer program product
CN110336838B (en) Account abnormity detection method, device, terminal and storage medium
US7627551B2 (en) Retrieving case-based reasoning information from archive records
CN108763380A (en) Brand recognition search method, device, computer equipment and storage medium
CN111563176A (en) Cartoon management system based on inertia big data
CN106997350A (en) A kind of method and device of data processing
CN107818175A (en) A kind of law class case problem intelligently prejudges system and method
CN110825805A (en) Data visualization method and device
CN113254572B (en) Electronic document classification supervision system based on cloud platform
CN110807108A (en) Asian face data automatic collection and cleaning method and system
CN117370325B (en) Data center system based on big data acquisition and analysis
CN112786124B (en) Problem troubleshooting method and device, storage medium and equipment
CN110232071A (en) Search method, device and storage medium, the electronic device of drug data
CN113779261A (en) Knowledge graph quality evaluation method and device, computer equipment and storage medium
CN111259223B (en) News recommendation and text classification method based on emotion analysis model
CN113284577A (en) Medicine prediction method, device, equipment and storage medium
CN111223533B (en) Medical data retrieval method and system
JP6961148B1 (en) Information processing system and information processing method
TWI684950B (en) Species data analysis method, system and computer program product
CN115098804B (en) Webpage search history record intelligent management system based on big data analysis
CN117473074B (en) Judicial case intelligent information matching system and method based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant