CN117370325B - Data center system based on big data acquisition and analysis - Google Patents
Data center system based on big data acquisition and analysis Download PDFInfo
- Publication number
- CN117370325B CN117370325B CN202311356968.3A CN202311356968A CN117370325B CN 117370325 B CN117370325 B CN 117370325B CN 202311356968 A CN202311356968 A CN 202311356968A CN 117370325 B CN117370325 B CN 117370325B
- Authority
- CN
- China
- Prior art keywords
- data
- characters
- integrated
- abnormal
- name
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 18
- 230000002159 abnormal effect Effects 0.000 claims abstract description 89
- 230000010354 integration Effects 0.000 claims abstract description 33
- 238000004140 cleaning Methods 0.000 claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000004927 fusion Effects 0.000 claims description 18
- 238000013524 data verification Methods 0.000 claims description 10
- 238000012856 packing Methods 0.000 claims description 6
- 230000001502 supplementing effect Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims 5
- 238000007405 data analysis Methods 0.000 claims 1
- 238000012545 processing Methods 0.000 abstract description 4
- 230000002457 bidirectional effect Effects 0.000 abstract description 3
- 238000012795 verification Methods 0.000 abstract description 3
- 238000004891 communication Methods 0.000 abstract description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2219—Large Object storage; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2468—Fuzzy queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Automation & Control Theory (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a data center system based on big data acquisition and analysis, which is designed in the field of data communication processing; comprising the following steps: the data acquisition module is connected with the island databases and acquires data information of the island databases; the data cleaning module is used for cleaning the data acquired by the data acquisition module; the data integration module is used for integrating the cleaned multi-source data; and the server is used for storing the integrated data. According to the invention, the abnormal data is analyzed, then the item types to which the abnormal data originally belong are matched in a stepwise narrowing mode, the abnormal data can be automatically matched, and the abnormal data can be manually filled, so that the data integrity during integration can be ensured to be extracted by a random method for the integrated data, then the data accuracy is checked in a bidirectional verification mode, the accuracy of the integrated data can be ensured, in addition, the omission can be automatically filled when the omission occurs in the check, and the accuracy is further increased.
Description
Technical Field
The invention relates to the field of data communication processing, in particular to a data center system based on big data acquisition and analysis.
Background
The data center system is used for solving the problem of 'data island', and is used for collecting, calculating, storing and processing massive data through a data technology, unifying standards and calibers, unifying the data to form standard data, storing the standard data to form a large data resource layer, and providing efficient service for clients.
For example, through searching, the patent with the Chinese patent publication number of CN115168474B discloses a method for constructing an Internet of things center system based on a big data model, which comprises the following steps: the method comprises the steps of (S1) selecting a message middleware Kafka as an intermediate bridge of a data acquisition and internet of things platform and used for receiving equipment data accessed by an internet of things sensing system; (S2) selecting a Flink distributed data processing engine to clean and filter data of different devices received by Kafka, and performing rule matching; (S3) selecting a distributed computing engine Spark lot to extract IoTDB data of different devices; (S4) selecting Atlas tools to construct a metadata management system, and managing business metadata, technical metadata and operation metadata; and (S5) performing secondary development AJ-Report to perform visual display of the data Report.
The above patent suffers from the following disadvantages: because the island data is recorded, the position difference can be caused by the misoperation of staff, namely the data recording position is not positioned in the row or column where the correct position is positioned, so that the island data is not stored in the type, and the island data can be directly cleaned out as dirty data when the data is integrated and cleaned, so that the integrated data is lost.
Therefore, the invention provides a data center system based on big data acquisition and analysis
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a data center system based on big data acquisition and analysis.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
Preferably: in the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set;
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
Preferably: in the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
Preferably: in the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
Preferably: in S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
Preferably: the data center system also comprises a data verification module.
Preferably: the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
Preferably: the data center system also comprises a data retrieval module which is used for retrieving and displaying the data by a user.
Preferably: the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data retrieval module specifically comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
C2: and retrieving the item type names containing the character segments in the integrated data packet, and returning to the client of the user according to the descending form of the matching degree.
In the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
Preferably: in the step C21, further includes D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:
D1: acquiring the item type name of a data packet opened by a user, and equally dividing the item type name into three sections;
D2: judging the position of the character input by the user in the three sections, and marking the prefix, the middle position and the suffix of the user according to the position in the three sections;
d3: matching user names of users, optimizing the retrieval sequence according to the marks of the users when the users retrieve next time, namely if the users are marked by 'prefixes', preferentially matching the first segment positions of the input characters of the users in the three segments of the project type name segmentation.
The beneficial effects of the invention are as follows:
1. According to the method, the abnormal data are analyzed, then the item types which the abnormal data originally belong to are matched in a mode of gradually narrowing the range, the abnormal data can be automatically matched, and filling can also be manually participated, so that the data integrity in integration can be ensured.
2. The invention adopts a random method to extract the integrated data, and then performs data accuracy test in a bidirectional verification mode, thereby ensuring the accuracy of the integrated data, and in addition, when the test is omitted, the omission can be automatically filled, and the accuracy is further increased.
3. The invention adopts a fuzzy index algorithm based on the data type name for the retrieval of the user, and matches different calculation modes according to the relation between the characters input by the user and the characters of the project type name, thereby ensuring the accuracy of the result matching under each condition.
4. According to the operation or user behavior habit after the user matching result, character sectional type user marks are adopted according to the habit, and then the next index of the user is matched according to the marks, so that the index accuracy can be improved, and the response speed can be increased.
Drawings
FIG. 1 is a system architecture diagram of a data center system based on big data acquisition and analysis according to the present invention;
fig. 2 is a logic diagram of the operation of the data center system based on big data acquisition and analysis according to the present invention.
Detailed Description
The technical scheme of the patent is further described in detail below with reference to the specific embodiments.
In the description of this patent, it should be noted that, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "disposed" are to be construed broadly, and may be fixedly connected, disposed, detachably connected, disposed, or integrally connected, disposed, for example. The specific meaning of the terms in this patent will be understood by those of ordinary skill in the art as the case may be.
Example 1:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set;
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
In this embodiment, by analyzing the abnormal data and then matching the abnormal data with the item type to which the abnormal data originally belongs in a form of gradually narrowing, the abnormal data can be automatically matched and can also be manually filled, so that the data integrity during integration can be ensured.
Example 2:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set;
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
The data center system also comprises a data verification module, and the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
In this embodiment, the integrated data is extracted by a random method, and then the data accuracy is checked in a bidirectional verification mode, so that the accuracy of the integrated data can be ensured, and in addition, when the check is omitted, the omission can be automatically filled, so that the accuracy is further increased.
Example 3:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set;
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
The data center system also comprises a data verification module, and the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
The data center system also comprises a data retrieval module which is used for retrieving and displaying data by a user, wherein the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data center system specifically comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
C2: and retrieving the item type names containing the character segments in the integrated data packet, and returning to the client of the user according to the descending form of the matching degree.
In the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
In this embodiment, a fuzzy index algorithm based on data type names is adopted for the retrieval of users, and different calculation modes are matched according to the relationship between the characters input by the users and the characters of the project type names, so that the accuracy of result matching under each condition is ensured.
Example 4:
a data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
s2: the data cleaning module cleans the collected data, takes the cleaned and stored data as standard data and takes the cleaned data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: and uploading the integrated fusion data packet to a server for storage.
In the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets WhereinData sets of the nth type representing the mth data packet, while acquiring a type name of each data set;
S22: and stripping all types of data sets in the data packet to form standard data, and the rest is abnormal data.
In the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain a set QA of all types of names,WhereinRepresenting the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data populates intoUntil traversing in the lower level data of (a)And then completing data integration.
In the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
s33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; and analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree.
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
The data center system also comprises a data verification module, and the working logic of the data verification module comprises the following steps:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
The data center system also comprises a data retrieval module which is used for retrieving and displaying data by a user, wherein the data retrieval module adopts a fuzzy index algorithm based on the name of the data type, and the working logic of the data center system specifically comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
C2: and retrieving the item type names containing the character segments in the integrated data packet, and returning to the client of the user according to the descending form of the matching degree.
In the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
In the step C21, further includes D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:
D1: acquiring the item type name of a data packet opened by a user, and equally dividing the item type name into three sections;
D2: judging the position of the character input by the user in the three sections, and marking the prefix, the middle position and the suffix of the user according to the position in the three sections;
d3: matching user names of users, optimizing the retrieval sequence according to the marks of the users when the users retrieve next time, namely if the users are marked by 'prefixes', preferentially matching the first segment positions of the input characters of the users in the three segments of the project type name segmentation.
In this embodiment, according to the operation or the user behavior habit after the user matches the result, the character segmented user mark is adopted according to the habit, and then the next index of the user is matched according to the mark, so that the index accuracy can be increased and the response speed can be increased.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (6)
1. A data center system based on big data acquisition analysis, comprising:
the data acquisition module is connected with the island databases and acquires data information of the island databases;
the data cleaning module is used for cleaning the data acquired by the data acquisition module;
The data integration module is used for integrating the cleaned multi-source data;
the server is used for storing the integrated data;
the working logic of the data center system is as follows:
s1: the data acquisition module acquires data information from a plurality of island databases;
S2: the data cleaning module cleans the collected data, takes the data saved by cleaning as standard data and the rest data as abnormal data;
S3: the data integration module integrates the standard data to form a fusion data packet, analyzes the abnormal data, integrates the abnormal data for the second time and integrates the abnormal data into the fusion data packet;
s4: uploading the integrated fusion data packet to a server for storage;
in the step S2, the data cleaning includes the following steps:
S21: scanning the data according to the acquired data to form m data packets with different sources, classifying the m data packets according to the classification form of the source data to form a plurality of types of data sets Wherein/>Data sets of the nth type representing the mth data packet are acquired simultaneously with the type name/>, of each data set;
S22: stripping all types of data sets in the data packet to form standard data, and obtaining the rest abnormal data;
in the step S3, the data integration method includes:
s31a: according to the type name of the acquired data set And the union is taken to obtain the collection QA,/>, of all types of namesWherein/>Representing the name of the t type after the union is taken;
S32a: combine m data packets with Name matching data is populated to/>Until traversing in the lower level data of (a)The data integration can be completed after that;
in the step S3, the method for secondarily integrating the abnormal data includes the following steps:
s31b: determining an island database to which the abnormal data belong, and analyzing the data composition type of the abnormal data, wherein the type comprises numbers, letters, characters, chinese characters or any combination;
S32b: matching a plurality of type names with the same data composition type as the abnormal data composition type in the island database; if the number of the matched types is zero, directly judging that the abnormal data is dirty data, and deleting;
S33b: traversing all data under each type name sub-level, and extracting data types of data non-continuous points in which spaces, blank lines and blank cells exist; analyzing the data types, calculating the matching degree, and matching the abnormal data to the type directory with the highest matching degree;
In S33b, the following cases exist:
a1: if the number of data types with data non-continuous points is 1, the abnormal data is directly filled;
A2: if the number of the data types with the data non-continuous points is not 1, packing the data types and the abnormal data, and returning to the entry personnel of the island database for traceability filling, and returning after filling.
2. The data center system based on big data collection analysis of claim 1, further comprising a data verification module.
3. The data center system based on big data collection analysis of claim 2, wherein the working logic of the data verification module comprises the steps of:
b1: after the data integration is finished, randomly selecting one or more types of data names from the island database, and extracting sub-level data of the data names;
b2: searching a data name corresponding to the integrated data packet from the integrated data packet, and extracting sub-level data of the data name;
b3: judging whether the sub-level data in B2 completely contains the sub-level data in B1;
B4: when the data packet is not completely contained, updating and supplementing missing data into the integrated data packet;
B5: when the data is completely contained, randomly selecting one or more data from the integrated data packet, and acquiring the data name to which the data belongs;
B6: searching selected data in the island database, judging whether the name of the data belonging to the island database is consistent with the integrated data, if so, judging that the data is integrated correctly, and if not, judging that the data is integrated incorrectly.
4. The data center system based on big data collection analysis of claim 1, further comprising a data retrieval module for user retrieval display of data.
5. The data center system based on big data collection and analysis of claim 4, wherein the data retrieval module adopts a fuzzy indexing algorithm based on data type names, and the working logic comprises the following steps:
C1: the user inputs a part of the data type name containing g characters;
c2: retrieving item type names containing character segments in the integrated data packet, and returning to a client of a user according to a descending form of the matching degree;
in the step C2, the matching degree calculating method comprises the following steps:
C21: if the character input by the user is completely contained in the item type name of the integrated data packet, adopting a formula A is the number of input characters which is the same as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name;
C22: if the number of characters input by the user is smaller than the number of characters of the item type and the characters input by the user are not completely contained in the item type name of the integrated data packet, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, k is a proportionality coefficient, and is preset by an administrator;
c23: if the number of characters input by the user is greater than the number of characters of the item type, adopting a formula A is the same number of input characters as the number of characters of the integrated data packet project type name, b is the different number of input characters as the number of characters of the integrated data packet project type name, c is the total number of characters of the integrated data packet project type name, d is the difference between the input characters and the total number of characters of the project type name, k and delta are proportionality coefficients, and are preset by an administrator.
6. The data center system based on big data collection analysis of claim 5, wherein in the step C21, further comprising D: after the matching result is completely displayed, acquiring the next operation of the user, and learning the user input habit, wherein the method specifically comprises the following steps:
D1: acquiring the item type name of a data packet opened by a user, and equally dividing the item type name into three sections;
D2: judging the position of the character input by the user in the three sections, and marking the prefix, the middle position and the suffix of the user according to the position in the three sections;
d3: matching user names of users, optimizing the retrieval sequence according to the marks of the users when the users retrieve next time, namely if the users are marked by 'prefixes', preferentially matching the first segment positions of the input characters of the users in the three segments of the project type name segmentation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311356968.3A CN117370325B (en) | 2023-10-19 | 2023-10-19 | Data center system based on big data acquisition and analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311356968.3A CN117370325B (en) | 2023-10-19 | 2023-10-19 | Data center system based on big data acquisition and analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117370325A CN117370325A (en) | 2024-01-09 |
CN117370325B true CN117370325B (en) | 2024-05-28 |
Family
ID=89401845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311356968.3A Active CN117370325B (en) | 2023-10-19 | 2023-10-19 | Data center system based on big data acquisition and analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117370325B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019153813A1 (en) * | 2018-02-07 | 2019-08-15 | 华南理工大学 | Full-text fuzzy retrieval method for similar chinese characters in ciphertext domain |
CN112395325A (en) * | 2020-11-27 | 2021-02-23 | 广州光点信息科技有限公司 | Data management method, system, terminal equipment and storage medium |
CN113392646A (en) * | 2021-07-07 | 2021-09-14 | 上海软中信息技术有限公司 | Data relay system, construction method and device |
CN114579539A (en) * | 2022-02-28 | 2022-06-03 | 武汉祁联生态科技有限公司 | Master-slave data sharing mode based on ecological environment big data framework |
CN114764535A (en) * | 2022-05-06 | 2022-07-19 | 广东电网有限责任公司 | Power data processing method, device and equipment for simulation and storage medium |
WO2022179123A1 (en) * | 2021-02-24 | 2022-09-01 | 深圳壹账通智能科技有限公司 | Data update and presentation method and apparatus, and electronic device and storage medium |
CN116483810A (en) * | 2022-07-29 | 2023-07-25 | 四创电子股份有限公司 | Data management method based on public security big data processing technical guidelines |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9819444B2 (en) * | 2014-05-09 | 2017-11-14 | Avago Technologies General Ip (Singapore) Pte. Ltd. | Robust line coding scheme for communication under severe external noises |
US20170068891A1 (en) * | 2015-09-04 | 2017-03-09 | Infotech Soft, Inc. | System for rapid ingestion, semantic modeling and semantic querying over computer clusters |
JP6943128B2 (en) * | 2017-10-06 | 2021-09-29 | 株式会社島津製作所 | Analytical database registration device, analytical data collection system, analytical system and analytical database registration method |
US20230207123A1 (en) * | 2021-12-23 | 2023-06-29 | GE Precision Healthcare LLC | Machine learning approach for detecting data discrepancies during clinical data integration |
-
2023
- 2023-10-19 CN CN202311356968.3A patent/CN117370325B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019153813A1 (en) * | 2018-02-07 | 2019-08-15 | 华南理工大学 | Full-text fuzzy retrieval method for similar chinese characters in ciphertext domain |
CN112395325A (en) * | 2020-11-27 | 2021-02-23 | 广州光点信息科技有限公司 | Data management method, system, terminal equipment and storage medium |
WO2022179123A1 (en) * | 2021-02-24 | 2022-09-01 | 深圳壹账通智能科技有限公司 | Data update and presentation method and apparatus, and electronic device and storage medium |
CN113392646A (en) * | 2021-07-07 | 2021-09-14 | 上海软中信息技术有限公司 | Data relay system, construction method and device |
CN114579539A (en) * | 2022-02-28 | 2022-06-03 | 武汉祁联生态科技有限公司 | Master-slave data sharing mode based on ecological environment big data framework |
CN114764535A (en) * | 2022-05-06 | 2022-07-19 | 广东电网有限责任公司 | Power data processing method, device and equipment for simulation and storage medium |
CN116483810A (en) * | 2022-07-29 | 2023-07-25 | 四创电子股份有限公司 | Data management method based on public security big data processing technical guidelines |
Non-Patent Citations (1)
Title |
---|
HSMA:面向物联网异构数据的模式分层匹配算法;郭帅;郭忠文;仇志金;;计算机研究与发展;20181115(11);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117370325A (en) | 2024-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110489633B (en) | Intelligent brain service system based on library data | |
CN104252445B (en) | Approximate repetitive file detection method and device | |
Kopparapu | Automatic extraction of usable information from unstructured resumes to aid search | |
RU2680746C2 (en) | Method and device for developing web page quality model | |
WO2012034733A2 (en) | Method and arrangement for handling data sets, data processing program and computer program product | |
CN110336838B (en) | Account abnormity detection method, device, terminal and storage medium | |
US7627551B2 (en) | Retrieving case-based reasoning information from archive records | |
CN108763380A (en) | Brand recognition search method, device, computer equipment and storage medium | |
CN111563176A (en) | Cartoon management system based on inertia big data | |
CN106997350A (en) | A kind of method and device of data processing | |
CN107818175A (en) | A kind of law class case problem intelligently prejudges system and method | |
CN110825805A (en) | Data visualization method and device | |
CN113254572B (en) | Electronic document classification supervision system based on cloud platform | |
CN110807108A (en) | Asian face data automatic collection and cleaning method and system | |
CN117370325B (en) | Data center system based on big data acquisition and analysis | |
CN112786124B (en) | Problem troubleshooting method and device, storage medium and equipment | |
CN110232071A (en) | Search method, device and storage medium, the electronic device of drug data | |
CN113779261A (en) | Knowledge graph quality evaluation method and device, computer equipment and storage medium | |
CN111259223B (en) | News recommendation and text classification method based on emotion analysis model | |
CN113284577A (en) | Medicine prediction method, device, equipment and storage medium | |
CN111223533B (en) | Medical data retrieval method and system | |
JP6961148B1 (en) | Information processing system and information processing method | |
TWI684950B (en) | Species data analysis method, system and computer program product | |
CN115098804B (en) | Webpage search history record intelligent management system based on big data analysis | |
CN117473074B (en) | Judicial case intelligent information matching system and method based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |