CN114610833A - Data center unstructured data access method - Google Patents

Data center unstructured data access method Download PDF

Info

Publication number
CN114610833A
CN114610833A CN202210070479.0A CN202210070479A CN114610833A CN 114610833 A CN114610833 A CN 114610833A CN 202210070479 A CN202210070479 A CN 202210070479A CN 114610833 A CN114610833 A CN 114610833A
Authority
CN
China
Prior art keywords
data
module
unstructured
backup
structured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210070479.0A
Other languages
Chinese (zh)
Inventor
马海鑫
张伟
谢虎
谢型浪
余杰文
宋学清
韩吉安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southern Power Grid Digital Grid Research Institute Co Ltd
Original Assignee
Southern Power Grid Digital Grid Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southern Power Grid Digital Grid Research Institute Co Ltd filed Critical Southern Power Grid Digital Grid Research Institute Co Ltd
Priority to CN202210070479.0A priority Critical patent/CN114610833A/en
Publication of CN114610833A publication Critical patent/CN114610833A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data center unstructured data access method, which comprises the following steps that a control module sends a collection instruction to a data collection module, the data collected by the data collection module is analyzed and processed by a data analysis module, and the data is verified by a data quality inspection and verification module, and has the beneficial effects that: according to the invention, the data is acquired and processed through the data acquisition module, then the data is checked and processed through the data quality check module, the data processing efficiency is improved, the data is further processed through the data conversion module and the data compression module, the stability of the operation flow is improved, the data received in real time is directly pushed to the client through the data pushing module after being processed, the real-time update of the data is realized, the real-time requirement of the client on the data is met, and the user experience and the friendliness are improved.

Description

Data center unstructured data access method
Technical Field
The invention belongs to the field of unstructured data processing, and particularly relates to an unstructured data access method for a data center.
Background
In the existing life, unstructured data are data which are irregular or incomplete in data structure, have no predefined data model and are inconvenient to express by a database two-dimensional logic table. The system comprises office documents, texts, pictures, XML, HTML, various reports, images, audio/video information and the like in all formats, and data in a computer informatization system is divided into structured data and unstructured data. Unstructured data is very diverse in format, diverse in standard, and technically more difficult to standardize and understand than structured information. Therefore, more intelligent IT technologies such as mass storage, intelligent retrieval, knowledge mining, content protection, value-added development and utilization of information are needed for storage, retrieval, distribution and utilization, and unstructured data can be obtained anywhere. The data can be obtained from the mail information, chat records and collected survey results in your company, and can also be comments on personal websites, comments in a customer relationship management system or text fields obtained from personal application programs used by your.
Disclosure of Invention
The present invention is directed to provide a method for accessing unstructured data in a data center to solve the above problems, and solve the problems mentioned in the background art.
In order to solve the above problems, the present invention provides a technical solution:
a data center unstructured data access method comprises the following steps:
s1, sending an acquisition instruction to the data acquisition module through the control module, and analyzing and processing the data acquired by the data acquisition module through the data analysis module;
and S2, verifying the data through the data quality inspection and verification module, converting the data type through the data conversion module when the data quality inspection and verification module is qualified, and cleaning the unqualified data through the data cleaning module when the data quality inspection and verification module is unqualified.
S3, compressing the data processed by the data conversion module through the data compression module, then intensively storing the data through the data storage module, and backing up the data through the data backup module;
s4, the data is installed, protected and managed through the data security management module, and then the data is directly pushed to the client through the data pushing module, so that real-time updating of the data is achieved.
Preferably, the data acquisition module determines a target file server corresponding to the acquisition task from a multi-source file server, acquires access path information of the target file server from the changed structured data, accesses the target file server based on the access path information to acquire unstructured data stored by the target file server, acquires the unstructured data into a visual analysis system, provides a uniform view angle, and organizes the data into a final service theme to be presented on a display picture.
Preferably, the data conversion module takes semi-structured data as transition, and finally completes format conversion of process data by adopting a mode of gradually converting unstructured data into semi-structured data and then into structured data, in the process from the unstructured data to the semi-structured data, WORD documents are taken as data sources, different data extraction strategies are constructed according to different card types and formats, and when data are output, different XML templates are selected, and the data are output in a specified XML document format; in the process of converting the semi-structured data into the structured data, the unstructured data is converted into the structured data by analyzing the XML document and establishing a mapping relation between the XML document and the data information.
Preferably, when the data compression module performs data compression processing, the sampled first frame data is used as a basic sample, the second frame data is compared with the first frame data through the compression comparison module to obtain a variable which changes on the two frames of data and a change value of the corresponding variable, and similarly, when the nth frame sample data and the nth-1 frame sample data are processed, the two frames of data are compared to obtain a change value variable which is mapped into a fixed memory space through a Hash algorithm, and a change difference value corresponding to the variable is stored into the corresponding memory space.
Preferably, the data storage module identifies unstructured data and generates a main label, the unstructured data is stored in blocks based on the main label, a secondary label is generated based on the mining characteristics, the secondary label is searched in each block of storage area based on the secondary label and generates a mapping, and the mapping relation is stored in the second storage area.
Preferably, the data backup module calculates and determines the performance weight corresponding to each production server according to the performance index of each production server, when one of the production servers receives a backup task for backing up data in the shared storage to the back-end server, a backup process is created for the backup task to generate a corresponding backup strategy, the backup strategy is decomposed into a plurality of sub-processes for execution, all the sub-processes are distributed to each production server according to the performance weight of each production server, all the sub-processes in each production server are executed according to respective backup strategies, and the data in the shared storage is backed up to the back-end server respectively.
Preferably, the data security management module encapsulates a communication protocol between the user terminal and the message service module file transmission module, so as to improve the storage security of the information.
Preferably, the data pushing module directly pushes the data received in real time to the client after processing the data, so that the data is updated in real time, the requirement of the client on the real-time performance of the data is met, and the user experience and the friendliness are improved.
Preferably, the data analysis module, the data acquisition module, the data quality inspection and verification module, the data conversion module, the data compression module, the data storage module, the data backup module, the data security management module, the data push module and the data cleaning module are all electrically connected with the control module
The invention has the beneficial effects that: according to the invention, the data is acquired and processed through the data acquisition module, then the data is checked and processed through the data quality check module, the data processing efficiency is improved, the data is further processed through the data conversion module and the data compression module, the stability of the operation flow is improved, the data received in real time is directly pushed to the client through the data pushing module after being processed, the real-time update of the data is realized, the real-time requirement of the client on the data is met, and the user experience and the friendliness are improved.
Drawings
For ease of illustration, the invention is described in detail by the following detailed description and the accompanying drawings.
FIG. 1 is a flow chart of the present invention;
fig. 2 is a block diagram of the present invention.
Detailed Description
As shown in fig. 1-2, the following technical solutions are adopted in the specific embodiment of the present invention:
example (b):
a data center unstructured data access method comprises the following steps:
s1, sending an acquisition instruction to the data acquisition module through the control module, and analyzing and processing the data acquired by the data acquisition module through the data analysis module;
and S2, verifying the data through the data quality inspection and verification module, converting the data type through the data conversion module when the data quality inspection and verification module is qualified, and cleaning the unqualified data through the data cleaning module when the data quality inspection and verification module is unqualified.
S3, compressing the data processed by the data conversion module through the data compression module, then intensively storing the data through the data storage module, and backing up the data through the data backup module;
s4, the data is installed, protected and managed through the data security management module, and then the data is directly pushed to the client through the data pushing module, so that real-time updating of the data is achieved.
The data acquisition module determines a target file server corresponding to an acquisition task from a multi-source file server, acquires access path information of the target file server from changed structured data, accesses the target file server based on the access path information to acquire unstructured data stored by the target file server, acquires the data into a visual analysis system, provides a uniform visual angle, and organizes the data into a final service theme to be presented on a display picture.
The data conversion module takes semi-structured data as transition, adopts a mode of gradually converting unstructured data into semi-structured data and then into structured data to finally complete format conversion of process data, takes WORD documents as data sources in the process of converting unstructured data into semi-structured data, constructs different data extraction strategies aiming at different card types and formats, selects different XML templates when outputting data, and outputs the data in a specified XML document format; in the process of converting the semi-structured data into the structured data, the unstructured data is converted into the structured data by analyzing the XML document and establishing a mapping relation between the XML document and the data information.
When the data compression module is used for compressing data, the sampled first frame data is used as a basic sample, the second frame data is compared with the first frame data through the compression comparison module to obtain variables changing on the two frames of data and change values of the corresponding variables, similarly, when the N frame of sampled data and the N-1 frame of sampled data are processed, the two frames of data are compared to obtain the change value variables, the change value variables are mapped into a fixed memory space through a Hash algorithm, and the change difference values corresponding to the variables are stored into the corresponding memory space.
The data storage module identifies unstructured data and generates a main label, the unstructured data are stored in blocks based on the main label, an auxiliary label is generated based on mining characteristics, retrieval is carried out on each storage area based on the auxiliary label and mapping is generated, and the mapping relation is stored in a second storage area.
The data backup module calculates and determines performance weights corresponding to the production servers according to performance indexes of the production servers, when one of the production servers receives a backup task for backing up data in shared storage to a back-end server, a backup process is created for the backup task to generate a corresponding backup strategy, the backup strategy is divided into a plurality of sub-processes to be executed, all the sub-processes are distributed to the production servers according to the performance weights of the production servers, all the sub-processes in the production servers are executed according to the backup strategies of the production servers, and the data in the shared storage are backed up to the back-end server respectively.
The data security management module encapsulates a communication protocol between the user terminal and the message service module file transmission module, and the storage security of the information is improved.
The data pushing module directly pushes the data received in real time to the client after processing the data, so that the data is updated in real time, the requirement of the client on the real-time performance of the data is met, and the user experience and the friendliness are improved.
The data analysis module, the data acquisition module, the data quality inspection and verification module, the data conversion module, the data compression module, the data storage module, the data backup module, the data safety management module, the data pushing module and the data cleaning module are all electrically connected with the control module.
While there have been shown and described what are at present considered to be the fundamental principles of the invention and its essential features and advantages, it will be understood by those skilled in the art that the invention is not limited by the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims (9)

1. A data center unstructured data access method is characterized by comprising the following steps:
s1, sending an acquisition instruction to the data acquisition module through the control module, and analyzing and processing the data acquired by the data acquisition module through the data analysis module;
s2, verifying the data through the data quality inspection and verification module, converting the data type through the data conversion module when the data quality inspection and verification module is qualified, and cleaning the unqualified data through the data cleaning module when the data quality inspection and verification module is unqualified;
s3, compressing the data processed by the data conversion module through the data compression module, then intensively storing the data through the data storage module, and backing up the data through the data backup module;
s4, the data is installed, protected and managed through the data security management module, and then the data is directly pushed to the client through the data pushing module, so that real-time updating of the data is achieved.
2. The method as claimed in claim 1, wherein the data collection module determines a target file server corresponding to the collection task from a source file server, obtains access path information of the target file server from the changed structured data, accesses the target file server based on the access path information to collect unstructured data stored in the target file server, collects the unstructured data into a visual analysis system, provides a uniform view, and organizes the data into a final service theme to be presented on a display screen.
3. The data center unstructured data access method according to claim 1, characterized in that the data conversion module takes semi-structured data as transition, and adopts a mode of gradually converting unstructured data to semi-structured data and then to structured data to finally complete format conversion of process data, in the process of converting unstructured data to semi-structured data, a WORD document is used as a data source, different data extraction strategies are constructed for different card types and formats, and when data is output, different XML templates are selected, and the data is output in a specified XML document format; in the process of converting the semi-structured data into the structured data, the unstructured data is converted into the structured data by analyzing the XML document and establishing a mapping relation between the XML document and the data information.
4. The method as claimed in claim 1, wherein the data compression module performs data compression processing on the first frame data, the second frame data is compared with the first frame data by the compression comparison module to obtain a variable that changes in the two frames of data and a change value of the corresponding variable, and similarly, when the nth frame sample data and the N-1 th frame sample data are processed, the two frames of data are compared to obtain a change value variable, the change value variable is mapped to the fixed memory space by a Hash algorithm, and a change difference value corresponding to the variable is stored in the corresponding memory space.
5. The data center unstructured data access method of claim 1, wherein the data storage module identifies unstructured data and generates a primary label, blocks the unstructured data based on the primary label and stores the unstructured data, generates a secondary label based on mining characteristics, retrieves and generates a mapping based on the secondary label in each block of storage area, and stores the mapping relationship to a second storage area.
6. The method according to claim 1, wherein the data backup module calculates and determines the performance weight corresponding to each production server according to the performance index of each production server, when one of the production servers receives a backup task for backing up data in the shared storage to a backend server, creates a backup process for the backup task to generate a corresponding backup policy, decomposes the backup policy into a plurality of subprocesses for execution, and allocates all the subprocesses to each production server according to the performance weight of each production server, and all the subprocesses in each production server are executed according to their own backup policies and respectively back up the data in the shared storage to the backend server.
7. The method according to claim 1, wherein the data security management module encapsulates a communication protocol between the user terminal and the file transfer module of the message service module, so as to improve the storage security of the information.
8. The method for accessing the unstructured data of the data center according to claim 1, wherein the data pushing module directly pushes the data received in real time to the client after processing the data, so as to realize real-time update of the data, meet the real-time requirement of the client on the data, and improve user experience and friendliness.
9. The method according to claim 1, wherein the data analysis module, the data acquisition module, the data quality inspection and verification module, the data conversion module, the data compression module, the data storage module, the data backup module, the data security management module, the data push module and the data cleaning module are electrically connected to the control module.
CN202210070479.0A 2022-01-21 2022-01-21 Data center unstructured data access method Pending CN114610833A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210070479.0A CN114610833A (en) 2022-01-21 2022-01-21 Data center unstructured data access method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210070479.0A CN114610833A (en) 2022-01-21 2022-01-21 Data center unstructured data access method

Publications (1)

Publication Number Publication Date
CN114610833A true CN114610833A (en) 2022-06-10

Family

ID=81858240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210070479.0A Pending CN114610833A (en) 2022-01-21 2022-01-21 Data center unstructured data access method

Country Status (1)

Country Link
CN (1) CN114610833A (en)

Similar Documents

Publication Publication Date Title
US11392550B2 (en) System and method for investigating large amounts of data
US20200402058A1 (en) Systems and methods for real-time processing of data streams
CN110866040B (en) User portrait generation method, device and system
Siddiqui et al. Pseudo-cache-based IoT small files management framework in HDFS cluster
CN112632129A (en) Code stream data management method, device and storage medium
US11256659B1 (en) Systems and methods for aggregating and displaying data from multiple data sources
CN112988770A (en) Method and device for updating serial number, electronic equipment and storage medium
CN114090529A (en) Log management method, device, system and storage medium
US20230153357A1 (en) Method of processing an observation information, electronic device and storage medium
CN114610833A (en) Data center unstructured data access method
CN113641769B (en) Data processing method and device
CN114691769A (en) Unstructured data processing method and device of power monitoring system
US20200167326A1 (en) System and method for acting on potentially incomplete data
CN112306992A (en) Big data platform based on internet
CN112597207B (en) Metadata management system
CN116610531B (en) Method for collecting data embedded points and requesting image uploading data based on code probe
CN112187623B (en) Information release management system
Chen et al. Internet Engineering Task Force C. Yang, Ed. Internet-Draft Y. Liu&Y. Wang&SY. Pan, Ed. Intended status: Standards Track South China University of Technology Expires: November 28, 2020 C. Chen Inspur
CN117472995A (en) Log data processing method and device and electronic equipment
CN117472693A (en) Buried point data processing method, system, equipment and storage medium based on data lake
CN117541165A (en) Comprehensive management method for case and zone
CN117171394A (en) Data dynamic processing method and device for network collaborative manufacturing platform
CN117094467A (en) Data auditing method and device, storage medium and electronic equipment
CN116304352A (en) Message pushing method, device, equipment and storage medium
CN114936823A (en) Time efficiency subsection monitoring early warning method, device, equipment and storage medium for distribution center

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication