CN113434492A - Data detection method and device, storage medium and electronic device - Google Patents

Data detection method and device, storage medium and electronic device Download PDF

Info

Publication number
CN113434492A
CN113434492A CN202110688742.8A CN202110688742A CN113434492A CN 113434492 A CN113434492 A CN 113434492A CN 202110688742 A CN202110688742 A CN 202110688742A CN 113434492 A CN113434492 A CN 113434492A
Authority
CN
China
Prior art keywords
data
file
storage space
time
processing time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110688742.8A
Other languages
Chinese (zh)
Inventor
董壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202110688742.8A priority Critical patent/CN113434492A/en
Publication of CN113434492A publication Critical patent/CN113434492A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1737Details of further file system functions for reducing power consumption or coping with limited storage space, e.g. in mobile devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data detection method, a data detection device, a storage medium and an electronic device, wherein the method comprises the following steps: scanning at least one storage space in the data cluster for storing data; sequentially reading the latest processing time of each file in the directory of the storage space, wherein the latest processing time comprises at least one of the following: the last modification time and the last access time of the file; and if the difference value between the latest processing time and the initial running time of any file exceeds a threshold value, deleting the metadata information of the file.

Description

Data detection method and device, storage medium and electronic device
Technical Field
The invention relates to the field of data processing, in particular to a data detection method, a data detection device, a data storage medium and an electronic device.
Background
At present, with the increase of various service scenarios, the task of extracting, converting and loading (ETL) is more and more, the space utilization rate of a large data cluster already exceeds sixty percent, the pressure of a Name Node (Name Node) is gradually increased, if the Name Node is not cleaned in time, a server needs to be expanded subsequently to provide more storage space, and more memory needs to be allocated to the Name Node to relieve the pressure.
Aiming at the problems, a big data cluster administrator can manually check the data storage condition and the data use condition in a table-by-table mode through a Shell command, the check results are assembled into a table (Excel) after the check is finished, a responsible person is matched according to the corresponding database name and the table name (the matched responsible person is inaccurate), the responsible person is notified to process, the responsible person feeds back the processed data to the administrator, and the administrator checks the data. However, this method requires a lot of time and effort, and thus has a technical problem that the detection of data is inefficient.
Aiming at the technical problem of low efficiency of detecting data in the related art, no effective solution is provided.
Disclosure of Invention
The embodiment of the invention provides a data detection method, a data detection device, a storage medium and an electronic device, and at least solves the technical problem of low efficiency of data detection.
According to an embodiment of the present invention, a method for detecting data is provided. The method can comprise the following steps: scanning at least one storage space in the data cluster for storing data; sequentially reading the latest processing time of each file in the directory of the storage space, wherein the latest processing time comprises at least one of the following: the last modification time and the last access time of the file; and if the difference value between the latest processing time and the initial running time of any file exceeds a threshold value, deleting the metadata information of the file.
In one exemplary embodiment, when at least one storage space for storing data in a data cluster is scanned, the amount of data stored in the storage space is obtained; and determining the occupied space of the storage space based on the quantity of the data stored in the storage space.
In an exemplary embodiment, in the case that the storage space is a data table, wherein the determining the occupation space of the storage space based on the amount of data stored in the storage space includes: traversing an internal table path of the data table, and determining a hierarchy of the data table based on the internal table path; counting the number of files under a table-level directory corresponding to the hierarchy of the data table; and counting the occupied space under the table-level directory of the data table based on the number of the files under the table-level directory.
In one exemplary embodiment, if the difference between the most recent processing time and the initial running time of any one file exceeds a threshold, the storage space used to store the file is determined to be a garbage area, and a prompt is triggered to the front-end operator.
In one exemplary embodiment, the garbage area is marked and a query is made as to whether metadata information associated with the garbage area exists in the metadata collection, and if so, the step of deleting the metadata information associated with the garbage area is performed.
In an exemplary embodiment, after the step of deleting the metadata information associated with the useless area is performed, the metadata information of the undeleted file in the metadata set is traversed, and if the undeleted file is a file that does not exist in the directory of the storage space, whether the metadata information of the undeleted file is the information that needs to be deleted is continuously judged.
In an exemplary embodiment, after scanning any one storage space is completed, scanning other storage spaces is started until the records stored in the predetermined time period are scanned, and the cache information of the subsequent records of the storage space is written into the history table.
According to another embodiment of the invention, a device for detecting data is also provided. The apparatus may include: the scanning unit is used for scanning at least one storage space for storing data in the data cluster; the reading unit is used for sequentially reading the latest processing time of each file in the directory of the storage space, wherein the latest processing time comprises at least one of the following: the last modification time and the last access time of the file; and the deleting unit is used for deleting the metadata information of any file when the difference value between the latest processing time and the initial running time of the file exceeds a threshold value.
According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium. The computer-readable storage medium may include a stored program, wherein the program is executed by a processor to perform the above-mentioned data detection method.
According to another aspect of the embodiment of the invention, an electronic device is also provided. The electronic device may comprise a memory in which a computer program is stored and a processor arranged to execute the above mentioned method of detecting data by means of the computer program.
In the embodiment of the invention, at least one storage space for storing data in a data cluster is scanned; sequentially reading the latest processing time of each file in the directory of the storage space, wherein the latest processing time comprises at least one of the following: the last modification time and the last access time of the file; and if the difference value between the latest processing time and the initial running time of any file exceeds a threshold value, deleting the metadata information of the file. That is to say, this application scans at least one storage space used for storing data in the data cluster, and the difference between the latest processing time and the initial running time of any file in the storage space exceeds the threshold value, and the metadata information of the file is deleted automatically, thereby relieving the data storage pressure, avoiding the situation that the storage condition and the data use condition are checked manually by the Shell command of the HDFS one by one, and processing the data manually, thereby reducing the time cost and the labor cost of data detection, solving the technical problem of low efficiency of detecting the data, and achieving the technical effect of improving the efficiency of detecting the data.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a computer terminal of a data detection method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of detecting data according to an embodiment of the present invention;
FIG. 3 is an interaction diagram of a method for scanning big data cluster garbage and tablespaces based on Java + Springboot + Mysql according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating the task presentation of a program schedule according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating usage of storage resources according to an embodiment of the invention;
FIG. 6 is a schematic diagram of a garbage table presentation, according to an embodiment of the present invention;
FIG. 7 is a schematic illustration of a cold data presentation, according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a Hive chart acceleration Top10 according to an embodiment of the invention;
FIG. 9 is a schematic diagram of a Hive spreadsheet Top10 according to an embodiment of the invention;
FIG. 10 is a diagram illustrating a garbage cleaning result, in accordance with an embodiment of the present invention;
fig. 11 is a block diagram of a data detection apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The method provided by the embodiment of the application can be executed in a computer terminal, a computer terminal or a similar operation device. Taking the example of the method running on a computer terminal, fig. 1 is a block diagram of a hardware structure of the computer terminal of the data detection method according to the embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more (only one shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and in an exemplary embodiment, may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration with equivalent functionality to that shown in FIG. 1 or with more functionality than that shown in FIG. 1.
The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the data request processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In this embodiment, a data detection method is provided, which is applied to the computer terminal described above, and fig. 2 is a flowchart of a data detection method according to an embodiment of the present invention. The method may comprise the steps of:
step S202, at least one storage space for storing data in the data cluster is scanned.
In the technical solution provided in step S202 of the present invention, the data cluster may be a Distributed File System (HDFS), in which at least one storage space for storing data is provided, and the storage space may be a data table, such as a data warehouse tool (Hive) table, a column-wise storage Distributed database (Kudu) table, a Distributed and column-oriented open source database (HBase), an enterprise-level search application server (Solr), and the like.
In this embodiment, a program may be designed by Java + Springboot + Mysql, and the scanning of at least one storage space for storing data in the data cluster may be scanning at least one storage space for storing data according to a certain scanning order to obtain the amount of data stored in the storage space, where the scanning order is, for example: hive table, Solr, HBase table and Kudu table.
Optionally, this embodiment may periodically scan at least one storage space in the data cluster for storing data, as needed.
Optionally, before scanning at least one storage space for storing data in the data cluster, the current record table (md _ t _ table _ space table) for storing the current scanning result may be cleared, and the current-day record of the history record table (md _ t _ table _ space _ his table) for storing the history scanning result may be deleted, so that it may be ensured that the program has no duplicate data in the table when it runs many times the day. The current recording list and the historical recording list can be Mysql lists, the structures of the Mysql lists are completely the same, the current recording list is used for storing scanning results of the current time period, front-end display can be conducted through an enterprise instant messaging platform, the historical recording list can be used for storing scanning results of the historical time period, statistics can be conducted on data change conditions in different fields, for example, data statistics is conducted through charts, databases and the like, and front-end display can also be conducted through the enterprise instant messaging platform.
Step S204, the latest processing time of each file in the directory of the storage space is read in sequence.
In the technical solution provided by step S204 of the present invention, after determining the occupied space of the storage space based on the amount of data stored in the storage space, the latest processing time of each file in the directory of the storage space may be sequentially read, where the latest processing time includes at least one of: the last modification time and the last access time of the file.
In this embodiment, the directory of the storage space has files therein, the directory may be a table-level directory, and all files in the directory may be traversed to sequentially read the latest processing time of each file in the directory, where the latest processing time may include the latest modification time and the latest access time of the file. The latest modification time is the file modification time closest to the current time, and the latest access time is the file access time closest to the current time, that is, the latest browsing time.
In step S206, if the difference between the latest processing time and the initial running time of any file exceeds the threshold, the metadata information of the file is deleted.
In the technical solution provided by step S206 of the present invention, after the latest processing time of each file in the directory of the storage space is sequentially read, a difference between the latest processing time and an initial running time of each file in the directory of the storage space is obtained, where the initial running time is also the program initial running time, and the difference may be in units of days. The embodiment may determine whether a difference between the latest processing time and the initial running time of each file in the directory of the storage space exceeds a threshold, for example, determine whether a difference between the latest modification time and the initial running time of each file exceeds a threshold, and determine whether a difference between the latest access time and the initial running time of each file exceeds a threshold.
If there is a file whose difference between the latest processing time and the initial running time exceeds a threshold, for example, if there is a file whose difference between the latest modification time and the initial running time exceeds a threshold and whose difference between the latest access time and the initial running time exceeds a threshold, the metadata information of the file may be deleted, and the metadata information of the file may be table metadata information.
Optionally, the embodiment may provide the scanning result of the storage space to the front end through an intelligent open platform for displaying, thereby supervising developers to pay attention to and clear up tables and useless tables with abnormally increased data volume, thereby saving time and energy of cluster administrators, releasing the storage space of a large data cluster HDFS, relieving the pressure of a Name Node, and reducing the hardware cost of a server.
In the technical solution provided in the foregoing step S202 to step S206 of the present application, at least one storage space for storing data in a data cluster is scanned; sequentially reading the latest processing time of each file in the directory of the storage space, wherein the latest processing time comprises at least one of the following: the last modification time and the last access time of the file; and if the difference value between the latest processing time and the initial running time of any file exceeds a threshold value, deleting the metadata information of the file. That is to say, this application scans at least one storage space used for storing data in the data cluster, and the difference between the latest processing time and the initial running time of any file in the storage space exceeds the threshold value, and the metadata information of the file is deleted automatically, thereby relieving the data storage pressure, avoiding the situation that the storage condition and the data use condition are checked manually by the Shell command of the HDFS one by one, and processing the data manually, thereby reducing the time cost and the labor cost of data detection, solving the technical problem of low efficiency of detecting the data, and achieving the technical effect of improving the efficiency of detecting the data.
The above-described method of this embodiment is further described below.
As an optional implementation manner, in step S202, when scanning at least one storage space for storing data in the data cluster, the method further includes: acquiring the quantity of data stored in a storage space; and determining the occupied space of the storage space based on the quantity of the data stored in the storage space.
The embodiment can scan at least one storage space for storing data in the data cluster at regular time according to the requirement, and acquire the amount of data stored in the storage space. Because the stored data occupies a certain space in the storage space, the embodiment can determine the current occupied space of the storage space based on the quantity of the data stored in the storage space, determine the use condition of the storage space according to the occupied space, and provide the use condition for front-end display through an intelligent home open platform, so that the time and the energy of a cluster administrator are saved.
As an optional implementation manner, in the case that the storage space is a data table, where the occupied space of the storage space is determined based on the amount of data stored in the storage space, the method includes: traversing an internal table path of the data table, and determining a hierarchy of the data table based on the internal table path; counting the number of files under a table-level directory corresponding to the hierarchy of the data table; and counting the occupied space under the table-level directory of the data table based on the number of the files under the table-level directory.
In this embodiment, the storage space may be a data table, such as a Hive table, having an internal storage path, which is an internal table path, such as a/user/Hive/winehouse/, from which a database name and a table name may be split. When the occupied space of the storage space is determined based on the data quantity stored in the storage space, the internal table path of the data table can be traversed, the hierarchy of the data table is determined based on the internal table path, the hierarchy is also the path hierarchy, and a non-partition table, a first-level partition table and a second-level partition table can be distinguished through the path hierarchy. The embodiment can count the number of files under the table-level directory corresponding to the hierarchy, count the occupied space of the table-level directory through the number of files, and convert the occupied space into Megabytes (MB).
Optionally, before traversing the internal table path of the data table, the metadata information of the storage space stored by the smart home open platform may be obtained in advance, and the metadata information is loaded into the metadata set (Map).
As an alternative embodiment, if the difference between the latest processing time and the initial running time of any one file exceeds the threshold, the storage space for saving the file is determined to be a useless area, and a prompt message is triggered to the front-end operator.
In this embodiment, if there is a file whose difference between the latest processing time and the initial running time exceeds the threshold, it is determined that the storage space for saving the file is a useless area, that is, the useless storage space, for example, is a useless table, and a prompt message may be triggered to the front-end operator, for example, the enterprise instant messaging application is triggered to send a prompt message to the front-end operator, so as to prompt that the storage space for saving the file is a useless area, where the front-end operator may be a creator of the storage space.
As an alternative implementation, the useless areas are marked, whether metadata information associated with the useless areas exists in the metadata set is queried, and if yes, the step of deleting the metadata information associated with the useless areas is executed.
In this embodiment, the garbage area may be marked, for example, a Flag (Flag) of the garbage area is set to 1, that is, if there is a file whose difference between the latest processing time and the initial running time exceeds a threshold value, the Flag of the storage space for saving the file may be set to 1. Alternatively, if there is no file whose difference between the latest processing time and the initial running time exceeds the threshold, the flag bit of the storage space for saving the file may be set to 0, which is used to indicate that the storage space for saving the file is a useful area.
When marking the useless area, whether metadata information related to the useless area exists in the metadata set can be inquired, if yes, the metadata information related to the useless area is deleted from the metadata set, and the table ID, the table Chinese description and the partition number can be set.
As an optional implementation, after the step of deleting the metadata information associated with the useless area is performed, the method further includes: and traversing the metadata information of the undeleted files in the metadata set, and if the undeleted files are files which do not exist in the directory of the storage space, continuously judging whether the metadata information of the undeleted files is the information which needs to be deleted.
After the step of deleting the metadata information associated with the useless area is performed, the metadata information of the undeleted file in the metadata set may be traversed to determine whether the undeleted file is a file that does not exist in a directory of the storage space, for example, a file of a table (an external table identified in the metadata and an internal table that does not meet the specification) that exists in the smart home open platform but does not exist in the directory under the internal table path (/ user/live/ware house).
If the undeleted file is a file which does not exist in the directory of the storage space, continuously judging whether the metadata information of the undeleted file is metadata information which needs to be deleted, for example, counting the number of files under a table-level directory corresponding to the hierarchy of a data table; and counting the occupied space under the table-level directory of the data table based on the number of the files under the table-level directory. If the difference value between the latest processing time and the initial running time of any file exceeds a threshold value, determining that a storage space for storing the file is a useless area, and triggering prompt information to a front-end operator; marking the useless areas, inquiring whether metadata information associated with the useless areas exists in the metadata set, and if so, executing the step of deleting the metadata information associated with the useless areas.
As an optional implementation manner, after scanning any one storage space is completed, scanning other storage spaces is started until the records stored in the predetermined time period are scanned, and the cache information of the subsequent records of the storage space is written into the history table.
In this embodiment, after any storage space in the data cluster is scanned, scanning of other storage spaces may be started, for example, after the Hive table is scanned, scanning of the Solr collection may be started until the records stored in the predetermined time period are scanned, and the cache information of the subsequent records in the storage space is written into the history table.
For example, after the Hive table is scanned, the Solr collection can be scanned to obtain the amount of data stored therein, and the information of each collection directory under the/Solr directory is obtained and stored into the current record table (md _ t _ table _ space table) according to the field names, wherein the current record table can be the current day record table. And then scanning the HBase table to obtain the data quantity stored in the HBase table, segmenting out table space names and table names according to the path by obtaining the information of the/HBase/data directory, obtaining the occupation condition of the table space, and storing the table space into the current record table. Scanning a Kudu table to obtain the amount of storage data, wherein the Kudu table is an independent columnar storage system and is in parallel relation with the HDFS, and Kudu has no concept of directory, so that the occupied space of the table can be obtained through monitoring information on a management platform (CM), optionally, all Kudu table names are obtained through an Application Program Interface (API) of a Kudu Client (Client), then each table is taken to access graph information on the clouder Manager, a bit point value of a latest time period (for example, 20 minutes) on the monitoring graph is obtained, an average value of the bit point values is obtained to serve as the occupied space size of the table, and then the information is stored in a current recording table.
So far, the information required by the current record table for storing the current scanning result is filled, then all records of the current record table are inquired and cached to obtain cache information, and then the cache information is written into a history record table (md _ t _ table _ space _ his table) for storing the history scanning result.
In this embodiment, the program of this embodiment may be distributed on an intelligent open platform, and the timing task is implemented by a scheduler, for example, the timing task is implemented by a Cron scheduler of a Spring boot framework, and may be executed at ten am every two, four, and day, where the execution takes about 20 minutes, and this is not limited herein.
Optionally, in this embodiment, based on a scanning result that scans at least one storage space for storing data in the data cluster and determines whether the storage space is a useless area, the following information is calculated for the scanning result by using SQL: the usage status of storage resources, non-table, cold data (data with low access heat), Hive table speed increasing Top10, Hive big table Top10, and the like.
According to the embodiment, the space occupied by all data tables (Hive, Kudu and HBase) in the big data cluster can be scanned through a Java + Springboot + Mysql program, whether the data tables are useful or not is identified, then the scanning result is output to the current recording table to be displayed to each field manager by a foreground of an intelligent home open platform, so that developers are supervised to pay attention to and clear up the tables and useless tables with abnormally increased data volume, the storage space of the big data cluster HDFS is released, the pressure of a Name Node is relieved, the hardware cost of a server is reduced, the technical problem of low efficiency of data detection is solved, and the technical effect of improving the efficiency of data detection is achieved.
In order to better understand the process of the data detection method, the following describes a flow of the data detection implementation method with reference to an optional embodiment, but the flow is not limited to the technical solution of the embodiment of the present invention.
With the increase of various service scenes, ETL tasks are more and more, the space utilization rate of a large data cluster HDFS exceeds sixty percent, the pressure of a Name Node is gradually increased, if the cleaning is not carried out in time, a server needs to be expanded subsequently to provide more storage space, and more memories need to be allocated to the Name Node to relieve the pressure.
The manager of the big data cluster can only manually check the space occupation condition and the data use condition of the table in the big data cluster through the Shell command of the HDFS, collect the check results into Excel after checking one by one, match the responsible person according to the corresponding database name and the table name (the matched responsible person is inaccurate), and then inform the responsible person to process. And the responsible person feeds back the processing result to the administrator after the processing is finished, and the administrator verifies the processing result. This is extremely time and effort consuming.
In the embodiment, the table space and the use condition can be automatically scanned at regular time according to needs through the Java + Springboot + Mysql program, and the table space and the use condition are provided for front-end display through the smart home open platform, so that the time and the energy of a cluster administrator are saved.
The programming concept of this embodiment is described below.
In the embodiment, Java + Springboot + Mysql is used for programming, and metadata is accurately matched to people and task timing scheduling and front-end display by combining an intelligent home open platform.
The procedure of this example requires a total of two Mysql tables: md _ t _ table _ space (storing current scan results), md _ t _ table _ space _ his (storing historical scan results), the two table structures are identical.
FIG. 3 is an interaction diagram of a method for scanning a large data cluster garbage table and a tablespace based on Java + Springboot + Mysql according to an embodiment of the present invention. As shown in fig. 3, the method may include the steps of:
step S301, emptying the current day table and deleting the records of the history table on the current day.
In this embodiment, the previous day table (md _ t _ table _ space table) is cleared first, and the records of the history table (md _ t _ table _ space _ his) on that day are deleted. This ensures that there is no duplicate data in the table when the program runs many times the day. The scan order may be: hive table, SolrCollection, HBase table, Kudu table.
Step S302, obtaining the metadata of the intelligent home open platform.
The embodiment can acquire Hive table metadata information stored by the smart open platform.
Step S303, traverse the Hive internal table path, perform relevant processing and information acquisition, and then delete the table from the Map.
In step S304, the acquired information is written to Mysql.
The embodiment may load the obtained platform metadata information into the Map.
S1, traversing the internal table path/user/Hive/ware house/, obtaining the table-level path, and cutting the database name and the table name from the table-level path.
And S2, distinguishing the non-partition table, the first-level partition table and the second-level partition table through the path hierarchy.
And S3, counting the number of files in the table-level directory.
S4, counting the occupation space (including copy) of the table-level directory according to the number of files, and converting into MB.
And S5, acquiring the latest modification time of the file in the table-level directory.
S6, traverse all the files in the table-level directory, and obtain the latest access time of each file, which may be used as the latest access time of the table.
S7, respectively calculating the latest modification time and the number of days between the latest access time and the program running time, if both are more than 90 days, setting Flag of the Hive table to be 1, representing that no table is available, and triggering enterprise WeChat to send alarm information to a creator of the table; otherwise, Flag is set to 0.
S8, if the internal table can be found in Map, setting the ID of the table, Chinese description of the table and the number of partitions; the table is then deleted from the Map.
Step S305, traversing the remaining tables in the Map, and performing relevant processing and information acquisition.
The remaining tables in the Map of this embodiment are tables that exist in the smart home open platform but do not exist in the/user/live/ware house directory (external tables identified in the metadata and internal tables that do not meet the specification), and the logic for traversing the remaining tables in the Map is consistent with the above steps S2 to S7.
Step S306, writing the acquired information into Mysql.
And step S307, traversing the Solr collection path, performing relevant processing and information acquisition, and writing the information into Mysql.
And after the Hive table is scanned, scanning the data volume of the Solr collection, and storing the comparison field names into Mysql by acquiring the information of each collection catalog in the/Solr catalog.
And step S308, traversing the HBase path, performing relevant processing and information acquisition, and writing the information into Mysql.
In the embodiment, after Solr collection is scanned, the data volume of the HBase table is scanned, and the table space name and the table name are segmented according to the path and stored in Mysql by acquiring the information of the/HBase/data directory and acquiring the occupation condition of the table space.
Step S309, all kudu list lists are obtained.
In this embodiment, after scanning the HBase table, the data amount of the Kudu table is scanned, and since Kudu is a separate columnar storage system and is in a parallel relationship with the HDFS, and Kudu has no concept of directory, the occupied space of the table can be obtained by the monitoring information on the Cloudera Manager.
Step S310, the occupied value of the storage space of each kudu table in the last 20 minutes is obtained through the chart of the CM, and the average value is obtained.
Optionally, this embodiment first obtains all Kudu table names through the Kudu Client API, then takes each table to access the chart information on Cloudera Manager, obtains the last 20 minutes of the bit value on the monitoring chart, and averages this as the size of the occupied space of this table.
Step S311, store the acquired information in Mysql.
In the md _ t _ table _ space table of this embodiment, that is, the Mysql, after the required information is filled, all records in the table are queried and cached to obtain cached information, and then the cached information is written into the md _ t _ table _ space _ his table.
And step S312, processing the meter data, and displaying the meter data at the front end of the intelligent open platform.
In the embodiment, the program of the method is released on the smart home open platform, a timed task can be realized through a Cron scheduler of Spring boot, and the timed task is executed at ten am of every Tuesday, Thursday and day, and the execution time is about 20 minutes.
Based on the scanning result, the following information can be calculated by using SQL: the usage status of storage resources, no-table, cold data, Hive table speed increasing Top10, Hive big Table Top10, and the like.
FIG. 4 is a diagram illustrating the task presentation of a program scheduling according to an embodiment of the present invention. As shown in fig. 4, information for system timing tasks and scan table spaces may be exposed in the front-end interface.
FIG. 5 is a diagram illustrating usage of storage resources according to an embodiment of the invention. As shown in FIG. 5, the presentation of storage resource usage in the front-end interface may include: user group, storage type, target data source, storage path, maximum capacity (GB), current used, operation. Wherein the user group may include: data acquisition-after-sale, data acquisition-supply chain planning field, data acquisition-overseas field; the storage type can be Hive; the target data source can be di _ fw, di _ gy, dh _ gy, dl _ hw _ seq and dl _ hw _ txt; the storage path can be/user/hive/ware house/di _ fw.db,/user/hive/ware house/dl _ hw _ seq.db,/user/hive/ware house/dl _ hw _ txt.db; the maximum capacity may be 10240, 1024; the current usage may be 4981, 33, 59, 159, 147; the operation may be a view.
FIG. 6 is a schematic diagram of a garbage table presentation, according to an embodiment of the present invention. As shown in fig. 6, the garbage without table may be displayed in the front-end interface, and the table name, the last access time, the last update time, the occupied space, and the operation of the garbage may be displayed, where the garbage is data without update and without access within 3 months.
FIG. 7 is a diagram of a cold data presentation, according to an embodiment of the present invention. As shown in fig. 7, the table name, last access time, last update time, occupancy space of the cold data may be exposed in the front-end interface.
Fig. 8 is a schematic diagram of Hive chart acceleration Top10 according to an embodiment of the invention. As shown in fig. 8, the first ten Hive table increasing rates can be displayed in the front interface, and the storage space occupied by each Hive table is also displayed.
FIG. 9 is a schematic diagram of a Hive spreadsheet Top10 according to an embodiment of the invention. As shown in fig. 9, the top ten names of Hive large tables can be displayed in the front-end interface, and the storage space occupied by each Hive table is also displayed.
FIG. 10 is a diagram illustrating a garbage cleaning result according to an embodiment of the present invention. As shown in FIG. 10, the storage space in the front-end interface that can show the Hive table is reduced from 187962GB to 155156GB, and in addition to the normal incremental data per day, a total of two weeks clears approximately 40000GB of space.
According to the embodiment, program development is carried out by Java + Springboot + Mysql, and an intelligent home open platform is combined, so that the purposes of scanning Hive, HBase, Solr and Kudu table space occupation, table use, metadata matching to people, scanning task timing scheduling, front-end page visual display, enterprise WeChat alarm to people, chart authority control and the like can be realized, the time and energy of cluster managers and tenant managers can be greatly saved, cluster management is more automatic and intelligent, the cluster storage space can be effectively optimized, the pressure of Name Node management files is released, the hardware cost of a server is reduced, the technical problem of low efficiency of data detection is solved, and the technical effect of improving the efficiency of data detection is achieved.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Fig. 11 is a block diagram of a data detection apparatus according to an embodiment of the present invention. As shown in fig. 11, the data detecting device 110 may include: a scanning unit 111, a reading unit 112, and a deleting unit 113.
The scanning unit 111 is configured to scan at least one storage space for storing data in the data cluster.
A reading unit 112, configured to sequentially read a latest processing time of each file in a directory of the storage space, where the latest processing time includes at least one of: the last modification time and the last access time of the file.
And the deleting unit 113 is used for deleting the metadata information of any file when the difference value between the latest processing time and the initial running time of the file exceeds a threshold value.
Optionally, the apparatus further comprises: the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring the quantity of data stored in a storage space when at least one storage space for storing data in a data cluster is scanned; the first determining unit is used for determining the occupied space of the storage space based on the quantity of the data stored in the storage space.
Optionally, the first determination unit includes: the determining module is used for traversing an internal table path of the data table under the condition that the storage space is the data table, and determining the hierarchy of the data table based on the internal table path; the first statistical module is used for counting the number of files under the table-level directory corresponding to the hierarchy of the data table; and the second statistical module is used for counting the occupied space of the data table under the table-level directory based on the number of the files under the table-level directory.
Optionally, the apparatus further comprises: and the second determining unit is used for determining that the storage space for storing the file is a useless area and triggering prompt information to a front-end operator when the difference value between the latest processing time and the initial running time of any file exceeds a threshold value.
Optionally, the apparatus further comprises: and the query unit is used for marking the useless areas, querying whether metadata information associated with the useless areas exists in the metadata set or not, and if so, executing the step of deleting the metadata information associated with the useless areas.
Optionally, the apparatus may comprise: and the traversing unit is used for traversing the metadata information of the undeleted files in the metadata set after the step of deleting the metadata information associated with the useless area is executed, and if the undeleted files are files which do not exist in the directory of the storage space, continuously judging whether the metadata information of the undeleted files is the information which needs to be deleted.
Optionally, the apparatus further comprises: and the scanning unit is used for starting to scan other storage spaces after any one storage space is scanned, until the record stored in the preset time period is scanned, and writing the cache information of the subsequent record of the storage space into the history record table.
In the data detection apparatus of this embodiment, at least one storage space for storing data in a data cluster is scanned by the scanning unit 111; the reading unit 112 sequentially reads the latest processing time of each file in the directory of the storage space, wherein the latest processing time includes at least one of the following: the last modification time and the last access time of the file; when the difference value between the latest processing time and the initial running time of any file exceeds a threshold value, the deleting unit 113 deletes the metadata information of the file. That is to say, this application scans at least one storage space used for storing data in the data cluster, and the difference between the latest processing time and the initial running time of any file in the storage space exceeds the threshold value, and the metadata information of the file is deleted automatically, thereby relieving the data storage pressure, avoiding the situation that the storage condition and the data use condition are checked manually by the Shell command of the HDFS one by one, and processing the data manually, thereby reducing the time cost and the labor cost of data detection, solving the technical problem of low efficiency of detecting the data, and achieving the technical effect of improving the efficiency of detecting the data.
Embodiments of the present invention also provide a computer-readable storage medium. The computer-readable storage medium includes a stored program, wherein the program, when executed by a processor, performs a method of detecting data of an embodiment of the present invention.
Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps:
s1, scanning at least one storage space for storing data in the data cluster, and acquiring the quantity of data stored in the storage space;
s2, determining the occupied space of the storage space based on the data quantity stored in the storage space;
s3, reading the latest processing time of each file in the directory of the storage space in sequence, wherein the latest processing time comprises at least one of the following: the last modification time and the last access time of the file;
s4, if the difference between the latest processing time and the initial running time of any file exceeds the threshold, the metadata information of the file is deleted.
Embodiments of the present invention also provide an electronic device, including a memory in which a computer program is stored and a processor configured to run the computer program to perform the method for detecting data of the embodiments of the present invention.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, scanning at least one storage space for storing data in the data cluster, and acquiring the quantity of data stored in the storage space;
s2, determining the occupied space of the storage space based on the data quantity stored in the storage space;
s3, reading the latest processing time of each file in the directory of the storage space in sequence, wherein the latest processing time comprises at least one of the following: the last modification time and the last access time of the file;
s4, if the difference between the latest processing time and the initial running time of any file exceeds the threshold, the metadata information of the file is deleted.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for detecting data, comprising:
scanning at least one storage space in the data cluster for storing data;
sequentially reading the latest processing time of each file in the directory of the storage space, wherein the latest processing time comprises at least one of the following: a last modification time and a last access time of the file;
and if the difference value between the latest processing time and the initial running time of any file exceeds a threshold value, deleting the metadata information of the file.
2. The method of claim 1, wherein when scanning at least one storage space in a data cluster for storing data, the method further comprises:
acquiring the quantity of data stored in the storage space;
and determining the occupied space of the storage space based on the data quantity stored in the storage space.
3. The method of claim 2, wherein, in the case that the storage space is a data table, determining the occupation space of the storage space based on the amount of data stored in the storage space comprises:
traversing an internal table path of the data table and determining a hierarchy of the data table based on the internal table path;
counting the number of files under a table-level directory corresponding to the hierarchy of the data table;
and counting the occupied space of the data table under the table-level directory based on the number of the files under the table-level directory.
4. The method of claim 1, wherein if the difference between the most recent processing time and the initial run time for any one file exceeds a threshold, determining the storage space used to store the file as a garbage area and triggering a prompt to a front-end operator.
5. The method of claim 4, wherein the garbage area is marked and a query is made as to whether metadata information associated with the garbage area exists in the metadata collection, and if so, the step of deleting the metadata information associated with the garbage area is performed.
6. The method of claim 5, wherein after the step of deleting the metadata information associated with the garbage region is performed, the method further comprises:
traversing metadata information of undeleted files in the metadata set;
and if the undeleted file is a file which does not exist in the directory of the storage space, continuously judging whether the metadata information of the undeleted file is the information which needs to be deleted.
7. The method according to any one of claims 1 to 6, wherein after scanning any one of the memory spaces is completed, scanning other memory spaces is started until the records stored in the predetermined time period are scanned, and the cache information of the subsequent records of the memory spaces is written into the history table.
8. An apparatus for detecting data, comprising:
the scanning unit is used for scanning at least one storage space for storing data in the data cluster;
a reading unit, configured to read a latest processing time of each file in a directory of the storage space in sequence, where the latest processing time includes at least one of: a last modification time and a last access time of the file;
and the deleting unit is used for deleting the metadata information of any file when the difference value between the latest processing time and the initial running time of the file exceeds a threshold value.
9. A computer-readable storage medium, comprising a stored program, wherein the program, when executed by a processor, performs the method of any of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.
CN202110688742.8A 2021-06-21 2021-06-21 Data detection method and device, storage medium and electronic device Pending CN113434492A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110688742.8A CN113434492A (en) 2021-06-21 2021-06-21 Data detection method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110688742.8A CN113434492A (en) 2021-06-21 2021-06-21 Data detection method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN113434492A true CN113434492A (en) 2021-09-24

Family

ID=77756967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110688742.8A Pending CN113434492A (en) 2021-06-21 2021-06-21 Data detection method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN113434492A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103353892A (en) * 2013-07-05 2013-10-16 北京东方网信科技股份有限公司 Method and system for data cleaning suitable for mass storage
US9792298B1 (en) * 2010-05-03 2017-10-17 Panzura, Inc. Managing metadata and data storage for a cloud controller in a distributed filesystem
CN108090118A (en) * 2017-11-07 2018-05-29 清华大学 The acquisition methods and system of file system metadata
CN108363643A (en) * 2018-03-27 2018-08-03 东北大学 A kind of HDFS copy management methods based on file access temperature
CN109460411A (en) * 2018-11-13 2019-03-12 杭州数梦工场科技有限公司 A kind of data aging method based on hive, device and equipment
CN110413587A (en) * 2019-06-26 2019-11-05 苏州浪潮智能科技有限公司 A kind of method and apparatus of aging history data
CN110659281A (en) * 2019-08-14 2020-01-07 中国平安财产保险股份有限公司 Hive-based data processing method and device, computer equipment and storage medium
CN111666260A (en) * 2019-03-08 2020-09-15 杭州海康威视数字技术股份有限公司 Data processing method and device
CN112269781A (en) * 2020-11-13 2021-01-26 网易(杭州)网络有限公司 Data life cycle management method, device, medium and electronic equipment
CN112559483A (en) * 2020-12-22 2021-03-26 赛尔网络有限公司 HDFS-based data management method and device, electronic equipment and medium
CN112988722A (en) * 2021-02-05 2021-06-18 新华三大数据技术有限公司 Hive partition table data cleaning method and device and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9792298B1 (en) * 2010-05-03 2017-10-17 Panzura, Inc. Managing metadata and data storage for a cloud controller in a distributed filesystem
CN103353892A (en) * 2013-07-05 2013-10-16 北京东方网信科技股份有限公司 Method and system for data cleaning suitable for mass storage
CN108090118A (en) * 2017-11-07 2018-05-29 清华大学 The acquisition methods and system of file system metadata
CN108363643A (en) * 2018-03-27 2018-08-03 东北大学 A kind of HDFS copy management methods based on file access temperature
CN109460411A (en) * 2018-11-13 2019-03-12 杭州数梦工场科技有限公司 A kind of data aging method based on hive, device and equipment
CN111666260A (en) * 2019-03-08 2020-09-15 杭州海康威视数字技术股份有限公司 Data processing method and device
CN110413587A (en) * 2019-06-26 2019-11-05 苏州浪潮智能科技有限公司 A kind of method and apparatus of aging history data
CN110659281A (en) * 2019-08-14 2020-01-07 中国平安财产保险股份有限公司 Hive-based data processing method and device, computer equipment and storage medium
CN112269781A (en) * 2020-11-13 2021-01-26 网易(杭州)网络有限公司 Data life cycle management method, device, medium and electronic equipment
CN112559483A (en) * 2020-12-22 2021-03-26 赛尔网络有限公司 HDFS-based data management method and device, electronic equipment and medium
CN112988722A (en) * 2021-02-05 2021-06-18 新华三大数据技术有限公司 Hive partition table data cleaning method and device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曾艳梅等: "一种基于元数据静动态数据联合查询方法的研究与实现", 《计算机应用与软件》, vol. 32, no. 01, 15 January 2015 (2015-01-15), pages 59 - 63 *
胡茂胜: "基于数据中心模式的分布式异构空间数据无缝集成技术研究", 《中国优秀硕士学位论文全文数据库·信息科技辑》, no. 10, 15 October 2009 (2009-10-15), pages 1 - 137 *

Similar Documents

Publication Publication Date Title
US11636083B2 (en) Data processing method and apparatus, storage medium and electronic device
CN111061752B (en) Data processing method and device and electronic equipment
CN112506870B (en) Data warehouse increment updating method and device and computer equipment
CN110147470B (en) Cross-machine-room data comparison system and method
CN111061758B (en) Data storage method, device and storage medium
CN107656807A (en) The automatic elastic telescopic method and device of a kind of virtual resource
CN111400288A (en) Data quality inspection method and system
CN110955704A (en) Data management method, device, equipment and storage medium
CN111324604A (en) Database table processing method and device, electronic equipment and storage medium
CN111694505B (en) Data storage management method, device and computer readable storage medium
CN110716938A (en) Data aggregation method and device, storage medium and electronic device
CN113779426A (en) Data storage method and device, terminal equipment and storage medium
CN109753505B (en) Method and system for creating temporary storage unit in big data storage system
CN113434492A (en) Data detection method and device, storage medium and electronic device
CN110019870B (en) Image retrieval method and system based on memory image cluster
CN108509639B (en) Table information management method, device and readable storage medium
CN104317820B (en) Statistical method and device for report forms
CN106993026B (en) Method and device for detecting and downloading newly added files of FTP server
CN114417200A (en) Network data acquisition method and device and electronic equipment
CN110321358A (en) A kind of method and device of user data reorganization
CN108206933B (en) Video data acquisition method and device based on video cloud storage system
CN110674214A (en) Big data synchronization method and device, computer equipment and storage medium
CN110674190B (en) Statistical method and device for file system tasks and server
US20130304735A1 (en) Records management system
CN116010340A (en) Data table management method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination