CN112231292A

CN112231292A - File processing method and device, storage medium and computer equipment

Info

Publication number: CN112231292A
Application number: CN202010930089.7A
Authority: CN
Inventors: 郑艳涛; 周一帆; 庞少强
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2019-02-15
Filing date: 2019-02-15
Publication date: 2021-01-15
Also published as: CN109902067B; CN109902067A

Abstract

The invention provides a file processing method, a file processing device, a storage medium and computer equipment, wherein a file is a file in a data warehouse tool, the data warehouse tool comprises a node of a target type, and the method comprises the steps of acquiring a mirror image file generated by the node of the target type; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; and combining the original files according to the information of the original files and a preset rule. The invention can realize the automatic identification of the files generated during the operation of the data warehouse tool and can combine the files in time.

Description

File processing method and device, storage medium and computer equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a file processing method and apparatus, a storage medium, and a computer device.

Background

The data warehouse tool can map the Structured data file into a database table, provide a simple SQL (Structured Query Language) function, and convert SQL statements into distributed computing tasks to be executed.

Data warehouse tools generally run on a Hadoop distributed file system, and a large number of small files are generated in the running process. The generation of small files may come from: the data source is generated when the data source is imported into the data warehouse tool or when offline calculations are made by reading the data table of the data warehouse tool. Usually, for a single file, one computing process or thread is required to be occupied during computing, and a large number of small files consume more computing resources, so that it is necessary to process files generated in the operation process of the data warehouse tool.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a file processing method, a file processing device, a storage medium and computer equipment, which can automatically identify files generated when a data warehouse tool operates and can combine the files in time.

To achieve the above object, an embodiment of a first aspect of the present invention provides a file processing method, where the file is a file in a data warehouse tool, and the data warehouse tool includes a node of a target type, and the method includes: acquiring a mirror image file generated by the node of the target type; analyzing the image file by combining the directory information of the data warehouse tool to obtain the information of the original file to which the image file belongs; and combining the original files according to the information of the original files by combining a preset rule.

In the file processing method provided by the embodiment of the first aspect of the present invention, a mirror image file generated by a node of a target type is obtained; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

To achieve the above object, a file processing apparatus according to a second aspect of the present invention is a file in a data warehouse tool, where the data warehouse tool includes a node of a target type, and the file processing apparatus includes: the acquisition module is used for acquiring the mirror image file generated by the node of the target type; the analysis module is used for analyzing the image file to obtain the information of the original file to which the image file belongs by combining the directory information of the data warehouse tool; and the merging processing module is used for merging the original files by combining a preset rule according to the information of the original files.

The file processing apparatus provided by the embodiment of the second aspect of the present invention obtains the image file generated by the node of the target type; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

To achieve the above object, a non-transitory computer-readable storage medium according to a third embodiment of the present invention is a non-transitory computer-readable storage medium, when instructions in the storage medium are executed by a processor of a mobile terminal, the instructions enabling the mobile terminal to execute a file processing method, the method including: the embodiment of the first aspect of the invention provides a file processing method.

In a non-transitory computer-readable storage medium according to a third embodiment of the present invention, an image file generated by a node of a target type is obtained; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

To achieve the above object, a computer program product according to a fourth aspect of the present invention is a computer program product, when instructions of the computer program product are executed by a processor, for executing a file processing method, where the file is a file in a data warehouse tool, and the data warehouse tool includes a node of a target type, and the method includes: acquiring a mirror image file generated by the node of the target type; analyzing the image file by combining the directory information of the data warehouse tool to obtain the information of the original file to which the image file belongs; and combining the original files according to the information of the original files by combining a preset rule.

In a computer program product according to a fourth aspect of the present invention, an image file generated by a node of a target type is obtained; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

The fifth aspect of the present invention further provides a computer device, which includes a housing, a processor, a memory, a circuit board, and a power circuit, wherein the circuit board is disposed inside a space enclosed by the housing, and the processor and the memory are disposed on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing: acquiring a mirror image file generated by the node of the target type; analyzing the image file by combining the directory information of the data warehouse tool to obtain the information of the original file to which the image file belongs; and combining the original files according to the information of the original files by combining a preset rule.

In the computer device according to the fifth aspect of the present invention, the image file generated by the target type node is obtained; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart illustrating a document processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application scenario according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a document processing apparatus according to another embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a flowchart illustrating a file processing method according to an embodiment of the present invention.

The present embodiment is exemplified in a case where the file processing method is configured as a file processing apparatus.

The file processing method in this embodiment may be configured in a file processing apparatus, and the file processing apparatus may be configured in a server, or may also be configured in an electronic device, which is not limited in this embodiment of the present application.

The present embodiment takes the case where the file processing method is configured in the electronic device as an example.

The file is a file in a data warehouse tool, the data warehouse tool comprises a target type node, and the target type node can be a metadata management center NameNode node.

It should be noted that the execution main body in the embodiment of the present application may be, for example, a Central Processing Unit (CPU) in a server or an electronic device in terms of hardware, and may be, for example, a related background service in the server or the electronic device in terms of software, which is not limited to this.

In order to solve the above technical problem, an embodiment of the present invention provides a file processing method, where an image file generated by a node of a target type is obtained; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

Referring to fig. 1, the method includes:

s101: and acquiring the image file generated by the node of the target type.

The target type node may be a node of a metadata management center NameNode.

In the specific execution process of the embodiment of the invention, in order to realize automatic identification of the file generated during the operation of the data warehouse tool, the mirror image file generated by the node of the target type can be stored in the local storage device during the operation of the data warehouse tool.

In the embodiment of the invention, the mirror image file generated by the NameNode node of the metadata management center can be directly acquired in the local storage equipment, and then the mirror image file is analyzed.

The metadata management center NameNode node generates an image file, wherein the image file is specifically corresponding to an original file, and the original file is a file generated when a data warehouse tool operates.

The image file may be analyzed in real time, or the image file may be analyzed at certain time intervals, which is not limited in this respect.

S102: and analyzing the mirror image file by combining the directory information of the data warehouse tool to obtain the information of the original file to which the mirror image file belongs.

The original files are files corresponding to a plurality of database tables and a plurality of partitions of the data warehouse tool, and the files corresponding to each database table and each partition.

The directory information therein is used to describe the specific organizational structure of each database table and each partition.

The information includes: the amount and the size of the occupied storage space.

In the specific execution process of the embodiment of the invention, the directory information of the data warehouse tool can be combined to determine the information of the first original file corresponding to each database table and the information of the second original file corresponding to each partition in the plurality of database tables and the plurality of partitions of the data warehouse tool.

The original file corresponding to the database table may be referred to as a first original file, and the original file corresponding to each partition may be referred to as a second original file.

By combining the directory information of the data warehouse tool, the information of the first original file corresponding to each database table and the information of the second original file corresponding to each partition in the plurality of database tables and the plurality of partitions of the data warehouse tool are determined, so that the original files can be positioned timely and accurately, and the information of the original files corresponding to the database tables can be acquired timely.

In the embodiment of the invention, the information of the original file to which the mirror image file belongs is obtained by analyzing the mirror image file in combination with the directory information of the data warehouse tool, rather than directly accessing the NameNode node of the metadata management center through the preset interface, the acquisition process of the information of the original file can be simplified, the file processing efficiency is improved, and the mirror image file is acquired for analysis instead of directly remotely calling the NameNode node of the metadata management center through the preset interface, so that extra access pressure is not brought to the NameNode node, and the adverse effect on the stability of the production environment can be avoided.

S103: and combining the original files according to the information of the original files and a preset rule.

Optionally, in some embodiments, merging the original files according to the information of the original files by combining a preset rule, includes: determining a first average value of the size of the storage space occupied by the first original files corresponding to each database table according to the information of each first original file, determining a second average value of the size of the storage space occupied by the second original files corresponding to each partition, and combining the original files according to the first average value, the second average value, the number of the first original files and the number of the second original files and a preset rule.

Optionally, in some embodiments, merging the original files according to the first average value, the second average value, the number of the first original files, and the number of the second original files in combination with a preset rule includes: when the first average value or the second average value is smaller than or equal to a first preset threshold value, merging the first original file or the second original file; and/or when the number of the first original files or the number of the second original files is larger than a second preset threshold value, merging the first original files or the second original files.

The first preset threshold and the second preset threshold may be set by a user according to a requirement, or may also be preset by a factory program of the electronic device, which is not limited to this.

By setting a first preset threshold and a second preset threshold, when the first average value or the second average value is smaller than or equal to the first preset threshold, merging the first original file or the second original file; and/or when the number of the first original files or the number of the second original files is larger than a second preset threshold, merging the first original files or the second original files, and setting a reasonable judgment condition for merging, so that the files generated when the data warehouse tool operates can be automatically identified, and the files can be merged in time.

In the embodiment, the mirror image file generated by the node of the target type is obtained; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

As an example, referring to fig. 2, fig. 2 is a schematic view of an application scenario according to an embodiment of the present invention. The node of the metadata management center NameNode records the metadata information including file information in a local disk (the disk is a local storage device in the invention) periodically, and the stored content can be called as an Image file (Image); periodically starting an Analysis process, obtaining an Image file generated by a NameNode node of a metadata management center by a background, analyzing the Image file, and obtaining information of an original file to which the Image file belongs according to a Hive directory (a single table or a single partition corresponding to Hive), wherein the information can be information such as the number of files, the size of the files and the like; storing the result obtained by analysis into any storage system for query; inquiring the obtained Hive directory information, and calculating by combining with preset rules to obtain a Hive table or a partition to be merged, wherein the preset rules comprise: 1) the average value of the sizes of the storage spaces occupied by the original files is less than or equal to a first preset threshold; 2) and if the number of the original files is larger than or equal to a second preset threshold value and the conditions are met, the corresponding small files of the directory are considered to be too many and need to be merged.

In the specific implementation process of the embodiment of the invention, considering that the data is a service sensitive resource, the process can be automatically completed by electronic equipment, can also be confirmed by intervention of related personnel, issues an instruction to combine a corresponding Hive table or partition, and specifically can be calculated through MapReduce/Spark or Hive.

Fig. 3 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present invention.

The file is a file in a data warehouse tool that includes nodes of the target type.

Referring to fig. 3, the apparatus 300 includes:

an obtaining module 301, configured to obtain an image file generated by a node of a target type;

the analysis module 302 is configured to analyze the mirror image file to obtain information of an original file to which the mirror image file belongs, in combination with directory information of the data warehouse tool;

and a merging processing module 303, configured to merge the original files according to the information of the original files and by combining a preset rule.

Optionally, in some embodiments, referring to fig. 4, the apparatus 300 further comprises:

and the storage module 304 is used for storing the image file generated by the node of the target type into the local storage device in the operation process of the data warehouse tool.

Optionally, in some embodiments, the information includes: the number and the size of the occupied storage space, the parsing module 302 is specifically configured to:

and determining information of a first original file corresponding to each database table and information of a second original file corresponding to each partition in a plurality of database tables and a plurality of partitions of the data warehouse tool by combining directory information of the data warehouse tool.

Optionally, in some embodiments, the merging processing module 303 is specifically configured to:

determining a first average value of the size of the storage space occupied by the first original file corresponding to each database table according to the information of each first original file, and determining a second average value of the size of the storage space occupied by the second original file corresponding to each partition;

and combining the original files according to the first average value, the second average value, the number of the first original files and the number of the second original files by combining a preset rule.

when the first average value or the second average value is smaller than or equal to a first preset threshold value, merging the first original file or the second original file; and/or the presence of a gas in the gas,

and when the number of the first original files or the number of the second original files is larger than a second preset threshold value, merging the first original files or the second original files.

It should be noted that the foregoing explanations of the file processing method embodiments in the embodiments of fig. 1-2 are also applicable to the file processing apparatus 300 of this embodiment, and the implementation principles thereof are similar and will not be described herein again.

The computer device may be a mobile phone, a tablet computer, etc.

Referring to fig. 5, the computer apparatus 50 of the present embodiment includes: the electronic device comprises a shell 501, a processor 502, a memory 503, a circuit board 504 and a power supply circuit 505, wherein the circuit board 504 is arranged inside a space enclosed by the shell 501, and the processor 502 and the memory 503 are arranged on the circuit board 504; a power supply circuit 505 for supplying power to each circuit or device of the computer apparatus 50; the memory 503 is used to store executable program code; the processor 502 runs a program corresponding to the executable program code by reading the executable program code stored in the memory 503, so as to execute:

acquiring a mirror image file generated by a node of a target type;

analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs;

and combining the original files according to the information of the original files and a preset rule.

It should be noted that the foregoing explanation on the file processing method embodiment in the embodiments of fig. 1 to fig. 2 also applies to the computer device 50 of this embodiment, and the implementation principle is similar, and is not described herein again.

The computer device in the embodiment acquires the image file generated by the node of the target type; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

To achieve the above embodiments, the present invention also proposes a non-transitory computer-readable storage medium, which when instructions in the storage medium are executed by a processor of a terminal, enables the terminal to execute a file processing method, the file being a file in a data warehouse tool, the data warehouse tool including nodes of a target type, the method including:

acquiring a mirror image file generated by a node of a target type;

The non-transitory computer readable storage medium in this embodiment obtains an image file generated by a node of a target type; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

To achieve the above embodiments, the present invention further provides a computer program product, wherein when instructions in the computer program product are executed by a processor, a file processing method is performed, where a file is a file in a data warehouse tool, the data warehouse tool includes a node of a target type, and the method includes:

acquiring a mirror image file generated by a node of a target type;

The computer program product in the embodiment obtains the image file generated by the node of the target type; analyzing the mirror image file by combining directory information of the data warehouse tool to obtain information of an original file to which the mirror image file belongs; according to the information of the original files, the original files are combined according to the preset rules, the files generated when the data warehouse tool operates can be automatically identified, and the files are combined in time.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of file processing, wherein the file is a file in a data warehouse tool, the data warehouse tool comprising nodes of a target type, the method comprising the steps of:

acquiring a mirror image file generated by the node of the target type;

analyzing the image file by combining the directory information of the data warehouse tool to obtain the information of the original file to which the image file belongs;

and combining the original files according to the information of the original files by combining a preset rule.

2. The file processing method according to claim 1, wherein before the obtaining the image file corresponding to the node of the target type, further comprising:

and storing the image file generated by the node of the target type into a local storage device in the operation process of the data warehouse tool.

3. The file processing method of claim 1, wherein the information comprises: the quantity and the size of the occupied storage space, in combination with the directory information of the data warehouse tool, the mirror image file is analyzed to obtain the information of the original file to which the mirror image file belongs, and the method comprises the following steps:

4. The file processing method according to claim 3, wherein the merging the original files according to the information of the original files and with a preset rule comprises:

and combining preset rules to merge the original files according to the first average value, the second average value, the number of the first original files and the number of the second original files.

5. The file processing method according to claim 4, wherein the merging the original files according to the first average value, the second average value, the number of the first original files, and the number of the second original files in combination with a preset rule includes:

6. An apparatus for processing a file, the file being a file in a data warehouse tool, the data warehouse tool comprising nodes of a target type, the apparatus comprising:

the acquisition module is used for acquiring the mirror image file generated by the node of the target type;

the analysis module is used for analyzing the image file to obtain the information of the original file to which the image file belongs by combining the directory information of the data warehouse tool;

and the merging processing module is used for merging the original files by combining a preset rule according to the information of the original files.

7. The document processing apparatus according to claim 6, further comprising:

and the storage module is used for storing the image file generated by the target type node into a local storage device in the operation process of the data warehouse tool.

8. The document processing apparatus according to claim 6, wherein the information includes: quantity and size of occupied storage space, the analysis module is specifically used for:

9. The document processing apparatus according to claim 8, wherein the merge processing module is specifically configured to:

10. The document processing apparatus according to claim 9, wherein the merge processing module is specifically configured to:

11. A non-transitory computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a file processing method according to any one of claims 1 to 5.

12. A computer device comprising a housing, a processor, a memory, a circuit board, and a power circuit, wherein the circuit board is disposed inside a space enclosed by the housing, the processor and the memory being disposed on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing:

acquiring a mirror image file generated by the node of the target type;