WO2018054200A1

WO2018054200A1 - Method and device for reading file

Info

Publication number: WO2018054200A1
Application number: PCT/CN2017/099554
Authority: WO
Inventors: 米维聪; 徐超; 罗海英
Original assignee: 上海泓智信息科技有限公司
Priority date: 2016-09-26
Filing date: 2017-08-30
Publication date: 2018-03-29
Also published as: CN107870928A

Abstract

Disclosed in the present invention are a method and a device for reading a file. The method comprises: a reading step, reading, in a stream manner, data with a predetermined length from a file on the basis of the size of a buffer; a buffering step, placing the read data in the buffer for buffering; a preprocessing step, preprocessing, according to preconfigured preprocessing requirements, the buffered data to obtain the content of the data; an importing step, storing the content of the data into a data platform; and circularly executing the reading step, buffering step, preprocessing step and importing step in sequence to accomplish the reading of the file The present invention solves the technical problem caused by a large file.

Description

File reading method and device

Technical field

The present invention relates to the field of big data, and in particular to a file reading method and apparatus.

Background technique

The current society is a fast-developing society with developed technology, information circulation, more and more people's exchanges, and more and more convenient life. Big data is the product of this high-tech era. Big data leads to the creation of large files, and the reading of large files is problematic compared to the processing of small files before.

For example, in some special industries, it is often necessary to face a huge number of files of more than a dozen GB or even tens of terabytes, and a 32-bit process has a virtual address space of only 4G. Obviously, it is impossible to load all the files into the memory at one time. in.

As another example, if the file is large, there is a problem in reading the contents of the file into the database.

In view of the above problems caused by the relatively large files, no effective solution has been proposed yet.

Summary of the invention

The embodiment of the invention provides a file reading method and device, so as to at least solve the technical problem caused by the relatively large file.

According to an aspect of an embodiment of the present invention, a file reading method is provided, including: a reading step of streaming a predetermined length of data from a file according to a size of a buffer area; and a caching step, which is to be read The data is cached in the buffer area; the pre-processing step preprocesses the cached data according to the pre-configured pre-processing requirements to obtain the content of the data; the import step saves the content of the data to the data platform; The steps, the caching step, the pre-processing step, and the import step complete the reading of the file.

According to another aspect of the embodiments of the present invention, there is further provided a file reading apparatus, comprising: a reading module, configured to perform a reading step of streaming a predetermined length from a file according to a size of a buffer area Data; a caching module, configured to perform a caching step, and the read data is cached in a buffer area; the preprocessing module is configured to perform a preprocessing step, and preprocess the cached data according to a preconfigured preprocessing requirement. Get the content of the data; the import module is used to perform the import step to save the contents of the data to the data platform.

In the embodiment of the present invention, by reading a big data file in a distributed manner, by reading a predetermined length of data in a stream manner, the data is put into a buffer area, and the data is preprocessed to obtain the content of the data. Finally, the content of the data is saved to the data platform, and the purpose of quickly loading the big data file to the memory is achieved, thereby solving the technical problem caused by the relatively large file.

DRAWINGS

The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:

1 is a flow chart of a file reading method according to an embodiment of the present invention;

2 is a flow chart of an optional file reading method according to an embodiment of the present invention;

3 is a flow chart of an alternative method of reading predetermined length data according to an embodiment of the present invention;

4 is a flow chart of an alternative method of reading predetermined length data, in accordance with an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a file reading apparatus according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

It is to be understood that the terms "first", "second" and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged where appropriate, so that the embodiments of the invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.

Example 1

According to an embodiment of the invention, an embodiment of a method of file reading is provided.

FIG. 1 is a file reading method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:

Step S102: The reading step reads a predetermined length of data from the file according to the size of the buffer area in a stream manner.

Step S104, the caching step, placing the read data in the buffer area for caching.

Step S106, a pre-processing step of pre-processing the buffered data according to the pre-configured pre-processing requirements to obtain the content of the data.

Step S108, the importing step saves the content of the data to the data platform.

Step S110, the loop sequentially performs the reading step, the caching step, the pre-processing step, and the importing step to complete the reading of the file.

As an optional embodiment, according to the size of the buffer area, the fixed length data is read from the file in a byte stream manner, the fixed length data is put into the buffer area for buffering, and then the buffer area is followed. The byte reads the cached data, and parses the data content of the file according to the length information of the file, the data type of the file content, and special characters, endian, codec, etc., that is, the preprocessing process of the data is completed. Finally, the parsed data is saved to the data platform, and data processing, data storage, query retrieval, and analysis mining and display operations can be performed. The above-described reading step, caching step, pre-processing step, and importing step are performed cyclically until the reading operation on the large file is completed.

In this embodiment, by reading a large data file in a distributed manner, by reading a predetermined length of data in a stream manner, the data is put into a buffer area, and preprocessed to obtain content of the data. Finally, the content of the data is saved to the data platform, and the purpose of quickly loading the big data file to the memory is achieved, thereby solving the technical problem caused by the relatively large file.

Considering that the file itself is relatively large, in order to speed up the processing, you can consider parallel processing, that is, you can split the file into multiple parts and then process it. FIG. 2 shows the flow of the optional implementation. As shown in FIG. 2, the foregoing method may further include the following steps:

In step S202, the file is split into multiple parts.

Step S204, performing a reading step, a caching step, a pre-processing step, and an importing step on the plurality of parts in the file by using the plurality of distributed services, respectively, and saving the content corresponding to the multiple parts to the data platform; or, by using multiple The distributed service performs a reading step, a caching step, and a preprocessing step on portions of the file, respectively. The content corresponding to the multiple parts is obtained, and the obtained content is merged, and the merged content is imported into the data platform.

At the time of splitting, the splitting can be performed according to the processing capabilities of different distributed services, that is, the files can be split into multiple parts according to the processing capabilities of multiple distributed services, and multiple parts are assigned to corresponding The distributed service is processed. For example, the processing power of the first distributed service is twice that of the second distributed service, so the file size for the first distributed processing service can be twice the file size of the second distributed processing server. . The file size split by this split method is different and corresponds to the processing power of the distributed service. As another way of processing, files can be split into equal-sized parts and then a corresponding number of files can be allocated according to processing power.

For example, the original file is divided into 4 parts, which are respectively recorded as: a, b, c, and d, and there are also 4 distributed services, namely A, B, C, and D. Distributed service A performs reading on file a. Steps, caching steps, pre-processing steps, and import steps are performed. After the execution is completed, the content A' after parsing the file a can be obtained. Similarly, the parsed files b, c, and d are respectively B', C'. And D', finally save the parsed file contents A', B', C', D' to the data platform. For another example, the original file is also split and split into a, b, c, and d 4 parts, and there are 4 distributed services, A, B, C, and D, and distributed services A and B. , C and D perform the reading step, the caching step and the pre-processing step on the files a, b, c and d, and obtain the contents A', B', C' and D' of the four parts of the file, and the four parts Merge into one A'B'C'D' and perform the import step in the merged content and import it onto the data platform.

When using multiple distributed services, in order to make distributed services faster, each distributed service can correspond to a separate cache area, and the cache area corresponding to multiple distributed services is based on the server where the distributed service is located. The resources are determined.

As an optional embodiment, each of the plurality of nodes corresponds to at least two cache blocks, wherein, in the process of processing the content in the first cache block of the two cache blocks, according to the file Reading the content of the second cache block in at least two caches and placing the read content in the second cache block; the content in the second cache block is processed from the file according to at least The size of the first cache block in the two caches reads the content and places the read content in the first cache block, wherein the first cache block and the second cache block are the same size or different.

Wherein, the node may be allocated a plurality of cache blocks, for example, two cache blocks A and B are allocated to the node 1, wherein when the node 1 processes the content in the cache block A, the cache block B can be used to read the content. The size of the read file content may be determined according to the size of the cache block B. Optionally, when the node 1 processes the content in the cache block B, the cache block A may be used to read the content, where, the read The size of the file content fetched may be determined according to the size of the cache block A, because the processed file needs to be processed after the file is processed. In the cache block A, if the size of the processed file content exceeds the cache block A, the cache block A cannot receive the processed file content, and the file cannot be placed. In order to properly allocate the cache block, it is necessary to obtain the file before processing the file. The size of the cache area is saved, and the size of the cache file is appropriately allocated according to the size of the buffer to be saved. The cache area to be saved may be each free cache block in the node.

In an optional embodiment, the processing node may include two buffer areas, one for receiving data sent by the read node (ie, indicating the buffer to be processed), and the other for buffering in the node pair cache. The data in the block is processed and stored in the system data (that is, the buffer area to be saved). After the data is stored in the system, the data processing is completed, that is, the import is successful. If the data is not processed successfully, the import fails. The import failure can be caused by a device failure. The read node may be the file data in the read buffer area, and the file to be processed in the buffer area may be allocated as multiple cache blocks.

Optionally, each node corresponds to at least three cache blocks, wherein content in one of the three cache blocks is placed in the other two cache blocks of the three cache blocks during processing. The content retrieved. Each node may be assigned at least three cache blocks when the cache block is allocated to each node, and the three cache blocks may perform different operations, which may be reading and processing file contents, for example, assigning to node 2 3 cache blocks A, B, C. The node can process the content of the file in the cache block A. In the process of processing the file content in the cache block A, the node can place the content of the file to be processed in the cache area to the cache block B and the cache block C, or The data block of the file to be processed allocated in the buffer area is placed into the cache block A and the cache block B.

The embodiment of the present invention may further include: the size of the cache block in the buffer area and/or the number of cache blocks are allocated according to the resource condition of the node where the cache block is located, where the allocation is periodic, or is allocated to meet the predetermined condition. Under the distribution. For this embodiment, since the processing speed of each of the plurality of nodes is different, after some of the nodes are processed to the content in the cache block, the possible nodes have not been processed yet.

As an optional embodiment, the order of processing the multiple cache blocks in the node is different, so that the idle state of the cache block is different. For example, the node 1 has three cache blocks A, B, and C, where the node 1 has After the processing of the cache blocks A and B is completed, the node 1 has two cache blocks which are data blocks that can allocate files to be processed. If multiple cache blocks of the plurality of nodes are idle, the node may be allocated according to the node. The information of the issued free cache block is assigned accordingly.

As an optional embodiment, the size of the cache block in the buffer may be determined according to the size of the free cache block in the node, that is, when the cache block in the node is allocated, according to the idle cache fed back by each node. The size and number of blocks are allocated corresponding to the size and number of cache blocks. For example, when node 1 feeds back that cache block A is 10M and cache block B is 20M, when a buffer block is allocated in the buffer area, a 10M cache block and A 20M cache block is given to node 1.

In another optional implementation manner, when the cache block is allocated, the allocation may be a periodic allocation, and the periodicity may be that the cache block is allocated once in a preset time period, for example, every 5 seconds. Cache block. In another optional implementation manner, when the cache block is allocated in the buffer area, the size and the number of the cache block may be allocated according to a preset condition, and the preset condition may include multiple conditions, for example, according to processing of each node. The speed determines the size and number of cache blocks.

For the foregoing embodiment, the size and the number of the cache block are determined according to the processing speed of each node, and a feedback mechanism may be set. When the node processes the allocated cache block, the buffer may be fed back to the buffer in a certain time interval. The number and size of the blocks, wherein a certain time interval may be preset, such as 4 seconds, that is, every 4 seconds, each node feeds back to the buffer area the number of cache blocks that it has processed during this time period. A response unit may be disposed in the buffer area, and the response unit is configured to process data fed back by each node, and the buffer unit may know the number of cache blocks processed by each node and the number of idle cache blocks of each node. Size, and determine the speed at which each node processes the cache block. According to the speed of the processing cache block of each node determined by the response unit and the idle state of the cache block in each node, the buffer area can reset the cache block sent to each node. The number and size.

Through the feedback mechanism, the buffer area can timely adjust the size and number of cache blocks allocated to each node, thereby reallocating more cache blocks to nodes with faster processing speed, and assigning less cache to nodes with slower processing speed. Blocks, to achieve a more reasonable number of cache blocks sent to each node, making resource utilization more reasonable.

Optionally, the file to be read by the read node may be split; the file after the split is distributed to the cache block corresponding to at least one of the plurality of nodes by the read node for processing. After the file to be processed is read, the file to be processed can be placed in the buffer area, and then the read node can be used to split the file to be processed into different cache blocks. The read node may be a node disposed in the buffer area, and the node may control the number and content of the cache blocks processed by the other nodes, and the read node may split the file to be processed into multiple cache blocks, and Each cache block is sent to a corresponding node.

The above-mentioned read node may be configured to process a total file content to be processed, and each of the other nodes may be processed in a cache block allocated to each node by the read node, and the manner of implementation may be a total score manner, through a total The read node controls each of the sub-nodes to process the contents of the corresponding cache block.

Another optional implementation manner, in the case that the size of the file to be read exceeds a threshold, the file to be read by the read node is split; by reading the node according to each of the plurality of nodes The resource situation will distribute the files after the split. The threshold may be a preset file size to be processed, for example, setting a 200M pending file to be read at a time, and the read node may determine the size of the file to be read, and if the size of the file to be processed exceeds the threshold. , the read node can split the file to be read.

Specifically, when splitting, the content of the to-be-processed file may be split into different cache blocks, where splitting the to-be-processed file may be determined according to the size and number of free cache blocks of each of the plurality of nodes. After splitting, the split one or more cache blocks can be distributed to the corresponding node.

The splitting of the file to be read includes: determining a split point for splitting the file to be read; determining whether the part obtained by splitting according to the split point includes incomplete content at the end or at the beginning; In the case of incomplete content, move the split point so that the part obtained by the split is complete.

As an optional embodiment, when determining a split point to be split into a file to be read, according to different resources of the split node, multiple split points may be determined, wherein, when splitting, First, the location of the split point is determined. The location of the split point may be determined according to the content in the file to be processed. The determined location of the split point to be processed may also be different due to the content of the file to be processed. For example, if the file to be processed includes text and images, when determining the split point, the split point can be determined as the beginning or end of the same text, or the split point can be determined as the beginning of the same picture. Or the end.

As an optional implementation manner, if there are multiple texts or pictures, when determining the split point, multiple pieces of text or pictures to be processed may be placed together, and the split point may be determined as multiple texts or At the beginning and end of the picture, the node can be a relatively complete file when processing the contents of the file, which can improve the processing efficiency.

In another optional implementation manner, determining whether the part obtained by splitting according to the split point includes incomplete content at the end or the beginning includes: determining whether the content at the split point is structured data or unstructured data If it is structured data, it is judged whether the part obtained according to the split point split is a complete record at the end or the beginning; if it is unstructured data, it is judged that the part obtained by the split point split is ended. Whether it is a complete document at the beginning or at the beginning. If it is judged that the part obtained by splitting the unstructured data according to the split point is a complete file at the end or the beginning, the split point may not be moved, and if the unstructured data is determined to be split according to the split point, The part is not a complete file at the end or at the beginning, you can move the split point and move the split point to the end or beginning of the unstructured data.

Wherein, the structured data and the unstructured data may be determined according to the content of the file to be processed. In this embodiment, the structured data may be set in advance, for example, structured data stored in a database, in the database. The file can be pre-divided into corresponding structures, which can be data split at any time; unstructured data can be complete data obtained, for example, pictures, videos, etc. Unstructured data is not well-split When the read node is split and is to be processed, the same unstructured data can be placed together as much as possible, which can improve the processing speed of the processed file of the node.

Optionally, the file after the split is distributed to at least one of the multiple nodes by the read node. The processing of the cache block includes: distributing each part of the split file to a node group by the read node, wherein each node group includes at least one node, and the node group as a whole will receive the Partially placed in the cache block for processing. In this embodiment, after the read node splits each file to be processed, the split file is distributed to the node group, and the node group may include each node.

In another optional implementation manner, the buffer block or the cache block corresponding to each node may be a ring queue, and the ring queue is accessed by the write pointer and the read pointer. After the cache in the ring queue is written, It is forbidden to write again before being read. For a ring queue, it may be to allocate a file to be processed into a ring cache block queue, that is, each cache block may form a ring queue, and the node may write the pointer when processing the file content stored in the cache block. The split file is written into the corresponding cache block, and the contents of the file in the cache block are processed by the read pointer. After the node writes the file in the cache block, when the file content is read, it is prohibited to write the file content to the cache block again, so that the node can process the file content in the cache block without interference.

In another optional implementation manner, the entire buffer area can form a large circular queue, and the queue can be accessed by writing a pointer and a read pointer. When writing, the write pointer cannot cross the current position of the read pointer. At this time, the content after the read pointer has not been read, so it cannot be overwritten; the read pointer cannot cross the write pointer, after the pointer is written. The address space has not yet been written to new data, so it is invalid and cannot be read. After the pending file is reassigned into multiple cache blocks, the ring queue is reassigned, and the size and number of cache blocks in the ring queue can be changed accordingly.

Optionally, for the ring queue of the buffer area, multiple nodes may be set at the time of allocation, and each node has a corresponding cache block, where the number of nodes may be set to correspond one-to-one with the number of cache blocks, that is, each A node has a cache block, or it can control several connected cache blocks in the cache queue through one node, and control the area of the ring queue through multiple nodes. The number of nodes can be fixed, and multiple After the cache block is cached, the cache block can be added to the ring queue, but the nodes of the corresponding area can be changed, that is, the number of cache blocks controlled by the node can be changed in real time.

In another optional implementation manner, the reading node may number the cache block, record the number and the start and/or end position of the cache block, and record whether the file content is successfully imported. The import operation may include caching. The operation of splitting the contents of the file, the operation of the node processing the contents of the cache block, and the operation of storing the processed file until the entire file is successfully imported can delete the related cache record. You can set the breakpoint mechanism. In this way, you can know which file data is imported successfully and which file data import fails when the device sends data failure. After the device fault is repaired, the imported data can be re-cached and sent to the processing node. At the same time, data that has been successfully imported does not need to be repeatedly imported.

Optionally, the size of the ring queue can be adjusted according to the resource corresponding to the node, and the cache is processed at the node. After the contents of the block, the read node can determine the size of the split cache block according to different file contents.

For the length of the cache queue, it may be first determined whether it is structured data of a fixed length. If it is determined that the structured data is fixed length, the queue length may be determined by an integer multiple of a single data; if the structure is not fixed length, The length of the queue is determined according to the length of the system. The queue length set by the system can be set by the user according to the actual situation.

As an optional embodiment, each distributed service has a separate cache area, and the cache area is configured by the cache manager of the server where the distributed service is located.

It should be noted that the configuration of the resource may be static configuration or dynamic configuration. Dynamic configuration can be configured based on the current load of the distributed service and the remaining processing power.

As an optional embodiment, since each distributed service has a separate cache area whose cache area is determined by the resources of the server where the distributed service is located, each distributed service has different processing capabilities. Similarly, the size of the buffer corresponding to each distributed service is also different. Split the file into multiple parts, and each part of the file after splitting is not necessarily equal. Each distributed service processes the corresponding split partial files according to the size of their respective processing capabilities.

Optionally, as shown in FIG. 4, before the data of the predetermined length is read from the file according to the size of the buffer area by using a stream file, the method further includes the following steps:

Step S402, configuring the size of the buffer area.

Step S404, configuring an alternate buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the backup buffer area is a backup of the buffer area.

As an optional embodiment, the size and number of the buffer area can be automatically configured according to the usage of the memory, and two buffer areas are configured. The two buffer areas have the same size and can be used to process the parsing of the file. Garbled problems; you can also configure multiple caches that are recycled.

The method shown in Figure 1 can be applied to the reading of all files, but can also be applied only to the processing of large files, Figure 3 shows such an alternative embodiment, Figure 3 shows the flow A method for reading a predetermined length of data from a file according to a size of a buffer area. As shown in FIG. 3, the method includes the following steps:

In step S302, the size of the file is obtained.

Step S304, if the size of the file exceeds the threshold, the data of the predetermined length is read from the file according to the size of the buffer area in the form of a stream file.

As an optional embodiment, for example, the size of the file is 100M, and the threshold of the file size that can be processed is 10M. Since the size of the file far exceeds the size of the file that can be processed, the file is read by the stream file. file. Assuming that the size of the buffer area is 1M, the contents of the 1M original file are read each time as a stream file.

Optionally, the pre-processing of the cached data according to the metadata information includes: reading from a buffer according to a byte, and acquiring content in the data according to the metadata information, where the metadata information is used for content parsing of the data. The metadata information includes at least one of the following: length information, data type, special character, endian, and codec mode information.

As an optional embodiment, different files have different encoding modes. In the process of parsing the file and obtaining the file content, if the encoding method used is different, garbled characters will appear, in order to solve the problem of garbled characters in the Chinese language. When the data is parsed in the buffer area, it needs to be parsed according to the encoding method of the original file, that is, the metadata information of the original file is obtained, for example, for a file, the length of the file may be 50 bytes in the file. The data type may be an integer, and the special character "$" is located at the 34th byte in the file. The length of the file, the data type of the file, and the information of the special characters are all metadata information.

Optionally, the method further includes: setting a breakpoint in at least one of a reading step, a caching step, a pre-processing step, and an importing step, wherein the breakpoint is used to record the information in the case where the step execution is performed, The recorded information is used for task recovery.

As an optional embodiment, a breakpoint is set in each execution step, and the program running in the background is concurrently executed by multiple tasks, so when a failure occurs in the execution step, the breakpoint recording step performs information related to the error. . For example, when preprocessing a file, an error occurs during parsing of the data, and a buffer overflow error occurs. At this time, the time at which the error occurred in the breakpoint, the cause of the error, and the error Information such as the location and the state of the program running in the background when an error occurs. When restoring a task, you can get the relevant information directly from the breakpoint and start from the step where the error occurred, without having to re-execute all the steps, saving the task execution time.

Example 2

In accordance with an embodiment of the present invention, an apparatus embodiment for file reading is provided.

FIG. 5 is a schematic structural diagram of a file reading apparatus according to an embodiment of the present invention. As shown in FIG. 5, the apparatus includes a reading module 501, a cache module 503, a preprocessing module 505, an import module 507, and a loop module 509.

The reading module 501 is configured to perform a reading step of reading a predetermined length of data from the file according to the size of the buffer area in a stream manner.

The cache module 503 is configured to perform a caching step, and the read data is cached in a buffer area.

The pre-processing module 505 is configured to perform a pre-processing step of pre-processing the buffered data according to the pre-configured pre-processing requirements to obtain the content of the data.

The import module 507 is configured to perform an importing step to save the content of the data to the data platform.

The looping module 509 is configured to sequentially perform the reading step, the caching step, the pre-processing step, and the importing step to complete the reading of the file.

Optionally, as shown in FIG. 5, the foregoing apparatus further includes:

a splitting module 511, configured to split the file into multiple parts;

The processing module 513 is configured to perform a reading step, a caching step, a pre-processing step, and an importing step on the plurality of parts in the file by using the plurality of distributed services, and save the content corresponding to the multiple parts to the data platform; or The method is configured to perform a reading step, a caching step, and a pre-processing step to obtain a plurality of parts corresponding to the plurality of parts in the file by using a plurality of distributed services, and then combine the obtained contents, and merge the content Import to the data platform.

As an optional embodiment, for example, the original file is divided into four parts, which are respectively recorded as: a, b, c, and d, and four distributed services, A, B, C, and D, respectively. The distributed service A performs a reading step, a caching step, a pre-processing step, and an importing step on the file a. After the execution, the content A' after parsing the file a can be obtained. Similarly, after parsing the files b, c, and d The contents are B', C', and D', and finally the parsed file contents A', B', C', D' are saved to the data platform. For another example, the original file is also split and split into a, b, c, and d 4 parts, and there are 4 distributed services, A, B, C, and D, and distributed services A and B. , C and D perform the reading step, the caching step and the pre-processing step on the files a, b, c and d, and obtain the contents A', B', C' and D' of the four parts of the file, and the four parts Merge into one A'B'C'D' and perform the import step in the merged content and import it onto the data platform.

Optionally, as shown in FIG. 5, the foregoing apparatus further includes:

The first configuration module 515 is configured to configure a size of the buffer area.

The second configuration module 517 is configured to configure a backup buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the backup buffer area is a backup of the buffer area.

Optionally, as shown in FIG. 5, the reading module 501 includes:

The first reading module 5011 is configured to acquire a size of the file.

The second reading module 5013 is configured to read a predetermined length of data from the file according to the size of the buffer area in a stream file manner in a case where the size of the file exceeds a threshold.

Optionally, as shown in FIG. 5, the pre-processing module is configured to perform pre-processing on the cached data according to the metadata information, where the pre-processing module 505 includes:

The information obtaining module 5051 is configured to read the data in the buffer from the buffer, and obtain the content in the data according to the metadata information, where the metadata information is used to perform content parsing on the data, where the metadata information includes at least one of the following: : Information on length information, data type, special characters, endian, and codec.

Optionally, as shown in FIG. 5, the foregoing apparatus further includes:

a breakpoint module 519, configured to set a breakpoint in at least one of a reading step, a caching step, a pre-processing step, and an importing step, wherein the breakpoint is used to record information in the case where the step execution error occurs, and the recorded Information is used for task recovery.

The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

In the above-mentioned embodiments of the present invention, the descriptions of the various embodiments are different, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed technical contents may be implemented in other manners. The device embodiments described above are only schematic. For example, the division of cells may be a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or integrated into Another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.

The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

An integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the various embodiments of the present invention. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

The above is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. These improvements and retouchings should also be considered. It is the scope of protection of the present invention.

Claims

A file reading method, comprising:

a reading step of streaming a predetermined length of data from the file according to the size of the buffer area;

a caching step of placing the read data in the buffer area for caching;

a pre-processing step of pre-processing the cached data according to a pre-configured pre-processing requirement to obtain content of the data;

The importing step saves the content of the data to the data platform;

The reading, the caching step, the pre-processing step, and the importing step are performed in sequence to complete reading of the file.
The method of claim 1 wherein

Split the file into multiple parts;

The reading step, the caching step, the pre-processing step, and the importing step are respectively performed on a plurality of portions of the file by a plurality of distributed services, and the content corresponding to the plurality of parts is saved To the data platform; or,

Performing the reading step, the caching step, and the pre-processing step on the plurality of portions of the file by the plurality of distributed services to obtain content corresponding to the plurality of parts, and then obtaining the obtained The content is merged and the merged content is imported into the data platform.
The method of claim 2, wherein the buffer area comprises a plurality of cache blocks, the plurality of cache blocks are distributed over a plurality of nodes, the nodes being used for content in a cache block corresponding to the node Process it.
The method of claim 3, wherein each of the plurality of nodes corresponds to at least two cache blocks, wherein

Reading, in the process of processing, the content in the first cache block of the two cache blocks is read from the file according to the size of the second cache block in the at least two block cache blocks, and Storing the read content in the second cache block; the content in the second cache block is processed from the file according to the first cache block in the at least two caches The size reads the content and places the read content in the first cache block, wherein the first cache block and the second cache block are the same size or different.
The method according to claim 4, wherein each of the nodes corresponds to at least three cache blocks, wherein contents of one of the three cache blocks are processed, to the third The read content is placed in the other two cache blocks in the cache block.
The method of claim 3, wherein the method further comprises:

The size of the cache block and/or the number of cache blocks in the buffer area are allocated according to the resource condition of the node where the cache block is located, wherein the allocation is a periodic allocation, or the allocation is to satisfy a predetermined condition. Under the distribution.
The method according to any one of claims 2 to 6, wherein splitting the file into a plurality of parts comprises:

Splitting the file through a read node;

The split file is distributed by the read node to a cache block corresponding to at least one of the plurality of nodes for processing.
The method of claim 7 wherein

If the size of the file exceeds a threshold, the file to be read is split by a reading node;

The split file is distributed by the read node according to the resource condition of each of the plurality of nodes.
The method of claim 7 wherein splitting the file comprises:

Determining a split point that splits the file;

Determining whether the portion obtained by splitting according to the split point includes incomplete content at the end or at the beginning;

In the case of including incomplete content, the split point is moved so that the content included in the split portion is complete.
The method according to claim 9, wherein judging whether the portion obtained by splitting according to the split point includes incomplete content at the end or at the beginning comprises:

Determining whether the content at the split point is structured data or unstructured data;

If it is structured data, it is judged whether the portion obtained by splitting according to the split point is a complete record at the end or at the beginning;

If it is unstructured data, it is judged whether the portion obtained by the split point split is a complete file at the end or at the beginning.
The method according to claim 7, wherein distributing the split file to the cache block corresponding to at least one of the plurality of nodes for processing by the reading node comprises:

Distributing each part of the split file to a node group by the read node, wherein each node group includes at least one node, and the node group as a whole places the received part in Processing in the cache block.
The method according to claim 3, wherein the buffer block or the cache block corresponding to each of the nodes is a ring queue, and the ring queue is accessed by a write pointer and a read pointer in the ring. It is prohibited to write again until the cache in the queue is written and not read.
The method of claim 12, wherein the size of the circular queue is adjusted according to resources corresponding to the node.
The method according to claim 2, wherein the plurality of distributed services respectively correspond to independent cache areas, and the buffer areas corresponding to the plurality of distributed services are determined according to resources of the server where the distributed service is located of.
The method of claim 2, wherein the file is split into a plurality of parts according to respective processing capabilities of the plurality of distributed services, and the plurality of parts are allocated to corresponding distributed services. deal with.
The method according to any one of claims 1 to 15, wherein the reading of the data of a predetermined length from the file according to the size of the buffer area in a streaming manner comprises:

Get the size of the file;

In the case where the size of the file exceeds the threshold, the data of a predetermined length is read from the file in the manner of a stream file according to the size of the buffer area.
The method according to any one of claims 1 to 15, wherein before the data of a predetermined length is read from the file according to the size of the buffer area in the form of a stream file, the method further comprises:

Configuring a size of the buffer area;

Configuring a spare buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the spare buffer area is a backup of the buffer area.
The method of claim 1, wherein the cached data is pre-arranged based on metadata information The rationale includes:

Reading from the buffer area according to the byte, and acquiring content in the data according to the metadata information, wherein the metadata information is used for content resolution on the data, where the metadata information includes at least the following One: length information, data type, endian, special characters, codec mode, and terminator information.
The method according to any one of claims 1 to 15, further comprising:

Setting a breakpoint in at least one of the reading step, the caching step, the pre-processing step, and the importing step, wherein the breakpoint is used to record information in the case where an error is performed in the step The information recorded is used for task recovery.
A document reading device, comprising:

a reading module, configured to perform a reading step, and streaming a predetermined length of data from the file according to the size of the buffer area;

a cache module, configured to perform a caching step, and the read data is cached in the buffer area;

a pre-processing module, configured to perform a pre-processing step of pre-processing the cached data according to a pre-configured pre-processing requirement to obtain content of the data;

Importing a module, configured to perform an importing step, and save the content of the data to a data platform;

And a looping module, configured to cyclically execute the reading step, the caching step, the pre-processing step, and the importing step to complete reading of the file.
The device of claim 20, wherein the device further comprises:

a splitting module for splitting the file into multiple parts;

a processing module, configured to perform, by using a plurality of distributed services, the reading step, the caching step, the pre-processing step, and the importing step, respectively, on multiple portions of the file And partially storing the partially corresponding content to the data platform; or for performing, by the plurality of distributed services, the reading step, the caching step, and the pre-processing respectively on the plurality of parts in the file The step obtains content corresponding to the plurality of parts, merges the obtained content, and imports the merged content into the data platform.
A storage medium, wherein the storage medium includes a stored program, wherein the program executes the file reading method according to any one of claims 1 to 19.
A processor, wherein the processor is configured to execute a program, wherein the program is executed to execute the file reading method according to any one of claims 1 to 19.