CN108090087B

CN108090087B - File processing method and device

Info

Publication number: CN108090087B
Application number: CN201611041059.0A
Authority: CN
Inventors: 米维聪; 席强辉; 徐超
Original assignee: Shanghai Hongzhi Information Technology Co ltd
Current assignee: Shanghai Hongzhi Information Technology Co ltd
Priority date: 2016-11-23
Filing date: 2016-11-23
Publication date: 2020-08-21
Anticipated expiration: 2036-11-23
Also published as: CN108090087A

Abstract

The invention discloses a file processing method and device. Wherein, the method comprises the following steps: reading the content with a preset length from a file to be read; the read content is placed in a cache region for caching, wherein the cache region comprises a plurality of cache blocks, the cache blocks are distributed on a plurality of nodes, and the nodes are used for processing the content in the cache blocks corresponding to the nodes; after the contents in the plurality of cache blocks are processed, the contents of the predetermined length are read again from the file to be read and placed in the plurality of cache blocks until the file to be read is processed. The invention solves the technical problem of low file processing efficiency.

Description

File processing method and device

Technical Field

The invention relates to the field of information processing, in particular to a file processing method and device.

Background

In the prior art, when processing data content stored in a large file, corresponding processing is generally performed on cached data content by setting a single node, however, when importing the large file content, the reading and processing speeds of the large file may be different due to different configurations of processors, and if the speed of processing the file data by the configured processors is slow, the efficiency of reading and processing the large file may be low.

In view of the above-mentioned problem of low efficiency of processing files, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a file processing method and device, which at least solve the technical problem of low file processing efficiency.

According to an aspect of an embodiment of the present invention, there is provided a file processing method including: reading the content with a preset length from a file to be read; the read content is placed in a cache region for caching, wherein the cache region comprises a plurality of cache blocks, the cache blocks are distributed on a plurality of nodes, and the nodes are used for processing the content in the cache blocks corresponding to the nodes; after the contents in the cache blocks are processed, reading the contents with the preset length from the file to be read again and placing the contents in the cache blocks until the file to be read is processed.

Further, each node in the plurality of nodes corresponds to at least two cache blocks, wherein in the process of processing the content in a first cache block of the two cache blocks, the content is read from the file according to the size of a second cache block of the at least two caches, and the read content is placed in the second cache block; and in the process of processing the content in the second cache block, reading the content from the file according to the size of a first cache block in the at least two caches, and placing the read content in the first cache block, wherein the sizes of the first cache block and the second cache block are the same or different.

Further, each node corresponds to at least three cache blocks, wherein in the process of processing the content in one cache block of the three cache blocks, the read content is placed into the other two cache blocks of the three cache blocks.

Further, the method further comprises: and allocating the size and/or the number of the cache blocks in the cache region according to the resource condition of the node where the cache blocks are located, wherein the allocation is periodical allocation, or the allocation is allocation meeting a preset condition.

Further, splitting the file to be read through a reading node; and distributing the split file to a cache block corresponding to at least one node in the plurality of nodes through the reading node for processing.

Further, under the condition that the size of the file to be read exceeds a threshold value, splitting the file to be read through a reading node; and distributing the split file according to the resource condition of each node in the plurality of nodes through the reading node.

Further, splitting the file to be read includes: determining a splitting point for splitting the file to be read; judging whether the part split according to the splitting point comprises incomplete content at the end or the beginning; in the case of including incomplete content, the splitting point is moved so that the split portion includes complete content.

Further, determining whether the part split according to the splitting point includes incomplete content at the end or the beginning includes: judging whether the content at the splitting point is structured data or unstructured data; if the data is structured data, judging whether the part obtained by splitting according to the splitting point is a complete record at the ending position or the starting position; and if the data is unstructured data, judging whether the part split according to the splitting point is a complete file at the ending position or the starting position.

Further, distributing the split file to a cache block corresponding to at least one node of the plurality of nodes through the reading node for processing includes: and respectively distributing each part of the split file to a node group through the reading node, wherein each node group comprises at least one node, and the node group integrally places the received part in a cache block for processing.

Further, the buffer block corresponding to the buffer area or each node is a circular queue, the circular queue is accessed by a write pointer and a read pointer, and re-writing is prohibited before the buffer in the circular queue is not read after being written.

Further, the size of the circular queue is adjusted according to the resource corresponding to the node.

According to another aspect of the embodiments of the present invention, there is also provided a file processing apparatus including: the first reading unit is used for reading the content with the preset length from the file to be read; the cache unit is used for placing the read contents in a cache region for caching, wherein the cache region comprises a plurality of cache blocks, the cache blocks are distributed on a plurality of nodes, and the nodes are used for processing the contents in the cache blocks corresponding to the nodes; and the second reading unit is used for reading the content with the preset length from the file to be read again and placing the content in the plurality of cache blocks until the file to be read is processed.

Through the embodiment, the content with the preset length can be read from the file to be read, the read content is placed in the cache region for caching, the cache region can comprise a plurality of cache blocks, the cache blocks are distributed on the corresponding nodes, and the nodes can process the content in the cache blocks; after the contents of the plurality of cache blocks are processed, the contents with the preset length can be read from the file to be processed again, the contents of the file to be processed which is read again are placed in the plurality of cache blocks, and the node can process the contents in the cache blocks again and continuously process the file until the file to be read is processed. Therefore, by continuously reading the content of the file to be processed, placing the file to be processed in the cache region and using the plurality of nodes to process the content of the cache block, the data file can be processed by the plurality of nodes simultaneously instead of by a single node, thereby solving the technical problem of low file processing efficiency.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative document processing method according to an embodiment of the invention;

FIG. 2 is a block diagram of an alternative document processing device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method and apparatus embodiment for file processing, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that herein.

Fig. 1 is a flowchart of an alternative document processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step S102, reading the content with preset length from the file to be read;

step S104, placing the read content in a cache region for caching, wherein the cache region comprises a plurality of cache blocks, the cache blocks are distributed on a plurality of nodes, and the nodes are used for processing the content in the cache blocks corresponding to the nodes;

step S106, after the contents in the cache blocks are processed, the contents with the preset length are read from the file to be read again and are placed in the cache blocks until the file to be read is processed.

By the embodiment, the content with the preset length can be read from the file to be read, and the read content is placed in the cache region for caching, wherein the cache region can comprise a plurality of cache blocks which are distributed on corresponding nodes, and the nodes can process the content in the cache blocks; after the contents of the plurality of cache blocks are processed, the contents with the preset length can be read from the file to be processed again, the contents of the file to be processed which is read again are placed in the plurality of cache blocks, and the node can process the contents in the cache blocks again and continuously process the file until the file to be read is processed. Therefore, by continuously reading the content of the file to be processed, placing the file to be processed in the cache region and using the plurality of nodes to process the content of the cache block, the data file can be processed by the plurality of nodes simultaneously instead of by a single node, thereby solving the technical problem of low file processing efficiency.

The embodiment of the invention can be applied to devices for processing data files such as a server and the like; the data file may be a file pre-stored in a storage medium (e.g., a hard disk), and the type of the file may be a variety of styles, such as a document, a picture, a video, and the like. A storage medium may store one or more files, wherein a storage medium is a carrier for storing data for storing file data. Optionally, the file to be processed may be read from the storage medium and loaded into the memory buffer, so that when the read file data needs to be sent to the processing node, the file does not need to be read from the storage medium, thereby increasing the speed.

Optionally, the file content in the storage medium may be read, and the read file content may be processed, where the processing operation may include multiple modes, such as sorting a document, making a data table, extracting picture content, and the like. Different files contain different content, and the size of each file content (e.g., picture) is different.

In another optional implementation manner, in the technical solution provided in step S102, content with a predetermined length may be read from a file to be read, where the length of the read file content may be preset, for example, 200M of file content is read at a time, and a user may preset different lengths of the read file according to an actual situation. The content of a predetermined length in the file to be processed may be read after receiving an instruction to process the file content of the storage medium. The file content to be processed may be a file content stored in a storage medium that needs to be read, and the read file content may be all file contents stored in the storage medium or a part of file contents stored in the storage medium, which is not limited specifically here.

When the content of the file to be processed is read, the reading operation can be performed in the storage node of the file, and the content of the file to be processed can be distributed to different processing nodes for parallel processing after being read, so that the processing speed is improved.

In the technical solution provided in step S104, the read content may be placed in a cache region for caching, where the cache region includes a plurality of cache blocks, the cache blocks are distributed on a plurality of nodes, and the node is configured to process the content in the cache block corresponding to the node. The nodes can also be used for controlling the caching speed, controlling the number and the size of received files when receiving file data sent by the cache region, and reducing the number of the received files when judging that the processing speed is slow due to different file processing speeds of different nodes, and increasing the number of the received files when judging that the processing speed is fast, thereby setting the caching size. In a plurality of nodes for processing files, one or more cache blocks may be set in a cache region, and the set cache blocks may be determined according to the processing capability of the node, and the cache blocks may be used to store the read contents of the file to be processed. In this embodiment, the read content is processed by the processing node.

In step S102, reading the file may be performed in one node, and in order to distinguish the node performing the file processing, the node reading the file is referred to as a reading node or a storage node in this embodiment, and the read file may be a file placed on the reading node or the storage node, or a file in another storage service. For the reading node or the storage node, when the content of the file to be processed is read, a cache region can also be set, and the cache region is used for storing the read content of the file to be processed in a preset time and correspondingly processing the stored content of the file to be processed. Alternatively, the server may be represented by a node.

Optionally, the number of nodes may be allocated according to a hardware configuration level, and when allocating cache blocks, different numbers of cache blocks may be allocated according to a processing file speed of a node, for example, node a has a high processing speed and may allocate 2 cache blocks, node B has a low processing speed and may allocate 1 cache block; alternatively, the allocation of the cache blocks may be dynamically allocated according to the processing capability of the node, the processing speeds of different nodes may be different, if it is determined that the speed of processing the file data of the node is higher, more cache blocks may be allocated in the server, and if it is determined that the processing speed of the node is lower, less cache blocks may be allocated to the node.

In another alternative embodiment, after allocating the cache blocks, the file contents in the corresponding cache blocks may be processed by each node, or the file contents in the cache blocks may be processed by a total node (e.g., a reading node, a storage node) set in the cache region.

The cache blocks may be allocated in different manners, optionally, the cache blocks may be equally allocated to the nodes, and each node is allocated the same number of cache blocks, for example, each node is allocated 2 cache blocks; in another alternative embodiment, different numbers of cache blocks may be allocated to the node according to the number of cache blocks requested by the node, for example, if the node requests 2 cache blocks, 2 cache blocks to be processed may be sent to the node.

When caching files, the files can be all placed in a total cache region, then the files in the cache region are split into a plurality of cache blocks, and the same or different number of cache blocks are sent to each node.

In another alternative embodiment, after the read content of the file to be processed with the predetermined length is divided into different cache blocks, the cache blocks may be sent to each node. The size of each cache block may be fixed, e.g. a 200M file may be divided into 10 different cache blocks, i.e. each cache block may be 20M; the size of each cache block may also be non-fixed, e.g., 10M for cache block 1 and 30M for cache block 2.

Optionally, the cache region may send different cache blocks to corresponding nodes according to a preset to-be-processed file task, that is, the size and the number of the cache blocks allocated to different nodes may be different, for example, 3 cache blocks are allocated to the node 1, and 2 cache blocks are allocated to the node 2.

For the above embodiment, after each node receives the cache block of the file to be processed sent from the cache region, the node may perform corresponding processing on the file to be processed in the cache block, and each node may process the content of the file to be processed in each cache block corresponding to the node at the same time, or may not process the content of the file in the cache block at the same time. Meanwhile, according to different processing capacities of the nodes, the time for completing the file processing is different.

In the technical solution provided in step S106, after the contents in the plurality of cache blocks are processed, the contents with the predetermined length may be read again from the file to be read and placed in the plurality of cache blocks until the file to be read is processed. After the contents in the cache blocks are processed, the processed information can be sent to the cache region, the cache region can divide the file contents which are not processed into the cache blocks again, and each cache block is sent to the corresponding node again, so that the files are processed by the nodes continuously, the speed of processing the files can be increased, and the efficiency of processing the files is improved.

Optionally, each node in the multiple nodes corresponds to at least two cache blocks, wherein in a process that contents in a first cache block of the two cache blocks are processed, the contents are read from the file according to sizes of second cache blocks of the at least two caches, and the read contents are placed in the second cache block; and in the processing process of the content in the second cache block, reading the content from the file according to the size of a first cache block in at least two caches, and placing the read content in the first cache block, wherein the sizes of the first cache block and the second cache block are the same or different.

A plurality of cache blocks may be allocated to a node, for example, 2 cache blocks a and B may be allocated to a node 1, where when the node 1 processes the content in the cache block a, the content may be read by using the cache block B, where the size of the read file content may be determined according to the size of the cache block B; optionally, when the node 1 processes the content in the cache block B, the cache block a may be used to read the content, where the size of the read file content may be determined according to the size of the cache block a, because after the file is processed, the processed file needs to be placed in the cache block a, and if the size of the processed file content exceeds the cache block a, the cache block a cannot completely receive the processed file content, so that the file cannot be placed.

Alternatively, the processing node may include two cache regions, one of which is used to receive data sent by the reading node (i.e. to indicate a cache region to be processed), and the other of which is used to cache data to be stored in the system after processing the data in the cache block (i.e. to indicate a cache region to be stored), where the data is stored in the system and then represents that the data processing is completed, i.e. the importing is successful, and if the data is not processed successfully, represents that the importing is failed, where the importing failure may be caused by a device failure. The reading node may be configured to read file data in the cache region, and may allocate a file to be processed in the cache region to a plurality of cache blocks.

Optionally, each node corresponds to at least three cache blocks, and in the process of processing the content in one cache block of the three cache blocks, the read content is placed in the other two cache blocks of the three cache blocks. At least three cache blocks may be allocated to each node when allocating cache blocks to each node, and these three cache blocks may perform different operations, which may be reading and processing file content, for example, allocating 3 cache blocks a, B, C to node 2. The node may process the file content in the cache block a, and in the process of processing the file content in the cache block a, may place the file content to be processed in the cache region to the cache block B and the cache block C, or may place the data block of the file to be processed allocated in the cache region to the cache block a and the cache block B.

The embodiment of the invention can also comprise: the size and/or the number of the cache blocks in the cache region are/is allocated according to the resource condition of the node where the cache blocks are located, wherein the allocation is periodical allocation or allocation meeting a preset condition. For this embodiment, since the processing speed of each of the plurality of nodes is different, after some nodes have finished processing the content in the cache block, some nodes may not have finished processing; optionally, the processing sequence of the plurality of cache blocks in the node is different, so that idle states of the cache blocks are different, for example, there are 3 cache blocks a, B, and C in the node 1, where the node 1 has already completed processing on the cache blocks a and B, then there are two cache blocks in the node 1 that can allocate data blocks of the file to be processed, and if there are a plurality of cache blocks in the plurality of nodes that are idle, then corresponding allocation can be performed according to information of the idle cache blocks sent by the node during allocation. Optionally, the size of the cache block in the cache region may be determined according to the size of the free cache block in the node, that is, when allocating the cache block in the node, the size and the number of the corresponding cache block may be allocated according to the size and the number of the free cache block fed back by each node, for example, when the node 1 feeds back that the cache block a is 10M and the cache block B is 20M, when allocating the cache block in the cache region, one cache block of 10M and one cache block of 20M may be allocated to the node 1.

In another alternative, when allocating a cache block, the allocation may be a periodic allocation, which may refer to allocating the cache block once in a preset time period, for example, allocating the cache block once every 5 seconds. In another alternative embodiment, when allocating cache blocks in the cache area, the size and the number of the cache blocks may be allocated according to a preset condition, and the preset condition may include various conditions, for example, the size and the number of the cache blocks are determined according to the processing speed of each node.

For the above embodiment, the size and the number of the cache blocks are determined according to the processing speed of each node, and a feedback mechanism may be provided, so that when the node processes the allocated cache blocks, the number and the size of the processed cache blocks may be fed back to the cache region within a certain time interval, where the certain time interval may be preset, for example, 4 seconds, that is, each node feeds back the number of the cache blocks processed in the time interval to the cache region every 4 seconds. The cache region can be provided with a response unit which is used for processing the data fed back by each node, through the response unit, the cache region can know the number of the cache blocks processed by each node and the number and the size of the idle cache blocks of each node, the speed of each node for processing the cache blocks is judged, and according to the speed of each node for processing the cache blocks and the idle state of the cache blocks in each node, which are judged by the response unit, the cache region can reset the number and the size of the cache blocks sent to each node. Through the feedback mechanism, the cache region can adjust the size and the number of the cache blocks distributed to each node in time, more cache blocks are distributed to the nodes with higher processing speed, less cache blocks are distributed to the nodes with lower processing speed, the number of the cache blocks sent to each node is more reasonable, and the utilization of resources is more reasonable.

Optionally, the file to be read may be split by a reading node; and distributing the split file to a cache block corresponding to at least one node in the plurality of nodes through the reading node for processing. After the file to be processed is read, the file to be processed may be placed in the cache region, and then the file to be processed may be split into different cache blocks by using the read node. The reading node may be a node arranged in the cache region, the number and content of cache blocks processed by other nodes may be controlled by the node, and the reading node may split a file to be processed into a plurality of cache blocks and send each cache block to corresponding node. The reading node may process a total content of the file to be processed, and each of the other nodes may process a content in a cache block allocated to each node by the reading node, and the implementation manner may be a total score manner, where one total reading node controls each of the score nodes to process the content of the corresponding cache block.

In another optional implementation manner, when the size of the file to be read exceeds a threshold, the file to be read is split through a reading node; and distributing the split file according to the resource condition of each node in the plurality of nodes by the reading node. The threshold may be a preset size of the file to be processed, for example, a 200M file to be processed is set to be read once, the reading node may determine the size of the file to be read, and if it is determined that the size of the file to be processed exceeds the threshold, the reading node may split the file to be read. When splitting, the content of the file to be processed can be split into different cache blocks, wherein the splitting of the file to be processed can be determined according to the size and the number of idle cache blocks of each node in the plurality of nodes, and after splitting, one or more split cache blocks can be distributed to corresponding nodes.

The method for splitting the file to be read comprises the following steps: determining splitting points for splitting a file to be read; judging whether the part split according to the split point comprises incomplete content at the end or the beginning; in the case of including incomplete content, the splitting point is moved so that the split portion includes complete content.

Optionally, when determining a splitting point for splitting a file to be read, multiple splitting points may be determined according to different resources of the splitting node, where in the splitting, a position of the splitting point may be determined first, and the position of the splitting point may be determined according to content in the file to be processed, and due to different content of the file to be processed, the determined position of the splitting point to be processed may also be different, for example, if the file to be processed includes a text and a picture, when determining the splitting point, the splitting point may be determined as a beginning or an end of the same text, or the splitting point may be determined as a beginning or an end of the same picture. If there are multiple texts or pictures, when determining the split point, the multiple texts or pictures to be processed can be placed together, and the split point is determined as the start and end of the multiple texts or pictures, so that the node can be a complete file when processing the content of the file, and thus, the processing efficiency can be improved.

In another optional implementation, the determining whether the part split according to the split point includes incomplete content at the end or the beginning includes: judging whether the content at the splitting point is structured data or unstructured data; if the data is structured data, judging whether the part obtained by splitting according to the splitting point is a complete record at the end or the beginning; and if the data is unstructured data, judging whether the part obtained by splitting according to the splitting point is a complete file at the end or the beginning. If the part obtained by splitting the unstructured data according to the splitting point is judged to be a complete file at the ending part or the starting part, the splitting point does not need to be moved, and if the part obtained by splitting the unstructured data according to the splitting point is judged not to be a complete file at the ending part or the starting part, the splitting point can be moved to move the splitting point to the ending part or the starting part of the unstructured data.

In this embodiment, the structured data may be set in advance, for example, the structured data stored in the database, and the file may be divided into corresponding structures in advance in the database, which may be data split at any time; the unstructured data can be acquired complete data, such as data of pictures, videos and the like, the unstructured data are not well split, and when the reading node is split to be processed, the same unstructured data can be placed together as much as possible, so that the processing speed of the node for processing files can be improved.

Optionally, the distributing the split file to a cache block corresponding to at least one node of the multiple nodes through the reading node to perform processing includes: and respectively distributing each part of the split file to a node group through a reading node, wherein each node group comprises at least one node, and the node group as a whole places the received part in a cache block for processing. In this embodiment, after the reading node splits each file to be processed, the split file may be distributed to a node group, where the node group may include each node.

In another alternative embodiment, the buffer block corresponding to the buffer area or each node may be a circular queue, and the circular queue is accessed by a write pointer and a read pointer, and is prohibited from being written again before the buffer in the circular queue is not read after being written. For the ring queue, the file to be processed may be allocated as a ring buffer block queue, that is, each buffer block may form a ring queue, and when processing the file content stored in the buffer block, the node may write the split file into the corresponding buffer block through the write pointer, and perform a processing operation on the file content in the buffer block through the read pointer. When the node reads the file content after writing the file in the cache block, the node prohibits writing the file content into the cache block again, so that the node can not interfere with the processing of the file content in the cache block.

In an alternative embodiment, the entire buffer may form a large circular queue that is accessible via a write pointer and a read pointer. The writing pointer can not cross the current position of the reading pointer during writing, and at this time, the content behind the reading pointer is not read yet and can not be covered; the read pointer cannot cross the write pointer, and the address space after the write pointer is not yet written with new data, and therefore is invalid data and cannot be read. After the file to be processed is redistributed into a plurality of cache blocks, the ring queue is redistributed, and the size and the number of the cache blocks in the ring queue can be changed correspondingly. Optionally, when the ring queue of the buffer area is allocated, a plurality of nodes may be set, each node has a corresponding cache block, where the setting of the number of nodes may be one-to-one corresponding to the number of cache blocks, that is, each node has one cache block, or may control several connected cache blocks in the cache queue through one node, a region of the ring queue is controlled by a plurality of nodes, the number of nodes may be fixed, after the plurality of cache blocks are reallocated, the cache blocks may be added to the ring queue, but the nodes in the corresponding region may not change, that is, the number of cache blocks controlled by the nodes may change in real time.

In another alternative embodiment, the reading node may number the cache block, record the number and the start and/or end position of the cache block, and record whether the file content is successfully imported, where the import operation may include an operation of splitting the file content in the cache region, an operation of processing the content in the cache block by the node, and an operation of storing the processed file, and until the whole file is successfully imported, the relevant cache record may not be deleted. A breakpoint mechanism may be set, so that when a failure occurs in data sent by a device, it is known which file data is successfully imported and which file data is unsuccessfully imported, and after the failure of the device is repaired, the data that is unsuccessfully imported may be cached again and sent to a processing node, and meanwhile, the data that has been successfully imported does not need to be repeatedly imported.

Optionally, the size of the ring queue may be adjusted according to a resource corresponding to the node, and after the node processes the content in the cache block, the reading node may determine the size of the split cache block according to different file contents.

The length of the cache queue can be judged whether the length of the cache queue is fixed-length structured data or not, and if the length of the cache queue is the fixed-length structured data, the queue length can be determined according to integral multiples of single data; if the length is not fixed, the length of the queue can be determined according to the length set by the system, and the length of the queue set by the system can be set by a user according to the actual situation, which is not limited herein.

Preferably, when processing the content of the file to be processed, the processing task can be actively applied by each node, and corresponding to the number of the file to be processed applied by each node, the number of data blocks applied by each node is sent to the node, so that the reasonable allocation of resources is realized.

In another alternative embodiment, when processing the file content (i.e. file data) of each cache block, a processing failure may occur, where the processing failure may include various situations, for example, during the process of reading the file content, it may occur that the processing node does not receive the piece of data at all, that is, the cache block does not receive the file data to be processed, and the piece of data is not processed; or a processing node failure, which may also lead to a data processing failure condition. For the case of data processing failure, the processing node and the reading node may set a handshake mechanism to avoid data loss during transmission, where the handshake mechanism may be that the processing node sends an inquiry instruction when sending data, and the instruction includes inquiring whether the reading node successfully receives the file data to be processed; the reading node may send a feedback instruction to the processing node after receiving the file data to be processed, so as to inform the processing node of successful reception of the file data. The processing node can judge whether the processing node receives the feedback instruction within a period of time interval, if the processing node does not receive the feedback instruction within the time interval, the processing node can judge that the data processing fails, and the processing node processes the file again after the failure is eliminated; if the feedback instruction is received within the interval time, the data processing success can be judged, and no fault occurs.

The data processing failure can be a node failure, the node failure can include a reading node failure and a processing node failure, and the reading node failure can be that file data cannot be read when data sent from a cache region is read; the processing node failure may be that after the file data read by the reading node is processed, the file data cannot be processed, so that a data processing failure occurs.

FIG. 2 is a block diagram of an alternative document processing device according to an embodiment of the present invention, including: a first reading unit 21 for reading a content of a predetermined length from a file to be read; the cache unit 23 is configured to place the read content in a cache region for caching, where the cache region includes multiple cache blocks, the multiple cache blocks are distributed over multiple nodes, and a node is used to process the content in the cache block corresponding to the node; a second reading unit 25 for reading again the contents of the predetermined length from the file to be read after the contents in the plurality of cache blocks are processed, and placing the contents in the plurality of cache blocks until the file to be read is processed.

Through the above embodiment, the first reading unit 21 may first read a content with a predetermined length from a file to be read, and the cache unit 23 may place the read content in a cache region for caching, where the cache region may include a plurality of cache blocks, and the plurality of cache blocks are distributed on corresponding nodes, where the nodes may process the content in the cache blocks; after processing the contents of the plurality of cache blocks, the second reading unit 25 may again read the contents of the predetermined length from the file to be processed, and place the contents of the file to be processed, which are read again, in the plurality of cache blocks, and the node may again process the contents of the cache blocks, and continuously process the file, knowing that the processing of the file to be read is completed. Therefore, by continuously reading the content of the file to be processed, putting the file to be processed in the cache blocks in the cache region and using the plurality of nodes to process the content of the cache blocks, the read content of the file to be processed can be simultaneously processed through the plurality of cache blocks instead of sequentially processing the read content of the file to be processed, and the technical problem of low file processing efficiency is solved.

Optionally, the apparatus further comprises: and the allocation unit is used for allocating the size of the cache blocks in the cache region and/or the number of the cache blocks according to the resource condition of the node where the cache blocks are located, wherein the allocation is periodical allocation or allocation meeting a preset condition.

In another alternative embodiment, the apparatus further comprises: the first splitting module is used for splitting the file to be read through a reading node; and the first distribution module is used for distributing the split file to a cache block corresponding to at least one node in the plurality of nodes through the reading node for processing.

Further, the apparatus further comprises: the second splitting module is used for splitting the file to be read through a reading node under the condition that the size of the file to be read exceeds a threshold value; and the second distribution module is used for distributing the split file according to the resource condition of each node in the plurality of nodes through the reading node.

Optionally, the first splitting module includes: the determining module is used for determining splitting points for splitting the file to be read; the judging module is used for judging whether the part obtained by splitting according to the splitting point comprises incomplete content at the ending position or the starting position; and the moving module is used for moving the splitting point under the condition that the incomplete content is included, so that the content included in the split part is complete.

In another optional implementation manner, the determining module includes: the first judgment submodule is used for judging whether the content at the splitting point is structured data or unstructured data; the second judgment submodule is used for judging whether the part obtained by splitting according to the splitting point is a complete record at the ending position or the starting position if the part is structured data; and the third judgment submodule is used for judging whether the part split according to the split point is a complete file at the ending position or the starting position if the part is the unstructured data.

Optionally, the first distribution module includes: and the first distribution submodule is used for respectively distributing each part of the split file to a node group through the reading node, wherein each node group comprises at least one node, and the node group as a whole places the received part in a cache block for processing.

In another alternative embodiment, the buffer block corresponding to the buffer area or each node is a circular queue, the circular queue is accessed by a write pointer and a read pointer, and the buffer in the circular queue is prohibited from being written again before being read after being written.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A file processing method, comprising:

reading the content with a preset length from a file to be read;

the read content is placed in a cache region for caching, wherein the cache region comprises a plurality of cache blocks, the cache blocks are distributed on a plurality of nodes, and the nodes are used for processing the content in the cache blocks corresponding to the nodes;

after the contents in the cache blocks are processed, reading the contents with the preset length from the file to be read again and placing the contents in the cache blocks until the file to be read is processed;

wherein the method further comprises:

splitting a file to be read through a reading node;

distributing the split file to a cache block corresponding to at least one node in the plurality of nodes through the reading node for processing;

splitting the file to be read comprises the following steps:

determining a splitting point for splitting the file to be read;

judging whether the part split according to the splitting point comprises incomplete content at the end or the beginning;

in the case of including incomplete content, the splitting point is moved so that the split portion includes complete content.

2. The method of claim 1, wherein each node in the plurality of nodes corresponds to at least two cache blocks, wherein,

reading contents from the file according to the size of a second cache block in the at least two cache blocks in the process of processing the contents in a first cache block in the two cache blocks, and placing the read contents in the second cache block; and in the process of processing the content in the second cache block, reading the content from the file according to the size of a first cache block in the at least two caches, and placing the read content in the first cache block, wherein the sizes of the first cache block and the second cache block are the same or different.

3. The method of claim 2, wherein each node corresponds to at least three cache blocks, and wherein the contents in one of the three cache blocks are processed while the read contents are placed in the other two of the three cache blocks.

4. The method of claim 1, further comprising:

and allocating the size and/or the number of the cache blocks in the cache region according to the resource condition of the node where the cache blocks are located, wherein the allocation is periodical allocation, or the allocation is allocation meeting a preset condition.

5. The method of claim 1,

under the condition that the size of the file to be read exceeds a threshold value, splitting the file to be read through a reading node;

and distributing the split file according to the resource condition of each node in the plurality of nodes through the reading node.

6. The method of claim 1, wherein determining whether the split portion includes incomplete content at an end or a beginning of the split portion according to the split point comprises:

judging whether the content at the splitting point is structured data or unstructured data;

if the data is structured data, judging whether the part obtained by splitting according to the splitting point is a complete record at the ending position or the starting position;

and if the data is unstructured data, judging whether the part split according to the splitting point is a complete file at the ending position or the starting position.

7. The method of claim 1, wherein distributing, by the reading node, the split file to a cache block corresponding to at least one of the plurality of nodes for processing comprises:

and respectively distributing each part of the split file to a node group through the reading node, wherein each node group comprises at least one node, and the node group integrally places the received part in a cache block for processing.

8. A method according to claim 1, wherein the buffer block corresponding to the buffer or each node is a circular queue, the circular queue is accessed by a write pointer and a read pointer, and re-writing is prohibited before the buffer in the circular queue is not read after being written.

9. The method of claim 8, wherein the size of the circular queue is adjusted according to the resource corresponding to the node.

10. A document processing apparatus, characterized by comprising:

the first reading unit is used for reading the content with the preset length from the file to be read;

the cache unit is used for placing the read contents in a cache region for caching, wherein the cache region comprises a plurality of cache blocks, the cache blocks are distributed on a plurality of nodes, and the nodes are used for processing the contents in the cache blocks corresponding to the nodes;

a second reading unit, configured to, after the contents in the cache blocks are processed, read again the contents with the predetermined length from the file to be read and place the contents in the cache blocks until the file to be read is processed;

wherein the apparatus further comprises:

the first splitting module is used for splitting the file to be read through a reading node;

the first distribution module is used for distributing the split file to a cache block corresponding to at least one node in the plurality of nodes through the reading node for processing;

wherein the first splitting module comprises:

the determining module is used for determining a splitting point for splitting the file to be read;

the judging module is used for judging whether the part obtained by splitting according to the splitting point comprises incomplete content at the end or the beginning;

and the moving module is used for moving the splitting point under the condition that incomplete content is included, so that the content included in the split part is complete.