WO2018054200A1 - Method and device for reading file - Google Patents

Method and device for reading file Download PDF

Info

Publication number
WO2018054200A1
WO2018054200A1 PCT/CN2017/099554 CN2017099554W WO2018054200A1 WO 2018054200 A1 WO2018054200 A1 WO 2018054200A1 CN 2017099554 W CN2017099554 W CN 2017099554W WO 2018054200 A1 WO2018054200 A1 WO 2018054200A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
data
content
reading
processing
Prior art date
Application number
PCT/CN2017/099554
Other languages
French (fr)
Chinese (zh)
Inventor
米维聪
徐超
罗海英
Original Assignee
上海泓智信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海泓智信息科技有限公司 filed Critical 上海泓智信息科技有限公司
Publication of WO2018054200A1 publication Critical patent/WO2018054200A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files

Definitions

  • the present invention relates to the field of big data, and in particular to a file reading method and apparatus.
  • the embodiment of the invention provides a file reading method and device, so as to at least solve the technical problem caused by the relatively large file.
  • FIG. 1 is a flow chart of a file reading method according to an embodiment of the present invention.
  • FIG. 2 is a flow chart of an optional file reading method according to an embodiment of the present invention.
  • FIG. 3 is a flow chart of an alternative method of reading predetermined length data according to an embodiment of the present invention.
  • FIG. 4 is a flow chart of an alternative method of reading predetermined length data, in accordance with an embodiment of the present invention.
  • FIG. 1 is a file reading method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:
  • Step S102 The reading step reads a predetermined length of data from the file according to the size of the buffer area in a stream manner.
  • Step S104 the caching step, placing the read data in the buffer area for caching.
  • Step S106 a pre-processing step of pre-processing the buffered data according to the pre-configured pre-processing requirements to obtain the content of the data.
  • Step S108 the importing step saves the content of the data to the data platform.
  • Step S110 the loop sequentially performs the reading step, the caching step, the pre-processing step, and the importing step to complete the reading of the file.
  • the fixed length data is read from the file in a byte stream manner, the fixed length data is put into the buffer area for buffering, and then the buffer area is followed.
  • the byte reads the cached data, and parses the data content of the file according to the length information of the file, the data type of the file content, and special characters, endian, codec, etc., that is, the preprocessing process of the data is completed.
  • the parsed data is saved to the data platform, and data processing, data storage, query retrieval, and analysis mining and display operations can be performed.
  • the above-described reading step, caching step, pre-processing step, and importing step are performed cyclically until the reading operation on the large file is completed.
  • FIG. 2 shows the flow of the optional implementation. As shown in FIG. 2, the foregoing method may further include the following steps:
  • step S202 the file is split into multiple parts.
  • the splitting can be performed according to the processing capabilities of different distributed services, that is, the files can be split into multiple parts according to the processing capabilities of multiple distributed services, and multiple parts are assigned to corresponding The distributed service is processed.
  • the processing power of the first distributed service is twice that of the second distributed service, so the file size for the first distributed processing service can be twice the file size of the second distributed processing server. .
  • the file size split by this split method is different and corresponds to the processing power of the distributed service.
  • files can be split into equal-sized parts and then a corresponding number of files can be allocated according to processing power.
  • each distributed service can correspond to a separate cache area, and the cache area corresponding to multiple distributed services is based on the server where the distributed service is located.
  • the resources are determined.
  • each of the plurality of nodes corresponds to at least two cache blocks, wherein, in the process of processing the content in the first cache block of the two cache blocks, according to the file Reading the content of the second cache block in at least two caches and placing the read content in the second cache block; the content in the second cache block is processed from the file according to at least The size of the first cache block in the two caches reads the content and places the read content in the first cache block, wherein the first cache block and the second cache block are the same size or different.
  • the node may be allocated a plurality of cache blocks, for example, two cache blocks A and B are allocated to the node 1, wherein when the node 1 processes the content in the cache block A, the cache block B can be used to read the content.
  • the size of the read file content may be determined according to the size of the cache block B.
  • the cache block A may be used to read the content, where, the read
  • the size of the file content fetched may be determined according to the size of the cache block A, because the processed file needs to be processed after the file is processed. In the cache block A, if the size of the processed file content exceeds the cache block A, the cache block A cannot receive the processed file content, and the file cannot be placed.
  • the size of the cache area is saved, and the size of the cache file is appropriately allocated according to the size of the buffer to be saved.
  • the cache area to be saved may be each free cache block in the node.
  • the processing node may include two buffer areas, one for receiving data sent by the read node (ie, indicating the buffer to be processed), and the other for buffering in the node pair cache.
  • the data in the block is processed and stored in the system data (that is, the buffer area to be saved). After the data is stored in the system, the data processing is completed, that is, the import is successful. If the data is not processed successfully, the import fails. The import failure can be caused by a device failure.
  • the read node may be the file data in the read buffer area, and the file to be processed in the buffer area may be allocated as multiple cache blocks.
  • the embodiment of the present invention may further include: the size of the cache block in the buffer area and/or the number of cache blocks are allocated according to the resource condition of the node where the cache block is located, where the allocation is periodic, or is allocated to meet the predetermined condition. Under the distribution. For this embodiment, since the processing speed of each of the plurality of nodes is different, after some of the nodes are processed to the content in the cache block, the possible nodes have not been processed yet.
  • the size of the cache block in the buffer may be determined according to the size of the free cache block in the node, that is, when the cache block in the node is allocated, according to the idle cache fed back by each node.
  • the size and number of blocks are allocated corresponding to the size and number of cache blocks. For example, when node 1 feeds back that cache block A is 10M and cache block B is 20M, when a buffer block is allocated in the buffer area, a 10M cache block and A 20M cache block is given to node 1.
  • the allocation when the cache block is allocated, the allocation may be a periodic allocation, and the periodicity may be that the cache block is allocated once in a preset time period, for example, every 5 seconds. Cache block.
  • the size and the number of the cache block when the cache block is allocated in the buffer area, the size and the number of the cache block may be allocated according to a preset condition, and the preset condition may include multiple conditions, for example, according to processing of each node. The speed determines the size and number of cache blocks.
  • the size and the number of the cache block are determined according to the processing speed of each node, and a feedback mechanism may be set.
  • the buffer may be fed back to the buffer in a certain time interval.
  • the number and size of the blocks, wherein a certain time interval may be preset, such as 4 seconds, that is, every 4 seconds, each node feeds back to the buffer area the number of cache blocks that it has processed during this time period.
  • a response unit may be disposed in the buffer area, and the response unit is configured to process data fed back by each node, and the buffer unit may know the number of cache blocks processed by each node and the number of idle cache blocks of each node. Size, and determine the speed at which each node processes the cache block. According to the speed of the processing cache block of each node determined by the response unit and the idle state of the cache block in each node, the buffer area can reset the cache block sent to each node. The number and size.
  • the buffer area can timely adjust the size and number of cache blocks allocated to each node, thereby reallocating more cache blocks to nodes with faster processing speed, and assigning less cache to nodes with slower processing speed. Blocks, to achieve a more reasonable number of cache blocks sent to each node, making resource utilization more reasonable.
  • the file to be read by the read node may be split; the file after the split is distributed to the cache block corresponding to at least one of the plurality of nodes by the read node for processing.
  • the file to be processed can be placed in the buffer area, and then the read node can be used to split the file to be processed into different cache blocks.
  • the read node may be a node disposed in the buffer area, and the node may control the number and content of the cache blocks processed by the other nodes, and the read node may split the file to be processed into multiple cache blocks, and Each cache block is sent to a corresponding node.
  • the above-mentioned read node may be configured to process a total file content to be processed, and each of the other nodes may be processed in a cache block allocated to each node by the read node, and the manner of implementation may be a total score manner, through a total
  • the read node controls each of the sub-nodes to process the contents of the corresponding cache block.
  • the file to be read by the read node is split; by reading the node according to each of the plurality of nodes The resource situation will distribute the files after the split.
  • the threshold may be a preset file size to be processed, for example, setting a 200M pending file to be read at a time, and the read node may determine the size of the file to be read, and if the size of the file to be processed exceeds the threshold. , the read node can split the file to be read.
  • the content of the to-be-processed file may be split into different cache blocks, where splitting the to-be-processed file may be determined according to the size and number of free cache blocks of each of the plurality of nodes. After splitting, the split one or more cache blocks can be distributed to the corresponding node.
  • the splitting of the file to be read includes: determining a split point for splitting the file to be read; determining whether the part obtained by splitting according to the split point includes incomplete content at the end or at the beginning; In the case of incomplete content, move the split point so that the part obtained by the split is complete.
  • determining whether the part obtained by splitting according to the split point includes incomplete content at the end or the beginning includes: determining whether the content at the split point is structured data or unstructured data If it is structured data, it is judged whether the part obtained according to the split point split is a complete record at the end or the beginning; if it is unstructured data, it is judged that the part obtained by the split point split is ended. Whether it is a complete document at the beginning or at the beginning.
  • the split point may not be moved, and if the unstructured data is determined to be split according to the split point, The part is not a complete file at the end or at the beginning, you can move the split point and move the split point to the end or beginning of the unstructured data.
  • the structured data and the unstructured data may be determined according to the content of the file to be processed.
  • the structured data may be set in advance, for example, structured data stored in a database, in the database.
  • the file can be pre-divided into corresponding structures, which can be data split at any time; unstructured data can be complete data obtained, for example, pictures, videos, etc. Unstructured data is not well-split When the read node is split and is to be processed, the same unstructured data can be placed together as much as possible, which can improve the processing speed of the processed file of the node.
  • the file after the split is distributed to at least one of the multiple nodes by the read node.
  • the processing of the cache block includes: distributing each part of the split file to a node group by the read node, wherein each node group includes at least one node, and the node group as a whole will receive the Partially placed in the cache block for processing.
  • the split file is distributed to the node group, and the node group may include each node.
  • the buffer block or the cache block corresponding to each node may be a ring queue, and the ring queue is accessed by the write pointer and the read pointer. After the cache in the ring queue is written, It is forbidden to write again before being read.
  • a ring queue it may be to allocate a file to be processed into a ring cache block queue, that is, each cache block may form a ring queue, and the node may write the pointer when processing the file content stored in the cache block.
  • the split file is written into the corresponding cache block, and the contents of the file in the cache block are processed by the read pointer.
  • the node writes the file in the cache block, when the file content is read, it is prohibited to write the file content to the cache block again, so that the node can process the file content in the cache block without interference.
  • multiple nodes may be set at the time of allocation, and each node has a corresponding cache block, where the number of nodes may be set to correspond one-to-one with the number of cache blocks, that is, each A node has a cache block, or it can control several connected cache blocks in the cache queue through one node, and control the area of the ring queue through multiple nodes.
  • the number of nodes can be fixed, and multiple After the cache block is cached, the cache block can be added to the ring queue, but the nodes of the corresponding area can be changed, that is, the number of cache blocks controlled by the node can be changed in real time.
  • the reading node may number the cache block, record the number and the start and/or end position of the cache block, and record whether the file content is successfully imported.
  • the import operation may include caching. The operation of splitting the contents of the file, the operation of the node processing the contents of the cache block, and the operation of storing the processed file until the entire file is successfully imported can delete the related cache record. You can set the breakpoint mechanism. In this way, you can know which file data is imported successfully and which file data import fails when the device sends data failure. After the device fault is repaired, the imported data can be re-cached and sent to the processing node. At the same time, data that has been successfully imported does not need to be repeatedly imported.
  • the size of the ring queue can be adjusted according to the resource corresponding to the node, and the cache is processed at the node. After the contents of the block, the read node can determine the size of the split cache block according to different file contents.
  • each distributed service has a separate cache area, and the cache area is configured by the cache manager of the server where the distributed service is located.
  • the configuration of the resource may be static configuration or dynamic configuration. Dynamic configuration can be configured based on the current load of the distributed service and the remaining processing power.
  • each distributed service since each distributed service has a separate cache area whose cache area is determined by the resources of the server where the distributed service is located, each distributed service has different processing capabilities. Similarly, the size of the buffer corresponding to each distributed service is also different. Split the file into multiple parts, and each part of the file after splitting is not necessarily equal. Each distributed service processes the corresponding split partial files according to the size of their respective processing capabilities.
  • Step S402 configuring the size of the buffer area.
  • Step S404 configuring an alternate buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the backup buffer area is a backup of the buffer area.
  • step S302 the size of the file is obtained.
  • Step S304 if the size of the file exceeds the threshold, the data of the predetermined length is read from the file according to the size of the buffer area in the form of a stream file.
  • the size of the file is 100M
  • the threshold of the file size that can be processed is 10M. Since the size of the file far exceeds the size of the file that can be processed, the file is read by the stream file. file. Assuming that the size of the buffer area is 1M, the contents of the 1M original file are read each time as a stream file.
  • different files have different encoding modes.
  • garbled characters will appear, in order to solve the problem of garbled characters in the Chinese language.
  • the data is parsed in the buffer area, it needs to be parsed according to the encoding method of the original file, that is, the metadata information of the original file is obtained, for example, for a file, the length of the file may be 50 bytes in the file.
  • the data type may be an integer, and the special character "$" is located at the 34th byte in the file.
  • the length of the file, the data type of the file, and the information of the special characters are all metadata information.
  • the method further includes: setting a breakpoint in at least one of a reading step, a caching step, a pre-processing step, and an importing step, wherein the breakpoint is used to record the information in the case where the step execution is performed, The recorded information is used for task recovery.
  • a breakpoint is set in each execution step, and the program running in the background is concurrently executed by multiple tasks, so when a failure occurs in the execution step, the breakpoint recording step performs information related to the error. .
  • the breakpoint recording step performs information related to the error.
  • the time at which the error occurred in the breakpoint, the cause of the error, and the error Information such as the location and the state of the program running in the background when an error occurs.
  • an apparatus embodiment for file reading is provided.
  • FIG. 5 is a schematic structural diagram of a file reading apparatus according to an embodiment of the present invention. As shown in FIG. 5, the apparatus includes a reading module 501, a cache module 503, a preprocessing module 505, an import module 507, and a loop module 509.
  • the reading module 501 is configured to perform a reading step of reading a predetermined length of data from the file according to the size of the buffer area in a stream manner.
  • the pre-processing module 505 is configured to perform a pre-processing step of pre-processing the buffered data according to the pre-configured pre-processing requirements to obtain the content of the data.
  • the import module 507 is configured to perform an importing step to save the content of the data to the data platform.
  • the looping module 509 is configured to sequentially perform the reading step, the caching step, the pre-processing step, and the importing step to complete the reading of the file.
  • the fixed length data is read from the file in a byte stream manner, the fixed length data is put into the buffer area for buffering, and then the buffer area is followed.
  • the byte reads the cached data, and parses the data content of the file according to the length information of the file, the data type of the file content, and special characters, endian, codec, etc., that is, the preprocessing process of the data is completed.
  • the parsed data is saved to the data platform, and data processing, data storage, query retrieval, and analysis mining and display operations can be performed.
  • the above-described reading step, caching step, pre-processing step, and importing step are performed cyclically until the reading operation on the large file is completed.
  • the foregoing apparatus further includes:
  • a splitting module 511 configured to split the file into multiple parts
  • the processing module 513 is configured to perform a reading step, a caching step, a pre-processing step, and an importing step on the plurality of parts in the file by using the plurality of distributed services, and save the content corresponding to the multiple parts to the data platform; or
  • the method is configured to perform a reading step, a caching step, and a pre-processing step to obtain a plurality of parts corresponding to the plurality of parts in the file by using a plurality of distributed services, and then combine the obtained contents, and merge the content Import to the data platform.
  • the splitting can be performed according to the processing capabilities of different distributed services, that is, the files can be split into multiple parts according to the processing capabilities of multiple distributed services, and multiple parts are assigned to corresponding The distributed service is processed.
  • the processing power of the first distributed service is twice that of the second distributed service, so the file size for the first distributed processing service can be twice the file size of the second distributed processing server. .
  • the file size split by this split method is different and corresponds to the processing power of the distributed service.
  • files can be split into equal-sized parts and then a corresponding number of files can be allocated according to processing power.
  • the original file is divided into four parts, which are respectively recorded as: a, b, c, and d, and four distributed services, A, B, C, and D, respectively.
  • the distributed service A performs a reading step, a caching step, a pre-processing step, and an importing step on the file a.
  • the content A' after parsing the file a can be obtained.
  • the contents are B', C', and D', and finally the parsed file contents A', B', C', D' are saved to the data platform.
  • the original file is also split and split into a, b, c, and d 4 parts, and there are 4 distributed services, A, B, C, and D, and distributed services A and B.
  • C and D perform the reading step, the caching step and the pre-processing step on the files a, b, c and d, and obtain the contents A', B', C' and D' of the four parts of the file, and the four parts Merge into one A'B'C'D' and perform the import step in the merged content and import it onto the data platform.
  • each distributed service can correspond to a separate cache area, and the cache area corresponding to multiple distributed services is based on the server where the distributed service is located.
  • the resources are determined.
  • each distributed service has a separate cache area, and the cache area is configured by the cache manager of the server where the distributed service is located.
  • the configuration of the resource may be static configuration or dynamic configuration. Dynamic configuration can be configured based on the current load of the distributed service and the remaining processing power.
  • the foregoing apparatus further includes:
  • the first configuration module 515 is configured to configure a size of the buffer area.
  • the second configuration module 517 is configured to configure a backup buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the backup buffer area is a backup of the buffer area.
  • the size and number of the buffer area can be automatically configured according to the usage of the memory, and two buffer areas are configured.
  • the two buffer areas have the same size and can be used to process the parsing of the file. Garbled problems; you can also configure multiple caches that are recycled.
  • the reading module 501 includes:
  • the first reading module 5011 is configured to acquire a size of the file.
  • the size of the file is 100M
  • the threshold of the file size that can be processed is 10M. Since the size of the file far exceeds the size of the file that can be processed, the file is read by the stream file. file. Assuming that the size of the buffer area is 1M, the contents of the 1M original file are read each time as a stream file.
  • the pre-processing module is configured to perform pre-processing on the cached data according to the metadata information, where the pre-processing module 505 includes:
  • the information obtaining module 5051 is configured to read the data in the buffer from the buffer, and obtain the content in the data according to the metadata information, where the metadata information is used to perform content parsing on the data, where the metadata information includes at least one of the following: : Information on length information, data type, special characters, endian, and codec.
  • the foregoing apparatus further includes:
  • a breakpoint is set in each execution step, and the program running in the background is concurrently executed by multiple tasks, so when a failure occurs in the execution step, the breakpoint recording step performs information related to the error. .
  • the breakpoint recording step performs information related to the error.
  • the time at which the error occurred in the breakpoint, the cause of the error, and the error Information such as the location and the state of the program running in the background when an error occurs.
  • the disclosed technical contents may be implemented in other manners.
  • the device embodiments described above are only schematic.
  • the division of cells may be a logical function division.
  • multiple units or components may be combined or integrated into Another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • An integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

Abstract

Disclosed in the present invention are a method and a device for reading a file. The method comprises: a reading step, reading, in a stream manner, data with a predetermined length from a file on the basis of the size of a buffer; a buffering step, placing the read data in the buffer for buffering; a preprocessing step, preprocessing, according to preconfigured preprocessing requirements, the buffered data to obtain the content of the data; an importing step, storing the content of the data into a data platform; and circularly executing the reading step, buffering step, preprocessing step and importing step in sequence to accomplish the reading of the file The present invention solves the technical problem caused by a large file.

Description

文件读取方法和装置File reading method and device 技术领域Technical field
本发明涉及大数据领域,具体而言,涉及一种文件读取方法和装置。The present invention relates to the field of big data, and in particular to a file reading method and apparatus.
背景技术Background technique
现在的社会是一个高速发展的社会,科技发达,信息流通,人们之间的交流越来越密切,生活也越来越方便,大数据就是这个高科技时代的产物。大数据导致了大文件的产生,而大文件的读取与之前的小文件的处理相比是有问题的。The current society is a fast-developing society with developed technology, information circulation, more and more people's exchanges, and more and more convenient life. Big data is the product of this high-tech era. Big data leads to the creation of large files, and the reading of large files is problematic compared to the processing of small files before.
例如,在某些特殊的行业中,经常需要面对十几GB乃至几十TB容量的巨型文件,而一个32位进程所拥有的虚拟地址空间只有4G,显然不能一次性将文件全部加载到内存中。For example, in some special industries, it is often necessary to face a huge number of files of more than a dozen GB or even tens of terabytes, and a 32-bit process has a virtual address space of only 4G. Obviously, it is impossible to load all the files into the memory at one time. in.
又例如,如果文件比较大,在将文件的内容读取到数据库中也会存在问题。As another example, if the file is large, there is a problem in reading the contents of the file into the database.
针对上述由于文件比较大而导致的问题,目前尚未提出有效的解决方案。In view of the above problems caused by the relatively large files, no effective solution has been proposed yet.
发明内容Summary of the invention
本发明实施例提供了一种文件读取方法和装置,以至少解决由于文件比较大而导致的技术问题。The embodiment of the invention provides a file reading method and device, so as to at least solve the technical problem caused by the relatively large file.
根据本发明实施例的一个方面,提供了一种文件读取方法,包括:读取步骤,以流的方式从文件中按照缓存区的大小读取预定长度的数据;缓存步骤,将读取到的数据放在缓存区进行缓存;预处理步骤,根据预先配置的预处理要求对缓存的数据进行预处理以得到数据的内容;导入步骤,将数据的内容保存至数据平台;循环依次执行读取步骤、缓存步骤、预处理步骤以及导入步骤完成对文件的读取。According to an aspect of an embodiment of the present invention, a file reading method is provided, including: a reading step of streaming a predetermined length of data from a file according to a size of a buffer area; and a caching step, which is to be read The data is cached in the buffer area; the pre-processing step preprocesses the cached data according to the pre-configured pre-processing requirements to obtain the content of the data; the import step saves the content of the data to the data platform; The steps, the caching step, the pre-processing step, and the import step complete the reading of the file.
根据本发明实施例的另一方面,还提供了一种文件读取装置,包括:读取模块,用于执行读取步骤,以流的方式从文件中按照缓存区的大小读取预定长度的数据;缓存模块,用于执行缓存步骤,将读取到的数据放在缓存区进行缓存;预处理模块,用于执行预处理步骤,根据预先配置的预处理要求对缓存的数据进行预处理以得到数据的内容;导入模块,用于执行导入步骤,将数据的内容保存至数据平台。 According to another aspect of the embodiments of the present invention, there is further provided a file reading apparatus, comprising: a reading module, configured to perform a reading step of streaming a predetermined length from a file according to a size of a buffer area Data; a caching module, configured to perform a caching step, and the read data is cached in a buffer area; the preprocessing module is configured to perform a preprocessing step, and preprocess the cached data according to a preconfigured preprocessing requirement. Get the content of the data; the import module is used to perform the import step to save the contents of the data to the data platform.
在本发明实施例中,采用分布式读取大数据文件的方式,通过以流的方式读取预定长度的数据,将该数据放入缓存区中,并对其进行预处理,得到数据的内容,最后将数据的内容保存至数据平台,达到了快速加载大数据文件至内存的目的,进而解决了由于文件比较大而导致的技术问题。In the embodiment of the present invention, by reading a big data file in a distributed manner, by reading a predetermined length of data in a stream manner, the data is put into a buffer area, and the data is preprocessed to obtain the content of the data. Finally, the content of the data is saved to the data platform, and the purpose of quickly loading the big data file to the memory is achieved, thereby solving the technical problem caused by the relatively large file.
附图说明DRAWINGS
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:
图1是根据本发明实施例的一种文件读取方法流程图;1 is a flow chart of a file reading method according to an embodiment of the present invention;
图2是根据本发明实施例的一种可选的文件读取方法流程图;2 is a flow chart of an optional file reading method according to an embodiment of the present invention;
图3是根据本发明实施例的一种可选的读取预定长度数据的方法流程图;3 is a flow chart of an alternative method of reading predetermined length data according to an embodiment of the present invention;
图4是根据本发明实施例的一种可选的读取预定长度数据之前的方法流程图;以及4 is a flow chart of an alternative method of reading predetermined length data, in accordance with an embodiment of the present invention;
图5是根据本发明实施例的一种文件读取装置的结构示意图。FIG. 5 is a schematic structural diagram of a file reading apparatus according to an embodiment of the present invention.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It is to be understood that the terms "first", "second" and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged where appropriate, so that the embodiments of the invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.
实施例1 Example 1
根据本发明实施例,提供了一种文件读取的方法实施例。According to an embodiment of the invention, an embodiment of a method of file reading is provided.
图1是根据本发明实施例的一种文件读取方法,如图1所示,该方法包括如下步骤:FIG. 1 is a file reading method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:
步骤S102,读取步骤,以流的方式从文件中按照缓存区的大小读取预定长度的数据。Step S102: The reading step reads a predetermined length of data from the file according to the size of the buffer area in a stream manner.
步骤S104,缓存步骤,将读取到的数据放在缓存区进行缓存。Step S104, the caching step, placing the read data in the buffer area for caching.
步骤S106,预处理步骤,根据预先配置的预处理要求对缓存的数据进行预处理以得到数据的内容。Step S106, a pre-processing step of pre-processing the buffered data according to the pre-configured pre-processing requirements to obtain the content of the data.
步骤S108,导入步骤,将数据的内容保存至数据平台。Step S108, the importing step saves the content of the data to the data platform.
步骤S110,循环依次执行读取步骤、缓存步骤、预处理步骤以及导入步骤完成对文件的读取。Step S110, the loop sequentially performs the reading step, the caching step, the pre-processing step, and the importing step to complete the reading of the file.
作为一种可选的实施例,根据缓存区的大小,以字节流的方式从文件中读取固定长度的数据,将该固定长度的数据放入缓存区进行缓存,然后从缓存区中按照字节对缓存的数据进行读取,根据文件的长度信息、文件内容的数据类型以及特殊字符、字节序、编解码方式等信息对文件的数据内容进行解析,即完成对数据的预处理过程,最后将解析后的数据保存到数据平台上,可以对其进行数据处理、数据存储、查询检索以及分析挖掘和展示等操作。循环执行上述读取步骤、缓存步骤、预处理步骤以及导入步骤,直至完成对该大文件的读取操作。As an optional embodiment, according to the size of the buffer area, the fixed length data is read from the file in a byte stream manner, the fixed length data is put into the buffer area for buffering, and then the buffer area is followed. The byte reads the cached data, and parses the data content of the file according to the length information of the file, the data type of the file content, and special characters, endian, codec, etc., that is, the preprocessing process of the data is completed. Finally, the parsed data is saved to the data platform, and data processing, data storage, query retrieval, and analysis mining and display operations can be performed. The above-described reading step, caching step, pre-processing step, and importing step are performed cyclically until the reading operation on the large file is completed.
在本实施例中,采用分布式读取大数据文件的方式,通过以流的方式读取预定长度的数据,将该数据放入缓存区中,并对其进行预处理,得到数据的内容,最后将数据的内容保存至数据平台,达到了快速加载大数据文件至内存的目的,进而解决了由于文件比较大而导致的技术问题。In this embodiment, by reading a large data file in a distributed manner, by reading a predetermined length of data in a stream manner, the data is put into a buffer area, and preprocessed to obtain content of the data. Finally, the content of the data is saved to the data platform, and the purpose of quickly loading the big data file to the memory is achieved, thereby solving the technical problem caused by the relatively large file.
考虑到文件本身比较大,为了加快处理速度,可以考虑并行处理,即可以将文件拆分为多个部分再进行处理。图2示出了该可选实施方式的流程,如图2所示,上述方法还可以包括如下步骤:Considering that the file itself is relatively large, in order to speed up the processing, you can consider parallel processing, that is, you can split the file into multiple parts and then process it. FIG. 2 shows the flow of the optional implementation. As shown in FIG. 2, the foregoing method may further include the following steps:
步骤S202,将文件拆分为多个部分。In step S202, the file is split into multiple parts.
步骤S204,通过多个分布式服务对文件中的多个部分分别执行读取步骤、缓存步骤、预处理步骤、以及导入步骤,将多个部分对应的内容保存至数据平台;或者,通过多个分布式服务对文件中的多个部分分别执行读取步骤、缓存步骤、以及预处理步 骤得到多个部分对应的内容,再将得到的内容进行合并,并将合并之后的内容导入到数据平台。Step S204, performing a reading step, a caching step, a pre-processing step, and an importing step on the plurality of parts in the file by using the plurality of distributed services, respectively, and saving the content corresponding to the multiple parts to the data platform; or, by using multiple The distributed service performs a reading step, a caching step, and a preprocessing step on portions of the file, respectively. The content corresponding to the multiple parts is obtained, and the obtained content is merged, and the merged content is imported into the data platform.
在拆分的时候,可以根据不同分布式服务的处理能力进行拆分,即,可以根据多个分布式服务各自的处理能力,将文件拆分为多个部分,并将多个部分分配至对应的分布式服务进行处理。例如,第一个分布式服务的处理能力是第二个分布式服务处理能力的两倍,那么给第一个分布式处理服务的文件大小可以是第二个分布式处理服务器文件大小的两倍。这种拆分方法拆分出来的文件大小是不相同的,与分布式服务的处理能力相对应。作为另外一种处理方式,可以将文件拆分成大小相同的部分,然后根据处理能力分配对应数量的文件。At the time of splitting, the splitting can be performed according to the processing capabilities of different distributed services, that is, the files can be split into multiple parts according to the processing capabilities of multiple distributed services, and multiple parts are assigned to corresponding The distributed service is processed. For example, the processing power of the first distributed service is twice that of the second distributed service, so the file size for the first distributed processing service can be twice the file size of the second distributed processing server. . The file size split by this split method is different and corresponds to the processing power of the distributed service. As another way of processing, files can be split into equal-sized parts and then a corresponding number of files can be allocated according to processing power.
例如,将原始文件拆分为4部分,分别记为:a、b、c和d,而分布式服务也有4个,分别为A、B、C和D,分布式服务A对文件a执行读取步骤、缓存步骤、预处理步骤以及导入步骤,执行完毕后可得到对文件a进行解析后的内容A’,同样,对文件b、c和d进行解析后的内容分别为B’、C’和D’,最后将解析后的文件内容A’、B’、C’、D’保存到数据平台上。又例如,同样对原始文件进行拆分,将其拆分为a、b、c和d 4部分,而分布式服务也有4个,分别为A、B、C和D,分布式服务A、B、C和D对文件a、b、c和d执行读取步骤、缓存步骤和预处理步骤后,得到这四部分文件的内容A’、B’、C’和D’,将这四部分内容合并为一体A’B’C’D’,并将合并后的内容执行导入步骤,将其导入到数据平台上。For example, the original file is divided into 4 parts, which are respectively recorded as: a, b, c, and d, and there are also 4 distributed services, namely A, B, C, and D. Distributed service A performs reading on file a. Steps, caching steps, pre-processing steps, and import steps are performed. After the execution is completed, the content A' after parsing the file a can be obtained. Similarly, the parsed files b, c, and d are respectively B', C'. And D', finally save the parsed file contents A', B', C', D' to the data platform. For another example, the original file is also split and split into a, b, c, and d 4 parts, and there are 4 distributed services, A, B, C, and D, and distributed services A and B. , C and D perform the reading step, the caching step and the pre-processing step on the files a, b, c and d, and obtain the contents A', B', C' and D' of the four parts of the file, and the four parts Merge into one A'B'C'D' and perform the import step in the merged content and import it onto the data platform.
在使用多个分布式服务的时候,为了使分布式服务处理更快,每个分布式服务分别可以对应独立的缓存区,并且,多个分布式服务对应的缓存区是根据分布式服务所在服务器的资源确定的。When using multiple distributed services, in order to make distributed services faster, each distributed service can correspond to a separate cache area, and the cache area corresponding to multiple distributed services is based on the server where the distributed service is located. The resources are determined.
作为一种可选的实施例,多个节点中的每个节点对应至少两个缓存块,其中,在两个缓存块中的第一缓存块中的内容被处理的过程中,从文件中根据至少两块缓存中的第二缓存块的大小读取内容,并将读取到的内容放置在第二缓存块中;在第二缓存块中的内容在处理的过程中,从文件中根据至少两块缓存中的第一缓存块的大小读取内容,并将读取到的内容放置在第一缓存块中,其中,第一缓存块和第二缓存块的大小相同或者不同。As an optional embodiment, each of the plurality of nodes corresponds to at least two cache blocks, wherein, in the process of processing the content in the first cache block of the two cache blocks, according to the file Reading the content of the second cache block in at least two caches and placing the read content in the second cache block; the content in the second cache block is processed from the file according to at least The size of the first cache block in the two caches reads the content and places the read content in the first cache block, wherein the first cache block and the second cache block are the same size or different.
其中,可以给节点分配多个缓存块,例如,给节点1分配2个缓存块A和B,其中,在节点1对缓存块A中的内容进行处理时,可以利用缓存块B读取内容,其中,读取的文件内容的大小可以是根据缓存块B的大小确定的;可选的,在节点1对缓存块B中的内容进行处理时,可以利用缓存块A读取内容,其中,读取的文件内容的大小可以是根据缓存块A的大小确定的,这是由于在处理文件后,需要将处理过的文件 放在缓存块A中,若处理文件内容的大小超过缓存块A,则缓存块A不能全部接收该处理过的文件内容,造成文件无法放置,为了合理分配缓存块,需要在处理文件之前获取待保存缓存区的大小,并根据待保存缓存区的大小合理分配缓存文件的大小,该待保存缓存区可以是节点中各个空闲缓存块。Wherein, the node may be allocated a plurality of cache blocks, for example, two cache blocks A and B are allocated to the node 1, wherein when the node 1 processes the content in the cache block A, the cache block B can be used to read the content. The size of the read file content may be determined according to the size of the cache block B. Optionally, when the node 1 processes the content in the cache block B, the cache block A may be used to read the content, where, the read The size of the file content fetched may be determined according to the size of the cache block A, because the processed file needs to be processed after the file is processed. In the cache block A, if the size of the processed file content exceeds the cache block A, the cache block A cannot receive the processed file content, and the file cannot be placed. In order to properly allocate the cache block, it is necessary to obtain the file before processing the file. The size of the cache area is saved, and the size of the cache file is appropriately allocated according to the size of the buffer to be saved. The cache area to be saved may be each free cache block in the node.
在一种可选的实施例中,处理节点可以包括两个缓存区,一个是用来接收读取节点发来的数据(即表示待处理缓存区),另一个是用来缓存在节点对缓存块中的数据进行处理后待存入系统的数据(即表示待保存缓存区),数据被存入系统后才代表数据处理完成,即导入成功,若数据没有处理成功,则代表导入失败,其中,导入失败可以是由于设备出现故障而造成的。读取节点可以是读取缓存区中的文件数据,可以将缓存区中的待处理文件分配为多个缓存块。In an optional embodiment, the processing node may include two buffer areas, one for receiving data sent by the read node (ie, indicating the buffer to be processed), and the other for buffering in the node pair cache. The data in the block is processed and stored in the system data (that is, the buffer area to be saved). After the data is stored in the system, the data processing is completed, that is, the import is successful. If the data is not processed successfully, the import fails. The import failure can be caused by a device failure. The read node may be the file data in the read buffer area, and the file to be processed in the buffer area may be allocated as multiple cache blocks.
可选的,每个节点对应至少三个缓存块,其中,三个缓存块中的一块缓存块中的内容在被处理的过程中,向三个缓存块中的另外两块缓存块中放置读取到的内容。可以在对每个节点分配缓存块时,给每个节点分配至少三个缓存块,这三个缓存块可以进行不同的操作,该操作可以是读取和处理文件内容,例如,给节点2分配3个缓存块A,B,C。节点可以对缓存块A中的文件内容进行处理,在节点对缓存块A中的文件内容处理的过程中,可以向缓存块B和缓存块C放置缓存区中的待处理文件内容,也可以是将缓存区中分配过的待处理文件的数据块放置到缓存块A和缓存块B中。Optionally, each node corresponds to at least three cache blocks, wherein content in one of the three cache blocks is placed in the other two cache blocks of the three cache blocks during processing. The content retrieved. Each node may be assigned at least three cache blocks when the cache block is allocated to each node, and the three cache blocks may perform different operations, which may be reading and processing file contents, for example, assigning to node 2 3 cache blocks A, B, C. The node can process the content of the file in the cache block A. In the process of processing the file content in the cache block A, the node can place the content of the file to be processed in the cache area to the cache block B and the cache block C, or The data block of the file to be processed allocated in the buffer area is placed into the cache block A and the cache block B.
本发明实施例还可以包括:缓存区中的缓存块的大小和/或缓存块的数量根据缓存块所在节点的资源情况进行分配,其中,分配为周期性分配,或者,分配为在满足预定条件下的分配。对于该实施方式,由于多个节点中的各个节点的处理速度不同,所以在有的节点对缓存块中的内容处理完成后,可能有的节点还没有处理完毕。The embodiment of the present invention may further include: the size of the cache block in the buffer area and/or the number of cache blocks are allocated according to the resource condition of the node where the cache block is located, where the allocation is periodic, or is allocated to meet the predetermined condition. Under the distribution. For this embodiment, since the processing speed of each of the plurality of nodes is different, after some of the nodes are processed to the content in the cache block, the possible nodes have not been processed yet.
作为一种可选的实施例,节点中对多个缓存块进行处理的顺序不同,使得缓存块的空闲状态不同,例如,节点1有3个缓存块A,B,C,其中,节点1已经对缓存块A和B处理完成,则节点1有两个缓存块是可以分配待处理文件的数据块的,若有多个节点中的多个缓存块是空闲的,则可以在分配时根据节点发出的空闲缓存块的信息作对应的分配。As an optional embodiment, the order of processing the multiple cache blocks in the node is different, so that the idle state of the cache block is different. For example, the node 1 has three cache blocks A, B, and C, where the node 1 has After the processing of the cache blocks A and B is completed, the node 1 has two cache blocks which are data blocks that can allocate files to be processed. If multiple cache blocks of the plurality of nodes are idle, the node may be allocated according to the node. The information of the issued free cache block is assigned accordingly.
作为一种可选的实施例,缓存区中的缓存块的大小可以是根据节点中的空闲缓存块的大小确定的,即可以在分配节点中的缓存块的时候,根据各个节点反馈的空闲缓存块的大小和数量分配对应的缓存块的大小和数量,例如,节点1反馈有缓存块A为10M和缓存块B为20M,则在缓存区分配缓存块时,可以分配一个10M的缓存块和一个20M的缓存块给节点1。 As an optional embodiment, the size of the cache block in the buffer may be determined according to the size of the free cache block in the node, that is, when the cache block in the node is allocated, according to the idle cache fed back by each node. The size and number of blocks are allocated corresponding to the size and number of cache blocks. For example, when node 1 feeds back that cache block A is 10M and cache block B is 20M, when a buffer block is allocated in the buffer area, a 10M cache block and A 20M cache block is given to node 1.
另一种可选的实施方式,在分配缓存块的时候,分配可以是周期性的分配,该周期性可以是指在预先设置的时间段内分配一次缓存块,例如,每隔5秒分配一次缓存块。另一种可选的实施方式,在缓存区分配缓存块的时候,可以根据预先设置的条件分配缓存块的大小和数量,该预先设置的条件可以包括多种情况,例如,根据各个节点的处理速度确定缓存块的大小和数量。In another optional implementation manner, when the cache block is allocated, the allocation may be a periodic allocation, and the periodicity may be that the cache block is allocated once in a preset time period, for example, every 5 seconds. Cache block. In another optional implementation manner, when the cache block is allocated in the buffer area, the size and the number of the cache block may be allocated according to a preset condition, and the preset condition may include multiple conditions, for example, according to processing of each node. The speed determines the size and number of cache blocks.
对于上述实施例,根据各个节点的处理速度确定缓存块的大小和数量,可以设置一种反馈机制,在节点对分配的缓存块进行处理时,可以在一定的时间间隔内向缓存区反馈处理的缓存块的数量和大小,其中,一定的时间间隔可以是预先设置的,如4秒,即每隔4秒各个节点向缓存区反馈自己在这一时间段内处理过的缓存块的数量。缓存区中可以设置一个响应单元,该响应单元用于处理各个节点反馈过来的数据,通过该响应单元,缓存区可以知道各个节点处理的缓存块的数量以及各个节点的空闲的缓存块的数量和大小,并判断出各个节点处理缓存块的速度,根据响应单元判断出的各个节点的处理缓存块的速度以及各个节点中的缓存块的空闲状态,缓存区可以重新设置向各个节点发送的缓存块的数量和大小。For the foregoing embodiment, the size and the number of the cache block are determined according to the processing speed of each node, and a feedback mechanism may be set. When the node processes the allocated cache block, the buffer may be fed back to the buffer in a certain time interval. The number and size of the blocks, wherein a certain time interval may be preset, such as 4 seconds, that is, every 4 seconds, each node feeds back to the buffer area the number of cache blocks that it has processed during this time period. A response unit may be disposed in the buffer area, and the response unit is configured to process data fed back by each node, and the buffer unit may know the number of cache blocks processed by each node and the number of idle cache blocks of each node. Size, and determine the speed at which each node processes the cache block. According to the speed of the processing cache block of each node determined by the response unit and the idle state of the cache block in each node, the buffer area can reset the cache block sent to each node. The number and size.
通过该反馈机制,缓存区可以及时的调整分配给各个节点的缓存块的大小和数量,实现给处理速度较快的节点分配更多的缓存块,给处理速度较慢的节点分配较少的缓存块,以实现向各个节点发送的缓存块的数量更合理,使得资源的利用更合理。Through the feedback mechanism, the buffer area can timely adjust the size and number of cache blocks allocated to each node, thereby reallocating more cache blocks to nodes with faster processing speed, and assigning less cache to nodes with slower processing speed. Blocks, to achieve a more reasonable number of cache blocks sent to each node, making resource utilization more reasonable.
可选的,可以通过一个读取节点对待读取的文件进行拆分;通过读取节点将拆分之后的文件分发到多个节点中的至少一个节点对应的缓存块中以进行处理。在读取待处理文件后,可以将待处理文件放置在缓存区中,之后,可以利用读取节点将待处理文件拆分为不同的缓存块。其中,该读取节点可以是设置在缓存区中的节点,通过该节点可以控制其它各个节点处理缓存块的数量以及内容,读取节点可以将待处理的文件拆分为多个缓存块,并将各个缓存块发送到对应的各个节点中。Optionally, the file to be read by the read node may be split; the file after the split is distributed to the cache block corresponding to at least one of the plurality of nodes by the read node for processing. After the file to be processed is read, the file to be processed can be placed in the buffer area, and then the read node can be used to split the file to be processed into different cache blocks. The read node may be a node disposed in the buffer area, and the node may control the number and content of the cache blocks processed by the other nodes, and the read node may split the file to be processed into multiple cache blocks, and Each cache block is sent to a corresponding node.
上述读取节点可以是处理一个总的待处理文件内容,各个其它节点可以是处理读取节点分配给各个节点的缓存块中的内容,其实现的方式可以是总分的方式,通过一个总的读取节点控制各个分节点来处理对应的缓存块的内容。The above-mentioned read node may be configured to process a total file content to be processed, and each of the other nodes may be processed in a cache block allocated to each node by the read node, and the manner of implementation may be a total score manner, through a total The read node controls each of the sub-nodes to process the contents of the corresponding cache block.
另一种可选的实施方式,在待读取文件的大小超过阈值的情况下,通过一个读取节点对待读取的文件进行拆分;通过读取节点根据多个节点中的每个节点的资源情况将拆分之后的文件进行分发。其中,该阈值可以是预先设置的待处理文件大小,例如,设置一次读取200M待处理文件,读取节点可以对待读取的文件的大小进行判断,若判断出待处理的文件的大小超出阈值,则读取节点可以对待读取文件进行拆分。 Another optional implementation manner, in the case that the size of the file to be read exceeds a threshold, the file to be read by the read node is split; by reading the node according to each of the plurality of nodes The resource situation will distribute the files after the split. The threshold may be a preset file size to be processed, for example, setting a 200M pending file to be read at a time, and the read node may determine the size of the file to be read, and if the size of the file to be processed exceeds the threshold. , the read node can split the file to be read.
具体的,在拆分时,可以将待处理文件内容拆分为不同的缓存块,其中,拆分待处理文件可以是根据多个节点中的每个节点的空闲的缓存块的大小和数量确定的,在拆分后,可以将拆分的一个或多个缓存块分发到对应节点中。Specifically, when splitting, the content of the to-be-processed file may be split into different cache blocks, where splitting the to-be-processed file may be determined according to the size and number of free cache blocks of each of the plurality of nodes. After splitting, the split one or more cache blocks can be distributed to the corresponding node.
其中,对待读取的文件进行拆分包括:确定对待读取文件进行拆分的拆分点;判断根据拆分点拆分得到的部分在结束处或者开始处是否包括不完整的内容;在包括不完整的内容的情况下,移动拆分点,使拆分得到的部分包括的内容完整。The splitting of the file to be read includes: determining a split point for splitting the file to be read; determining whether the part obtained by splitting according to the split point includes incomplete content at the end or at the beginning; In the case of incomplete content, move the split point so that the part obtained by the split is complete.
作为一种可选的实施例,确定对待读取文件进行拆分的拆分点时,根据拆分的节点的资源的不同,可以确定多个拆分点,其中,在拆分的时候,可以先确定拆分点的位置,该拆分点的位置可以是根据待处理文件中的内容确定,由于待处理的文件内容的不同,确定的待处理的拆分点的位置也可以是不同的,例如,若待处理的文件包括文本和图片,在确定拆分点的时候,可以将拆分点确定为同一个文本的开始处或结束处,也可以确定拆分点为同一个图片的开始处或结束处。As an optional embodiment, when determining a split point to be split into a file to be read, according to different resources of the split node, multiple split points may be determined, wherein, when splitting, First, the location of the split point is determined. The location of the split point may be determined according to the content in the file to be processed. The determined location of the split point to be processed may also be different due to the content of the file to be processed. For example, if the file to be processed includes text and images, when determining the split point, the split point can be determined as the beginning or end of the same text, or the split point can be determined as the beginning of the same picture. Or the end.
作为一种可选的实施方式,若有多个文本或图片,则在确定拆分点的时候,可以将多个待处理的文本或图片放置在一起,将拆分点确定为多个文本或图片的起始处和结束处,这样,节点在处理该文件的内容的时候,可以是较为完整的文件,这样可以提高处理效率。As an optional implementation manner, if there are multiple texts or pictures, when determining the split point, multiple pieces of text or pictures to be processed may be placed together, and the split point may be determined as multiple texts or At the beginning and end of the picture, the node can be a relatively complete file when processing the contents of the file, which can improve the processing efficiency.
另一种可选的实施方式,判断根据拆分点拆分得到的部分在结束处或者开始处是否包括不完整的内容包括:判断拆分点处的内容是否为结构化数据或者非结构化数据;如果为结构化数据,则判断根据拆分点拆分得到的部分在结束处或者开始处是否为完整的记录;如果为非结构化数据,则判断根据拆分点拆分得到的部分在结束处或开始处是否为完整的文件。若判断出非结构化数据根据拆分点拆分得到的部分在结束处或开始处为完整的文件,则可以不用移动拆分点,若判断出非结构化数据根据拆分点拆分得到的部分在结束处或开始处不是完整的文件,则可以移动拆分点,将拆分点移动到非结构化数据的结束处或开始处。In another optional implementation manner, determining whether the part obtained by splitting according to the split point includes incomplete content at the end or the beginning includes: determining whether the content at the split point is structured data or unstructured data If it is structured data, it is judged whether the part obtained according to the split point split is a complete record at the end or the beginning; if it is unstructured data, it is judged that the part obtained by the split point split is ended. Whether it is a complete document at the beginning or at the beginning. If it is judged that the part obtained by splitting the unstructured data according to the split point is a complete file at the end or the beginning, the split point may not be moved, and if the unstructured data is determined to be split according to the split point, The part is not a complete file at the end or at the beginning, you can move the split point and move the split point to the end or beginning of the unstructured data.
其中,结构化数据和非结构化数据可以是根据待处理文件的内容确定的,在该实施方式中,结构化数据可以是提前设置的,例如,存储在数据库中的结构化的数据,在数据库中可以将文件预先划分好相应的结构,其可以是随时拆分的数据;非结构化数据可以是获取的完整的数据,例如,图片,视频等数据,非结构化的数据是不好拆分,在读取节点拆分待处理的时候,可以尽量将相同的非结构化数据放置在一起,这样可以提高节点的处理文件的处理速度。Wherein, the structured data and the unstructured data may be determined according to the content of the file to be processed. In this embodiment, the structured data may be set in advance, for example, structured data stored in a database, in the database. The file can be pre-divided into corresponding structures, which can be data split at any time; unstructured data can be complete data obtained, for example, pictures, videos, etc. Unstructured data is not well-split When the read node is split and is to be processed, the same unstructured data can be placed together as much as possible, which can improve the processing speed of the processed file of the node.
可选的,通过读取节点将拆分之后的文件分发到多个节点中的至少一个节点对应 的缓存块中以进行处理包括:通过读取节点将拆分之后的文件的每个部分分别分发给一个节点组,其中,每个节点组包括至少一个节点,节点组作为一个整体将接收到的部分放置在缓存块中进行处理。在该实施方式中,可以在读取节点拆分好各个待处理文件后,将拆分好的文件分发给节点组,该节点组可以包括各个节点。Optionally, the file after the split is distributed to at least one of the multiple nodes by the read node. The processing of the cache block includes: distributing each part of the split file to a node group by the read node, wherein each node group includes at least one node, and the node group as a whole will receive the Partially placed in the cache block for processing. In this embodiment, after the read node splits each file to be processed, the split file is distributed to the node group, and the node group may include each node.
另一种可选的实施方式,缓存区或每个节点对应的缓存块可以为环状队列,环状队列通过写入指针和读取指针访问,在环状队列中的缓存被写入之后未被读取之前禁止再次写入。对于环状队列,其可以是将待处理文件分配为一个环状缓存块队列,即各个缓存块可以形成一个环状队列,节点可以在处理缓存块中存储的文件内容时,通过写入指针将拆分好的文件写入对应的缓存块,并通过读取指针对缓存块中的文件内容进行处理操作。在节点对缓存块中的文件进行写入之后,对文件内容进行读取操作时,禁止再次向缓存块写入文件内容,这样可以使得节点处理缓存块中的文件内容时不受干扰。In another optional implementation manner, the buffer block or the cache block corresponding to each node may be a ring queue, and the ring queue is accessed by the write pointer and the read pointer. After the cache in the ring queue is written, It is forbidden to write again before being read. For a ring queue, it may be to allocate a file to be processed into a ring cache block queue, that is, each cache block may form a ring queue, and the node may write the pointer when processing the file content stored in the cache block. The split file is written into the corresponding cache block, and the contents of the file in the cache block are processed by the read pointer. After the node writes the file in the cache block, when the file content is read, it is prohibited to write the file content to the cache block again, so that the node can process the file content in the cache block without interference.
另一种可选的实施方式,整个缓存区可以构成一个大的环状队列,可以通过写入指针和读取指针访问队列。写入时写入指针不能越过读取指针的当前位置,这时,读取指针以后的内容还没有被读取,因此不能覆盖;读取指针也不能够越过写入指针,写入指针之后的地址空间还没有写入新的数据,因此是无效数据,不能被读取。在待处理文件被重新分配为多个缓存块后,对环状队列重新分配,环状队列中的缓存块的大小和数量可以做相应的变化。In another optional implementation manner, the entire buffer area can form a large circular queue, and the queue can be accessed by writing a pointer and a read pointer. When writing, the write pointer cannot cross the current position of the read pointer. At this time, the content after the read pointer has not been read, so it cannot be overwritten; the read pointer cannot cross the write pointer, after the pointer is written. The address space has not yet been written to new data, so it is invalid and cannot be read. After the pending file is reassigned into multiple cache blocks, the ring queue is reassigned, and the size and number of cache blocks in the ring queue can be changed accordingly.
可选的,对于缓存区的环状队列可以在分配时,设置多个节点,每个节点上有对应的缓存块,这里节点数量的设置可以是与缓存块的数量一一对应的,即每一个节点有一个缓存块,也可以是通过一个节点控制缓存队列中的几个相连的缓存块,将环状队列的区域通过多个节点来控制,节点的数量可以是固定,在重新分配多个缓存块后,可以将缓存块加入环状队列,但是对应区域的节点可以不作变化,即节点控制的缓存块的数量可以是实时变化的。Optionally, for the ring queue of the buffer area, multiple nodes may be set at the time of allocation, and each node has a corresponding cache block, where the number of nodes may be set to correspond one-to-one with the number of cache blocks, that is, each A node has a cache block, or it can control several connected cache blocks in the cache queue through one node, and control the area of the ring queue through multiple nodes. The number of nodes can be fixed, and multiple After the cache block is cached, the cache block can be added to the ring queue, but the nodes of the corresponding area can be changed, that is, the number of cache blocks controlled by the node can be changed in real time.
另一种可选的实施方式,读取节点可以对缓存块进行编号,并将编号及缓存块起始和/或结束位置记录下来,还可以记录文件内容是否成功导入,该导入操作可以包括缓存区拆分文件内容的操作、节点处理缓存块中的内容的操作、以及将处理后的文件存储起来的操作,直到整个文件导入成功,才可以删除相关缓存记录。可以设置断点机制,这样,可以在设备发送数据出现故障时,知道哪些文件数据导入成功,哪些文件数据导入失败,在设备故障修复后,可以将导入失败的数据重新缓存并发送给处理节点,同时可以使得已经成功导入的数据不需要重复导入。In another optional implementation manner, the reading node may number the cache block, record the number and the start and/or end position of the cache block, and record whether the file content is successfully imported. The import operation may include caching. The operation of splitting the contents of the file, the operation of the node processing the contents of the cache block, and the operation of storing the processed file until the entire file is successfully imported can delete the related cache record. You can set the breakpoint mechanism. In this way, you can know which file data is imported successfully and which file data import fails when the device sends data failure. After the device fault is repaired, the imported data can be re-cached and sent to the processing node. At the same time, data that has been successfully imported does not need to be repeatedly imported.
可选的,环状队列的大小可以根据节点对应的资源进行调整,在节点处理了缓存 块中的内容后,读取节点可以根据不同文件内容确定拆分的缓存块的大小。Optionally, the size of the ring queue can be adjusted according to the resource corresponding to the node, and the cache is processed at the node. After the contents of the block, the read node can determine the size of the split cache block according to different file contents.
其中,对于缓存队列的长度,可以先判断其是否为定长的结构化数据,若判断出是定长的结构化数据,可以按单条数据的整数倍确定队列长度;若非定长的结构,可以按照系统设定的长度确定队列长度,该系统设置的队列长度可以是用户根据实际情况自主设置,这里不做限定。For the length of the cache queue, it may be first determined whether it is structured data of a fixed length. If it is determined that the structured data is fixed length, the queue length may be determined by an integer multiple of a single data; if the structure is not fixed length, The length of the queue is determined according to the length of the system. The queue length set by the system can be set by the user according to the actual situation.
作为一种可选的实施例,每个分布式服务对应有一个独立的缓存区,该缓存区由分布式服务所在服务器的缓存管理器进行资源的配置。As an optional embodiment, each distributed service has a separate cache area, and the cache area is configured by the cache manager of the server where the distributed service is located.
需要说明的是,对资源的配置可以是静态配置,也可以是动态配置。动态配置可以根据分布式服务目前的负荷以及剩余处理能力来进行配置。It should be noted that the configuration of the resource may be static configuration or dynamic configuration. Dynamic configuration can be configured based on the current load of the distributed service and the remaining processing power.
作为一种可选的实施例,由于每个分布式服务都有一个独立的缓存区,其缓存区是由分布式服务所在的服务器的资源确定的,所以每个分布式服务的处理能力各不相同,每个分布式服务对应的缓存区的大小也是不相同的。将文件拆分为多个部分,拆分后文件的每部分不一定都是均等的。每个分布式服务根据各自处理能力的大小处理相应拆分后的部分文件。As an optional embodiment, since each distributed service has a separate cache area whose cache area is determined by the resources of the server where the distributed service is located, each distributed service has different processing capabilities. Similarly, the size of the buffer corresponding to each distributed service is also different. Split the file into multiple parts, and each part of the file after splitting is not necessarily equal. Each distributed service processes the corresponding split partial files according to the size of their respective processing capabilities.
可选地,如图4所示,在以流文件的方式从文件中按照缓存区的大小读取预定长度的数据之前,方法还包括如下步骤:Optionally, as shown in FIG. 4, before the data of the predetermined length is read from the file according to the size of the buffer area by using a stream file, the method further includes the following steps:
步骤S402,配置缓存区的大小。Step S402, configuring the size of the buffer area.
步骤S404,配置缓存区的备用缓存区,其中,备用缓存区的大小与缓存区的大小一致,备用缓存区为缓存区的备份。Step S404, configuring an alternate buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the backup buffer area is a backup of the buffer area.
作为一种可选的实施例,根据内存的使用情况可自动配置缓存区的大小和数量,并配置两个缓存区,两个缓存区的大小相同,可以用来处理对文件进行解析式出现的乱码问题;也可以配置多个缓存区,这些缓存区循环使用。As an optional embodiment, the size and number of the buffer area can be automatically configured according to the usage of the memory, and two buffer areas are configured. The two buffer areas have the same size and can be used to process the parsing of the file. Garbled problems; you can also configure multiple caches that are recycled.
图1中示出的方法可以应用到所有文件的读取上,但是,也可以仅仅应用到大文件的处理上,图3示出了这样的一个可选的实施方式,图3为以流的方式从文件中按照缓存区的大小读取预定长度的数据的方法流程图,如图3所示,该方法包括如下步骤:The method shown in Figure 1 can be applied to the reading of all files, but can also be applied only to the processing of large files, Figure 3 shows such an alternative embodiment, Figure 3 shows the flow A method for reading a predetermined length of data from a file according to a size of a buffer area. As shown in FIG. 3, the method includes the following steps:
步骤S302,获取文件的大小。In step S302, the size of the file is obtained.
步骤S304,在文件的大小超过阈值的情况下,以流文件的方式从文件中按照缓存区的大小读取预定长度的数据。 Step S304, if the size of the file exceeds the threshold, the data of the predetermined length is read from the file according to the size of the buffer area in the form of a stream file.
作为一种可选的实施例,例如,文件的大小为100M,而可处理的文件大小的阈值为10M,由于文件的大小远远超过了可处理文件的大小,所以通过流文件的方式读取文件。假设缓存区的大小为1M,则每次以流文件的方式读取1M原始文件的内容。As an optional embodiment, for example, the size of the file is 100M, and the threshold of the file size that can be processed is 10M. Since the size of the file far exceeds the size of the file that can be processed, the file is read by the stream file. file. Assuming that the size of the buffer area is 1M, the contents of the 1M original file are read each time as a stream file.
可选地,根据元数据信息对缓存的数据进行预处理包括:从缓存区中按照字节进行读取,根据元数据信息获取数据中的内容,其中,元数据信息用于对数据进行内容解析,元数据信息包括以下至少之一:长度信息、数据类型、特殊字符、字节序、编解码方式的信息。Optionally, the pre-processing of the cached data according to the metadata information includes: reading from a buffer according to a byte, and acquiring content in the data according to the metadata information, where the metadata information is used for content parsing of the data. The metadata information includes at least one of the following: length information, data type, special character, endian, and codec mode information.
作为一种可选的实施例,不同的文件有不同的编码方式,在对文件进行解析处理进而获得文件内容的过程中,如果使用的编码方式不同,将会出现乱码,为了解决汉子乱码的问题,在缓存区中对数据进行解析时,需要根据原始文件的编码方式来进行解析,即获得原始文件的元数据信息,例如,对于一个文件,该文件的长度可能为50个字节,文件中的数据类型可能为整型,特殊字符“$”位于文件中第34个字节的位置处。其中,文件的长度、文件的数据类型以及特殊字符的信息均为元数据信息。As an optional embodiment, different files have different encoding modes. In the process of parsing the file and obtaining the file content, if the encoding method used is different, garbled characters will appear, in order to solve the problem of garbled characters in the Chinese language. When the data is parsed in the buffer area, it needs to be parsed according to the encoding method of the original file, that is, the metadata information of the original file is obtained, for example, for a file, the length of the file may be 50 bytes in the file. The data type may be an integer, and the special character "$" is located at the 34th byte in the file. The length of the file, the data type of the file, and the information of the special characters are all metadata information.
可选地,上述方法还包括:在读取步骤、缓存步骤、预处理步骤、导入步骤中的至少之一设置断点,其中,断点用于在步骤执行出错的情况下进行信息的记录,记录的信息用于进行任务恢复。Optionally, the method further includes: setting a breakpoint in at least one of a reading step, a caching step, a pre-processing step, and an importing step, wherein the breakpoint is used to record the information in the case where the step execution is performed, The recorded information is used for task recovery.
作为一种可选的实施例,在每个执行步骤中设置断点,后台运行的程序是多任务并发执行的,所以当在该执行步骤中发生故障时,断点记录步骤执行出错的相关信息。例如,当对文件进行预处理操作时,在对数据进行解析时出现了错误,出现了缓存区的溢满的错误,此时,断点记录下发生错误的时间、发生错误的原因、错误的位置以及发生错误时后台运行程序所执行的状态等信息。当对任务进行恢复时,可以直接从断点中获取相关的信息,并从发生错误的步骤开始执行,而无需再重新执行全部的步骤,节省了任务执行的时间。As an optional embodiment, a breakpoint is set in each execution step, and the program running in the background is concurrently executed by multiple tasks, so when a failure occurs in the execution step, the breakpoint recording step performs information related to the error. . For example, when preprocessing a file, an error occurs during parsing of the data, and a buffer overflow error occurs. At this time, the time at which the error occurred in the breakpoint, the cause of the error, and the error Information such as the location and the state of the program running in the background when an error occurs. When restoring a task, you can get the relevant information directly from the breakpoint and start from the step where the error occurred, without having to re-execute all the steps, saving the task execution time.
实施例2Example 2
根据本发明实施例,提供了一种文件读取的装置实施例。In accordance with an embodiment of the present invention, an apparatus embodiment for file reading is provided.
图5是根据本发明实施例的一种文件读取装置的结构示意图,如图5所示,该装置包括读取模块501、缓存模块503、预处理模块505、导入模块507和循环模块509。FIG. 5 is a schematic structural diagram of a file reading apparatus according to an embodiment of the present invention. As shown in FIG. 5, the apparatus includes a reading module 501, a cache module 503, a preprocessing module 505, an import module 507, and a loop module 509.
读取模块501,用于执行读取步骤,以流的方式从文件中按照缓存区的大小读取预定长度的数据。The reading module 501 is configured to perform a reading step of reading a predetermined length of data from the file according to the size of the buffer area in a stream manner.
缓存模块503,用于执行缓存步骤,将读取到的数据放在缓存区进行缓存。 The cache module 503 is configured to perform a caching step, and the read data is cached in a buffer area.
预处理模块505,用于执行预处理步骤,根据预先配置的预处理要求对缓存的数据进行预处理以得到数据的内容。The pre-processing module 505 is configured to perform a pre-processing step of pre-processing the buffered data according to the pre-configured pre-processing requirements to obtain the content of the data.
导入模块507,用于执行导入步骤,将数据的内容保存至数据平台。The import module 507 is configured to perform an importing step to save the content of the data to the data platform.
循环模块509,用于循环依次执行读取步骤、缓存步骤、预处理步骤以及导入步骤完成对文件的读取。The looping module 509 is configured to sequentially perform the reading step, the caching step, the pre-processing step, and the importing step to complete the reading of the file.
作为一种可选的实施例,根据缓存区的大小,以字节流的方式从文件中读取固定长度的数据,将该固定长度的数据放入缓存区进行缓存,然后从缓存区中按照字节对缓存的数据进行读取,根据文件的长度信息、文件内容的数据类型以及特殊字符、字节序、编解码方式等信息对文件的数据内容进行解析,即完成对数据的预处理过程,最后将解析后的数据保存到数据平台上,可以对其进行数据处理、数据存储、查询检索以及分析挖掘和展示等操作。循环执行上述读取步骤、缓存步骤、预处理步骤以及导入步骤,直至完成对该大文件的读取操作。As an optional embodiment, according to the size of the buffer area, the fixed length data is read from the file in a byte stream manner, the fixed length data is put into the buffer area for buffering, and then the buffer area is followed. The byte reads the cached data, and parses the data content of the file according to the length information of the file, the data type of the file content, and special characters, endian, codec, etc., that is, the preprocessing process of the data is completed. Finally, the parsed data is saved to the data platform, and data processing, data storage, query retrieval, and analysis mining and display operations can be performed. The above-described reading step, caching step, pre-processing step, and importing step are performed cyclically until the reading operation on the large file is completed.
在本实施例中,采用分布式读取大数据文件的方式,通过以流的方式读取预定长度的数据,将该数据放入缓存区中,并对其进行预处理,得到数据的内容,最后将数据的内容保存至数据平台,达到了快速加载大数据文件至内存的目的,进而解决了由于文件比较大而导致的技术问题。In this embodiment, by reading a large data file in a distributed manner, by reading a predetermined length of data in a stream manner, the data is put into a buffer area, and preprocessed to obtain content of the data. Finally, the content of the data is saved to the data platform, and the purpose of quickly loading the big data file to the memory is achieved, thereby solving the technical problem caused by the relatively large file.
可选地,如图5所示,上述装置还包括:Optionally, as shown in FIG. 5, the foregoing apparatus further includes:
拆分模块511,用于将文件拆分为多个部分;a splitting module 511, configured to split the file into multiple parts;
处理模块513,用于通过多个分布式服务对文件中的多个部分分别执行读取步骤、缓存步骤、预处理步骤、以及导入步骤,将多个部分对应的内容保存至数据平台;或者,用于通过多个分布式服务对文件中的多个部分分别执行读取步骤、缓存步骤、以及预处理步骤得到多个部分对应的内容,再将得到的内容进行合并,并将合并之后的内容导入到数据平台。The processing module 513 is configured to perform a reading step, a caching step, a pre-processing step, and an importing step on the plurality of parts in the file by using the plurality of distributed services, and save the content corresponding to the multiple parts to the data platform; or The method is configured to perform a reading step, a caching step, and a pre-processing step to obtain a plurality of parts corresponding to the plurality of parts in the file by using a plurality of distributed services, and then combine the obtained contents, and merge the content Import to the data platform.
在拆分的时候,可以根据不同分布式服务的处理能力进行拆分,即,可以根据多个分布式服务各自的处理能力,将文件拆分为多个部分,并将多个部分分配至对应的分布式服务进行处理。例如,第一个分布式服务的处理能力是第二个分布式服务处理能力的两倍,那么给第一个分布式处理服务的文件大小可以是第二个分布式处理服务器文件大小的两倍。这种拆分方法拆分出来的文件大小是不相同的,与分布式服务的处理能力相对应。作为另外一种处理方式,可以将文件拆分成大小相同的部分,然后根据处理能力分配对应数量的文件。 At the time of splitting, the splitting can be performed according to the processing capabilities of different distributed services, that is, the files can be split into multiple parts according to the processing capabilities of multiple distributed services, and multiple parts are assigned to corresponding The distributed service is processed. For example, the processing power of the first distributed service is twice that of the second distributed service, so the file size for the first distributed processing service can be twice the file size of the second distributed processing server. . The file size split by this split method is different and corresponds to the processing power of the distributed service. As another way of processing, files can be split into equal-sized parts and then a corresponding number of files can be allocated according to processing power.
作为一种可选的实施例,例如,将原始文件拆分为4部分,分别记为:a、b、c和d,而分布式服务也有4个,分别为A、B、C和D,分布式服务A对文件a执行读取步骤、缓存步骤、预处理步骤以及导入步骤,执行完毕后可得到对文件a进行解析后的内容A’,同样,对文件b、c和d进行解析后的内容分别为B’、C’和D’,最后将解析后的文件内容A’、B’、C’、D’保存到数据平台上。又例如,同样对原始文件进行拆分,将其拆分为a、b、c和d 4部分,而分布式服务也有4个,分别为A、B、C和D,分布式服务A、B、C和D对文件a、b、c和d执行读取步骤、缓存步骤和预处理步骤后,得到这四部分文件的内容A’、B’、C’和D’,将这四部分内容合并为一体A’B’C’D’,并将合并后的内容执行导入步骤,将其导入到数据平台上。As an optional embodiment, for example, the original file is divided into four parts, which are respectively recorded as: a, b, c, and d, and four distributed services, A, B, C, and D, respectively. The distributed service A performs a reading step, a caching step, a pre-processing step, and an importing step on the file a. After the execution, the content A' after parsing the file a can be obtained. Similarly, after parsing the files b, c, and d The contents are B', C', and D', and finally the parsed file contents A', B', C', D' are saved to the data platform. For another example, the original file is also split and split into a, b, c, and d 4 parts, and there are 4 distributed services, A, B, C, and D, and distributed services A and B. , C and D perform the reading step, the caching step and the pre-processing step on the files a, b, c and d, and obtain the contents A', B', C' and D' of the four parts of the file, and the four parts Merge into one A'B'C'D' and perform the import step in the merged content and import it onto the data platform.
在使用多个分布式服务的时候,为了使分布式服务处理更快,每个分布式服务分别可以对应独立的缓存区,并且,多个分布式服务对应的缓存区是根据分布式服务所在服务器的资源确定的。When using multiple distributed services, in order to make distributed services faster, each distributed service can correspond to a separate cache area, and the cache area corresponding to multiple distributed services is based on the server where the distributed service is located. The resources are determined.
作为一种可选的实施例,每个分布式服务对应有一个独立的缓存区,该缓存区由分布式服务所在服务器的缓存管理器进行资源的配置。As an optional embodiment, each distributed service has a separate cache area, and the cache area is configured by the cache manager of the server where the distributed service is located.
需要说明的是,对资源的配置可以是静态配置,也可以是动态配置。动态配置可以根据分布式服务目前的负荷以及剩余处理能力来进行配置。It should be noted that the configuration of the resource may be static configuration or dynamic configuration. Dynamic configuration can be configured based on the current load of the distributed service and the remaining processing power.
作为一种可选的实施例,由于每个分布式服务都有一个独立的缓存区,其缓存区是由分布式服务所在的服务器的资源确定的,所以每个分布式服务的处理能力各不相同,每个分布式服务对应的缓存区的大小也是不相同的。将文件拆分为多个部分,拆分后文件的每部分不一定都是均等的。每个分布式服务根据各自处理能力的大小处理相应拆分后的部分文件。As an optional embodiment, since each distributed service has a separate cache area whose cache area is determined by the resources of the server where the distributed service is located, each distributed service has different processing capabilities. Similarly, the size of the buffer corresponding to each distributed service is also different. Split the file into multiple parts, and each part of the file after splitting is not necessarily equal. Each distributed service processes the corresponding split partial files according to the size of their respective processing capabilities.
可选地,如图5所示,上述装置还包括:Optionally, as shown in FIG. 5, the foregoing apparatus further includes:
第一配置模块515,用于配置缓存区的大小。The first configuration module 515 is configured to configure a size of the buffer area.
第二配置模块517,用于配置缓存区的备用缓存区,其中,备用缓存区的大小与缓存区的大小一致,备用缓存区为缓存区的备份。The second configuration module 517 is configured to configure a backup buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the backup buffer area is a backup of the buffer area.
作为一种可选的实施例,根据内存的使用情况可自动配置缓存区的大小和数量,并配置两个缓存区,两个缓存区的大小相同,可以用来处理对文件进行解析式出现的乱码问题;也可以配置多个缓存区,这些缓存区循环使用。As an optional embodiment, the size and number of the buffer area can be automatically configured according to the usage of the memory, and two buffer areas are configured. The two buffer areas have the same size and can be used to process the parsing of the file. Garbled problems; you can also configure multiple caches that are recycled.
可选地,如图5所示,读取模块501包括:Optionally, as shown in FIG. 5, the reading module 501 includes:
第一读取模块5011,用于获取文件的大小。 The first reading module 5011 is configured to acquire a size of the file.
第二读取模块5013,用于在文件的大小超过阈值的情况下,以流文件的方式从文件中按照缓存区的大小读取预定长度的数据。The second reading module 5013 is configured to read a predetermined length of data from the file according to the size of the buffer area in a stream file manner in a case where the size of the file exceeds a threshold.
作为一种可选的实施例,例如,文件的大小为100M,而可处理的文件大小的阈值为10M,由于文件的大小远远超过了可处理文件的大小,所以通过流文件的方式读取文件。假设缓存区的大小为1M,则每次以流文件的方式读取1M原始文件的内容。As an optional embodiment, for example, the size of the file is 100M, and the threshold of the file size that can be processed is 10M. Since the size of the file far exceeds the size of the file that can be processed, the file is read by the stream file. file. Assuming that the size of the buffer area is 1M, the contents of the 1M original file are read each time as a stream file.
可选地,如图5所示,预处理模块,用于根据元数据信息对缓存的数据进行预处理,预处理模块505包括:Optionally, as shown in FIG. 5, the pre-processing module is configured to perform pre-processing on the cached data according to the metadata information, where the pre-processing module 505 includes:
信息获取模块5051,用于从缓存区中按照字节进行读取,根据元数据信息获取数据中的内容,其中,元数据信息用于进行对数据进行内容解析,元数据信息包括以下至少之一:长度信息、数据类型、特殊字符、字节序、编解码方式的信息。The information obtaining module 5051 is configured to read the data in the buffer from the buffer, and obtain the content in the data according to the metadata information, where the metadata information is used to perform content parsing on the data, where the metadata information includes at least one of the following: : Information on length information, data type, special characters, endian, and codec.
作为一种可选的实施例,不同的文件有不同的编码方式,在对文件进行解析处理进而获得文件内容的过程中,如果使用的编码方式不同,将会出现乱码,为了解决汉子乱码的问题,在缓存区中对数据进行解析时,需要根据原始文件的编码方式来进行解析,即获得原始文件的元数据信息,例如,对于一个文件,该文件的长度可能为50个字节,文件中的数据类型可能为整型,特殊字符“$”位于文件中第34个字节的位置处。其中,文件的长度、文件的数据类型以及特殊字符的信息均为元数据信息。As an optional embodiment, different files have different encoding modes. In the process of parsing the file and obtaining the file content, if the encoding method used is different, garbled characters will appear, in order to solve the problem of garbled characters in the Chinese language. When the data is parsed in the buffer area, it needs to be parsed according to the encoding method of the original file, that is, the metadata information of the original file is obtained, for example, for a file, the length of the file may be 50 bytes in the file. The data type may be an integer, and the special character "$" is located at the 34th byte in the file. The length of the file, the data type of the file, and the information of the special characters are all metadata information.
可选地,如图5所示,上述装置还包括:Optionally, as shown in FIG. 5, the foregoing apparatus further includes:
断点模块519,用于在读取步骤、缓存步骤、预处理步骤、导入步骤中的至少之一设置断点,其中,断点用于在步骤执行出错的情况下进行信息的记录,记录的信息用于进行任务恢复。a breakpoint module 519, configured to set a breakpoint in at least one of a reading step, a caching step, a pre-processing step, and an importing step, wherein the breakpoint is used to record information in the case where the step execution error occurs, and the recorded Information is used for task recovery.
作为一种可选的实施例,在每个执行步骤中设置断点,后台运行的程序是多任务并发执行的,所以当在该执行步骤中发生故障时,断点记录步骤执行出错的相关信息。例如,当对文件进行预处理操作时,在对数据进行解析时出现了错误,出现了缓存区的溢满的错误,此时,断点记录下发生错误的时间、发生错误的原因、错误的位置以及发生错误时后台运行程序所执行的状态等信息。当对任务进行恢复时,可以直接从断点中获取相关的信息,并从发生错误的步骤开始执行,而无需再重新执行全部的步骤,节省了任务执行的时间。As an optional embodiment, a breakpoint is set in each execution step, and the program running in the background is concurrently executed by multiple tasks, so when a failure occurs in the execution step, the breakpoint recording step performs information related to the error. . For example, when preprocessing a file, an error occurs during parsing of the data, and a buffer overflow error occurs. At this time, the time at which the error occurred in the breakpoint, the cause of the error, and the error Information such as the location and the state of the program running in the background when an error occurs. When restoring a task, you can get the relevant information directly from the breakpoint and start from the step where the error occurred, without having to re-execute all the steps, saving the task execution time.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。 In the above-mentioned embodiments of the present invention, the descriptions of the various embodiments are different, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed technical contents may be implemented in other manners. The device embodiments described above are only schematic. For example, the division of cells may be a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or integrated into Another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。An integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the various embodiments of the present invention. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .
以上仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。 The above is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. These improvements and retouchings should also be considered. It is the scope of protection of the present invention.

Claims (23)

  1. 一种文件读取方法,其中,包括:A file reading method, comprising:
    读取步骤,以流的方式从文件中按照缓存区的大小读取预定长度的数据;a reading step of streaming a predetermined length of data from the file according to the size of the buffer area;
    缓存步骤,将读取到的所述数据放在所述缓存区进行缓存;a caching step of placing the read data in the buffer area for caching;
    预处理步骤,根据预先配置的预处理要求对缓存的所述数据进行预处理以得到所述数据的内容;a pre-processing step of pre-processing the cached data according to a pre-configured pre-processing requirement to obtain content of the data;
    导入步骤,将所述数据的内容保存至数据平台;The importing step saves the content of the data to the data platform;
    循环依次执行所述读取步骤、所述缓存步骤、所述预处理步骤以及所述导入步骤完成对所述文件的读取。The reading, the caching step, the pre-processing step, and the importing step are performed in sequence to complete reading of the file.
  2. 根据权利要求1所述的方法,其中,The method of claim 1 wherein
    将所述文件拆分为多个部分;Split the file into multiple parts;
    通过多个分布式服务对所述文件中的多个部分分别执行所述读取步骤、所述缓存步骤、所述预处理步骤、以及所述导入步骤,将所述多个部分对应的内容保存至所述数据平台;或者,The reading step, the caching step, the pre-processing step, and the importing step are respectively performed on a plurality of portions of the file by a plurality of distributed services, and the content corresponding to the plurality of parts is saved To the data platform; or,
    通过所述多个分布式服务对所述文件中的多个部分分别执行所述读取步骤、所述缓存步骤、以及所述预处理步骤得到所述多个部分对应的内容,再将得到的内容进行合并,并将合并之后的内容导入到所述数据平台。Performing the reading step, the caching step, and the pre-processing step on the plurality of portions of the file by the plurality of distributed services to obtain content corresponding to the plurality of parts, and then obtaining the obtained The content is merged and the merged content is imported into the data platform.
  3. 根据权利要求2所述的方法,其中,所述缓存区包括多个缓存块,所述多个缓存块分布在多个节点上,所述节点用于对与该节点对应的缓存块中的内容进行处理。The method of claim 2, wherein the buffer area comprises a plurality of cache blocks, the plurality of cache blocks are distributed over a plurality of nodes, the nodes being used for content in a cache block corresponding to the node Process it.
  4. 根据权利要求3所述的方法,其中,所述多个节点中的每个节点对应至少两个缓存块,其中,The method of claim 3, wherein each of the plurality of nodes corresponds to at least two cache blocks, wherein
    在所述两个缓存块中的第一缓存块中的内容在被处理的过程中,从所述文件中根据所述至少两个块缓存块中的第二缓存块的大小读取内容,并将读取到的内容放置在所述第二缓存块中;在所述第二缓存块中的内容在处理的过程中,从所述文件中根据所述至少两块缓存中的第一缓存块的大小读取内容,并将读取到的内容放置在所述第一缓存块中,其中,所述第一缓存块和所述第二缓存块的大小相同或者不同。 Reading, in the process of processing, the content in the first cache block of the two cache blocks is read from the file according to the size of the second cache block in the at least two block cache blocks, and Storing the read content in the second cache block; the content in the second cache block is processed from the file according to the first cache block in the at least two caches The size reads the content and places the read content in the first cache block, wherein the first cache block and the second cache block are the same size or different.
  5. 根据权利要求4所述的方法,其中,所述每个节点对应至少三个缓存块,其中,所述三个缓存块中的一块缓存块中的内容在被处理的过程中,向所述三个缓存块中的另外两块缓存块中放置读取到的内容。The method according to claim 4, wherein each of the nodes corresponds to at least three cache blocks, wherein contents of one of the three cache blocks are processed, to the third The read content is placed in the other two cache blocks in the cache block.
  6. 根据权利要求3所述的方法,其中,所述方法还包括:The method of claim 3, wherein the method further comprises:
    所述缓存区中的缓存块的大小和/或缓存块的数量根据所述缓存块所在节点的资源情况进行分配,其中,所述分配为周期性分配,或者,所述分配为在满足预定条件下的分配。The size of the cache block and/or the number of cache blocks in the buffer area are allocated according to the resource condition of the node where the cache block is located, wherein the allocation is a periodic allocation, or the allocation is to satisfy a predetermined condition. Under the distribution.
  7. 根据权利要求2至6中任一项所述的方法,其中,将所述文件拆分为多个部分包括:The method according to any one of claims 2 to 6, wherein splitting the file into a plurality of parts comprises:
    通过一个读取节点对所述文件进行拆分;Splitting the file through a read node;
    通过所述读取节点将拆分之后的文件分发到所述多个节点中的至少一个节点对应的缓存块中以进行处理。The split file is distributed by the read node to a cache block corresponding to at least one of the plurality of nodes for processing.
  8. 根据权利要求7所述的方法,其中,The method of claim 7 wherein
    在所述文件的大小超过阈值的情况下,通过一个读取节点对所述待读取的文件进行拆分;If the size of the file exceeds a threshold, the file to be read is split by a reading node;
    通过所述读取节点根据所述多个节点中的每个节点的资源情况将拆分之后的文件进行分发。The split file is distributed by the read node according to the resource condition of each of the plurality of nodes.
  9. 根据权利要求7所述的方法,其中,对所述文件进行拆分包括:The method of claim 7 wherein splitting the file comprises:
    确定对所述文件进行拆分的拆分点;Determining a split point that splits the file;
    判断根据所述拆分点拆分得到的部分在结束处或者开始处是否包括不完整的内容;Determining whether the portion obtained by splitting according to the split point includes incomplete content at the end or at the beginning;
    在包括不完整的内容的情况下,移动所述拆分点,使拆分得到的部分包括的内容完整。In the case of including incomplete content, the split point is moved so that the content included in the split portion is complete.
  10. 根据权利要求9所述的方法,其中,判断根据所述拆分点拆分得到的部分在结束处或者开始处是否包括不完整的内容包括:The method according to claim 9, wherein judging whether the portion obtained by splitting according to the split point includes incomplete content at the end or at the beginning comprises:
    判断所述拆分点处的内容是否为结构化数据或者非结构化数据;Determining whether the content at the split point is structured data or unstructured data;
    如果为结构化数据则判断根据所述拆分点拆分得到的部分在所述结束处或者所述开始处是否为完整的记录; If it is structured data, it is judged whether the portion obtained by splitting according to the split point is a complete record at the end or at the beginning;
    如果为非结构化数据则判断根据所述拆分点拆分得到的部分在所述结束处或所述开始处是否为完整的文件。If it is unstructured data, it is judged whether the portion obtained by the split point split is a complete file at the end or at the beginning.
  11. 根据权利要求7所述的方法,其中,通过所述读取节点将拆分之后的文件分发到所述多个节点中的至少一个节点对应的缓存块中以进行处理包括:The method according to claim 7, wherein distributing the split file to the cache block corresponding to at least one of the plurality of nodes for processing by the reading node comprises:
    通过所述读取节点将拆分之后的文件的每个部分分别分发给一个节点组,其中,所述每个节点组包括至少一个节点,所述节点组作为一个整体将接收到的部分放置在缓存块中进行处理。Distributing each part of the split file to a node group by the read node, wherein each node group includes at least one node, and the node group as a whole places the received part in Processing in the cache block.
  12. 根据权利要求3所述的方法,其中,所述缓存区或每个所述节点对应的缓存块为环状队列,所述环状队列通过写入指针和读取指针访问,在所述环状队列中的缓存被写入之后未被读取之前禁止再次写入。The method according to claim 3, wherein the buffer block or the cache block corresponding to each of the nodes is a ring queue, and the ring queue is accessed by a write pointer and a read pointer in the ring. It is prohibited to write again until the cache in the queue is written and not read.
  13. 根据权利要求12所述的方法,其中,所述环状队列的大小根据节点对应的资源进行调整。The method of claim 12, wherein the size of the circular queue is adjusted according to resources corresponding to the node.
  14. 根据权利要求2所述的方法,其中,所述多个分布式服务分别对应独立的缓存区,并且,所述多个分布式服务对应的缓存区是根据所述分布式服务所在服务器的资源确定的。The method according to claim 2, wherein the plurality of distributed services respectively correspond to independent cache areas, and the buffer areas corresponding to the plurality of distributed services are determined according to resources of the server where the distributed service is located of.
  15. 根据权利要求2所述的方法,其中,根据所述多个分布式服务各自的处理能力,将所述文件拆分为多个部分,并将所述多个部分分配至对应的分布式服务进行处理。The method of claim 2, wherein the file is split into a plurality of parts according to respective processing capabilities of the plurality of distributed services, and the plurality of parts are allocated to corresponding distributed services. deal with.
  16. 根据权利要求1至15中任一项所述的方法,其中,以流的方式从所述文件中按照缓存区的大小读取预定长度的所述数据包括:The method according to any one of claims 1 to 15, wherein the reading of the data of a predetermined length from the file according to the size of the buffer area in a streaming manner comprises:
    获取所述文件的大小;Get the size of the file;
    在所述文件的大小超过阈值的情况下,以流文件的方式从所述文件中按照缓存区的大小读取预定长度的所述数据。In the case where the size of the file exceeds the threshold, the data of a predetermined length is read from the file in the manner of a stream file according to the size of the buffer area.
  17. 根据权利要求1至15中任一项所述的方法,其中,在以流文件的方式从文件中按照缓存区的大小读取预定长度的数据之前,所述方法还包括:The method according to any one of claims 1 to 15, wherein before the data of a predetermined length is read from the file according to the size of the buffer area in the form of a stream file, the method further comprises:
    配置所述缓存区的大小;Configuring a size of the buffer area;
    配置所述缓存区的备用缓存区,其中,所述备用缓存区的大小与所述缓存区的大小一致,所述备用缓存区为所述缓存区的备份。Configuring a spare buffer area of the buffer area, wherein the size of the spare buffer area is consistent with the size of the buffer area, and the spare buffer area is a backup of the buffer area.
  18. 根据权利要求1所述的方法,其中,根据元数据信息对缓存的所述数据进行预处 理包括:The method of claim 1, wherein the cached data is pre-arranged based on metadata information The rationale includes:
    从所述缓存区中按照字节进行读取,根据所述元数据信息获取所述数据中的内容,其中,所述元数据信息用于对数据进行内容解析,所述元数据信息包括以下至少之一:长度信息、数据类型、字节序、特殊字符、编解码方式、结束符的信息。Reading from the buffer area according to the byte, and acquiring content in the data according to the metadata information, wherein the metadata information is used for content resolution on the data, where the metadata information includes at least the following One: length information, data type, endian, special characters, codec mode, and terminator information.
  19. 根据权利要求1至15中任一项所述的方法,其中,还包括:The method according to any one of claims 1 to 15, further comprising:
    在所述读取步骤、所述缓存步骤、所述预处理步骤、所述导入步骤中的至少之一设置断点,其中,所述断点用于在步骤执行出错的情况下进行信息的记录,记录的所述信息用于进行任务恢复。Setting a breakpoint in at least one of the reading step, the caching step, the pre-processing step, and the importing step, wherein the breakpoint is used to record information in the case where an error is performed in the step The information recorded is used for task recovery.
  20. 一种文件读取装置,其中,包括:A document reading device, comprising:
    读取模块,用于执行读取步骤,以流的方式从文件中按照缓存区的大小读取预定长度的数据;a reading module, configured to perform a reading step, and streaming a predetermined length of data from the file according to the size of the buffer area;
    缓存模块,用于执行缓存步骤,将读取到的所述数据放在所述缓存区进行缓存;a cache module, configured to perform a caching step, and the read data is cached in the buffer area;
    预处理模块,用于执行预处理步骤,根据预先配置的预处理要求对缓存的所述数据进行预处理以得到所述数据的内容;a pre-processing module, configured to perform a pre-processing step of pre-processing the cached data according to a pre-configured pre-processing requirement to obtain content of the data;
    导入模块,用于执行导入步骤,将所述数据的内容保存至数据平台;Importing a module, configured to perform an importing step, and save the content of the data to a data platform;
    循环模块,用于循环依次执行所述读取步骤、所述缓存步骤、所述预处理步骤以及所述导入步骤完成对所述文件的读取。And a looping module, configured to cyclically execute the reading step, the caching step, the pre-processing step, and the importing step to complete reading of the file.
  21. 根据权利要求20所述的装置,其中,所述装置还包括:The device of claim 20, wherein the device further comprises:
    拆分模块,用于将所述文件拆分为多个部分;a splitting module for splitting the file into multiple parts;
    处理模块,用于通过多个分布式服务对所述文件中的多个部分分别执行所述读取步骤、所述缓存步骤、所述预处理步骤、以及所述导入步骤,将所述多个部分对应的内容保存至所述数据平台;或者,用于通过所述多个分布式服务对所述文件中的多个部分分别执行所述读取步骤、所述缓存步骤、以及所述预处理步骤得到所述多个部分对应的内容,再将得到的内容进行合并,并将合并之后的内容导入到所述数据平台。a processing module, configured to perform, by using a plurality of distributed services, the reading step, the caching step, the pre-processing step, and the importing step, respectively, on multiple portions of the file And partially storing the partially corresponding content to the data platform; or for performing, by the plurality of distributed services, the reading step, the caching step, and the pre-processing respectively on the plurality of parts in the file The step obtains content corresponding to the plurality of parts, merges the obtained content, and imports the merged content into the data platform.
  22. 一种存储介质,其中,所述存储介质包括存储的程序,其中,所述程序执行权利要求1至19中任意一项所述的文件读取方法。 A storage medium, wherein the storage medium includes a stored program, wherein the program executes the file reading method according to any one of claims 1 to 19.
  23. 一种处理器,其中,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1至19中任意一项所述的文件读取方法。 A processor, wherein the processor is configured to execute a program, wherein the program is executed to execute the file reading method according to any one of claims 1 to 19.
PCT/CN2017/099554 2016-09-26 2017-08-30 Method and device for reading file WO2018054200A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610851849.9 2016-09-26
CN201610851849.9A CN107870928A (en) 2016-09-26 2016-09-26 File reading and device

Publications (1)

Publication Number Publication Date
WO2018054200A1 true WO2018054200A1 (en) 2018-03-29

Family

ID=61689371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/099554 WO2018054200A1 (en) 2016-09-26 2017-08-30 Method and device for reading file

Country Status (2)

Country Link
CN (1) CN107870928A (en)
WO (1) WO2018054200A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344092A (en) * 2018-09-11 2019-02-15 天津易华录信息技术有限公司 A kind of method and system improving cold storing data reading speed
CN110784756A (en) * 2019-12-31 2020-02-11 珠海亿智电子科技有限公司 File reading method and device, computing equipment and storage medium
CN111552440A (en) * 2020-04-26 2020-08-18 全球能源互联网研究院有限公司 Cloud-edge-end data synchronization method for power internet of things
CN111680474A (en) * 2020-06-08 2020-09-18 中国银行股份有限公司 Method and device for repairing messy codes of files
CN112698877A (en) * 2019-10-21 2021-04-23 上海哔哩哔哩科技有限公司 Data processing method and system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI698740B (en) * 2018-08-27 2020-07-11 宏碁股份有限公司 Deployment method of recovery images and electronic device using the same
CN110750505A (en) * 2019-08-31 2020-02-04 苏州浪潮智能科技有限公司 Large file reading optimization method, device, equipment and storage medium
CN112764908B (en) * 2021-01-26 2024-01-26 北京鼎普科技股份有限公司 Network data acquisition processing method and device and electronic equipment
CN113783939A (en) * 2021-08-20 2021-12-10 奇安信科技集团股份有限公司 File transmission method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615175A (en) * 2009-08-11 2009-12-30 深圳市五巨科技有限公司 A kind of system and method for reading electronic book of mobile terminal
CN102521349A (en) * 2011-12-12 2012-06-27 深圳市创新科信息技术有限公司 Pre-reading method of files
CN103412950A (en) * 2013-08-28 2013-11-27 浙江大学 Method for increasing read-write speed of spatial big data files
CN104331255A (en) * 2014-11-17 2015-02-04 中国科学院声学研究所 Embedded file system-based reading method for streaming data
CN104394229A (en) * 2014-12-09 2015-03-04 浪潮电子信息产业股份有限公司 Large file uploading method based on concurrent transmission mode
CN105761039A (en) * 2016-02-17 2016-07-13 华迪计算机集团有限公司 Method for processing express delivery information big data

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0981497A (en) * 1995-09-12 1997-03-28 Toshiba Corp Real-time stream server, storing method for real-time stream data and transfer method therefor
CN101202882B (en) * 2007-07-19 2010-09-15 深圳市同洲电子股份有限公司 Method, system for transmitting medium resource and set-top box
CN101119278A (en) * 2007-09-14 2008-02-06 广东威创日新电子有限公司 Method and system for processing mass data
CN101127578A (en) * 2007-09-14 2008-02-20 广东威创日新电子有限公司 A method and system for processing a magnitude of data
CN103077149A (en) * 2013-01-09 2013-05-01 厦门市美亚柏科信息股份有限公司 Method and system for transmitting data
CN103164538B (en) * 2013-04-11 2016-10-19 深圳市华力特电气股份有限公司 A kind of data analysis method and device
CN105701178B (en) * 2016-01-05 2017-06-09 北京汇商融通信息技术有限公司 Distributed picture storage system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615175A (en) * 2009-08-11 2009-12-30 深圳市五巨科技有限公司 A kind of system and method for reading electronic book of mobile terminal
CN102521349A (en) * 2011-12-12 2012-06-27 深圳市创新科信息技术有限公司 Pre-reading method of files
CN103412950A (en) * 2013-08-28 2013-11-27 浙江大学 Method for increasing read-write speed of spatial big data files
CN104331255A (en) * 2014-11-17 2015-02-04 中国科学院声学研究所 Embedded file system-based reading method for streaming data
CN104394229A (en) * 2014-12-09 2015-03-04 浪潮电子信息产业股份有限公司 Large file uploading method based on concurrent transmission mode
CN105761039A (en) * 2016-02-17 2016-07-13 华迪计算机集团有限公司 Method for processing express delivery information big data

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344092A (en) * 2018-09-11 2019-02-15 天津易华录信息技术有限公司 A kind of method and system improving cold storing data reading speed
CN109344092B (en) * 2018-09-11 2023-06-23 天津易华录信息技术有限公司 Method and system for improving cold storage data reading speed
CN112698877A (en) * 2019-10-21 2021-04-23 上海哔哩哔哩科技有限公司 Data processing method and system
CN112698877B (en) * 2019-10-21 2023-07-14 上海哔哩哔哩科技有限公司 Data processing method and system
CN110784756A (en) * 2019-12-31 2020-02-11 珠海亿智电子科技有限公司 File reading method and device, computing equipment and storage medium
CN110784756B (en) * 2019-12-31 2020-05-29 珠海亿智电子科技有限公司 File reading method and device, computing equipment and storage medium
CN111552440A (en) * 2020-04-26 2020-08-18 全球能源互联网研究院有限公司 Cloud-edge-end data synchronization method for power internet of things
CN111680474A (en) * 2020-06-08 2020-09-18 中国银行股份有限公司 Method and device for repairing messy codes of files
CN111680474B (en) * 2020-06-08 2024-02-23 中国银行股份有限公司 File messy code repairing method and device

Also Published As

Publication number Publication date
CN107870928A (en) 2018-04-03

Similar Documents

Publication Publication Date Title
WO2018054200A1 (en) Method and device for reading file
CN108052675B (en) Log management method, system and computer readable storage medium
CN105872016B (en) The operation method of virtual machine in a kind of desktop cloud
US10649905B2 (en) Method and apparatus for storing data
CN110247984B (en) Service processing method, device and storage medium
US9836516B2 (en) Parallel scanners for log based replication
CN110633378A (en) Graph database construction method supporting super-large scale relational network
CN110716848A (en) Data collection method and device, electronic equipment and storage medium
CN113094430B (en) Data processing method, device, equipment and storage medium
WO2023066182A1 (en) File processing method and apparatus, device, and storage medium
US20170083387A1 (en) High-performance computing framework for cloud computing environments
CN112486913A (en) Log asynchronous storage method and device based on cluster environment
WO2017015059A1 (en) Efficient cache warm up based on user requests
CN109788251B (en) Video processing method, device and storage medium
CN109710502B (en) Log transmission method, device and storage medium
CN114780615A (en) Error code management method and device thereof
CN113051221A (en) Data storage method, device, medium, equipment and distributed file system
CN112433812A (en) Method, system, equipment and computer medium for virtual machine cross-cluster migration
CN112506432A (en) Dynamic and static separated real-time data storage and management method and device for electric power automation system
CN107621994A (en) The method and device that a kind of data snapshot creates
CN112363980A (en) Data processing method and device for distributed system
CN110781137A (en) Directory reading method and device for distributed system, server and storage medium
CN113849686A (en) Video data acquisition method and device, electronic equipment and storage medium
CN115809015A (en) Method for data processing in distributed system and related system
CN111427654A (en) Instruction processing method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17852269

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18.07.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17852269

Country of ref document: EP

Kind code of ref document: A1