CN112667411B

CN112667411B - Data processing method and device, electronic equipment and computer storage medium

Info

Publication number: CN112667411B
Application number: CN201910985320.XA
Authority: CN
Inventors: 张永曦
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2022-12-13
Anticipated expiration: 2039-10-16
Also published as: CN112667411A

Abstract

The embodiment of the application discloses a data processing method, a data processing device, electronic equipment and a computer storage medium, wherein the method comprises the following steps: writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files; taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and the file processed by each Map task is one file of the source data. Therefore, based on the first message queue, the files of each source data can be randomly distributed to different Map tasks for processing, and before each file is processed, the affiliated fragments do not need to be determined, that is, dynamic fragments of the source data are realized, furthermore, each Map task only processes one file at a time, so that the difference of data processing time among the Map tasks is reduced, and the execution efficiency of the whole Map reduce task is improved.

Description

Data processing method and device, electronic equipment and computer storage medium

Technical Field

The present application relates to the field of big data, and in particular, to a data processing method and apparatus, an electronic device, and a computer storage medium.

Background

The MapReduce is a programming model and is used for parallel operation of a large-scale data set, and due to the occurrence of the MapReduce programming model, a person who cannot perform distributed programming can operate a program on a large-scale distributed system, wherein the InputFormat module is an important module in the MapReduce programming model and is mainly responsible for traversing and fragmenting source data in a MapReduce task and analyzing the fragmented data by using a Recorderder for processing the Map task, and the InputFormat module is used for fragmenting the traversed source data in the MapReduce task and is an important factor influencing the execution efficiency of the MapReduce task.

In the prior art, there are two main schemes for fragmenting source data after traversal in a MapReduce task, where the first scheme is to uniformly fragment the traversed data through a uniformsis minputformat module, and meanwhile, before a Map task processes the fragmented data, assign corresponding fragmented data to be processed to each Map task in advance, that is, all Map tasks in the MapReduce task process files with the same data volume, however, some Map tasks in the MapReduce task are slower than other Map tasks due to a network, a server, and the like, and further affect the execution efficiency of the whole MapReduce task.

The second scheme is that through a dynamic inputformat module, firstly, traversed data is divided into a plurality of chunks, each Chunk contains files with the same data volume, when a Map task starts to process data, each Map task randomly obtains one Chunk for processing, and after the processing is finished, the next Chunk is randomly obtained for processing.

Disclosure of Invention

Embodiments of the present application are intended to provide a data processing method, apparatus, electronic device, and computer storage medium.

The embodiment of the application provides a data processing method, which comprises the following steps:

writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files;

taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and each file processed by each Map task is a file of the source data.

Optionally, the writing source data in the MapReduce task into the first message queue includes:

and traversing source data in the MapReduce task in a multithreading mode, taking out the traversed data, and writing the taken out data into the first message queue when the taken out data meets a first preset condition.

Optionally, the traversing source data in the MapReduce task in a multi-thread manner, and fetching the traversed data includes:

and determining the number of threads for processing the source data, writing the source data into a non-blocking queue, and taking out the data from the non-blocking queue at least once by using each thread.

Optionally, the first preset condition includes: the extracted data is a file of source data.

Optionally, the first preset condition further includes: and the file names of the taken data are successfully matched regularly.

Optionally, the number of Map tasks in the MapReduce task is a specified number.

Optionally, the first message queue is a Redis message queue.

The embodiment of the present application further provides a data processing apparatus, where the apparatus includes: a first processing module and a second processing module, wherein:

a first processing module: the device is used for writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files;

a second processing module: the file is used for taking out the source data from the first message queue, and each Map task in the MapReduce tasks is utilized to process the taken-out file at least once; and each file processed by each Map task is a file of the source data.

Optionally, the first processing module is configured to traverse source data in the MapReduce task in a multi-thread manner, take out the traversed data, and write the taken out data into the first message queue when the taken out data meets a first preset condition.

Optionally, the first processing module is configured to determine the number of threads that process the source data, write the source data into a non-blocking queue, and take out data from the non-blocking queue at least once by using each thread.

Optionally, the first message queue is a Redis message queue.

An embodiment of the present application further provides an electronic device, including a processor and a memory for storing a computer program capable of running on the processor; wherein,

the processor is configured to execute any one of the above data processing methods when the computer program is run.

The embodiment of the present application also provides a computer storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements any one of the above-mentioned data processing methods.

In the data processing method, the data processing device, the electronic equipment and the computer storage medium, source data in a MapReduce task are written into a first message queue; the source data comprises a plurality of files; taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and the file processed by each Map task is one file of the source data. Therefore, based on the first message queue, the files of each source data can be randomly allocated to different Map tasks for processing, and before each file is processed, the fragments to which each file belongs do not need to be determined, that is, dynamic fragmentation of the source data is realized, and further, each Map task only processes one file at a time, so that the difference of data processing time among the Map tasks is reduced, and the execution efficiency of the whole Map task is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;

FIG. 2 is a system diagram of data processing according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating thread execution according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the examples provided herein are merely illustrative of the present application and are not intended to limit the present application. In addition, the following examples are provided as partial examples for implementing the present application, not all examples for implementing the present application, and the technical solutions described in the examples of the present application may be implemented in any combination without conflict.

It should be noted that in the embodiments of the present application, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, the use of the phrase "including a. -. Said." does not exclude the presence of other elements (e.g., steps in a method or elements in a device, such as portions of circuitry, processors, programs, software, etc.) in the method or device in which the element is included.

For example, the method for processing data provided by the embodiment of the present application includes a series of steps, but the method for processing data provided by the embodiment of the present application is not limited to the described steps, and similarly, the apparatus for processing data provided by the embodiment of the present application includes a series of modules, but the apparatus provided by the embodiment of the present application is not limited to include the modules explicitly described, and may also include modules that are required to be configured to acquire relevant information or perform processing based on the information.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, C, and may mean including any one or more elements selected from the group consisting of a, B, and C.

Embodiments of the application are operational with numerous other general purpose or special purpose computing system environments or configurations, and with terminal and server computing systems. Here, the terminal may be a thin client, a thick client, a hand-held or laptop device, a microprocessor-based system, a set-top box, a programmable consumer electronics, a network personal computer, a small computer system, etc., and the server may be a server computer system, a small computer system, a mainframe computer system, a distributed cloud computing environment including any of the above, etc.

The electronic devices of the terminal, server, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In some embodiments of the present application, a data processing method is provided, which implements dynamic fragmentation of the source data, and each Map task processes only one file at a time, so that a difference between data processing times of the Map tasks is reduced, and execution efficiency of the whole MapReduce task is improved.

Example one

Fig. 1 is a schematic flowchart of a data processing method in an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

s101: writing source data in the MapReduce task into a first message queue; the source data includes a plurality of files.

For the implementation manner of this step, exemplarily, source data in the MaReduce task may be traversed in a multi-threaded manner, and the traversed data is taken out, when the taken out data meets a first preset condition, the taken out data is written into the first message queue, and when the taken out data does not meet the first preset condition, the data is not written into the first message queue.

In this embodiment of the application, the source data in the MapReduce task may be data including a single directory, where the directory may include a plurality of files, or may be data including multiple layers of nested directories, where each level of the directory may include a file and/or a next level of directory.

In the related technology, a single thread mode is adopted to traverse the source data in the MapReduce task, when the number of files of the source data in the MapReduce task is large or the directory nesting complexity is high, the efficiency of traversing the source data in the single thread mode is low, in the embodiment of the application, the source data in the MapReduce task is traversed in a multi-thread mode, and the traversed source data is obtained, so that the source data of the MapReduce task with the characteristics of high data volume and multilayer nested directories can be traversed more quickly in a concurrent processing mode, and the execution efficiency of the MapReduce task is improved.

Alternatively, the number of threads for processing the source data of the MapReduce task can be determined, the source data is written into the non-blocking queue, and the data is taken out of the non-blocking queue at least once by each thread.

Here, the non-blocking queue is a queue for enabling a producer to add data to its queue or a consumer to take out data from its queue, where the producer is a thread that adds data to the queue and the consumer is a thread that takes out data from the queue, and at the same time, no matter the producer adds data to the queue or the consumer takes out data from the queue, execution of the non-blocking queue is not blocked, that is, the non-blocking queue is used for enabling multithreading to concurrently process data in its queue; the number of threads may be set according to actual requirements, or may be determined according to the number of processor cores on the client server, and further, a thread pool may be created according to the determined number of threads, so as to implement coordination and scheduling for multiple threads.

Therefore, based on the thread pool and the non-blocking queue, data in the non-blocking queue can be reasonably coordinated and processed in a multi-thread concurrent mode, and execution efficiency of the MapReduce task is improved.

Optionally, the first preset condition includes: and the retrieved traversed data is a file in the source data.

It can be seen that when the retrieved traversed data is a file in the source data, the basic requirements of subsequent data processing can be satisfied.

Optionally, the first preset condition further includes: and the file names of the extracted traversed data are successfully matched in a regular mode.

Here, the file name regular matching refers to selecting data consistent with a preset file name in a regular matching manner by using the file name as a preset condition, and exemplarily, the regular matching of the file name can be realized by using a regular expression.

Therefore, the traversed data can be screened through the file name regular matching, expected data can be obtained, and the execution efficiency of tasks is further improved through filtering invalid data.

In a specific example, the number of cores of a processor on a client server is obtained through an InputFormat module, where the processor is a Central Processing Unit (CPU), the number of threads for Processing source data in a MapReduce task is determined according to the number of cores of the CPU, a thread pool is created according to the determined number of threads, where the thread pool is fixedthreadqueue, where the thread pool is used to implement coordination and scheduling for multiple threads, and at the same time, a non-blocking queue is created, where the non-blocking queue is ConcurrentLinkedQueue, and a Semaphore (Semaphore) is created, where the number of semaphores is equal to the number of threads in the thread pool, and the Semaphore is used to represent a data state in the non-blocking queue; the method comprises the steps of analyzing source data in a MapReduce task, determining that the source data are data containing a single directory, writing the source data in the MapReduce task into a non-blocking queue, simultaneously starting all threads in a thread pool, and processing the data in the non-blocking queue, wherein each thread is utilized to take out the data from the non-blocking queue at least once, and simultaneously writing the taken out data into a first message queue when the taken out data are files and the file names of the taken out data are successfully matched regularly.

S102: taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and the file processed by each Map task is one file of the source data.

Optionally, the first message queue may be a message queue with high concurrency characteristics, and exemplarily, the first message queue may be a Redis message queue.

Therefore, by utilizing the characteristic of high concurrency of the first message queue, all Map tasks can process data in the first message queue at the same time, and the execution efficiency of the MapReduce task is further improved.

Alternatively, the number of Map tasks in the MaReduce task may be a specified number, and the user may set a desired number of Map tasks, for example, through a Map.

In the related technology, a mode of determining the number of fragments, namely the Map task number, by a uniformsizereinputFormat module is to determine a fragment number in advance, and it is required to ensure that each fragment processes files with the same data amount on the premise of the fragment number, but because the size of each fragment file is different, the number of fragments needs to be adjusted in a process of adjusting the size of each fragment to be consistent, so that the finally determined number of fragments, namely the Map task number and the expected number have a large difference, a mode of determining the number of fragments by a dynamicinputFormat module is Min (user expected number, generated Chunk number), and because the logic for generating Chunk is complex and the finally estimated number is inaccurate, the finally determined number of Map tasks is also larger than the expected Map task number.

In an actual application scenario, referring to fig. 2, in a system schematic diagram of data processing provided in an embodiment of the present application, an InputFormat module traverses source data in a MapReduce task in a multithreading manner with an identity of a producer, and takes out the traversed data, when the taken out data is a file and a file name of the taken out data is successfully matched in a regular manner, the taken out data is written into a Redis message queue, and meanwhile, an expected number of Map tasks is set by a Map job.job.maps parameter in a MapReduce model, during a process of operation of the MapReduce task, all Map tasks concurrently process the data taken out from the Redis message queue with an identity of a consumer, and each Map task processes one file each time, and when all data in the Redis message queue is consumed, the MapReduce task is executed and completed.

It can be seen that, based on the Redis message queue, the file of each source data can be randomly allocated to different Map tasks for processing, and before each file is processed, the corresponding fragment does not need to be determined, that is, dynamic fragmentation of the source data is realized, further, each Map task only processes one file at a time, so that the difference between the data processing time of each Map task is reduced, and the execution efficiency of the whole MapReduce task is improved.

In practical applications, the steps S101 to S102 may be implemented based on a Processor in an electronic Device, and the Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is to be understood that the electronic device for implementing the above-described processor function may be other electronic devices, and the embodiments of the present application are not limited in particular.

The embodiment of the application provides a data processing method, which comprises the steps of writing source data in a MapReduce task into a first message queue; the source data comprises a plurality of files; taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and each file processed by each Map task is a file of the source data. Therefore, based on the first message queue, the files of each source data can be randomly allocated to different Map tasks for processing, and before each file is processed, the fragments to which each file belongs do not need to be determined, that is, dynamic fragmentation of the source data is realized, furthermore, each Map task only processes one file at a time, the difference of data processing time among the Map tasks is reduced, and the execution efficiency of the whole Map reduce task is improved.

Example two

In order to further embody the purpose of the present application, a further example is provided on the basis of the first embodiment of the present application.

The second embodiment of the present application provides a specific implementation manner that source data in a MapReduce task is traversed in a multithreading manner, and meanwhile, data that is traversed and meets a first preset condition is written into a first message queue, where the first message queue is a Redis message queue.

Firstly, according to the number of cores of a CPU on a client server acquired by an InputFormat module, determining the number of threads to be 20, creating a thread pool containing 20 threads, simultaneously creating a non-blocking queue, wherein the non-blocking queue is ConcurrentLinkedQueue, creating a semaphore, wherein the number of the semaphore is equal to the number of the threads in the thread pool, and after writing source data in a MapReduce task containing a plurality of layers of nested directories into the non-blocking queue, starting the 20 threads in the threads and concurrently processing the data in the non-blocking queue.

Referring to fig. 3, a thread execution flowchart provided in the embodiment of the present application is shown, where each thread first executes a semaphore. Judging whether the non-blocking queue is empty, when the non-blocking queue is empty, namely, when no data exists in the non-blocking queue, executing a reuse operation, namely, marking the semaphore as a reuse state, wherein the reuse state represents that the data processing is finished, meanwhile, judging the states of other semaphores, when the states of all the semaphores are the reuse state, finishing the processing of the data in the non-blocking queue, and exiting the thread; restarting to execute semaphore. When the non-blocking queue is not empty, judging whether the taken out data is a file or a directory, when the taken out data is determined to be the file, carrying out file name regular matching on the taken out file, judging whether the file name regular matching is successful, when the matching is successful, packaging file information into a JavaBean object FilePair, then serializing the file through a Gson mode, writing the file into a Redis message queue, simultaneously, re-executing pop operation of the non-blocking queue, when the matching is failed, discarding the file, and simultaneously re-executing pop operation of the non-blocking queue; when the extracted data is determined to be the directory, executing list operation on the extracted directory, and acquiring a result of the list operation; judging whether the result of the list operation is a directory or not, writing the directory into a non-blocking queue when the result of the list operation is the directory, then re-executing pop operation of the non-blocking queue, performing file name regular matching on the read file when the result of the list operation is the file, judging whether the file name regular matching is successful or not, discarding the file when the matching is failed, re-executing the pop operation of the non-blocking queue, and writing the file into a Redis message queue when the matching is successful, and re-executing the pop operation of the non-blocking queue.

When all threads exit, the source data in the MapReduce task is traversed in a multithreading mode, and the traversed data meeting the preset conditions are written into a Redis message queue.

Therefore, the source data of the MapReduce task with the characteristics of high data volume and multilayer nested directories can be traversed more quickly in a concurrent processing mode, and the processing efficiency of the MapReduce task is improved.

EXAMPLE III

In order to further embody the purpose of the present application, further illustration is made on the basis of the first and second embodiments of the present application.

The third embodiment of the present application provides a specific implementation manner of a method for implementing dynamic fragmentation by using a first message queue, and meanwhile, the fragmentation granularity is accurate to a file level, where the first message queue is a Redis message queue.

In a specific application scenario, according to the method of the second embodiment, first, a Redis message queue including traversed data meeting a preset condition is obtained through an InputFormat module, then, a desired Map task number, that is, a fragment number, is set through a Map _ job _ maps parameter in a MapReduce model, and then, data in the Redis message queue is fragmented according to the determined fragment number, wherein each fragment corresponds to a fragment object, in the application scenario, the fragment object is a RedisSplit object, in the fragmentation process, first, the size of each fragment is determined through a getLength () method, in the application scenario, the size of each fragment is specified to be consistent, a specific numerical value of each fragment is determined through a total file number/fragment number method, then, a node name stored in each fragment and a storage mode of each fragment are returned through a getLocatenation () method and a getLocatenfo method, and a data null factor is omitted in the local application scenario, so that two dynamic factors are omitted.

Further, all RedisSplit objects need to be initialized, that is, the relevant configuration information of the Redis message queue is initialized into all RedisSplit objects, where the relevant configuration information of the Redis message queue includes: the method includes the steps that after all Redis objects are initialized, a generated list containing all Redis objects is returned to a MapReduce framework, namely, a fragmentation process is ended.

Further, each fragment in the Redis message queue needs to be parsed into a Key (Key) and Value (Value) pair for processing by a Map task, in this application scenario, parsing of each fragment in the Redis message queue is implemented by a Redis recorderreader, where the Redis recorderreader inherits a parent record reader, that is, is consistent with a function of the recorderreader, and in the parsing process, specifically, first, in an initial content () method, according to relevant configuration information of a Redis message column in the Redis object, a Jedis object is initialized, where the Jedis object can operate the Redis message queue, that is, the Redis message queue can be operated by the initial content () method, and second, from the Redis message queue by the nextskeyvalue () method, taking out the last queue element at the tail of the message queue in a non-blocking way in an lpop operation way, and then respectively analyzing the queue element taken out by a nextKeyValue () method into a Key Value and a Value in a getCurrentKey () method and a getCurrentValue () method, wherein the concrete implementation method is that after deserializing the queue element in a Gson way, a java bean object FilePair is obtained, and from the java bean object, the Key Value and the Value can be obtained, so that a Key task can be obtained and processed by the Value, further, in the process of redissociedrardreader parsing, the data processing percentage can be determined by a getProgress () method, wherein the data processing percentage can be determined by a formula (1):

wherein numRecordProcessdByThisMap represents processed Key and Value logarithm, and the residual capacity of the Reids message queue can be obtained by a llen method.

When all data in the Redis message queue is fetched, a false value is returned to the Redis RecordReader, indicating that all data in the Redis message queue has been completed by the Map task processing.

Example four

Aiming at the data processing method in the first embodiment of the application, a data processing device is further provided in the fourth embodiment of the application.

Fig. 4 is a schematic structural diagram of an apparatus for behavior recognition according to an embodiment of the present application, where as shown in fig. 4, the apparatus includes: a first processing module 400 and a second processing module 401, wherein:

the first processing module 400: the device is used for writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files;

the second processing module 401: the file is used for taking out the source data from the first message queue, and each Map task in the MapReduce tasks is utilized to process the taken-out file at least once; and each file processed by each Map task is a file of the source data.

In an embodiment, the first processing module 400 is configured to traverse source data in the MapReduce task in a multi-threaded manner, fetch the traversed data, and write the fetched data into the first message queue when the fetched data meets a first preset condition.

In one embodiment, the first processing module 400 is configured to determine the number of threads to process the source data, write the source data into a non-blocking queue, and fetch data from the non-blocking queue at least once by using each thread.

In one embodiment, the first preset condition includes: the extracted data is a file of source data.

In one embodiment, the first preset condition further includes: and the file names of the taken data are successfully matched regularly.

In an embodiment, the number of Map tasks in the MapReduce task is a specified number.

In one embodiment, the first message queue is a Redis message queue.

In practical applications, the first processing module 400 and the second processing module 401 may be implemented by a processor located in an electronic device, and the processor may be at least one of an ASIC, a DSP, a DSPD, a PLD, an FPGA, a CPU, a controller, a microcontroller, and a microprocessor.

In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Specifically, the computer program instructions corresponding to a data processing method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disc, a usb disk, or the like, and when the computer program instructions corresponding to a behavior recognition method in the storage medium are read or executed by an electronic device, the data processing method in any of the foregoing embodiments is implemented.

Based on the same technical concept of the foregoing embodiment, referring to fig. 5, it shows an electronic device 50 provided in an embodiment of the present application, which may include: a memory 51 and a processor 52; wherein,

the memory 51 for storing computer programs and data;

the processor 52 is configured to execute the computer program stored in the memory to implement the method for data processing of any one of the foregoing embodiments.

In practical applications, the memory 51 may be a volatile memory (RAM); or a non-volatile memory (non-volatile memory) such as a ROM, a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 52.

The processor 52 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present application may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

The methods disclosed in the method embodiments provided by the present application can be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in various product embodiments provided by the application can be combined arbitrarily to obtain new product embodiments without conflict.

The features disclosed in the various method or apparatus embodiments provided herein may be combined in any combination to arrive at new method or apparatus embodiments without conflict.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of data processing, the method comprising:

determining the number of threads for processing source data in a MapReduce task, writing the source data into a non-blocking queue, taking out the data from the non-blocking queue at least once by using each thread, and writing the taken out data into a first message queue when the taken out data meets a first preset condition; the source data comprises a plurality of files; the first message queue is a Redis message queue;

taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and the file processed by each Map task is one file of the source data.

2. The method according to claim 1, wherein the first preset condition comprises:

the extracted data is a file of source data.

3. The method of claim 2, wherein the first preset condition further comprises:

and the file names of the taken data are successfully matched regularly.

4. The method according to claim 1, wherein the number of Map tasks in the MapReduce tasks is a specified number.

5. An apparatus for data processing, the apparatus comprising: a first processing module and a second processing module, wherein:

a first processing module: the device comprises a first message queue, a second message queue and a cache queue, wherein the first message queue is used for storing source data of a MapReduce task, the second message queue is used for storing the source data of the MapReduce task, the first message queue is used for storing the source data of the MapReduce task, and the second message queue is used for storing the source data of the MapReduce task; the source data comprises a plurality of files; the first message queue is a Redis message queue;

6. An electronic device comprising a processor and a memory for storing a computer program operable on the processor; wherein,

the processor is configured to perform the method of any one of claims 1 to 4 when running the computer program.

7. A computer storage medium on which a computer program is stored, characterized in that the computer program realizes the method of any one of claims 1 to 4 when executed by a processor.