CN112667411A

CN112667411A - Data processing method and device, electronic equipment and computer storage medium

Info

Publication number: CN112667411A
Application number: CN201910985320.XA
Authority: CN
Inventors: 张永曦
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2021-04-16
Anticipated expiration: 2039-10-16
Also published as: CN112667411B

Abstract

The embodiment of the application discloses a data processing method, a data processing device, electronic equipment and a computer storage medium, wherein the method comprises the following steps: writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files; taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and each file processed by each Map task is a file of the source data. Therefore, based on the first message queue, the files of each source data can be randomly allocated to different Map tasks for processing, and before each file is processed, the fragments to which each file belongs do not need to be determined, that is, dynamic fragmentation of the source data is realized, and further, each Map task only processes one file at a time, so that the difference of data processing time among the Map tasks is reduced, and the execution efficiency of the whole Map task is improved.

Description

Data processing method and device, electronic equipment and computer storage medium

Technical Field

The present application relates to the field of big data, and in particular, to a data processing method and apparatus, an electronic device, and a computer storage medium.

Background

The MapReduce is a programming model and is used for parallel operation of a large-scale data set, and due to the occurrence of the MapReduce programming model, a person who cannot perform distributed programming can operate a program on a large-scale distributed system, wherein the InputFormat module is an important module in the MapReduce programming model and is mainly responsible for traversing and fragmenting source data in a MapReduce task and analyzing the fragmented data by using a Recorderder for processing the Map task, and the InputFormat module is used for fragmenting the traversed source data in the MapReduce task and is an important factor influencing the execution efficiency of the MapReduce task.

In the prior art, there are two main schemes for fragmenting source data after traversal in a MapReduce task, where the first scheme is to uniformly fragment the traversed data through a uniformsis minputformat module, and meanwhile, before a Map task processes the fragmented data, assign corresponding fragmented data to be processed to each Map task in advance, that is, all Map tasks in the MapReduce task process files with the same data volume, however, some Map tasks in the MapReduce task are slower than other Map tasks due to a network, a server, and the like, and further affect the execution efficiency of the whole MapReduce task.

The second scheme is that through a dynamic inputformat module, firstly, traversed data is divided into a plurality of chunks, each Chunk contains files with the same data volume, when a Map task starts to process the data, each Map task randomly acquires one Chunk for processing, and after the processing is completed, the next Chunk is randomly acquired for processing, however, because each Chunk contains a plurality of files and the Map uses a single file as a basic processing unit, when the number of the files contained in the Chunk is unreasonable, the processing speed of the Map task with a slow processing speed and the processing speed of the Map task with a fast processing speed are long, and the execution efficiency difference of the whole MapReduce task is influenced.

Disclosure of Invention

Embodiments of the present application are intended to provide a data processing method, apparatus, electronic device, and computer storage medium.

The embodiment of the application provides a data processing method, which comprises the following steps:

writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files;

taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and each file processed by each Map task is a file of the source data.

Optionally, the writing source data in the MapReduce task into the first message queue includes:

and traversing source data in the MapReduce task in a multithreading mode, taking out the traversed data, and writing the taken out data into the first message queue when the taken out data meets a first preset condition.

Optionally, the traversing source data in the MapReduce task in a multi-thread manner, and fetching the traversed data includes:

and determining the number of threads for processing the source data, writing the source data into a non-blocking queue, and taking out the data from the non-blocking queue at least once by using each thread.

Optionally, the first preset condition includes: the extracted data is a file of source data.

Optionally, the first preset condition further includes: and the file names of the taken data are successfully matched regularly.

Optionally, the number of Map tasks in the MapReduce task is a specified number.

Optionally, the first message queue is a Redis message queue.

The embodiment of the present application further provides a data processing apparatus, where the apparatus includes: a first processing module and a second processing module, wherein:

a first processing module: the device is used for writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files;

a second processing module: the file is used for taking out the source data from the first message queue, and each Map task in the MapReduce tasks is utilized to process the taken-out file at least once; and each file processed by each Map task is a file of the source data.

Optionally, the first processing module is configured to traverse source data in the MapReduce task in a multi-thread manner, take out the traversed data, and write the taken out data into the first message queue when the taken out data meets a first preset condition.

Optionally, the first processing module is configured to determine the number of threads that process the source data, write the source data into a non-blocking queue, and take out data from the non-blocking queue at least once by using each thread.

Optionally, the first message queue is a Redis message queue.

An embodiment of the present application further provides an electronic device, including a processor and a memory for storing a computer program capable of running on the processor; wherein,

the processor is configured to execute any one of the above data processing methods when the computer program is executed.

The embodiment of the present application also provides a computer storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements any one of the above-mentioned data processing methods.

In the data processing method, the data processing device, the electronic equipment and the computer storage medium, source data in a MapReduce task are written into a first message queue; the source data comprises a plurality of files; taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and each file processed by each Map task is a file of the source data. Therefore, based on the first message queue, the files of each source data can be randomly allocated to different Map tasks for processing, and before each file is processed, the fragments to which each file belongs do not need to be determined, that is, dynamic fragmentation of the source data is realized, and further, each Map task only processes one file at a time, so that the difference of data processing time among the Map tasks is reduced, and the execution efficiency of the whole Map task is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;

FIG. 2 is a system diagram of data processing according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating thread execution according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the examples provided herein are merely illustrative of the present application and are not intended to limit the present application. In addition, the following examples are provided as partial examples for implementing the present application, not all examples for implementing the present application, and the technical solutions described in the examples of the present application may be implemented in any combination without conflict.

It should be noted that in the embodiments of the present application, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, the use of the phrase "including a. -. said." does not exclude the presence of other elements (e.g., steps in a method or elements in a device, such as portions of circuitry, processors, programs, software, etc.) in the method or device in which the element is included.

For example, the method for processing data provided by the embodiment of the present application includes a series of steps, but the method for processing data provided by the embodiment of the present application is not limited to the described steps, and similarly, the apparatus for processing data provided by the embodiment of the present application includes a series of modules, but the apparatus provided by the embodiment of the present application is not limited to include the modules explicitly described, and may also include modules that are required to be configured to acquire relevant information or perform processing based on the information.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Embodiments of the application are operational with numerous other general purpose or special purpose computing system environments or configurations, and with terminal and server computing systems. Here, the terminal may be a thin client, a thick client, a hand-held or laptop device, a microprocessor-based system, a set-top box, a programmable consumer electronics, a network personal computer, a small computer system, etc., and the server may be a server computer system, a small computer system, a mainframe computer system, a distributed cloud computing environment including any of the above, etc.

The electronic devices of the terminal, server, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In some embodiments of the present application, a data processing method is provided, which implements dynamic fragmentation of the source data, and each Map task processes only one file at a time, so that a difference between data processing times of the Map tasks is reduced, and execution efficiency of the whole MapReduce task is improved.

Example one

Fig. 1 is a schematic flowchart of a data processing method in an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

s101: writing source data in the MapReduce task into a first message queue; the source data includes a plurality of files.

For the implementation manner of this step, exemplarily, source data in the MaReduce task may be traversed in a multi-threaded manner, and the traversed data is taken out, when the taken out data meets a first preset condition, the taken out data is written into the first message queue, and when the taken out data does not meet the first preset condition, the data is not written into the first message queue.

In this embodiment of the application, the source data in the MapReduce task may be data including a single directory, where the directory may include a plurality of files, or may be data including multiple layers of nested directories, where each level of the directory may include a file and/or a next level of directory.

In the related technology, a single-thread mode is adopted to traverse the source data in the MapReduce task, when the number of files of the source data in the MapReduce task is large or the directory nesting complexity is high, the efficiency of traversing the source data in the single-thread mode is low, in the embodiment of the application, the source data in the MapReduce task is traversed in a multi-thread mode, and the traversed source data is obtained, so that the source data of the MapReduce task with the characteristics of high data volume and multilayer nested directories can be traversed more quickly in a concurrent processing mode, and the execution efficiency of the MapReduce task is improved.

Alternatively, the number of threads for processing the source data of the MapReduce task can be determined, the source data is written into the non-blocking queue, and the data is taken out of the non-blocking queue at least once by each thread.

Here, the non-blocking queue is a queue for enabling a producer to add data to its queue or a consumer to take out data from its queue, where the producer is a thread that adds data to the queue and the consumer is a thread that takes out data from the queue, and at the same time, no matter the producer adds data to the queue or the consumer takes out data from the queue, execution of the non-blocking queue is not blocked, that is, the non-blocking queue is used for enabling multithreading to concurrently process data in its queue; the number of threads may be a number set according to actual requirements, or a number determined according to the number of processor cores on the client server, and further, a thread pool may be created according to the determined number of threads, so as to implement coordination and scheduling of multiple threads.

Therefore, based on the thread pool and the non-blocking queue, data in the non-blocking queue can be reasonably coordinated and processed in a multi-thread concurrent mode, and execution efficiency of the MapReduce task is improved.

Optionally, the first preset condition includes: and the retrieved traversed data is a file in the source data.

It can be seen that when the retrieved traversed data is a file in the source data, the basic requirements of subsequent data processing can be satisfied.

Optionally, the first preset condition further includes: and the file names of the extracted traversed data are successfully matched in a regular mode.

Here, the file name regular matching refers to selecting data consistent with a preset file name in a regular matching manner by using the file name as a preset condition, and exemplarily, the regular matching of the file name can be realized by using a regular expression.

Therefore, the traversed data can be screened through file name regular matching, expected data can be obtained, and the execution efficiency of tasks is further improved through filtering invalid data.

In a specific example, the core number of a processor on a client server is obtained through an InputFormat module, where the processor is a Central Processing Unit (CPU), the number of threads for Processing source data in a MapReduce task is determined according to the core number of the CPU, a thread pool is created according to the determined number of threads, where the thread pool is fixedthreedpool, where the thread pool is used to implement coordination and scheduling for multiple threads, and at the same time, a non-blocking queue is created, where the non-blocking queue is ConcurrentLinkedQueue, and a Semaphore (Semaphore) is created, where the number of semaphores is equal to the number of threads in the thread pool, and the semaphores are used to represent data states in the non-blocking queue; analyzing source data in the MapReduce task, determining that the source data is data containing a single directory, writing the source data in the MapReduce task into a non-blocking queue, starting all threads in a thread pool, and processing the data in the non-blocking queue at the same time, wherein each thread is utilized to take out the data from the non-blocking queue at least once, and simultaneously, when the taken out data is a file and the file name of the taken out data is successfully matched regularly, writing the taken out data into a first message queue.

S102: taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and each file processed by each Map task is a file of the source data.

Optionally, the first message queue may be a message queue with high concurrency characteristics, and illustratively, the first message queue may be a Redis message queue.

Therefore, by utilizing the characteristic of high concurrency of the first message queue, all Map tasks can process data in the first message queue at the same time, and the execution efficiency of the MapReduce task is further improved.

Alternatively, the number of Map tasks in the MaReduce task may be a specified number, and for example, the user may set a desired number of Map tasks through a Map.

In the related art, the way for the uniformsis minputformat module to determine the number of fragments, that is, the number of Map tasks, is to determine a fragment number in advance, and it needs to be ensured that each fragment processes files with the same data amount on the premise of the fragment number, but since the size of each fragment file is different, the number of fragments needs to be adjusted in the process of adjusting the size of each fragment to be consistent, so that the difference between the finally determined fragment number, that is, the number of Map tasks and the expected number is large, the way for the DynamicInputFormat module to determine the number of fragments is Min (the number of users' expectation and the number of generated Chunk), since the logic for generating Chunk is complex and the finally estimated number is inaccurate, the finally determined number of Map tasks is also large as the expected number of Map tasks, while in the embodiment of the present application, the number of Map tasks can be directly specified by the user, and it is ensured that the number of Map tasks remains unchanged in the process of executing MapReduce tasks, therefore, the finally determined Map task number is consistent with the Map number expected by the user, reasonable utilization of cluster resources is achieved, and meanwhile, the user can be enabled to more effectively adjust and optimize the MapReduce task based on the accurate Map task number.

In an actual application scenario, referring to fig. 2, in a system schematic diagram of data processing provided in an embodiment of the present application, an InputFormat module traverses source data in a MapReduce task in a multithreading manner with an identity of a producer, and takes out the traversed data, when the taken out data is a file and a file name of the taken out data is successfully matched in a regular manner, the taken out data is written into a Redis message queue, and meanwhile, an expected number of Map tasks is set by a Map job.job.maps parameter in a MapReduce model, during a process of operation of the MapReduce task, all Map tasks concurrently process the data taken out from the Redis message queue with an identity of a consumer, and each Map task processes one file each time, and when all data in the Redis message queue is consumed, the MapReduce task is executed and completed.

It can be seen that, based on the Redis message queue, the file of each source data can be randomly allocated to different Map tasks for processing, and before each file is processed, the corresponding fragment does not need to be determined, that is, dynamic fragmentation of the source data is realized, further, each Map task only processes one file at a time, so that the difference between the data processing time of each Map task is reduced, and the execution efficiency of the whole MapReduce task is improved.

In practical applications, the steps S101 to S102 may be implemented based on a Processor in an electronic Device, and the Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is to be understood that the electronic device for implementing the above-described processor function may be other electronic devices, and the embodiments of the present application are not limited in particular.

The embodiment of the application provides a data processing method, which comprises the steps of writing source data in a MapReduce task into a first message queue; the source data comprises a plurality of files; taking out the file of the source data from the first message queue, and processing the taken-out file at least once by utilizing each Map task in the MapReduce tasks; and each file processed by each Map task is a file of the source data. Therefore, based on the first message queue, the files of each source data can be randomly allocated to different Map tasks for processing, and before each file is processed, the fragments to which each file belongs do not need to be determined, that is, dynamic fragmentation of the source data is realized.

Example two

In order to further embody the purpose of the present application, a further example is provided on the basis of the first embodiment of the present application.

The second embodiment of the present application provides a specific implementation manner that source data in a MapReduce task is traversed in a multithreading manner, and meanwhile, data that meets a first preset condition after traversal is written into a first message queue, where the first message queue is a Redis message queue.

Firstly, according to the number of cores of a CPU on a client server acquired by an InputFormat module, determining the number of threads to be 20, creating a thread pool containing 20 threads, simultaneously creating a non-blocking queue, wherein the non-blocking queue is ConcurrentLinkedQueue, creating a semaphore, wherein the number of the semaphore is equal to the number of the threads in the thread pool, after writing source data in a MapReduce task containing a plurality of layers of nested directories into the non-blocking queue, starting the 20 threads in the threads, and concurrently processing the data in the non-blocking queue.

Referring to fig. 3, a thread execution flowchart provided in the embodiment of the present application is shown, where each thread first executes a semaphore. Judging whether the non-blocking queue is empty, when the non-blocking queue is empty, namely, when no data exists in the non-blocking queue, executing a reuse operation, namely, marking the semaphore as a reuse state, wherein the reuse state represents that the data processing is finished, meanwhile, judging the states of other semaphores, when the states of all the semaphores are the reuse state, finishing the processing of the data in the non-blocking queue, and exiting the thread; when there is no semaphore in the realse state, restarting executing the Semaphore. When the non-blocking queue is not empty, judging whether the taken data is a file or a directory, when the taken data is determined to be the file, performing file name regular matching on the taken file, judging whether the file name regular matching is successful, when the matching is successful, packaging file information into a JavaBean object FilePair, then serializing the file in a Gson mode, writing the file into a Redis message queue, simultaneously, re-executing pop operation of the non-blocking queue, when the matching is failed, discarding the file, and simultaneously re-executing the pop operation of the non-blocking queue; when the extracted data is determined to be the directory, executing a list operation on the extracted directory, and acquiring a result of the list operation; judging whether the result of the list operation is a directory or not, writing the directory into a non-blocking queue when the result of the list operation is the directory, then re-executing pop operation of the non-blocking queue, performing file name regular matching on the read file when the result of the list operation is the file, judging whether the file name regular matching is successful or not, discarding the file when the matching is failed, re-executing the pop operation of the non-blocking queue, and writing the file into a Redis message queue when the matching is successful, and re-executing the pop operation of the non-blocking queue.

When all threads exit, the source data in the MapReduce task is traversed in a multithreading mode, and the traversed data meeting the preset conditions are written into a Redis message queue.

Therefore, the source data of the MapReduce task with the characteristics of high data volume and multilayer nested directories can be traversed more quickly by a concurrent processing mode, and the processing efficiency of the MapReduce task is improved.

EXAMPLE III

In order to further embody the purpose of the present application, further illustration is performed on the basis of the first and second embodiments of the present application.

The third embodiment of the present application provides a specific implementation manner of a method for implementing dynamic fragmentation by using a first message queue, and meanwhile, the fragmentation granularity is accurate to a file level, where the first message queue is a Redis message queue.

In a specific application scenario, by the method of the second embodiment, first, a Redis message queue including traversed data meeting a preset condition is obtained through an InputFormat module, then, a desired Map task number, that is, a fragment number, is set through a Map job.job.maps parameter in a MapReduce model, then, data in the Redis message queue is fragmented according to the determined fragment number, wherein each fragment corresponds to a fragment object, in the application scenario, the fragment object is a RedisSplit object, in the fragmentation process, first, the size of each fragment is determined through a getLength () method, in the application scenario, the size of each fragment is specified to be consistent, a specific value of each fragment is determined through a total file number/fragment number method, then, a node name stored in each fragment and a storage name mode of each fragment at each node are returned through a getLocateions () method and a getLocatenfo method respectively, in the application scenario, in order to realize dynamic fragmentation, data localization factors need to be ignored, so both methods return null values.

Further, all RedisSplit objects need to be initialized, that is, the relevant configuration information of the Redis message queue is initialized into all RedisSplit objects, where the relevant configuration information of the Redis message queue includes: the method includes the steps that after all Redis objects are initialized, a generated list containing all Redis objects is returned to a MapReduce framework, namely, a fragmentation process is ended.

Further, each fragment in the Redis message queue needs to be parsed into a Key (Key) and Value (Value) pair for processing by a Map task, in this application scenario, parsing of each fragment in the Redis message queue is implemented by a Redis recordcardreader, wherein the Redis recordcardreader inherits parent recordcardreader, that is, is consistent with the function of the recordcardreader, and in the parsing process, specifically, first, in an initial Value () method, according to the relevant configuration information of the Redis message queue in the Redis object, a Jedis object is initialized, wherein the Jedis object can operate the Redis message queue, that is, the Redis message queue can be operated by an initial Value () method, second, by a nextyvalue () method, from the Redis message queue, by a way of a pop operation, the last element of the message queue can be taken out in a non-blocking manner, then the last element of the Value queue is taken out by a Value () method, and the Value of the Value () queue is taken out as a Value () method, respectively, and the Value of the Value () method is taken out a Value of the Value queue, the specific implementation method is that a JavaBean object FilePair is obtained after the queue elements are deserialized in a Gson mode, a Key Value and a Value can be obtained from the JavaBean object, and then the Key and the Value can be obtained to process a Map task, further, in the process of redisRecordReader analysis, the data processing percentage can be determined by a getProgers () method, wherein the data processing percentage can be determined by formula (1):

wherein numRecordProcessdByThisMap represents processed Key and Value logarithm, and the residual capacity of the Reids message queue can be obtained by a llen method.

When all data in the Redis message queue is fetched, a false value is returned to the Redis RecordReader, indicating that all data in the Redis message queue has been completed by the Map task processing.

Example four

Aiming at the data processing method in the first embodiment of the application, a data processing device is further provided in the fourth embodiment of the application.

Fig. 4 is a schematic structural diagram of an apparatus for behavior recognition according to an embodiment of the present application, where as shown in fig. 4, the apparatus includes: a first processing module 400 and a second processing module 401, wherein:

the first processing module 400: the device is used for writing source data in the MapReduce task into a first message queue; the source data comprises a plurality of files;

the second processing module 401: the file is used for taking out the source data from the first message queue, and each Map task in the MapReduce tasks is utilized to process the taken-out file at least once; and each file processed by each Map task is a file of the source data.

In an embodiment, the first processing module 400 is configured to traverse source data in the MapReduce task in a multi-threaded manner, fetch the traversed data, and write the fetched data into the first message queue when the fetched data meets a first preset condition.

In one embodiment, the first processing module 400 is configured to determine the number of threads to process the source data, write the source data into a non-blocking queue, and fetch data from the non-blocking queue at least once by using each thread.

In one embodiment, the first preset condition includes: the extracted data is a file of source data.

In an embodiment, the first preset condition further includes: and the file names of the taken data are successfully matched regularly.

In an embodiment, the number of Map tasks in the MapReduce task is a specified number.

In one embodiment, the first message queue is a Redis message queue.

In practical applications, the first processing module 400 and the second processing module 401 may be implemented by a processor located in an electronic device, and the processor may be at least one of an ASIC, a DSP, a DSPD, a PLD, an FPGA, a CPU, a controller, a microcontroller, and a microprocessor.

In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Specifically, the computer program instructions corresponding to a data processing method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disc, a usb disk, or the like, and when the computer program instructions corresponding to a behavior recognition method in the storage medium are read or executed by an electronic device, the data processing method in any of the foregoing embodiments is implemented.

Based on the same technical concept of the foregoing embodiment, referring to fig. 5, it shows an electronic device 50 provided in an embodiment of the present application, which may include: a memory 51 and a processor 52; wherein,

the memory 51 for storing computer programs and data;

the processor 52 is configured to execute the computer program stored in the memory to implement any one of the data processing methods of the foregoing embodiments.

In practical applications, the memory 51 may be a volatile memory (RAM); or a non-volatile memory (non-volatile memory) such as a ROM, a flash memory (flash memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 52.

The processor 52 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present application may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, which are not repeated herein for brevity

The methods disclosed in the method embodiments provided by the present application can be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in various product embodiments provided by the application can be combined arbitrarily to obtain new product embodiments without conflict.

The features disclosed in the various method or apparatus embodiments provided herein may be combined in any combination to arrive at new method or apparatus embodiments without conflict.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein writing the source data in the MapReduce task into the first message queue comprises:

3. The method according to claim 2, wherein the multithreading for traversing the source data in the MapReduce task and fetching the traversed data comprises:

4. The method according to claim 2, wherein the first preset condition comprises:

the extracted data is a file of source data.

5. The method of claim 4, wherein the first preset condition further comprises:

and the file names of the taken data are successfully matched regularly.

6. The method according to claim 1, wherein the number of Map tasks in the MapReduce tasks is a specified number.

7. The method of claim 1, wherein the first message queue is a Redis message queue.

8. An apparatus for data processing, the apparatus comprising: a first processing module and a second processing module, wherein:

9. An electronic device comprising a processor and a memory for storing a computer program operable on the processor; wherein,

the processor is configured to perform the method of any one of claims 1 to 7 when running the computer program.

10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the method of any one of claims 1 to 7.