CN112241872A

CN112241872A - Distributed data calculation analysis method, device, equipment and storage medium

Info

Publication number: CN112241872A
Application number: CN202011087359.9A
Authority: CN
Inventors: 黄培
Original assignee: Shanghai Zhongyan Network Technology Co ltd
Current assignee: Shanghai Zhongyan Network Technology Co ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-19

Abstract

The application discloses a distributed data calculation analysis method, a device, equipment and a storage medium. The method comprises the steps of obtaining a project to be analyzed and generating a corresponding analysis model; acquiring data to be calculated, and dividing the data to be calculated into a plurality of data blocks; establishing a calculation analysis task, and performing calculation analysis on each data block by using the calculation analysis task to obtain a corresponding temporary analysis result; and combining each temporary analysis result to obtain a final data analysis result. The method is based on the Python language building, adopts a multi-level mode, firstly carries out to-be-analyzed project segmentation, then carries out data segmentation, merges temporary analysis results after analyzing each data block to obtain final data analysis results, only needs to deploy Python application environment, is simple to deploy, and is easy in data integration. The method and the device solve the technical problems that analysis and calculation aiming at mass data in the related technology mostly depend on a frame, the building and deployment are complex, and data integration is not easy to carry out.

Description

Distributed data calculation analysis method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a distributed data computation analysis method, apparatus, device, and storage medium.

Background

Distributed data processing is the processing of data using distributed computing techniques. With the rapid expansion of data volume, the data volume faced by internet companies has reached PB level, and traditional centralized data processing has become increasingly unable to meet market demand. A distributed network is composed of several intercommunicating computers, each having its own processor and storage device, and the huge computing tasks originally concentrated on a single node are load-balanced and assigned to the computers in the distributed network for parallel processing.

At present, a plurality of frameworks exist in a distributed data analysis mode in the market, such as Spark, Dash, Loky, Celery and the like, but the frameworks have disadvantages in specific scenes, for example, Spark deployment is troublesome; the Dash has powerful functions but unique bottom layer structure, and is inconvenient for data integration; the Loky documents are fewer, the bottom layer is compared, and the use threshold is higher; celery requires embedded code and has low performance.

In the prior art, analysis and calculation aiming at mass data mostly depend on a frame, construction and deployment are complex, and data integration is not easy to perform.

Aiming at the problems that analysis and calculation aiming at mass data in the related technology mostly depend on a frame, the construction and deployment are complex, and data integration is difficult to carry out, an effective solution is not provided at present.

Disclosure of Invention

The application mainly aims to provide a distributed data calculation analysis method, a distributed data calculation analysis device, distributed data calculation equipment and a storage medium, so that the problems that analysis and calculation for mass data in the related technology mostly depend on a frame, the building and deployment are complex, and data integration is not easy to perform are solved.

In order to achieve the above object, in a first aspect, the present application provides a distributed data computation analysis method.

The method according to the application comprises the following steps:

acquiring a project to be analyzed, and generating a corresponding analysis model according to the type of the project to be analyzed;

acquiring data to be calculated of an item to be analyzed;

dividing the data to be calculated into a plurality of data blocks according to a preset division rule corresponding to the analysis model;

establishing a calculation analysis task, and performing calculation analysis on each data block by using the calculation analysis task to obtain a corresponding temporary analysis result;

and combining each temporary analysis result to obtain a final data analysis result.

In one possible implementation manner of the present application, the preset segmentation rule is: and averagely dividing the data to be calculated into a plurality of data blocks, wherein the number of the data blocks is the same as the number of idle processes of the system.

In one possible implementation of the present application, the number of computational analysis tasks is the same as the number of data blocks.

In a possible implementation manner of the present application, a calculation analysis task is used to perform calculation analysis on each data block to obtain a corresponding temporary analysis result, which specifically includes: and carrying out calculation analysis on each data block in parallel by using each calculation analysis task to obtain a corresponding temporary analysis result.

In a possible implementation manner of the present application, merging each temporary analysis result to obtain a final data analysis result includes:

and merging each temporary analysis result according to a merging calculation frame corresponding to the analysis model to generate a final result file and obtain a final data analysis result.

In one possible implementation of the present application, the method supports stand-alone applications and multi-machine distributed applications.

In a second aspect, the present application further provides a distributed data computation analysis apparatus, including:

the acquisition module is used for acquiring the items to be analyzed and the data to be calculated of the items to be analyzed;

the processing module is used for generating a corresponding analysis model according to the type of the item to be analyzed;

and the merging output module is used for merging each temporary analysis result to obtain a final data analysis result.

In one possible implementation manner of the present application, the processing module includes:

and the parallel processing unit is used for performing calculation analysis on each data block in parallel by using each calculation analysis task to obtain a corresponding temporary analysis result.

In one possible implementation manner of the present application, the merging output module is specifically configured to:

In a third aspect, the present application further provides a distributed data computing and analyzing electronic device, where the electronic device includes:

one or more processors;

a memory; and

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the distributed data computational analysis method of any of the first aspects.

In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, the computer program being loaded by a processor to perform the steps of the distributed data computation analysis method of any one of the first aspect.

In the embodiment of the application, a distributed data calculation analysis method is built based on Python language, a multi-level mode is adopted, firstly, a project to be analyzed is segmented, then, data to be calculated of the project to be analyzed is segmented, each data block is analyzed, then, temporary analysis results are combined, and a final data analysis result is obtained; and the technical problems that analysis and calculation for mass data in the related technology mostly depend on a frame, the construction and deployment are complex, and data integration is difficult to perform are solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a distributed data computation analysis method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an embodiment of a distributed data computation analysis apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an embodiment of a distributed data computation analysis electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the present application and its embodiments, and are not used to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.

In addition, the term "plurality" shall mean two as well as more than two.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

First, an embodiment of the present application provides a distributed data calculation and analysis method, where an execution subject of the distributed data calculation and analysis method is a distributed data calculation and analysis device, the distributed data calculation and analysis device is applied to a processor, and the distributed data calculation and analysis method includes: acquiring a project to be analyzed, and generating a corresponding analysis model according to the type of the project to be analyzed; acquiring data to be calculated of an item to be analyzed; dividing the data to be calculated into a plurality of data blocks according to a preset division rule corresponding to the analysis model; establishing a calculation analysis task, and performing calculation analysis on each data block by using the calculation analysis task to obtain a corresponding temporary analysis result; and combining each temporary analysis result to obtain a final data analysis result.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a distributed data calculation and analysis method according to an embodiment of the present application, where the distributed data calculation and analysis method includes:

101. and acquiring the item to be analyzed, and generating a corresponding analysis model according to the type of the item to be analyzed.

In the embodiment of the present application, if an item to be analyzed is a research item, and the size of data to be calculated of the research item is 500 ten thousand, the corresponding analysis model is an analysis model of a statistical result for each topic of the research item.

102. And acquiring data to be calculated of the item to be analyzed.

In the embodiment of the present application, data to be calculated of a project to be analyzed is uniformly stored in a database, the data to be calculated is obtained, and the data to be calculated can be called from the database.

103. And dividing the data to be calculated into a plurality of data blocks according to a preset division rule corresponding to the analysis model.

In the embodiment of the present application, the preset segmentation rule is: the data to be calculated is averagely divided into a plurality of data blocks, the number of the divided data blocks is the same as the number of the idle processes of the system, namely if the number of the idle processes of the system is 500, the 500 ten thousand data are averagely divided into 500 data blocks, wherein each data block comprises 1 ten thousand data.

104. And establishing a calculation analysis task, and performing calculation analysis on each data block by using the calculation analysis task to obtain a corresponding temporary analysis result.

In the embodiment of the application, the number of created computational analysis tasks is the same as the number of data blocks, that is, the number of created computational analysis tasks is the same as the number of system idle processes, each data block may correspond to one computational analysis task, and each data block may correspond to one system idle process, that is, the computational analysis tasks may be invoked by the system idle processes to perform computational analysis on the data blocks in parallel and simultaneously, so as to obtain corresponding temporary analysis results, that is, 500 temporary analysis results are obtained.

105. And combining each temporary analysis result to obtain a final data analysis result.

In the embodiment of the application, 500 temporary analysis results are subjected to combined calculation according to a combined calculation frame corresponding to the analysis model, a final result file is generated, a final data analysis result is obtained, and rapid data analysis and data sharing can be realized only by loading the final data analysis result from the final result file.

The method can support single-machine application and multi-machine distributed application, can conduct real-time statistical analysis on 10 hundred million-level project quantities and 10 hundred million-level data quantities, is built based on Python language, adopts a multi-level mode, firstly conducts to-be-analyzed project segmentation on the ten million-level project quantities, then conducts data segmentation on to-be-calculated data of a to-be-analyzed project, averagely divides the data of each to-be-analyzed project into a plurality of data blocks, combines temporary analysis results after analyzing each data block to obtain a final data analysis result of each project, can be achieved only by deploying Python application environment, is simple in deployment, easy in data integration, free of other configuration, capable of improving concurrency capacity only by adding a service process, and easy to expand.

In order to better implement the distributed data calculation and analysis method in the embodiment of the present application, on the basis of the distributed data calculation and analysis method, an embodiment of the present application further provides a distributed data calculation and analysis device, as shown in fig. 2, the distributed data calculation and analysis device 200 includes:

an obtaining module 201, configured to obtain an item to be analyzed and data to be calculated of the item to be analyzed;

the processing module 202 is configured to generate a corresponding analysis model according to the type of the item to be analyzed;

and a merging output module 203, configured to merge each temporary analysis result to obtain a final data analysis result.

In some embodiments of the present application, the processing module 202 comprises:

the parallel processing unit 2021 is configured to perform computation analysis on each data block in parallel by using each computation analysis task to obtain a corresponding temporary analysis result.

In some embodiments of the present application, the merge output module 203 is specifically configured to:

Specifically, for a specific process of implementing the functions of each module and unit in the device in the embodiment of the present application, reference may be made to the description of the distributed data calculation and analysis method in the corresponding embodiment of fig. 1, which is not described herein again in detail.

The embodiment of the present application further provides a distributed data calculation and analysis electronic device, which integrates any one of the distributed data calculation and analysis apparatuses provided in the embodiment of the present application, and the electronic device includes:

one or more processors;

a memory; and

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to perform the steps of the distributed data calculation analysis method in any of the above embodiments of the distributed data calculation analysis method.

The distributed data calculation and analysis electronic device provided by the embodiment of the application integrates any one of the distributed data calculation and analysis devices provided by the embodiment of the application. As shown in fig. 3, it shows a schematic structural diagram of an electronic device according to an embodiment of the present application, specifically:

the electronic device may include components such as a processor 301 of one or more processing cores, memory 302 of one or more computer-readable storage media, a power supply 303, and an input unit 304. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 3 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 301 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 302 and calling data stored in the memory 302, thereby performing overall monitoring of the electronic device. Optionally, processor 301 may include one or more processing cores; the Processor 301 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, preferably the processor 301 may integrate an application processor, which handles primarily the operating system, user interfaces, application programs, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 301.

The memory 302 may be used to store software programs and modules, and the processor 301 executes various functional applications and data processing by operating the software programs and modules stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 302 may also include a memory controller to provide the processor 301 with access to the memory 302.

The electronic device further comprises a power supply 303 for supplying power to each component, and preferably, the power supply 303 can be logically connected with the processor 301 through a power management system, so that functions of charging, discharging, power consumption management and the like can be managed through the power management system. The power supply 303 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 304, and the input unit 304 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 301 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 302 according to the following instructions, and the processor 301 runs the application programs stored in the memory 302, thereby implementing various functions as follows:

acquiring data to be calculated of an item to be analyzed;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the distributed data calculation and analysis apparatus, the electronic device and the corresponding units thereof described above may refer to the description of the distributed data calculation and analysis method in the embodiment corresponding to fig. 1, and are not described herein again in detail.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by related hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by the processor 301.

To this end, an embodiment of the present application provides a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like. The computer program is loaded by the processor to execute the steps of any one of the distributed data calculation and analysis methods provided by the embodiments of the present application. For example, the computer program may be loaded by a processor to perform the steps of:

acquiring data to be calculated of an item to be analyzed;

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A distributed data calculation analysis method is characterized in that the method is built based on Python language, and the method comprises the following steps:

acquiring data to be calculated of the item to be analyzed;

2. The method of claim 1, wherein the preset segmentation rule is: and averagely dividing the data to be calculated into a plurality of data blocks, wherein the number of the data blocks is the same as the number of idle processes of the system.

3. The method of claim 1, wherein the number of computational analysis tasks is the same as the number of data blocks.

4. The method of claim 3, wherein the computational analysis is performed on each of the data blocks by using the computational analysis task to obtain a corresponding temporary analysis result, specifically: and utilizing each calculation analysis task to perform calculation analysis on each data block in parallel to obtain a corresponding temporary analysis result.

5. The method of claim 1, wherein said combining each of said temporal analysis results to obtain a final data analysis result comprises:

6. The method of claim 1, wherein the method supports stand-alone applications and multi-machine distributed applications.

7. A distributed data computation analysis apparatus, comprising:

the acquisition module is used for acquiring a project to be analyzed and data to be calculated of the project to be analyzed;

8. The apparatus of claim 7, wherein the processing module comprises:

9. A distributed data computing analysis electronic device, comprising:

one or more processors;

a memory; and

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the distributed data computing analysis method of any of claims 1-6.

10. A computer-readable storage medium, having stored thereon a computer program which is loaded by a processor to perform the steps of the distributed data computing analysis method of any one of claims 1 to 6.