CN113704340B - Data processing method, device, server and storage medium - Google Patents

Data processing method, device, server and storage medium Download PDF

Info

Publication number
CN113704340B
CN113704340B CN202111004548.XA CN202111004548A CN113704340B CN 113704340 B CN113704340 B CN 113704340B CN 202111004548 A CN202111004548 A CN 202111004548A CN 113704340 B CN113704340 B CN 113704340B
Authority
CN
China
Prior art keywords
data
processing
distributed
data processing
analyzed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111004548.XA
Other languages
Chinese (zh)
Other versions
CN113704340A (en
Inventor
王志猛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Envision Innovation Intelligent Technology Co Ltd
Envision Digital International Pte Ltd
Original Assignee
Shanghai Envision Innovation Intelligent Technology Co Ltd
Envision Digital International Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Envision Innovation Intelligent Technology Co Ltd, Envision Digital International Pte Ltd filed Critical Shanghai Envision Innovation Intelligent Technology Co Ltd
Priority to CN202111004548.XA priority Critical patent/CN113704340B/en
Publication of CN113704340A publication Critical patent/CN113704340A/en
Priority to PCT/SG2022/050611 priority patent/WO2023033726A2/en
Application granted granted Critical
Publication of CN113704340B publication Critical patent/CN113704340B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Multi Processors (AREA)

Abstract

The application discloses a data processing method, a data processing device, a server and a storage medium, and belongs to the technical field of computers and Internet. The method comprises the following steps: obtaining data to be analyzed from a distributed database, wherein the data structure of the data to be analyzed is RDD; carrying out format conversion treatment on the data to be analyzed to obtain converted data, wherein the data structure of the converted data accords with the data structure requirement corresponding to a single machine algorithm; at the data processing nodes contained in the distributed framework, the converted data are processed in parallel by adopting a single-machine algorithm to obtain the processed data; and storing the processed data into a distributed database. The method and the device realize decoupling of the data and the algorithm of the distributed framework, reduce the capability requirement of development and application of the distributed framework on algorithm personnel, reduce the learning cost of the algorithm personnel, shorten the life cycle of development of the distributed framework, and are beneficial to popularization and application of the distributed framework.

Description

Data processing method, device, server and storage medium
Technical Field
The embodiment of the application relates to the technical fields of computers and the Internet, in particular to a data processing method, a data processing device, a server and a storage medium.
Background
Conventional algorithmic research and applications are typically developed based on a single server (also referred to as "stand-alone" in the embodiments of the present application). However, as the traffic volume accumulates and the real-time traffic volume grows, the computing power of one or even several servers becomes increasingly difficult to meet the demands of traffic growth.
To solve this technical problem, algorithmic staff has studied and proposed a distributed algorithm running framework (hereinafter referred to as a "distributed framework"). The distributed framework refers to an algorithm access framework written based on a distributed storage, calculation and machine learning framework, and the algorithm access framework enables distributed reading, writing and calculation. In response, a stand-alone algorithm operation framework (hereinafter referred to as a "stand-alone framework") used in conventional algorithm research and application refers to an algorithm access framework written based on a stand-alone and relational database, which enables algorithms to implement stand-alone reading, writing and computation. Obviously, because the algorithm can realize distributed reading, writing and calculation in the distributed framework, the calculation force realized by the distributed framework is far greater than that realized by the single-machine framework, and the algorithm is suitable for the service with larger data volume. For example, with the popularization of big data and machine learning in the field of fans, effective information is mined from mass data accumulated during the operation of fans by adopting a distributed framework, so that the detection of the operation state of the fans and the diagnosis of faults are realized, and the operation of the fans becomes more convenient.
However, the development and application of the distributed framework have higher requirements on the capability of algorithm personnel, so that the algorithm personnel are required to have necessary knowledge of the algorithm, and the algorithm personnel are required to have engineering capability of the development and application of the distributed framework, thereby increasing the learning time cost of the algorithm personnel, prolonging the life cycle of the development of the distributed framework and being unfavorable for the popularization and application of the distributed framework.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, a server and a storage medium, which can be used for decoupling data and algorithms of a distributed framework and reducing the capability requirements of development and application of the distributed framework on algorithm personnel. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a data processing method, which is applied to a distributed framework, where the method includes:
obtaining data to be analyzed from a distributed database, wherein the data structure of the data to be analyzed is RDD;
performing format conversion processing on the data to be analyzed to obtain converted data, wherein the data structure of the converted data accords with the data structure requirement corresponding to a single machine algorithm;
at the data processing nodes contained in the distributed framework, the single-machine algorithm is adopted to process the converted data in parallel, so as to obtain processed data;
and storing the processed data into the distributed database.
In another aspect, an embodiment of the present application provides a data processing apparatus, provided in a distributed framework, including:
the data acquisition module is used for acquiring data to be analyzed from the distributed database, and the data structure of the data to be analyzed is RDD;
the format conversion module is used for carrying out format conversion processing on the data to be analyzed to obtain converted data, and the data structure of the converted data accords with the data structure requirement corresponding to a single machine algorithm;
the data processing module is used for carrying out parallel processing on the converted data by adopting the single machine algorithm at the data processing nodes contained in the distributed framework to obtain processed data;
and the data storage module is used for storing the processed data into the distributed database.
In yet another aspect, embodiments of the present application provide a server including a processor and a memory, the memory having stored therein a computer program loaded and executed by the processor to implement a data processing method as described above.
In yet another aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as described above.
In yet another aspect, embodiments of the present application provide a computer program product for implementing the above-described data processing method when executed by a processor.
The technical scheme provided by the embodiment of the application can bring the following beneficial effects:
before processing the data, the data structure of the data is converted through the distributed framework, so that the converted data is adapted to the data structure requirement corresponding to the single-machine algorithm, and the single-machine algorithm is connected into the distributed framework. Moreover, the distributed framework is connected with a single-machine algorithm, so that algorithm personnel develop the algorithm based on the single-machine environment, and engineering personnel are connected with the developed algorithm into the distributed framework, so that the data and algorithm of the distributed framework are decoupled, the capability requirements of development and application of the distributed framework on the algorithm personnel are reduced, the learning cost of the algorithm personnel is reduced, the life cycle of the development of the distributed framework is shortened, and the popularization and application of the distributed framework are facilitated. In addition, in the embodiment of the application, the distributed framework adopts a single-machine algorithm to process the converted data in parallel at each data processing node, and the efficient data throughput, scheduling and parallel capacity of the distributed framework are fully utilized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a scheduling approach provided by one embodiment of the present application;
FIG. 2 is a schematic diagram of a data processing system provided in one embodiment of the present application;
FIG. 3 is a flow chart of a data processing method provided by one embodiment of the present application;
FIG. 4 is a schematic diagram of a data slicing manner according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a data processing method provided in one embodiment of the present application;
FIG. 6 is a block diagram of a data processing apparatus provided in one embodiment of the present application;
FIG. 7 is a block diagram of a data processing apparatus provided in another embodiment of the present application;
fig. 8 is a block diagram of a server according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
First, related terms according to the embodiments of the present application will be described.
The distributed framework refers to an algorithm access framework written based on a distributed storage, calculation and machine learning framework, and the algorithm access framework enables distributed reading, writing and calculation. The single-machine framework refers to an algorithm access framework which is written based on a single machine and a relational database and enables the algorithm to realize single-machine reading, writing and calculation. In one example, the manner in which the data of the distributed and stand-alone frameworks are scheduled and calculated is shown in FIG. 1, based on which a summary of the following Table one can be obtained.
Scheduling mode comparison of table-distributed framework and single-machine framework
Name of the name Distributed framework Single-machine type frame
Data source Hive/HDFS Mysql
Data structure at the time of computation RDD Pandas Dataframe (data frame)
Machine learning library Pyspark adapted machine learning library Python adapted machine learning library
With reference now to FIG. 2, a diagram of a data processing system is depicted in accordance with one embodiment of the present application. As shown in fig. 2, the data processing system includes: a distributed framework 10.
The distributed framework 10 refers to an algorithm access framework capable of implementing distributed reading, writing and computing. In the embodiment of the present application, the distributed framework 10 includes a master node device 20 and a server cluster formed by a plurality of servers 30. Each server 30 may perform corresponding calculation according to the data and algorithm issued by the master node device 20, and return the calculation result to the master node device 20.
In one example, as shown in FIG. 2, the distributed framework 10 may call the data stored in the distributed database 40 and calculate the called data, with the calculated result still stored in the distributed database 40. Optionally, the distributed framework 10 and the distributed database 40 are communicatively connected via a network, which may be a wired network or a wireless network.
Alternatively, the distributed database 40 includes any one of the following: hive (a data warehouse tool based on Hadoop for data extraction, transformation, loading, which is a mechanism that can store, query and analyze large-scale data stored in Hadoop), HDFS (Hadoop Distributed File System, distributed file system). Optionally, the data structure of the data stored in the distributed database 40 is RDD (Resilient Distributed Datasets, resilient distributed data set).
The data processing method provided by the embodiment of the application realizes decoupling of the data and the algorithm of the distributed framework: on one hand, algorithm personnel develop the algorithm based on a single machine environment, and on the other hand, engineering personnel access the developed algorithm into a distributed framework. Thus, in the embodiment of the present application, the machine learning library used by the distributed framework in the computing process is the same as the machine learning library used by the stand-alone framework in the computing process, for example, the machine learning library used by the distributed framework in the computing process includes: python-adapted machine learning libraries (python scikits_learn, pyspark LightGBM, etc.).
The following describes the data processing method provided in the embodiments of the present application in conjunction with several examples.
Referring to fig. 3, a flowchart of a data processing method according to an embodiment of the present application is shown. The method may be applied in the distributed framework 10 of the data processing system described above. The method may comprise the following steps.
In step 310, the data to be analyzed is obtained from the distributed database, and the data structure of the data to be analyzed is RDD.
The distributed database refers to a database with distributed storage capability, such as Hive, HDFS, etc. The distributed framework may obtain data to be analyzed from a distributed database, where the data structure of the data to be analyzed is RDD. The embodiment of the application does not limit the time for acquiring the data to be analyzed from the distributed database by the distributed framework. In one example, the distributed framework obtains data to be analyzed from a distributed database when there is a data processing need. In another example, the distributed framework obtains data to be analyzed from the distributed database once every preset time period. In yet another example, the distributed database actively pushes the data to be analyzed to the distributed framework every preset time period.
Step 320, performing format conversion processing on the data to be analyzed to obtain converted data, wherein the data structure of the converted data meets the data structure requirement corresponding to the single machine algorithm.
The distributed framework is based on specific business requirements, on one hand, data to be analyzed corresponding to the business requirements can be obtained from the distributed database, and on the other hand, an algorithm used for realizing the business requirements can be determined. In the embodiment of the application, the algorithm determined by the distributed framework is developed by an algorithm personnel based on a single machine environment, and is also called a single machine algorithm.
Because the single-machine algorithm is developed before accessing the distributed framework, the data structure requirements corresponding to the single-machine algorithm are also determined. In order to enable the distributed framework to process data to be analyzed by using a single-machine algorithm, in the embodiment of the invention, after the data to be analyzed is obtained, the distributed framework performs format conversion processing on the data to be analyzed, so that the data structure of the converted data accords with the data structure requirement corresponding to the single-machine algorithm. Subsequently, the distributed framework also uses a stand-alone algorithm to process the converted data.
And 330, at the data processing nodes included in the distributed framework, performing parallel processing on the converted data by adopting a single-machine algorithm to obtain processed data.
As can be seen from the above description, the distributed framework includes a server cluster composed of a plurality of servers. Each server in the server cluster may perform data processing to form data processing nodes in the distributed framework. Optionally, one server corresponds to one data processing node; alternatively, the plurality of servers corresponds to one data processing node, which is not limited in the embodiments of the present application.
The distributed framework can carry out parallel processing on the converted data by adopting a single-machine algorithm at the data processing nodes contained in the distributed framework, so that high-efficiency data throughput and parallel capability are realized. Optionally, the distributed framework performs data processing by adopting a single-machine algorithm at all the included data processing nodes; or, the distributed framework performs data processing by adopting a single-machine algorithm at the included partial data processing nodes, which is not limited in the embodiment of the application, and the number of the data processing nodes participating in calculation can be determined by combining the size of the converted data, the fineness of the service requirement and the like in the actual application process.
In one example, the distributed framework includes n data processing nodes, n being an integer greater than 1; step 330 includes: at n data processing nodes, adopting a single machine algorithm to process the converted data in parallel to obtain n processing results; and carrying out data fusion processing on the n processing results to obtain processed data.
Because the distributed framework distributes and collects data through one iterator (iterator), each of the n data processing nodes generates a processing result, and therefore, data fusion processing needs to be performed on the processing results generated by the n data processing nodes respectively to obtain processed data, so that the iterator returns the processed data to the distributed database. For other description of the data fusion process performed by the distributed framework, please refer to the following method embodiments, which are not repeated here.
The embodiment of the application does not limit the single-machine algorithm and the converted data used by each data processing node, and optionally, the n data processing nodes use the same single-machine algorithm and the converted data used are different; alternatively, different stand-alone algorithms and different converted data are used for the n data processing nodes; alternatively, the n data processing nodes use the same converted data, and the single algorithms used are different. For other descriptions of data processing performed by each data processing node, please refer to the following method embodiments, which are not repeated here.
It should be understood that the number of all data processing nodes included in the distributed framework should be greater than or equal to n, and in this example, the description is given only by taking as an example that the service requirement needs to use n data processing nodes among the data processing nodes included in the distributed framework, which does not limit the embodiments of the present application.
Step 340, storing the processed data in a distributed database.
After the processed data is obtained, the distributed framework returns the processed data to the distributed database for storage, so that the corresponding data can be conveniently called from the distributed database for data analysis and the like.
In summary, according to the technical scheme provided by the embodiment of the application, the data structure of the data is converted by the distributed framework before the data is processed, so that the converted data is adapted to the data structure requirement corresponding to the single-machine algorithm, and the single-machine algorithm is connected into the distributed framework. Moreover, the distributed framework is connected with a single-machine algorithm, so that algorithm personnel develop the algorithm based on the single-machine environment, and engineering personnel are connected with the developed algorithm into the distributed framework, so that the data and algorithm of the distributed framework are decoupled, the capability requirements of development and application of the distributed framework on the algorithm personnel are reduced, the learning cost of the algorithm personnel is reduced, the life cycle of the development of the distributed framework is shortened, and the popularization and application of the distributed framework are facilitated. In addition, in the embodiment of the application, the distributed framework adopts a single-machine algorithm to process the converted data in parallel at each data processing node, and the efficient data throughput, scheduling and parallel capacity of the distributed framework are fully utilized.
In one example, at the n data processing nodes, the parallel processing is performed on the converted data by using a single machine algorithm to obtain n processing results, including: according to a target segmentation mode, carrying out data segmentation on the converted data to obtain n data slices; issuing n data slices to n data processing nodes; and processing the ith data slice in the n data processing nodes by adopting a single machine algorithm at the ith data processing node to obtain an ith processing result, wherein the n processing results comprise the ith processing result, and i is a positive integer less than or equal to n.
After the data to be analyzed is obtained from the distributed database, the distributed framework firstly performs format conversion processing on the data to be analyzed to obtain converted data. Because a plurality of data processing nodes participate in data processing under the distributed framework, in order to reduce processing cost of each data processing node and accelerate data processing speed, in the embodiment of the application, the distributed framework performs data segmentation on converted data to obtain n data slices, and then the n data slices are respectively issued to n data processing nodes contained in the distributed framework. Alternatively, the distributed framework issues a single machine algorithm simultaneously when issuing n data slices. And processing the data slices issued to each data processing node by adopting a single-machine algorithm at each data processing node to obtain a processing result.
In the embodiment of the application, the distributed framework can segment the converted data according to the target segmentation mode. Taking the application of the technical solution provided in the embodiment of the present application to the wind energy field as an example, the data to be analyzed acquired by the distributed framework includes data related to a fan, and in one example, as shown in fig. 4, the target segmentation mode includes at least one of the following segmentation modes: data splitting is performed based on wind farms (e.g., as shown in fig. 4 (a)), based on fans (e.g., as shown in fig. 4 (b)), based on time (e.g., as shown in fig. 4 (c)).
Taking one data processing node for data processing on one data slice as an example, the distributed framework needs to issue n data slices to n data processing nodes. In one example, the distributed framework may randomly select n data processing nodes from the included data processing nodes, and randomly issue n data slices to the n data processing nodes, so as to improve the issue efficiency of the data slices. In another example, the issuing the n data slices to the n data processing nodes includes: determining the data sizes corresponding to the n data slices respectively; acquiring processing capacities corresponding to the n data processing nodes respectively; and issuing the n data slices to the n data processing nodes based on the data sizes respectively corresponding to the n data slices and the processing capacities respectively corresponding to the n data processing nodes. The data slices are distributed according to the processing capacity of each data processing node and the sizes of the data slices, so that the sizes of the data slices are ensured to be suitable for the processing capacity of the data processing nodes, and a better data processing effect is achieved.
In summary, according to the technical scheme provided by the embodiment of the application, the data to be processed is subjected to data slicing processing to obtain a plurality of data slices, and the data slices are respectively issued to a plurality of data processing nodes included in the distributed framework, so that each data processing node can process the distributed data slices by adopting a single-machine algorithm. The data segmentation processing is performed, so that the data quantity required to be processed by each data processing node is reduced, the processing cost of each data processing node is reduced, and the data processing speed is increased.
Typically, the manner in which the data is fused includes stitching. Taking a table as an example, the expression form of the processing results includes a plurality of processing results which can be subjected to data fusion in a mode of row-by-row splicing or column-by-column splicing.
Illustratively, the n processing results include table two and table three below.
Table two wens wtg _10m
wf_id wtg_id
WF0002 WF0002WTG0005
WF0002 WF0002WTG0018
WF0006 WF0006WTG0011
WF0006 WF0006WTG0024
WF0006 WF0006WTG0037
WF0008 WF0008WTG0007
Table three wens wtg info
wtq_latitude wtq_longitude
34.31365 120.019
42.4992 117.8644
If the column-wise splicing processing is performed on the second table and the third table, the splicing result shown in the following table four can be obtained.
Table four, by row splice results
Illustratively, the n processing results include table five and table six below.
Table five wens wtg _10m
wf_id wtg_id col1 col2 col3
WF0002 WF0002WTG0005 val1 val2 val3
WF0002 WF0002WTG0018 val1 val2 val3
WF0006 WF0006WTG0011 val1 val2 val3
Table six wens wtg info
wtq_latitude wtq_longitude
34.31365 120.019
42.4992 117.8644
If the above-described fifth and sixth tables are subjected to the row-wise splicing processing, the splicing result shown in the following seventh table can be obtained.
Table seven column-wise splice results
table wf_id wtg_id col1 col2 col3
1 WF0002 WF0002WTG0005 val1 val2 val3
1 WF0002 WF0002WTG0018 val1 val2 val3
1 WF0006 WF0006WTG0011 val1 val2 val3
2 34.31365 120.019 NaN NaN NaN
2 42.4992 117.8644 NaN NaN NaN
As can be seen from the splicing results shown in the table four and the table seven, a plurality of redundant bits (NaN) are generated by splicing according to rows and columns, and the redundant bits also occupy a certain storage space, so that the storage resources of the distributed database are wasted. In addition, as shown in the above table seven, during the column-wise splicing process, the data tag (here, the column name) of a part of the processing result will be discarded, which is not beneficial to the subsequent standardized management and retrieval.
Based on this, the embodiment of the application provides a data fusion processing mode, which can be used for solving the technical problems. Next, the data fusion processing method will be described.
In one example, the data fusion processing is performed on the n processing results to obtain processed data, where the processing includes: virtualizing the data labels of the n processing results to obtain virtualized labels; based on the virtualized label, the n processing results are dictionary-typed, and n dictionary-typed results are obtained; and integrating the data of the n dictionary results to obtain processed data.
The data tag is used to identify data in the processing result, and may generally indicate the meaning of the data. In one example, the representation of the processing result includes a table containing at least one column of data, and the data tag of the processing result includes a column name of the table; alternatively, the presentation form of the processing result includes a table containing at least one line of data; the data tag of the processing result includes the row name of the table.
Because each data processing node processes different data slices, the processing results obtained by each data processing node may be different, and the data labels of the processing results are also different. In order to facilitate management and reduce the size of the data tag, in the embodiment of the present application, the distributed framework may virtualize the data tags of the n processing results, to obtain a virtualized tag. For example, if the above table two and table three are included with n processing results, the column name "wf_id" may be virtualized to "c0" and the column name "wtg _id" may be virtualized to "c1". Wherein c0 and c1 are virtualization tags.
Based on the virtualization tags, the distributed framework may dictionary the n processing results to obtain n dictionary results. For example, if the n processing results include the table two and the table three, the dictionary results of the table two are as follows:
{‘c0’:‘WF0002’,‘c1’:‘WF0002WTG0005’}
{‘c0’:‘WF0002’,‘c1’:‘WF0002WTG0018’}
{‘c0’:‘WF0006’,‘c1’:‘WF0006WTG0011’}
{‘c0’:‘WF0006’,‘c1’:‘WF0006WTG0024’}
{‘c0’:‘WF0006’,‘c1’:‘WF0006WTG0037’}
{‘c0’:‘WF0008’,‘c1’:‘WF0008WTG0007’}
the result of the above table three dictionary is as follows:
{‘c0’:‘34.31365’,‘c1’:‘120.019’}
{‘c0’:‘42.4992’,‘c1’:‘117.8644’}
and then, the distributed framework integrates data of each dictionary result and maps the data to a new data structure, so that the processed data can be obtained. For example, the above-described second-table dictionary result and the third-table dictionary result are integrated to obtain the following processed result shown in the following table eight.
Table eight post-treatment results
mata_data Table_name
{‘c0’:‘WF0002’,‘c1’:‘WF0002WTG0005’} wens_wtg_10m
{‘c0’:‘WF0002’,‘c1’:‘WF0002WTG0018’} wens_wtg_10m
{‘c0’:‘WF0006’,‘c1’:‘WF0006WTG0011’} wens_wtg_10m
{‘c0’:‘WF0006’,‘c1’:‘WF0006WTG0024’} wens_wtg_10m
{‘c0’:‘WF0006’,‘c1’:‘WF0006WTG0037’} wens_wtg_10m
{‘c0’:‘WF0008’,‘c1’:‘WF0008WTG0007’} wens_wtg_10m
{‘c0’:‘34.31365’,‘c1’:‘120.019’} wens_wtg_info
{‘c0’:‘42.4992’,‘c1’:‘117.8644’} wens_wtg_info
In summary, according to the technical scheme provided by the embodiment of the application, the distributed framework is ensured to complete the collection of the data processing result through one iterator by performing data fusion processing on the processing result output by each data processing node. In addition, in the embodiment of the application, the distributed framework performs virtualization processing on the data labels of the processing results to obtain the virtualized labels, and performs data integration after each processing result is subjected to dictionary according to the virtualized labels, so that the processing results are convenient to manage, meanwhile, the storage space required by the processing results is reduced, and the storage resources of the distributed database are saved.
Referring to fig. 5, a schematic diagram of a data processing method according to an embodiment of the present application is shown. The method may be applied in the distributed framework 10 of the data processing system described above.
First, the distributed framework obtains the data to be analyzed from Hive. As shown in fig. 5, the data structure of the data to be analyzed is RDD. In order to realize the access of a single-machine algorithm in the distributed framework, the distributed framework needs to perform format conversion processing on the data to be analyzed after acquiring the data to be analyzed, so as to obtain converted data. The data structure of the converted data accords with the data structure requirement corresponding to the single machine algorithm.
In order to reduce the processing overhead of a single data processing node and speed up data processing, the distributed framework also performs data segmentation on converted data to obtain a plurality of data slices. The distributed framework then issues the individual data slices and algorithms to the individual data processing nodes, as shown in FIG. 5.
In the embodiment of the application, the distributed framework is accessed to a single-machine algorithm, so that the machine learning library required to be called by the distributed framework is the same as the machine learning library required to be called by the single-machine framework in the data processing process. For example, as shown in FIG. 5, the machine learning library that the distributed framework needs to invoke includes: python scikits_learn, pyspark light gbm.
Each data processing node can obtain a corresponding processing result, the distributed framework carries out data fusion processing on the processing result obtained by each data processing node to obtain processed data, and the processed data are stored in a distributed database.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
Referring to FIG. 6, a block diagram of a data processing apparatus according to one embodiment of the present application is shown. The apparatus 600 has functions for implementing the above-described method embodiments, where the functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The apparatus 600 may be the server described above, or may be provided in the server described above. The apparatus 600 may include: a data acquisition module 610, a format conversion module 620, a data processing module 630, and a data storage module 640.
The data obtaining module 610 is configured to obtain data to be analyzed from the distributed database, where a data structure of the data to be analyzed is RDD.
The format conversion module 620 is configured to perform format conversion processing on the data to be analyzed to obtain converted data, where a data structure of the converted data meets a data structure requirement corresponding to a single machine algorithm.
And the data processing module 630 is configured to process the converted data in parallel by using the single-machine algorithm at a data processing node included in the distributed framework to obtain processed data.
And the data storage module 640 is used for storing the processed data into the distributed database.
In one example, the distributed framework includes n data processing nodes, the n being an integer greater than 1; as shown in fig. 7, the data processing module 630 includes: a data processing unit 632, configured to perform parallel processing on the converted data at the n data processing nodes by using the single machine algorithm, so as to obtain n processing results; and a data fusion unit 634, configured to perform data fusion processing on the n processing results, so as to obtain the processed data.
In one example, as shown in fig. 7, the data fusion unit 634 is configured to: virtualizing the data labels of the n processing results to obtain virtualized labels; based on the virtualized label, the n processing results are dictionary-typed, and n dictionary-typed results are obtained; and integrating the data of the n dictionary results to obtain the processed data.
In one example, the representation of the processing result includes a table containing at least one column of data; the data tag of the processing result comprises a column name of the table; alternatively, the representation of the processing result includes a table containing at least one line of data; the data tag of the processing result includes a row name of the table.
In one example, as shown in fig. 7, the data processing unit 632 is configured to: performing data segmentation on the converted data according to a target segmentation mode to obtain n data slices; issuing the n data slices to the n data processing nodes; and processing the ith data slice in the n data processing nodes by adopting the single machine algorithm at the ith data processing node to obtain an ith processing result, wherein the n processing results comprise the ith processing result, and i is a positive integer less than or equal to n.
In one example, the issuing the n data slices to the n data processing nodes includes: determining the data sizes corresponding to the n data slices respectively; obtaining the processing capacities corresponding to the n data processing nodes respectively; and issuing the n data slices to the n data processing nodes based on the data sizes respectively corresponding to the n data slices and the processing capacities respectively corresponding to the n data processing nodes.
In one example, the data to be analyzed includes fan related data; the target segmentation mode comprises at least one of the following segmentation modes: data segmentation is performed based on wind fields, data segmentation is performed based on fans, and data segmentation is performed based on time.
In summary, according to the technical scheme provided by the embodiment of the application, the data structure of the data is converted by the distributed framework before the data is processed, so that the converted data is adapted to the data structure requirement corresponding to the single-machine algorithm, and the single-machine algorithm is connected into the distributed framework. Moreover, the distributed framework is connected with a single-machine algorithm, so that algorithm personnel develop the algorithm based on the single-machine environment, and engineering personnel are connected with the developed algorithm into the distributed framework, so that the data and algorithm of the distributed framework are decoupled, the capability requirements of development and application of the distributed framework on the algorithm personnel are reduced, the learning cost of the algorithm personnel is reduced, the life cycle of the development of the distributed framework is shortened, and the popularization and application of the distributed framework are facilitated. In addition, in the embodiment of the application, the distributed framework adopts a single-machine algorithm to process the converted data in parallel at each data processing node, and the efficient data throughput, scheduling and parallel capacity of the distributed framework are fully utilized.
It should be noted that, in the device provided in the embodiment of the present application, when the functions of the device are implemented, only the division of the above functional modules is used for illustration, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to fig. 8, a block diagram of a server according to an embodiment of the present application is shown. The server may be used to implement the data processing method provided in the above-described embodiments. Specifically, the present invention relates to a method for manufacturing a semiconductor device.
The server 800 includes a processing unit (such as a CPU (Central Processing Unit, central processing unit), a GPU (Graphics Processing Unit, graphics processor), an FPGA (Field Programmable Gate Array ), etc.) 801, a system Memory 804 including a RAM (Random-Access Memory) 802 and a ROM (Read-Only Memory) 803, and a system bus 805 connecting the system Memory 804 and the central processing unit 801. The server 800 also includes an I/O system (Input Output System, basic input/output system) 806 that facilitates the transfer of information between various devices within the server, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.
The I/O system 806 includes a display 808 for displaying information and an input device 809, such as a mouse, keyboard, or the like, for user input of information. Wherein the display 808 and the input device 809 are connected to the central processing unit 801 via an input output controller 810 connected to the system bus 805. The I/O system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 810 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.
Without loss of generality, the computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the ones described above. The system memory 804 and mass storage device 807 described above may be collectively referred to as memory.
The server 800 may also operate via a network, such as the internet, to remote computers on the network, in accordance with embodiments of the present application. I.e., server 800 may be connected to a network 812 through a network interface unit 811 coupled to the system bus 805, or alternatively, the network interface unit 811 may be used to connect to other types of networks or remote computer systems (not shown).
The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the data processing method described above.
In an embodiment of the present application, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-mentioned data processing method.
In an exemplary embodiment, a computer program product is also provided, which, when being executed by a processor, is adapted to carry out the above-mentioned data processing method.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims (7)

1. A data processing method, characterized in that it is applied in a distributed framework, said distributed framework comprising n data processing nodes, said n being an integer greater than 1; the method comprises the following steps:
obtaining data to be analyzed from a distributed database, wherein the data structure of the data to be analyzed is an elastic distributed data set RDD, and the data to be analyzed comprises data related to a fan;
performing format conversion processing on the data to be analyzed to obtain converted data, wherein the data structure of the converted data accords with the data structure requirement corresponding to a single machine algorithm;
performing data segmentation on the converted data according to a target segmentation mode to obtain n data slices; the target segmentation mode comprises at least one of the following segmentation modes: data segmentation is carried out based on a wind field, data segmentation is carried out based on a fan, and data segmentation is carried out based on time;
issuing the n data slices to the n data processing nodes;
processing an ith data slice in the n data slices by adopting the single machine algorithm at an ith data processing node in the n data processing nodes to obtain an ith processing result, wherein i is a positive integer smaller than or equal to n;
carrying out data fusion processing on n processing results obtained by the n data processing nodes to obtain processed data;
and storing the processed data into the distributed database.
2. The method according to claim 1, wherein the performing data fusion processing on the n processing results obtained by the n data processing nodes to obtain processed data includes:
virtualizing the data labels of the n processing results to obtain virtualized labels;
based on the virtualized label, the n processing results are dictionary-typed, and n dictionary-typed results are obtained;
and integrating the data of the n dictionary results to obtain the processed data.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises,
the representation of the processing result comprises a table containing at least one column of data; the data tag of the processing result comprises a column name of the table;
or alternatively, the process may be performed,
the representation of the processing result comprises a table containing at least one row of data; the data tag of the processing result includes a row name of the table.
4. The method of claim 1, wherein said issuing said n data slices to said n data processing nodes comprises:
determining the data sizes corresponding to the n data slices respectively;
obtaining the processing capacities corresponding to the n data processing nodes respectively;
and issuing the n data slices to the n data processing nodes based on the data sizes respectively corresponding to the n data slices and the processing capacities respectively corresponding to the n data processing nodes.
5. A data processing apparatus, arranged in a distributed framework comprising n data processing nodes, n being an integer greater than 1; the device comprises:
the data acquisition module is used for acquiring data to be analyzed from the distributed database, wherein the data structure of the data to be analyzed is an elastic distributed data set RDD, and the data to be analyzed comprises data related to a fan;
the format conversion module is used for carrying out format conversion processing on the data to be analyzed to obtain converted data, and the data structure of the converted data accords with the data structure requirement corresponding to a single machine algorithm;
the data processing module is used for carrying out data segmentation on the converted data according to a target segmentation mode to obtain n data slices, wherein the target segmentation mode comprises at least one of the following segmentation modes: data segmentation is carried out based on a wind field, data segmentation is carried out based on a fan, and data segmentation is carried out based on time; issuing the n data slices to the n data processing nodes; processing an ith data slice in the n data slices by adopting the single machine algorithm at an ith data processing node in the n data processing nodes to obtain an ith processing result, wherein i is a positive integer smaller than or equal to n; carrying out data fusion processing on n processing results obtained by the n data processing nodes to obtain processed data;
and the data storage module is used for storing the processed data into the distributed database.
6. A server comprising a processor and a memory, wherein the memory has stored therein a computer program that is loaded and executed by the processor to implement the data processing method of any of claims 1 to 4.
7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the data processing method according to any one of claims 1 to 4.
CN202111004548.XA 2021-08-30 2021-08-30 Data processing method, device, server and storage medium Active CN113704340B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111004548.XA CN113704340B (en) 2021-08-30 2021-08-30 Data processing method, device, server and storage medium
PCT/SG2022/050611 WO2023033726A2 (en) 2021-08-30 2022-08-26 Method and apparatus for processing data, and server and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111004548.XA CN113704340B (en) 2021-08-30 2021-08-30 Data processing method, device, server and storage medium

Publications (2)

Publication Number Publication Date
CN113704340A CN113704340A (en) 2021-11-26
CN113704340B true CN113704340B (en) 2023-07-21

Family

ID=78656821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111004548.XA Active CN113704340B (en) 2021-08-30 2021-08-30 Data processing method, device, server and storage medium

Country Status (2)

Country Link
CN (1) CN113704340B (en)
WO (1) WO2023033726A2 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063842A (en) * 2018-07-06 2018-12-21 无锡雪浪数制科技有限公司 A kind of machine learning platform of compatible many algorithms frame
CN109658006A (en) * 2018-12-30 2019-04-19 广东电网有限责任公司 A kind of large-scale wind power field group auxiliary dispatching method and device
CN110704995A (en) * 2019-11-28 2020-01-17 电子科技大学中山学院 Cable layout method and computer storage medium for multiple types of fans of multi-substation
WO2020224374A1 (en) * 2019-05-05 2020-11-12 腾讯科技(深圳)有限公司 Data replication method and apparatus, and computer device and storage medium
CN112185572A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic device and medium
CN112241872A (en) * 2020-10-12 2021-01-19 上海众言网络科技有限公司 Distributed data calculation analysis method, device, equipment and storage medium
CN113220427A (en) * 2021-04-15 2021-08-06 远景智能国际私人投资有限公司 Task scheduling method and device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477731B2 (en) * 2013-10-01 2016-10-25 Cloudera, Inc. Background format optimization for enhanced SQL-like queries in Hadoop
CN107609141B (en) * 2017-09-20 2020-07-31 国网上海市电力公司 Method for performing rapid probabilistic modeling on large-scale renewable energy data
CN112487125B (en) * 2020-12-09 2022-08-16 武汉大学 Distributed space object organization method for space-time big data calculation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063842A (en) * 2018-07-06 2018-12-21 无锡雪浪数制科技有限公司 A kind of machine learning platform of compatible many algorithms frame
CN109658006A (en) * 2018-12-30 2019-04-19 广东电网有限责任公司 A kind of large-scale wind power field group auxiliary dispatching method and device
WO2020224374A1 (en) * 2019-05-05 2020-11-12 腾讯科技(深圳)有限公司 Data replication method and apparatus, and computer device and storage medium
CN110704995A (en) * 2019-11-28 2020-01-17 电子科技大学中山学院 Cable layout method and computer storage medium for multiple types of fans of multi-substation
CN112185572A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic device and medium
CN112241872A (en) * 2020-10-12 2021-01-19 上海众言网络科技有限公司 Distributed data calculation analysis method, device, equipment and storage medium
CN113220427A (en) * 2021-04-15 2021-08-06 远景智能国际私人投资有限公司 Task scheduling method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于并行变量预测模型的变压器故障诊断及优化研究;马利洁 等;《基于并行变量预测模型的变压器故障诊断及优化研究》;82-89 *

Also Published As

Publication number Publication date
CN113704340A (en) 2021-11-26
WO2023033726A2 (en) 2023-03-09
WO2023033726A3 (en) 2023-05-04

Similar Documents

Publication Publication Date Title
CN110442444B (en) Massive remote sensing image-oriented parallel data access method and system
US9886418B2 (en) Matrix operands for linear algebra operations
US11194762B2 (en) Spatial indexing using resilient distributed datasets
TWI709049B (en) Random walk, cluster-based random walk method, device and equipment
US10496659B2 (en) Database grouping set query
CN105843899B (en) A kind of big data automation analytic method for simplifying programming and system
CN114420215A (en) Large-scale biological data clustering method and system based on spanning tree
CN112364041A (en) Data processing method and device, computer equipment and storage medium
Luo et al. Big-data analytics: challenges, key technologies and prospects
CN110888972A (en) Sensitive content identification method and device based on Spark Streaming
CN112905596B (en) Data processing method, device, computer equipment and storage medium
CN113704340B (en) Data processing method, device, server and storage medium
CN116755637B (en) Transaction data storage method, device, equipment and medium
CN116719822B (en) Method and system for storing massive structured data
US20120143793A1 (en) Feature specification via semantic queries
Sabarad et al. Color and texture feature extraction using Apache Hadoop framework
CN112905635A (en) Service processing method, device, equipment and storage medium
CN113760950A (en) Index data query method and device, electronic equipment and storage medium
CN111262727A (en) Service capacity expansion method, device, equipment and storage medium
CN116450872B (en) Spark distributed vector grid turning method, system and equipment
CN116451005B (en) Spark-based distributed grid algebra operation method, system and equipment
CN117115380B (en) Multi-source spatial data processing method and system
CN114185890B (en) Database retrieval method and device, storage medium and electronic equipment
US20230214394A1 (en) Data search method and apparatus, electronic device and storage medium
CN115408491B (en) Text retrieval method and system for historical data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant