WO2023033726A2 - Method and apparatus for processing data, and server and storage medium thereof - Google Patents

Method and apparatus for processing data, and server and storage medium thereof Download PDF

Info

Publication number
WO2023033726A2
WO2023033726A2 PCT/SG2022/050611 SG2022050611W WO2023033726A2 WO 2023033726 A2 WO2023033726 A2 WO 2023033726A2 SG 2022050611 W SG2022050611 W SG 2022050611W WO 2023033726 A2 WO2023033726 A2 WO 2023033726A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
processed
acquiring
distributed
processing
Prior art date
Application number
PCT/SG2022/050611
Other languages
French (fr)
Other versions
WO2023033726A3 (en
Inventor
Zhimeng WANG
Original Assignee
Envision Digital International Pte. Ltd.
Shanghai Envision Digital Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Envision Digital International Pte. Ltd., Shanghai Envision Digital Co., Ltd. filed Critical Envision Digital International Pte. Ltd.
Publication of WO2023033726A2 publication Critical patent/WO2023033726A2/en
Publication of WO2023033726A3 publication Critical patent/WO2023033726A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the present disclosure relates to the field of computer and Internet technologies, and in particular, relates to a method and apparatus for processing data, and a server and a storage medium thereof.
  • distributed framework is an algorithm access framework written based on the distributed storage, computing, and machine learning framework that enables the algorithm to achieve distributed read-write and computing.
  • a single -machine algorithm running framework (the following is referred to as "single -machine framework") employed in the traditional algorithm research and application is an algorithm access framework written based on single -machine and relational databases that enables the algorithm to achieve single -machine read-write and computing.
  • the algorithm may achieve distributed read-write and computing
  • computing power that the distributed framework may achieve is far greater than computing power that the single -machine framework may achieve
  • the distributed framework is suitable for the service with a large data volume.
  • the development and application of the distributed framework have high requirements on the ability of the algorithm personnel, which not only requires the algorithm personnel have necessary knowledge of the algorithm, but also requires the algorithm personnel have engineering capability of the development and application of the distributed framework. Therefore, the learning time cost of the algorithm personnel is increased, and the life cycle of the development of the distributed framework is prolonged, which is not conducive to popularizing and applying the distributed framework.
  • Embodiments of the present disclosure provide a method and apparatus for processing data, and a server and a storage medium thereof, which may decouple data and algorithm of a distributed framework, and reduce requirements of development and application of the distributed framework on the ability of algorithm personnel.
  • the technical solutions are as follows:
  • the embodiments of the present disclosure provide a method for processing data, applicable to a distributed framework.
  • the method includes:
  • the embodiments of the present disclosure provide an apparatus for processing data, disposed in a distributed framework.
  • the apparatus includes:
  • a data acquiring module configured to acquire to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is RDD;
  • a format converting module configured to acquiring converted data by format conversion on the to-be-analyzed data, wherein a data structure of the converted data satisfies a data structure requirement corresponding to a single-machine algorithm
  • a data processing module configured to acquire processed data by parallel processing on the converted data using the single-machine algorithm at data processing nodes in the distributed framework; and [0015] a data storage module, configured to store the processed data to the distributed database, [0016]
  • the embodiments of the present disclosure provide a server.
  • the server includes a processor and a memory configured to store one or more computer programs. The processor, when loading and running the one or more computer programs, is caused to perform the method as described above.
  • the embodiments of the present disclosure provide a computer- readable storage medium, storing one or more computer programs, wherein the one or more computer programs, when loaded and run by a processor of a server, cause the server to perform the method as described above.
  • the embodiments of the present disclosure provide a computer program product, wherein the computer program product, when loaded and run by a processor of a server, cause the server to perform the method as described above.
  • the data structure of the to-be-analyzed data is converted, by the distributed framework, prior to processing the to-be-analyzed data, such that the converted data satisfies the data structure requirement corresponding to the single-machine algorithm to access the singlemachine algorithm to the distributed framework.
  • the algorithm personnel develop the algorithm based on single-machine environment, and engineering personnel access the developed algorithm to the distributed framework.
  • the distributed framework processes the converted data by parallel processing using the single -machine algorithm at the data processing nodes, which takes full advantage of efficient data throughput, scheduling, and parallel capabilities of the distributed framework.
  • FIG. 1 is a schematic diagram of a scheduling mode according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a system for processing data according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a method for processing data according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a data slicing mode according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of another method for processing data according to an embodiment of the present disclosure.
  • FIG. 6 is a block diagram of an apparatus for processing data according to an embodiment of the present disclosure.
  • FIG. 7 is a block diagram of another apparatus for processing data according to an embodiment of the present disclosure.
  • FIG. 8 is a structure block diagram of a server according to an embodiment of the present disclosure.
  • a distributed framework is an algorithm access framework written based on the distributed storage, computing, and machine learning framework that enables an algorithm to achieve distributed read-write and computation.
  • a single -machine framework is an algorithm access framework written based on single-machine and relational databases that enables an algorithm to achieve single -machine read-write and computing.
  • modes related to data scheduling and computing of the distributed framework and the single-machine framework are shown in FIG. 1. Based on this, a summary shown in Table 1 may be acquired.
  • FIG. 2 is a schematic diagram of a system for processing data according to an embodiment of the present disclosure.
  • the system includes the distributed framework 10.
  • the distributed framework 10 refers to the algorithm access framework that is capable of achieving the distributed read-write and computation.
  • the distributed framework 10 includes a master node 20 and a server cluster composed of a plurality of servers 30. Each of the plurality of servers 30 may conduct corresponding computation based on data and algorithms sent from the master node 20, and return the computation result to the master node 20.
  • the distributed framework 10 may call data stored in a distributed database 40 and conduct computation on the called data, and computed results are still stored to the distributed database 40.
  • the distributed framework 10 is communicably connected to the distributed database 40 over a network.
  • the network may be a wired network or a wireless network.
  • the distributed database 40 includes any one of: Hive (a data warehouse tool based on Hadoop, configured for data extraction, conversion, and loading, which is a mechanism for storing, enquiring, and analyzing large-scale data stored in the Hadoop), and a Hadoop Distributed File System (HDFS).
  • Hive a data warehouse tool based on Hadoop, configured for data extraction, conversion, and loading, which is a mechanism for storing, enquiring, and analyzing large-scale data stored in the Hadoop
  • HDFS Hadoop Distributed File System
  • the data structure of the data stored in the distributed database 40 is Resilient Distributed Dataset (RDD).
  • the data and the algorithm of the distributed framework are decoupled.
  • algorithm personnel develop the algorithm based on single-machine environment; and in another aspect, engineering personnel access the developed algorithm to the distributed framework. Therefore, in the embodiments of the present disclosure, the machine learning library used by the distributed framework during a process of computing is consistent with the machine learning library used by the single-machine framework during a process of computing.
  • the machine learning libraries used by the distributed framework during the process of computing includes: machine learning libraries matched with python (e.g., python Scikits Learn or pyspark LightGBM).
  • FIG. 3 is a flowchart of a method for processing data according to an embodiment of the present disclosure.
  • the method may be applicable to the distributed framework 10 of the system for processing data as described above.
  • the method may include the following steps.
  • to-be-analyzed data is acquired from a distributed database, wherein a data structure of the to-be -analyzed data is RDD.
  • the distributed database is a database with a distributed storage capability, such as Hive, HDFS, and the like.
  • the distributed framework may acquire the to-be-analyzed data from the distributed database, and the data structure of the to-be-analyzed data is RDD.
  • a timing at which the distributed framework acquires the to-be-analyzed data from the distributed database is not limited herein.
  • the distributed framework acquires the to-be-analyzed data from the distributed database when the data needs to be processed.
  • the distributed framework acquires the to-be-analyzed data from the distributed database at a preset time interval.
  • the distributed database actively pushes the to-be-analyzed data to the distributed framework at a preset time interval.
  • step 320 converted data is acquired by format conversion on the to-be-analyzed data, wherein a data structure of the converted data satisfies a data structure requirement corresponding to a single-machine algorithm.
  • the distributed framework may acquire to-be-analyzed data corresponding to the service demand from the distributed database; on the other hand, may determine an algorithm employed to achieve the service demand.
  • the algorithm determined by the distributed framework is developed by algorithm personnel based on single -machine environment, the algorithm is referred to as the single -machine algorithm.
  • the distributed framework processes the to-be-analyzed data by format conversion upon acquiring the to -be -analyzed data, such that the data structure of the converted data sati sfi es the data structure requirement corresponding to the single-machine algorithm.
  • the distributed frame may also use the single -machine algorithm to process the converted data.
  • step 330 processed data is acquired by parallel processing on the converted data using the single-machine algorithm at data processing nodes in the distributed framework.
  • the distributed framework includes a server cluster including a plurality of servers.
  • Each server in the server cluster may process data, such that the data processing nodes in the distributed framework are composed.
  • the servers are in one-to-one correspondence to the data processing nodes; or the plurality of servers correspond to one of the data processing nodes, which is not limited herein.
  • the distributed framework may process the converted data by parallel processing using the single-machine algorithm at the data processing nodes in the distributed framework, such that data throughput and parallel capabilities are efficient.
  • the distributed framework processes the converted data by parallel processing using the single-machine algorithm at all the data processing nodes in the distributed framework; or, the distributed framework processes the converted data by parallel processing using the single-machine algorithm at a part of the data processing nodes in the distributed framework, which is not limited herein.
  • the number of data processing nodes involved in computation may be determined by a size of the converted data, a fine degree of the service demand, and the like.
  • the distributed framework includes n data processing nodes, and n is an integer greater than 1; and step 303 includes: acquiring n processed results by parallel processing on the converted data using the single-machine algorithm at the n data processing nodes, and acquiring the processed data by data fusion on the n processed results.
  • the distributed framework distributes and collects data by an iterator, and each of the n data processing nodes generates the processed result, it is necessary to acquire the processed data by data fusion on the processed results respectively generated by the n data processing nodes, which ensures that the iterator returns the processed data to the distributed framework.
  • the distributed framework distributes and collects data by an iterator, and each of the n data processing nodes generates the processed result, it is necessary to acquire the processed data by data fusion on the processed results respectively generated by the n data processing nodes, which ensures that the iterator returns the processed data to the distributed framework.
  • the single-machine algorithm and the converted data used by each of the data processing nodes are not limited in the embodiments of the present disclosure.
  • the n data processing nodes use the same single-machine algorithms, and the n data processing nodes use different converted data; or, the single-machine algorithms and the n data processing nodes use different converted data; or, the n data processing nodes use the same converted data, and the n data processing nodes use the same single-machine algorithms.
  • the parallel processing processed by each of the data processing nodes reference may be made to the following method embodiments, which are not repeated herein.
  • the number of data processing nodes in the distributed framework may be greater than or equal to n.
  • description is given only using a scenario where the n data processing nodes of the data processing nodes in the distributed framework need to be used as an example.
  • the embodiments of the present disclosure are not limited thereto.
  • step 340 the processed data is stored to tire distributed framework.
  • the distributed framework Upon acquiring the processed data, the distributed framework returns the processed data to the distributed database for storage to facilitate calling corresponding data from the distributed database for data analysis and the like subsequently.
  • the data structure of the to-be-analyzed data is converted prior to processing the to- be-analyzed data by the distributed framework, such that the converted data satisfies the data structure requirement corresponding to the single-machine algorithm to access the singlemachine algorithm to the distributed framework.
  • the algorithm personnel develop the algorithm based on the single -machine environment, and the engineering personnel access the developed algorithm to the distributed framework.
  • the distributed framework processes the converted data by parallel processing using the single -machine algorithm at the data processing nodes, which takes full advantage of efficient data throughput, scheduling, and parallel capabilities of the distributed framework.
  • acquiring the n processed results by parallel processing on the converted data using the single -machine algorithm at the n data processing nodes includes: acquiring n data slices by data slicing on the converted data in a target slicing mode; issuing the n data slices to the n data processing nodes; and acquiring an i th processed results by processing an i th data slice in the n data slices using the single-machine algorithm at the f h data processing node in the n data processing nodes, wherein the n processed results include the i th processed result, i being an integer less than or equal to n.
  • the distributed framework Upon acquiring the to-be-analyzed data from the distributed database, the distributed framework fist acquires the converted data by format conversion on the to-be-analyzed data. Because in the distributed framework, the plurality of data processing nodes participate in processing data, in the embodiments of the present disclosure, the distributed framework acquires the n data slices by data slicing on the converted data, and respectively issues the n data slices to the n data processing nodes in the distributed framework, so as to alleviate a processing overhead of each of the data processing nodes and accelerate the speed of data processing. Optionally, when the distributed framework issues the n data slices, the single -machine algorithms are issued simultaneously. At each of the data processing nodes, the distributed framework acquires the processed result by processing the data slice issued to the data processing node using the single-machine.
  • the distributed framework may slice the converted data based on the target slicing mode.
  • the to-be-analyzed data acquired by the distributed framework includes data of a fan.
  • the target slicing modes include at least one of: data slicing based on a wind field, for example, as illustrated in FIG. 4(a), data slicing based on a fan, for example, as illustrated in FIG. 4(b), and data slicing based on time, for example, as illustrated in FIG. 4(c).
  • the distributed framework needs to issue the n data slices to the n data processing nodes.
  • the distributed framework may randomly select n data processing nodes from the included data processing nodes, and randomly issue the n data slices to the n data processing nodes, such that an efficiency in issuing the data slices is improved.
  • issuing the n data slices to the n data processing nodes includes: determining data sizes respectively corresponding to the n data slices; acquiring processing capabilities respectively corresponding to the n data processing nodes; and issuing the n data slices to the n data processing nodes, based on the data sizes respectively corresponding to the n data slices and the processing capabilities respectively corresponding to the n data processing nodes.
  • the technical solution according to the embodiments of the present disclosure acquires the plurality of data slices by data slicing on the to-be-analyzed data, and then respectively issues the plurality of data slices to the plurality of data processing nodes in the distributed framework, which ensures that each of the data processing nodes processes the issued data slice using the single-machine algorithm. Due to data slicing, a data volume required to be processed by each of the data processing nodes is reduced. In this way, the processing overhead of each of data processing nodes is alleviated, and the speed of data processing is accelerated.
  • the data fusion includes splicing.
  • forms of the processed results include a table as an example
  • the plurality of processed results may be processed the data fusion processing by splicing in rows or columns.
  • n processed results include results listed in Table 2 and Table 3.
  • Table 7 A result spliced in rows process of splicing in rows or columns, and these redundant bits occupy the storage space, which thus causing a waste of a storage resource of the distributed database. Moreover, as illustrated in Table 7, in the process of splicing in columns, data labels (column names herein) of a part of the processed results are abandoned, which is not conducive to subsequent standard management and retrieval.
  • the embodiments of the present disclosure provide a method for data fusion, which may solve the above problem. Description is given to the method for data fusion below.
  • acquiring the processed data by data fusion on the n processed results includes: acquiring virtualized labels by virtualizing data labels of the n processed results; acquiring n dictionary results by dictionary processing on the n processed results, based on the virtualized labels; and acquiring the processed data by data integration on the n dictionary results.
  • Tire data labels are configured to identify data of the processed results, and generally indicate meanings of the data.
  • forms of the processed results include a table including one or more columns of data, and the data labels of the processed results include names of one or more columns of tire table; or forms of the processed results include a table including one or more rows of data, and the data labels of the processed results include names of one or more rows of the table.
  • the distributed framework may acquire the virtualized labels by virtualizing the data labels of the n processed results.
  • the n processed results include results listed in Table 2 and Table 3, a column name "wf id" may be virtualized to vcO," and a column name "wtg id” may be virtualized to "cl,” wherein cO and cl are the virtualized labels.
  • the distributed framework may acquire the n dictionary results by dictionary processing on the n processed results.
  • n processed results include results listed in Table 2 and Table 3
  • a result of Table 2 upon dictionary' processing is as follows:
  • a result of Table 3 upon dictionary' processing is as follows:
  • the distributed framework acquired the processed data by data integration on the dictionary results, mapping the dictionary' results to a new data structure.
  • processed results shown in Table 8 may be acquired by data integration on the results listed in Table 2 and Table 3 upon dictionary processing.
  • the distributed framework may collect the processed results by an iterator. Moreover, in the embodiments of the present disclosure, tire distributed framework acquires the virtualized labels by virtualizing the data labels of the processed results, then acquires the dictionary' results by dictionary' processing on the processed results, based on the virtualized labels, and then acquires the processed data by integration on the dictionary results. In this way, storage apace required for the processed results is reduced, and the storage resource of the distributed database is saved, while facilitating management of the processed results.
  • FIG. 5 is a schematic diagram of another method for processing data according to an embodiment of the present disclosure.
  • the method may be applicable to the distributed framework 10 of the system for processing data as described above.
  • the distributed framework acquires to-be-analyzed data from Hive.
  • a data structure of the to-be-analyzed data is RDD.
  • the distributed framework needs to acquire converted data by format conversion on the to-be-analyzed data upon acquiring the to-be-analyzed data.
  • a data structure of the converted data satisfies a data structure requirement corresponding to the single -machine algorithm.
  • the distributed framework acquires a plurality of data slices by data slicing on the converted data. Afterwards, referring to FIG. 5, the distributed framework issues the data slices and the algorithms to the data processing nodes.
  • a machine learning library that the distributed framework needs to call is consistent with a machine learning library that a single-machine framework needs to call, in a process of data processing.
  • the machine learning libraries that the distributed framework needs to call include: python, Scikits Learn, and pyspark LightGBM.
  • Each of the data processing nodes may acquire a corresponding processed result.
  • the distributed framework acquires the processed data by data fusion on the processed results acquired by the data processing nodes, and stores the processed data to the distributed database.
  • Tire apparatus is applicable to the method embodiments of the present disclosure.
  • FIG. 6 is a block diagram of an apparatus 600 for processing data according to an embodiment of the present disclosure.
  • the apparatus 600 has a function of achieving the above method embodiment, and the function may be implemented by hardware, or may be realized by corresponding software executed by hardware.
  • the apparatus 600 may be the server described above, or may be disposed in the server described above.
  • the apparatus 600 may include: a data acquiring module 610, a format converting module 620, a data processing module 630, and a data storage module 640.
  • fire data acquiring module 610 is configured to acquire to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is RDD.
  • Tire format converting module 620 is configured to acquire converted data by format conversion on the to-be-analyzed data, wherein a data structure of the con verted data satisfies a data structure requirement corresponding to a single-machine algorithm.
  • the data processing module 630 is configured to acquire processed data by parallel processing on the converted data using the single -machine algorithm at data processing nodes in a distributed framework.
  • the data storage module 640 is configured to store the processed data to the distributed database.
  • the distributed framework incudes n data processing nodes, wherein n is an integer greater than 1.
  • the data processing module 630 incudes: a data processing unit 632, configured to acquire n processed results by parallel processing on the converted data using the single -machine algorithm at the n data processing nodes; and a data fusion module 634, configured to acquire the processed data by data fusion on the n processed results.
  • the data fusion unit 634 is configured to: acquire virtualized labels by virtualizing data labels of the n processed results; acquire n dictionary' results by dictionary- processing on the n processed results based on the virtualized labels; and acquire the processed data by data integration on the n dictionary results.
  • forms of the processed results include a table as least including one or more columns of data; and the data labels of the processed results include names of one or more columns of the table.
  • forms of the processed results include a table as least including one or more rows of data; and the data labels of the processed results include names of one or more rows of the table.
  • the data processing unit 632 is configured to: acquire n data slices by data slicing on the converted data in a target slicing mode; issue the n data slices to the n data processing nodes; and acquire an i sh processed result by processing an i th data slice in the n data slices using the single-machine algorithm at the i th data processing node in the n data processing nodes, wherein the n processed results include the i th processed result, i being an integer less than or equal to n.
  • issuing the n data slices to the n data processing nodes includes: determining data sizes respectively corresponding to the n data slices; acquiring processing capabilities respectively corresponding to the n data processing nodes; and issuing the n data slices to the n data processing nodes based on the data sizes respectively corresponding to the n data slices and the processing capabilities respectively corresponding to the n data processing nodes.
  • the to-be-analyzed data includes data of a fan; and the target slicing modes include at least one of: data slicing based on a wind field, data slicing based on a fan, and data slicing based on time.
  • the technical solution according to tire embodiments of the present disclosure converts, by the distributed framework, the data structure of the to-be-analyzed data prior to processing the to-be-analyzed data, such that the converted data satisfies the data structure requirement corresponding to the single-machine algorithm to access the singlemachine algorithm to the distributed framework.
  • the distributed framework processes the converted data by parallel processing using the single -machine algorithm at the data processing nodes, which takes full advantage of efficient data throughput, scheduling, and parallel capabilities of the distributed framework.
  • FIG. 8 is a structure block diagram of a server according to an embodiment of the present disclosure.
  • the server may be configured to perform the method described above.
  • the server 800 includes a processing unit 801 including a central processing unit (CPU), a graphics processing unit (GPU), or a field programmable gate array (FPGA), a system memory 804 including a random-access memory (RAM) 802 or a read-only memory' (ROM) 803, and a system bus 805 connecting the system memory 804 to the central processing unit 801.
  • the server 800 further includes an input/output (I/O system) 806 that facilitates information transfer between each device in the server, a mass storage device 807 configured to store an operating system 813, an application program 814, and a program module 815.
  • I/O system input/output
  • the I/O system 806 includes a display 808 configured to display information, and an input device 809 such as a mouse, a keyboard, and the like, configured for users to input information.
  • the display 808 and the input device 809 are both connected to the central processing unit 801 by an input and output controller 810 connected to the system bus 805.
  • the I/O system 806 may further include the input and output controller 810 to receive and process input from a plurality of other devices such as the keyboard, the mouse, an electronic stylus, and the like.
  • the input and output controller 810 may further provide outputs to other output devices such as a display screen, a printer, and the like.
  • the mass storage device 807 is connected to the central processing unit 801 by a mass storage controller (not shown) connected to the system bus 805.
  • the mass storage device 807 and related computer-readable medium provide a nonvolatile storage. That is, the mass storage device 807 may include computer-readable medium (not shown) such as a hard disk, a compact disc read-only memory (CD-ROM), or a driver.
  • the computer-readable medium may include computer storage medium and communication medium.
  • the computer storage medium includes volatile and nonvolatile, and removable and unremovable medium, which is practiced by any method or technology configured to store information such as a computer-readable instruction, a data structure, a program module, or other data.
  • the computer storage medium includes a RAM, a ROM, an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), a flash memory or other solid-state memories, a CD-ROM, a digital video disc (DVD), or other optical storages, a tap box, a tap, a disk storage or other magnetic storage devices.
  • the personnel of skill in the art know that the computer storage medium is not limited to the above.
  • the system memory 804 and the mass storage device 807 collectively refer to as a memory.
  • the server 800 may be connected to a remote computer on a network to run by networks such as the Internet. That is, the server 800 may be connected to the network 812 by a network interface unit 811 connected to the system bus 805. In other words, the network interface unit 811 may be connected to other types of network or remote computer systems (not shown).
  • the memory further includes one or more computer programs stored in the memory.
  • the one or more computer programs are configured to be loaded and run by one or more processors to perform the method described above.
  • An embodiment of the present disclosure provides a computer-readable storage medium storing one or more computer programs, wherein the one-or-more computer programs, when loaded and run by a processor of a server, cause the server to perform the method described above.
  • An exemplarily embodiment provides a computer program product, wherein the computer program product, when loaded and run by a processor of a server, is configured to perform the method as described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Multi Processors (AREA)

Abstract

Disclosed are a method and apparatus for processing data, and a server and a storage medium thereof. The method includes: acquiring to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is RDD; acquiring converted data by format conversion on the to-be-analyzed data, wherein a data structure of the converted data satisfies a data structure requirement corresponding to a single-machine algorithm; acquiring processed data by parallel processing on the converted data using the single-machine algorithm at data processing nodes in the distributed framework; and storing the processed data in a distributed database.

Description

METHOD AND APPARATUS FOR PROCESSING DATA, AND SERVER AND STORAGE MEDIUM THEREOF
TECHNICAL FIELD
[0001] The present disclosure relates to the field of computer and Internet technologies, and in particular, relates to a method and apparatus for processing data, and a server and a storage medium thereof.
BACKGROUND
[0002] Traditional algorithm research and application are generally based on a single server (which is referred to as "single -machine" in the embodiments of the present disclosure). However, with an accumulation of the service data volume and a growth of the real-time service data volume, it is difficult for computing power of one or even more servers to satisfy the requirement of the service growth.
[0003] To solve this technology problem, algorithm personnel research and propose a distributed algorithm running framework (the following is referred to as “distributed framework”). The distributed framework is an algorithm access framework written based on the distributed storage, computing, and machine learning framework that enables the algorithm to achieve distributed read-write and computing. Correspondingly, a single -machine algorithm running framework (the following is referred to as "single -machine framework") employed in the traditional algorithm research and application is an algorithm access framework written based on single -machine and relational databases that enables the algorithm to achieve single -machine read-write and computing. Obviously, because in the distributed framework, the algorithm may achieve distributed read-write and computing, computing power that the distributed framework may achieve is far greater than computing power that the single -machine framework may achieve, and the distributed framework is suitable for the service with a large data volume. For example, with the popularization of big data and machine learning in the field of fans, it is convenient to mine effective information, by employing the distributed framework, from massive information accumulated during an operation of the fan to detect an operation state of the fan and to diagnose faults of the fan. [0004] However, the development and application of the distributed framework have high requirements on the ability of the algorithm personnel, which not only requires the algorithm personnel have necessary knowledge of the algorithm, but also requires the algorithm personnel have engineering capability of the development and application of the distributed framework. Therefore, the learning time cost of the algorithm personnel is increased, and the life cycle of the development of the distributed framework is prolonged, which is not conducive to popularizing and applying the distributed framework.
SUMMARY
[0005] Embodiments of the present disclosure provide a method and apparatus for processing data, and a server and a storage medium thereof, which may decouple data and algorithm of a distributed framework, and reduce requirements of development and application of the distributed framework on the ability of algorithm personnel. The technical solutions are as follows:
[0006] In one aspect, the embodiments of the present disclosure provide a method for processing data, applicable to a distributed framework. The method includes:
[0007] acquiring to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is ROD;
[0008] acquiring converted data by format conversion on the to-be-analyzed data, wherein a data structure of the converted data satisfies a data structure requirement corresponding to a singlemachine algorithm;
[0009] acquiring processed data by parallel processing on the converted data using the singlemachine algorithm at data processing nodes in the distributed framework; and
[0010] storing the processed data in the distributed database.
[0011] In another aspect, the embodiments of the present disclosure provide an apparatus for processing data, disposed in a distributed framework. The apparatus includes:
[0012] a data acquiring module, configured to acquire to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is RDD;
[0013] a format converting module, configured to acquiring converted data by format conversion on the to-be-analyzed data, wherein a data structure of the converted data satisfies a data structure requirement corresponding to a single-machine algorithm;
[0014] a data processing module, configured to acquire processed data by parallel processing on the converted data using the single-machine algorithm at data processing nodes in the distributed framework; and [0015] a data storage module, configured to store the processed data to the distributed database, [0016] In still another aspect, the embodiments of the present disclosure provide a server. The server includes a processor and a memory configured to store one or more computer programs. The processor, when loading and running the one or more computer programs, is caused to perform the method as described above.
[0017] In yet still another aspect, the embodiments of the present disclosure provide a computer- readable storage medium, storing one or more computer programs, wherein the one or more computer programs, when loaded and run by a processor of a server, cause the server to perform the method as described above.
[0018] In yet still another aspect, the embodiments of the present disclosure provide a computer program product, wherein the computer program product, when loaded and run by a processor of a server, cause the server to perform the method as described above.
[0019] The technical solutions according to the embodiments of the present disclosure achieve at least the following beneficial effects:
[0020] The data structure of the to-be-analyzed data is converted, by the distributed framework, prior to processing the to-be-analyzed data, such that the converted data satisfies the data structure requirement corresponding to the single-machine algorithm to access the singlemachine algorithm to the distributed framework. Moreover, because it is the single-machine algorithm that is accessed to the distributed framework, the algorithm personnel develop the algorithm based on single-machine environment, and engineering personnel access the developed algorithm to the distributed framework. In this way, the data and the algorithm of the distributed framework are decoupled, the requirements of development and application of the distributed framework on the ability of the algorithm personnel are lowered, the learning cost of the algorithm personnel is reduced, the life cycle of the development of the distributed framework is shortened, and thus popularization and application of the distributed framework are facilitated. In addition, in the embodiments of the present disclosure, the distributed framework processes the converted data by parallel processing using the single -machine algorithm at the data processing nodes, which takes full advantage of efficient data throughput, scheduling, and parallel capabilities of the distributed framework.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] For clearer descriptions of the technical solutions in the embodiments of the present disclosure, the following briefly introduces the accompanying drawings required for describing the embodiments. Apparently, the accompanying drawings in the following description show merely some embodiments of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
[0022] FIG. 1 is a schematic diagram of a scheduling mode according to an embodiment of the present disclosure;
[0023] FIG. 2 is a schematic diagram of a system for processing data according to an embodiment of the present disclosure;
[0024] FIG. 3 is a flowchart of a method for processing data according to an embodiment of the present disclosure;
[0025] FIG. 4 is a schematic diagram of a data slicing mode according to an embodiment of the present disclosure;
[0026] FIG. 5 is a schematic diagram of another method for processing data according to an embodiment of the present disclosure;
[0027] FIG. 6 is a block diagram of an apparatus for processing data according to an embodiment of the present disclosure;
[0028] FIG. 7 is a block diagram of another apparatus for processing data according to an embodiment of the present disclosure; and
[0029] FIG. 8 is a structure block diagram of a server according to an embodiment of the present disclosure.
DETAILED DESCRIPTION
[0030] The present disclosure will be described in further detail with reference to the enclosed drawings, to clearly present the objects, technical solutions, and advantages of the present disclosure.
[0031] The terms involved in the embodiments of the present disclosure are described hereinafter first.
[0032] A distributed framework is an algorithm access framework written based on the distributed storage, computing, and machine learning framework that enables an algorithm to achieve distributed read-write and computation. A single -machine framework is an algorithm access framework written based on single-machine and relational databases that enables an algorithm to achieve single -machine read-write and computing. In an example, modes related to data scheduling and computing of the distributed framework and the single-machine framework are shown in FIG. 1. Based on this, a summary shown in Table 1 may be acquired.
Table 1 Comparison of scheduling modes between the distributed framework and the singlemachine framework
Figure imgf000006_0001
[0033] FIG. 2 is a schematic diagram of a system for processing data according to an embodiment of the present disclosure. Referring to FIG. 2, the system includes the distributed framework 10.
[0034] The distributed framework 10 refers to the algorithm access framework that is capable of achieving the distributed read-write and computation. In the embodiments of the present disclosure, the distributed framework 10 includes a master node 20 and a server cluster composed of a plurality of servers 30. Each of the plurality of servers 30 may conduct corresponding computation based on data and algorithms sent from the master node 20, and return the computation result to the master node 20.
[0035] In an example, as illustrated in FIG. 2, the distributed framework 10 may call data stored in a distributed database 40 and conduct computation on the called data, and computed results are still stored to the distributed database 40. Optionally, the distributed framework 10 is communicably connected to the distributed database 40 over a network. The network may be a wired network or a wireless network.
[0036] Optionally, the distributed database 40 includes any one of: Hive (a data warehouse tool based on Hadoop, configured for data extraction, conversion, and loading, which is a mechanism for storing, enquiring, and analyzing large-scale data stored in the Hadoop), and a Hadoop Distributed File System (HDFS). Optionally, the data structure of the data stored in the distributed database 40 is Resilient Distributed Dataset (RDD).
[0037] In the method for processing data according to the embodiments of the present disclosure, the data and the algorithm of the distributed framework are decoupled. In one aspect, algorithm personnel develop the algorithm based on single-machine environment; and in another aspect, engineering personnel access the developed algorithm to the distributed framework. Therefore, in the embodiments of the present disclosure, the machine learning library used by the distributed framework during a process of computing is consistent with the machine learning library used by the single-machine framework during a process of computing. For example, the machine learning libraries used by the distributed framework during the process of computing includes: machine learning libraries matched with python (e.g., python Scikits Learn or pyspark LightGBM).
[0038] Description is given to the method for processing data according to the embodiments of the present disclosure with several examples below.
[0039] FIG. 3 is a flowchart of a method for processing data according to an embodiment of the present disclosure. Referring to FIG. 3, the method may be applicable to the distributed framework 10 of the system for processing data as described above. The method may include the following steps.
[0040] In step 310, to-be-analyzed data is acquired from a distributed database, wherein a data structure of the to-be -analyzed data is RDD.
[0041] The distributed database is a database with a distributed storage capability, such as Hive, HDFS, and the like. The distributed framework may acquire the to-be-analyzed data from the distributed database, and the data structure of the to-be-analyzed data is RDD. A timing at which the distributed framework acquires the to-be-analyzed data from the distributed database is not limited herein. In an example, the distributed framework acquires the to-be-analyzed data from the distributed database when the data needs to be processed. In another example, the distributed framework acquires the to-be-analyzed data from the distributed database at a preset time interval. In still another example, the distributed database actively pushes the to-be-analyzed data to the distributed framework at a preset time interval.
[0042] In step 320, converted data is acquired by format conversion on the to-be-analyzed data, wherein a data structure of the converted data satisfies a data structure requirement corresponding to a single-machine algorithm.
[0043] The distributed framework, based on actual service demand, on the one hand, may acquire to-be-analyzed data corresponding to the service demand from the distributed database; on the other hand, may determine an algorithm employed to achieve the service demand. In the embodiments of the present disclosure, because the algorithm determined by the distributed framework is developed by algorithm personnel based on single -machine environment, the algorithm is referred to as the single -machine algorithm.
[0044] Because the single-machine algorithm is developed before the single-machine algorithm is accessed to the distributed framework, the data structure requirement corresponding to the single -machine algorithm is determined. In order to enable the distributed framework to use the single -machine algorithm to process the to-be-analyzed data, in the embodiments of the present disclosure, the distributed framework processes the to-be-analyzed data by format conversion upon acquiring the to -be -analyzed data, such that the data structure of the converted data sati sfi es the data structure requirement corresponding to the single-machine algorithm. Afterwards, the distributed frame may also use the single -machine algorithm to process the converted data.
[0045] In step 330, processed data is acquired by parallel processing on the converted data using the single-machine algorithm at data processing nodes in the distributed framework.
[0046] The distributed framework includes a server cluster including a plurality of servers. Each server in the server cluster may process data, such that the data processing nodes in the distributed framework are composed. Optionally, the servers are in one-to-one correspondence to the data processing nodes; or the plurality of servers correspond to one of the data processing nodes, which is not limited herein.
[0047] The distributed framework may process the converted data by parallel processing using the single-machine algorithm at the data processing nodes in the distributed framework, such that data throughput and parallel capabilities are efficient. Optionally, the distributed framework processes the converted data by parallel processing using the single-machine algorithm at all the data processing nodes in the distributed framework; or, the distributed framework processes the converted data by parallel processing using the single-machine algorithm at a part of the data processing nodes in the distributed framework, which is not limited herein. In practice, the number of data processing nodes involved in computation may be determined by a size of the converted data, a fine degree of the service demand, and the like.
[0048] In an example, the distributed framework includes n data processing nodes, and n is an integer greater than 1; and step 303 includes: acquiring n processed results by parallel processing on the converted data using the single-machine algorithm at the n data processing nodes, and acquiring the processed data by data fusion on the n processed results.
[0049] Because the distributed framework distributes and collects data by an iterator, and each of the n data processing nodes generates the processed result, it is necessary to acquire the processed data by data fusion on the processed results respectively generated by the n data processing nodes, which ensures that the iterator returns the processed data to the distributed framework. For other descriptions of the data fusion processed by the distributed framework, reference may be made to the following method embodiments, which are not repeated herein.
[0050] The single-machine algorithm and the converted data used by each of the data processing nodes are not limited in the embodiments of the present disclosure. Optionally, the n data processing nodes use the same single-machine algorithms, and the n data processing nodes use different converted data; or, the single-machine algorithms and the n data processing nodes use different converted data; or, the n data processing nodes use the same converted data, and the n data processing nodes use the same single-machine algorithms. For other descriptions of the parallel processing processed by each of the data processing nodes, reference may be made to the following method embodiments, which are not repeated herein.
[0051] It should be understood that the number of data processing nodes in the distributed framework may be greater than or equal to n. In this example, description is given only using a scenario where the n data processing nodes of the data processing nodes in the distributed framework need to be used as an example. However, the embodiments of the present disclosure are not limited thereto.
[0052] In step 340, the processed data is stored to tire distributed framework.
[0053] Upon acquiring the processed data, the distributed framework returns the processed data to the distributed database for storage to facilitate calling corresponding data from the distributed database for data analysis and the like subsequently.
[0054] In summary', in the technical solution according to the embodiments of the present disclosure, the data structure of the to-be-analyzed data is converted prior to processing the to- be-analyzed data by the distributed framework, such that the converted data satisfies the data structure requirement corresponding to the single-machine algorithm to access the singlemachine algorithm to the distributed framework. Moreover, because it is the single -machine that is accessed to the distributed framework, the algorithm personnel develop the algorithm based on the single -machine environment, and the engineering personnel access the developed algorithm to the distributed framework. In this way, the data and the algorithm of the distributed framework are decoupled, the requirements of development and application of the distributed framework on the ability of the algorithm personnel are lowered, the learning cost of the algorithm personnel is reduced, the life cycle of the development of the distributed framework is shortened, and thus popularization and application of the distributed framework are facilitated. In addition, in the embodiments of the present disclosure, the distributed framework processes the converted data by parallel processing using the single -machine algorithm at the data processing nodes, which takes full advantage of efficient data throughput, scheduling, and parallel capabilities of the distributed framework.
[0055] In an example, acquiring the n processed results by parallel processing on the converted data using the single -machine algorithm at the n data processing nodes includes: acquiring n data slices by data slicing on the converted data in a target slicing mode; issuing the n data slices to the n data processing nodes; and acquiring an ith processed results by processing an ith data slice in the n data slices using the single-machine algorithm at the fh data processing node in the n data processing nodes, wherein the n processed results include the ith processed result, i being an integer less than or equal to n. [0056] Upon acquiring the to-be-analyzed data from the distributed database, the distributed framework fist acquires the converted data by format conversion on the to-be-analyzed data. Because in the distributed framework, the plurality of data processing nodes participate in processing data, in the embodiments of the present disclosure, the distributed framework acquires the n data slices by data slicing on the converted data, and respectively issues the n data slices to the n data processing nodes in the distributed framework, so as to alleviate a processing overhead of each of the data processing nodes and accelerate the speed of data processing. Optionally, when the distributed framework issues the n data slices, the single -machine algorithms are issued simultaneously. At each of the data processing nodes, the distributed framework acquires the processed result by processing the data slice issued to the data processing node using the single-machine.
[0057] In the embodiments of the present disclosure, the distributed framework may slice the converted data based on the target slicing mode. Taking a scenario where the technical solution according to the embodiments of the present disclosure is applied in the field of wind energy as an example, the to-be-analyzed data acquired by the distributed framework includes data of a fan. In an example, as illustrated in FIG. 4, the target slicing modes include at least one of: data slicing based on a wind field, for example, as illustrated in FIG. 4(a), data slicing based on a fan, for example, as illustrated in FIG. 4(b), and data slicing based on time, for example, as illustrated in FIG. 4(c).
[0058] Taking a scenario where one of the data processing nodes processes one of the data sli ces as an example, the distributed framework needs to issue the n data slices to the n data processing nodes. In an example, the distributed framework may randomly select n data processing nodes from the included data processing nodes, and randomly issue the n data slices to the n data processing nodes, such that an efficiency in issuing the data slices is improved. In other example, issuing the n data slices to the n data processing nodes includes: determining data sizes respectively corresponding to the n data slices; acquiring processing capabilities respectively corresponding to the n data processing nodes; and issuing the n data slices to the n data processing nodes, based on the data sizes respectively corresponding to the n data slices and the processing capabilities respectively corresponding to the n data processing nodes. By issuing the data slices according to the processing capability of each of the data processing nodes and the data size of each of the data slices, the size of the data slice is ensured to adapt to the processing capability of the data processing node, and thus an effect in processing data is improved.
[0059] In summary’, the technical solution according to the embodiments of the present disclosure acquires the plurality of data slices by data slicing on the to-be-analyzed data, and then respectively issues the plurality of data slices to the plurality of data processing nodes in the distributed framework, which ensures that each of the data processing nodes processes the issued data slice using the single-machine algorithm. Due to data slicing, a data volume required to be processed by each of the data processing nodes is reduced. In this way, the processing overhead of each of data processing nodes is alleviated, and the speed of data processing is accelerated.
[0060] In general, the data fusion includes splicing. Taking a scenario where forms of the processed results include a table as an example, the plurality of processed results may be processed the data fusion processing by splicing in rows or columns.
[0061] Exemplarily, the n processed results include results listed in Table 2 and Table 3.
Table 2 Wens_wtg_10m
Figure imgf000011_0001
Table 3 Wens_wtg_info
Figure imgf000011_0002
[0062] In a case that the results in Table 2 and Table 3 are spliced in columns, a spliced result shown in the following Table 4 may be acquired.
Table 4 A result spliced in columns
Figure imgf000011_0003
Figure imgf000012_0001
Table 5 Wens wig 10m
Figure imgf000012_0002
Table 6 Wens_wtg_info
Figure imgf000012_0003
[0064] In a case that the results in Table 5 and Table 6 are spliced in rows, a spliced result shown in the following Table 7 may be acquired.
Table 7 A result spliced in rows
Figure imgf000012_0004
process of splicing in rows or columns, and these redundant bits occupy the storage space, which thus causing a waste of a storage resource of the distributed database. Moreover, as illustrated in Table 7, in the process of splicing in columns, data labels (column names herein) of a part of the processed results are abandoned, which is not conducive to subsequent standard management and retrieval.
[0066] Based on this, the embodiments of the present disclosure provide a method for data fusion, which may solve the above problem. Description is given to the method for data fusion below.
[0067] In an example, acquiring the processed data by data fusion on the n processed results includes: acquiring virtualized labels by virtualizing data labels of the n processed results; acquiring n dictionary results by dictionary processing on the n processed results, based on the virtualized labels; and acquiring the processed data by data integration on the n dictionary results. [0068] Tire data labels are configured to identify data of the processed results, and generally indicate meanings of the data. In an example, forms of the processed results include a table including one or more columns of data, and the data labels of the processed results include names of one or more columns of tire table; or forms of the processed results include a table including one or more rows of data, and the data labels of the processed results include names of one or more rows of the table.
[0069] Because each of the data processing nodes processes different data slices, the processed results acquired by each of the data processing nodes may be different, and the data labels of the processed results are also different. To facilitate management and reduce a size of the data label, in the embodiments of the present disclosure, the distributed framework may acquire the virtualized labels by virtualizing the data labels of the n processed results. For example, in a case that the n processed results include results listed in Table 2 and Table 3, a column name "wf id" may be virtualized to vcO," and a column name "wtg id" may be virtualized to "cl," wherein cO and cl are the virtualized labels.
[0070] Based on the virtualized labels, the distributed framework may acquire the n dictionary results by dictionary processing on the n processed results. For example, in a case that the n processed results include results listed in Table 2 and Table 3, a result of Table 2 upon dictionary' processing is as follows:
Figure imgf000013_0001
[0071] A result of Table 3 upon dictionary' processing is as follows:
{‘c0’: ‘34.31365’, ‘cl’: ‘120.019’}
{‘c0’: ‘42.4992’, ‘cl’: ‘117.8644’}
[0072] Then the distributed framework acquired the processed data by data integration on the dictionary results, mapping the dictionary' results to a new data structure. For example, processed results shown in Table 8 may be acquired by data integration on the results listed in Table 2 and Table 3 upon dictionary processing.
Table 8 Processed results
Figure imgf000014_0001
[0073] In summary, in the technical solution according to the embodiments of the present disclosure, by data integration on the processed results output by the data processing nodes, the distributed framework may collect the processed results by an iterator. Moreover, in the embodiments of the present disclosure, tire distributed framework acquires the virtualized labels by virtualizing the data labels of the processed results, then acquires the dictionary' results by dictionary' processing on the processed results, based on the virtualized labels, and then acquires the processed data by integration on the dictionary results. In this way, storage apace required for the processed results is reduced, and the storage resource of the distributed database is saved, while facilitating management of the processed results.
[0074] FIG. 5 is a schematic diagram of another method for processing data according to an embodiment of the present disclosure. Referring to FIG. 5, the method may be applicable to the distributed framework 10 of the system for processing data as described above. [0075] First, the distributed framework acquires to-be-analyzed data from Hive. Referring to FIG. 5, a data structure of the to-be-analyzed data is RDD. To access a single -machine algorithm to the distributed framework, the distributed framework needs to acquire converted data by format conversion on the to-be-analyzed data upon acquiring the to-be-analyzed data. A data structure of the converted data satisfies a data structure requirement corresponding to the single -machine algorithm.
[0076] To alleviate a processing overhead of a single data processing node and accelerate the speed of data processing, the distributed framework acquires a plurality of data slices by data slicing on the converted data. Afterwards, referring to FIG. 5, the distributed framework issues the data slices and the algorithms to the data processing nodes.
[0077] Because it is the single -machine algorithm that is accessed to the distributed framework in the embodiments of tire present disclosure, a machine learning library that the distributed framework needs to call is consistent with a machine learning library that a single-machine framework needs to call, in a process of data processing. For example, as illustrated in FIG. 5, the machine learning libraries that the distributed framework needs to call include: python, Scikits Learn, and pyspark LightGBM.
[0078] Each of the data processing nodes may acquire a corresponding processed result. The distributed framework acquires the processed data by data fusion on the processed results acquired by the data processing nodes, and stores the processed data to the distributed database.
[0079] The following is an embodiment of an apparatus for processing data according to the present disclosure. Tire apparatus is applicable to the method embodiments of the present disclosure. For details not disclosed in the apparatus embodiment of the present disclosure, reference may be made to the method embodiments of the present disclosure.
[0080] FIG. 6 is a block diagram of an apparatus 600 for processing data according to an embodiment of the present disclosure. Referring to FIG. 6, the apparatus 600 has a function of achieving the above method embodiment, and the function may be implemented by hardware, or may be realized by corresponding software executed by hardware. The apparatus 600 may be the server described above, or may be disposed in the server described above. The apparatus 600 may include: a data acquiring module 610, a format converting module 620, a data processing module 630, and a data storage module 640.
[0081] lire data acquiring module 610 is configured to acquire to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is RDD.
[0082] Tire format converting module 620 is configured to acquire converted data by format conversion on the to-be-analyzed data, wherein a data structure of the con verted data satisfies a data structure requirement corresponding to a single-machine algorithm. [0083] The data processing module 630 is configured to acquire processed data by parallel processing on the converted data using the single -machine algorithm at data processing nodes in a distributed framework.
[0084] The data storage module 640 is configured to store the processed data to the distributed database.
[0085] In an example, the distributed framework incudes n data processing nodes, wherein n is an integer greater than 1. As illustrated in FIG. 7, the data processing module 630 incudes: a data processing unit 632, configured to acquire n processed results by parallel processing on the converted data using the single -machine algorithm at the n data processing nodes; and a data fusion module 634, configured to acquire the processed data by data fusion on the n processed results.
[0086] In an example, as illustrated in FIG. 7, the data fusion unit 634 is configured to: acquire virtualized labels by virtualizing data labels of the n processed results; acquire n dictionary' results by dictionary- processing on the n processed results based on the virtualized labels; and acquire the processed data by data integration on the n dictionary results.
[0087] In an example, forms of the processed results include a table as least including one or more columns of data; and the data labels of the processed results include names of one or more columns of the table. Alternatively, forms of the processed results include a table as least including one or more rows of data; and the data labels of the processed results include names of one or more rows of the table.
[0088] In an example, as illustrated in FIG. 7, the data processing unit 632 is configured to: acquire n data slices by data slicing on the converted data in a target slicing mode; issue the n data slices to the n data processing nodes; and acquire an ish processed result by processing an ith data slice in the n data slices using the single-machine algorithm at the ith data processing node in the n data processing nodes, wherein the n processed results include the ith processed result, i being an integer less than or equal to n.
[0089] In an example, issuing the n data slices to the n data processing nodes includes: determining data sizes respectively corresponding to the n data slices; acquiring processing capabilities respectively corresponding to the n data processing nodes; and issuing the n data slices to the n data processing nodes based on the data sizes respectively corresponding to the n data slices and the processing capabilities respectively corresponding to the n data processing nodes.
[0090] In an example, the to-be-analyzed data includes data of a fan; and the target slicing modes include at least one of: data slicing based on a wind field, data slicing based on a fan, and data slicing based on time. [0091] In summary, the technical solution according to tire embodiments of the present disclosure converts, by the distributed framework, the data structure of the to-be-analyzed data prior to processing the to-be-analyzed data, such that the converted data satisfies the data structure requirement corresponding to the single-machine algorithm to access the singlemachine algorithm to the distributed framework. Moreover, because it is the single-machine algorithm that is accessed to the distributed framework, algorithm personnel develop tire algorithm based on single-machine environment, and engineering personnel access tire developed algorithm to the distributed framework. In this way, the data and the algorithm of the distributed framework are decoupled, the requirements of development and application of the distributed framework on the ability of the algorithm personnel are lowered, the learning cost of the algorithm personnel is reduced, the life cycle of the development of the distributed framework is shortened, and thus popularization and application of the distributed framework are facilitated. In addition, in the embodiments of the present disclosure, the distributed framework processes the converted data by parallel processing using the single -machine algorithm at the data processing nodes, which takes full advantage of efficient data throughput, scheduling, and parallel capabilities of the distributed framework.
[0092] Is should be noted that, when functions of the apparatus according to the embodiments of the present disclosure are realized, description is only given to the above division of the functional modules. The above functions of the apparatus may be distributed to different functional modules according to actual needs. That is, an internal structure of the apparatus is divided into different functional modules to implement a part or all of the functions described above. In addition, the apparatus according to the above embodiments is based on the same concept as the method embodiments described above, and the specific implementation process of the apparatus is detailed in the method embodiments, which is not repeated herein.
[0093] FIG. 8 is a structure block diagram of a server according to an embodiment of the present disclosure. Referring to FIG. 8, the server may be configured to perform the method described above.
[0094] The server 800 includes a processing unit 801 including a central processing unit (CPU), a graphics processing unit (GPU), or a field programmable gate array (FPGA), a system memory 804 including a random-access memory (RAM) 802 or a read-only memory' (ROM) 803, and a system bus 805 connecting the system memory 804 to the central processing unit 801. The server 800 further includes an input/output (I/O system) 806 that facilitates information transfer between each device in the server, a mass storage device 807 configured to store an operating system 813, an application program 814, and a program module 815. [0095] The I/O system 806 includes a display 808 configured to display information, and an input device 809 such as a mouse, a keyboard, and the like, configured for users to input information. The display 808 and the input device 809 are both connected to the central processing unit 801 by an input and output controller 810 connected to the system bus 805. The I/O system 806 may further include the input and output controller 810 to receive and process input from a plurality of other devices such as the keyboard, the mouse, an electronic stylus, and the like. Similarly, the input and output controller 810 may further provide outputs to other output devices such as a display screen, a printer, and the like.
[0096] The mass storage device 807 is connected to the central processing unit 801 by a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and related computer-readable medium provide a nonvolatile storage. That is, the mass storage device 807 may include computer-readable medium (not shown) such as a hard disk, a compact disc read-only memory (CD-ROM), or a driver.
[0097] Generally, the computer-readable medium may include computer storage medium and communication medium. The computer storage medium includes volatile and nonvolatile, and removable and unremovable medium, which is practiced by any method or technology configured to store information such as a computer-readable instruction, a data structure, a program module, or other data. The computer storage medium includes a RAM, a ROM, an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), a flash memory or other solid-state memories, a CD-ROM, a digital video disc (DVD), or other optical storages, a tap box, a tap, a disk storage or other magnetic storage devices. The personnel of skill in the art know that the computer storage medium is not limited to the above. The system memory 804 and the mass storage device 807 collectively refer to as a memory.
[0098] According to the embodiments of the present disclosure, the server 800 may be connected to a remote computer on a network to run by networks such as the Internet. That is, the server 800 may be connected to the network 812 by a network interface unit 811 connected to the system bus 805. In other words, the network interface unit 811 may be connected to other types of network or remote computer systems (not shown).
[0099] The memory further includes one or more computer programs stored in the memory. The one or more computer programs are configured to be loaded and run by one or more processors to perform the method described above.
[00100] An embodiment of the present disclosure provides a computer-readable storage medium storing one or more computer programs, wherein the one-or-more computer programs, when loaded and run by a processor of a server, cause the server to perform the method described above.
[00101] An exemplarily embodiment provides a computer program product, wherein the computer program product, when loaded and run by a processor of a server, is configured to perform the method as described above.
[00102] It should be understood that the term "a plurality of mentioned in the embodiments of the present disclosure indicates two or more. The term "and/or" mentioned in the embodiments of the present disclosure indicates three relationships between contextual objects. For example, A and/or B may mean that A exists alone, A and B exist at the same time, and B exists alone. The symbol "/" generally denotes an "OR" relationship between contextual objects.
[00103] Described above are merely optional embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent substitutions, improvements, and the like may be made within the protection scope of the present disclosure, without departing from the spirit and principles of the present disclosure.

Claims

CLAIMS What is claimed is:
1. A method for processing data, applicable to a distributed framework, the method comprising: acquiring to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is resilient distributed dataset (RDD); acquiring converted data by format conversion on the to-be-analyzed data, w'herein a data structure of the converted data satisfies a data structure requirement corresponding to a singlemachine algorithm; acquiring processed data by the single-machine algorithm at data processing nodes in the distributed framework; and storing the processed data in the distributed database.
2. The method according to claim 1, wherein the distributed framework comprises n data processing nodes, n being an integer greater than 1; and acquiring processed data by parallel processing on the converted data using the singlemachine algorithm at data processing nodes in the distributed framework comprises: acquiring n processed results by parallel processing on the converted data using the single -machine algorithm at the n data processing nodes; and acquiring the processed data by data fusion on the n processed results.
3. The method according to claim 2, acquiring the processed data by data fusion on the n processed results comprises: acquiring virtualized labels by virtualizing data labels of the n processed results; acquiring n dictionary results by dictionary processing on the n processed results, based on the virtualized labels; and acquiring the processed data by data integration on the n dictionary' results.
4. The method according to claim 3, wherein forms of the processed results comprise a table comprising one or more columns of data; and the data labels of the processed results comprise names of one or more columns of the table; or forms of the processed results comprise a table comprising one or more rows of data; and the data labels of the processed results comprise names of one or more rows of the table.
5. The method according to claim 2, acquiring the n processed results by parallel processing on the converted data using the single -machine algorithm at the n data processing nodes comprises: acquiring n data slices by data slicing on the converted data in a target slicing mode; issuing the n data slices to the n data processing nodes; and acquiring an f !i processed result by processing an itn data slice in the n data slices using the single -machine algorithm at the i'h data processing node in the n data processing nodes, wherein the n processed results comprise the i'h processed result, i being an integer less than or equal to n.
6. The method according to claim 5, issuing the n data slices to the n data processing nodes comprises: determining data sizes respectively corresponding to the n data slices; acquiring processing capabilities respectively corresponding to the n data processing nodes; and issuing the n data slices to the n data processing nodes, based on the data sizes respectively corresponding to the n data slices and the processing capabilities respectively corresponding to the n data processing nodes.
7. The method according to claim 5, wherein the to-be-analyzed data comprises data of a fan; and the target slicing modes comprise at least one of: data slicing based on a wind field, data slicing based on a fan, and data slicing based on time.
8. An apparatus for processing data, disposed in a distributed framework, the apparatus comprising: a data acquiring module, configured to acquire to-be-analyzed data from a distributed database, wherein a data structure of the to-be-analyzed data is resilient distributed dataset (RDD); a format converting module, configured to acquire converted data by format conversion on the to-be-analyzed data, wherein a data structure of the converted data satisfies a data structure requirement corresponding to a single-machine algorithm; a data processing module, configured to acquire processed data by parallel processing on the converted data using the single -machine algorithm at data processing nodes in the distributed framework; and a data storage module, configured to store the processed data to the distributed database.
9. A server, comprising: a processor and a memory configured to store one or more computer programs, wherein the processor, when loading and running tire one or more computer programs, is caused to perform the method as defined in any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium, storing one or more computer programs, wherein the one or more computer programs, when loaded and run by a processor of a server, cause the server to perform the method as defined in any one of claims 1 to 7.
PCT/SG2022/050611 2021-08-30 2022-08-26 Method and apparatus for processing data, and server and storage medium thereof WO2023033726A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111004548.X 2021-08-30
CN202111004548.XA CN113704340B (en) 2021-08-30 2021-08-30 Data processing method, device, server and storage medium

Publications (2)

Publication Number Publication Date
WO2023033726A2 true WO2023033726A2 (en) 2023-03-09
WO2023033726A3 WO2023033726A3 (en) 2023-05-04

Family

ID=78656821

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2022/050611 WO2023033726A2 (en) 2021-08-30 2022-08-26 Method and apparatus for processing data, and server and storage medium thereof

Country Status (2)

Country Link
CN (1) CN113704340B (en)
WO (1) WO2023033726A2 (en)

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477731B2 (en) * 2013-10-01 2016-10-25 Cloudera, Inc. Background format optimization for enhanced SQL-like queries in Hadoop
CN107609141B (en) * 2017-09-20 2020-07-31 国网上海市电力公司 Method for performing rapid probabilistic modeling on large-scale renewable energy data
CN109063842A (en) * 2018-07-06 2018-12-21 无锡雪浪数制科技有限公司 A kind of machine learning platform of compatible many algorithms frame
CN109658006B (en) * 2018-12-30 2022-02-15 广东电网有限责任公司 Large-scale wind power plant group auxiliary scheduling method and device
CN110209734B (en) * 2019-05-05 2022-11-18 深圳市腾讯计算机系统有限公司 Data copying method and device, computer equipment and storage medium
CN110704995B (en) * 2019-11-28 2020-05-01 电子科技大学中山学院 Cable layout method and computer storage medium for multiple types of fans of multi-substation
CN112185572B (en) * 2020-09-25 2024-03-01 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic equipment and medium
CN112241872A (en) * 2020-10-12 2021-01-19 上海众言网络科技有限公司 Distributed data calculation analysis method, device, equipment and storage medium
CN112487125B (en) * 2020-12-09 2022-08-16 武汉大学 Distributed space object organization method for space-time big data calculation
CN113220427A (en) * 2021-04-15 2021-08-06 远景智能国际私人投资有限公司 Task scheduling method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
WO2023033726A3 (en) 2023-05-04
CN113704340A (en) 2021-11-26
CN113704340B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
US11544623B2 (en) Consistent filtering of machine learning data
CN111324610A (en) Data synchronization method and device
US20160292162A1 (en) Streamlined system to restore an analytic model state for training and scoring
CN111241203B (en) Hive data warehouse synchronization method, system, equipment and storage medium
KR102610636B1 (en) Offload parallel compute to database accelerators
US10185743B2 (en) Method and system for optimizing reduce-side join operation in a map-reduce framework
CN114417408B (en) Data processing method, device, equipment and storage medium
CN103699656A (en) GPU-based mass-multimedia-data-oriented MapReduce platform
Luo et al. Big-data analytics: challenges, key technologies and prospects
CN110888972A (en) Sensitive content identification method and device based on Spark Streaming
CN113918532A (en) Portrait label aggregation method, electronic device and storage medium
CN108334532B (en) Spark-based Eclat parallelization method, system and device
CN111611479B (en) Data processing method and related device for network resource recommendation
US20150149498A1 (en) Method and System for Performing an Operation Using Map Reduce
CN116132448B (en) Data distribution method based on artificial intelligence and related equipment
WO2023033726A2 (en) Method and apparatus for processing data, and server and storage medium thereof
CN107562943B (en) Data calculation method and system
CN116303427A (en) Data processing method and device, electronic equipment and storage medium
CN113220530B (en) Data quality monitoring method and platform
CN113760950A (en) Index data query method and device, electronic equipment and storage medium
CN116451005B (en) Spark-based distributed grid algebra operation method, system and equipment
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table
CN117891400A (en) Model simulation data storage method, device, equipment and storage medium
CN117093580A (en) Data storage method, system, equipment and medium based on label calculation
CN114969139A (en) Big data operation and maintenance management method, system, device and storage medium

Legal Events

Date Code Title Description
DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE