CN109740765B

CN109740765B - Machine learning system building method based on Amazon network server

Info

Publication number: CN109740765B
Application number: CN201910106145.2A
Authority: CN
Inventors: 何海林; 徐滢
Original assignee: Chengdu Pinguo Technology Co Ltd
Current assignee: Chengdu Pinguo Technology Co Ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2023-05-02
Anticipated expiration: 2039-01-31
Also published as: CN109740765A

Abstract

The invention discloses a machine learning system building method based on an Amazon network server, which comprises the following steps: creating an Amazon EMR cluster by adopting an AWS bot 3 interface, wherein the Amazon EMR cluster is configured with a Zeppelin storage platform; copying precompiled Spark task code from Amazon S3 onto a Master machine of the Amazon EMR cluster; registering a memory path of the Spark task code on the Master machine into a Spark interpreter of the Zeppelin storage platform through a service interface of the Zeppelin storage platform, and registering a code warehouse of Zeppelin Notebook of the Zeppelin storage platform into the Amazon S3; the required machine learning instance is created by the AWS. The technical scheme provided by the invention can rapidly complete the construction of the required system and improve the development efficiency.

Description

Machine learning system building method based on Amazon network server

Technical Field

The invention relates to the technical field of computer network resource management, in particular to a machine learning system building method based on an Amazon network server.

Background

Amazon Web servers (Amazon Web Services, AWS) are cloud computing resource management platforms operated by Amazon corporation to provide various types of AWS resources, such as Amazon elastic computing network cloud (AWS EC2, amazon Elastic Compute Cloud) service resources, amazon simple storage service (Amazon S3, amazon Simple Storage Service) resources, and the like, to enterprises in a remote Web service manner. The AWS EC2 service resource can enable a user to remotely use a computer system consisting of different types of virtual computers in a mode of renting the virtual computers (namely examples), any application software required by the user can be operated in the computer system, and meanwhile, the user can create, operate and terminate the AWS EC2 service at any time; amazon S3 services may be used for network data storage. The hosted Hadoop framework provided by Amazon EMR allows users to handle large amounts of data in multiple dynamically extensible Amazon EC2 instances. Clusters are a collection of Amazon EC2 instances, which are also the core components of Amazon EMR.

Based on various basic services provided by the AWS, developers can quickly build various computing environments for realizing cloud computing, big data, machine learning and the like. How to respond quickly to product demand under controllable cost resources to accomplish machine learning goals becomes a factor that needs to be prioritized by developers. The machine learning involves a lot of content, and mainly comprises tasks such as data acquisition, data management, data characteristic processing, model selection, super-parameter searching, model training and the like. The data acquisition, management and the like in the early stage can be completed by using software tools such as Flume, spark, kinesis, kafka, elastic Search and the like; subsequent processing of data features, model-related training tasks, etc. require the use of a computing engine and computing platform such as Spark, tensorflow. After the machine learning task target is determined, the developer also needs to perform a plurality of links such as data verification, model training, AB test and the like, and the links need to quickly call the AWS resource for calculation and verification. Specifically, the creation of EMR and EC2 needs to be completed in the amazon background, then relevant codes are deployed, a Spark computing engine and the like are used for submitting computing tasks or starting a Python programming language script to debug data features and models, and relevant data results and indexes are observed. The entire development cycle itself is time consuming, as different models may require different types of data, and model tuning may take a period of time. If the AWS resources are recovered due to insufficient investment of the early cost, the developer needs to re-create the EMR and the EC2, so that the development period is further prolonged, and meanwhile, the workload of the developer is greatly increased due to the process of re-creating the EMR and the EC 2. All of the above problems result in low working efficiency for developers.

Disclosure of Invention

The invention aims to provide a machine learning system building method based on an Amazon network server, which can quickly complete the building of a required system and improve the development efficiency.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a machine learning system building method based on an Amazon network server comprises the following steps: creating an Amazon EMR cluster by adopting an AWS bot 3 interface, wherein the Amazon EMR cluster is configured with a Zeppelin storage platform; copying precompiled Spark task code from Amazon S3 onto a Master machine of the Amazon EMR cluster; registering a memory path of the Spark task code on the Master machine into a Spark interpreter of the Zeppelin storage platform through a service interface of the Zeppelin storage platform, and registering a code warehouse of Zeppelin Notebook of the Zeppelin storage platform into the Amazon S3; the required machine learning instance is created by the AWS.

Preferably, the creating the required machine learning instance through AWS includes: selecting an AWS EC2 instance, and creating a mirror image for the AWS EC2 instance; the mirror image comprises a predetermined software tool; and creating a required machine learning instance by adopting the AWS bot 3 interface according to the mirror image.

Further, the method further comprises the following steps: the AWS bot 3 interface is adopted to specify the IP address of the required machine learning instance, and/or the AWS bot 3 interface is adopted to specify the identity of the required machine learning instance.

Further, the method further comprises the following steps: and adding a Jupyter Notebook function in the required machine learning example, and setting the Jupyter Notebook function to be in a starting-up self-starting state.

Preferably, the predetermined software tool is Tensorflow.

Further, after registering the code repository of Zeppelin Notebook of the Zeppelin storage platform in the Amazon S3, the method further comprises: and restarting the Zeppelin storage platform.

Preferably, the precompiled Spark task code is copied from Amazon S3 onto the Master machine of the Amazon EMR cluster using aws cli Shell commands.

According to the machine learning system building method based on the Amazon network server, spark task codes are synchronized to a Master machine of an Amazon EMR cluster, and then the Spark task codes can be called in subsequent Zeppelin Notebook; registering the code repository of Zeppelin Notebook into the Amazon S3 can quickly synchronize to Zeppelin Notebook again when the Amazon EMR clusters are reestablished after they are released. In addition, when the machine learning example is created, the mirror technology is adopted, so that the re-created EC2 example is quickly configured through mirror; the invention adopts the AWS bot 3 interface to carry out various operations, and is also greatly convenient for developers. Therefore, the technical scheme provided by the invention can rapidly complete the construction of the required system and improve the development efficiency.

Drawings

FIG. 1 is a flow chart of a method according to a first embodiment of the invention;

fig. 2 is a flowchart of a method according to a second embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

Fig. 1 is a flowchart of a method according to a first embodiment of the present invention, including:

step 101, creating an Amazon EMR cluster by adopting an AWS bot 3 interface, wherein the Amazon EMR cluster is configured with a Zeppelin storage platform;

boto is a python-based SDK of AWS (of course also supporting SDKs of other languages, such as Ruby, java, etc.), and allows developers to use amazon, etc. services like S3 and EC2, etc. when writing software, boto provides a simple, object-oriented API, as well as low-level service access.

The Zeppelin is a distributed Key-Value storage platform, and aims at high performance and large clusters, and the platform is not an end point but a starting point, so that on the basis of the Zeppelin, not only can KV database access be provided, but also more complex protocol requirements can be met through simple one-layer conversion.

Step 102, copying a pre-compiled Spark task code from Amazon S3 to a Master machine of the Amazon EMR cluster;

in this step, the precompiled Spark task code is copied from Amazon S3 to the Master machine of the Amazon EMR cluster using aws cli Shell command. The Spark task code mainly refers to a jar file.

Step 103, registering a memory path of the Spark task code on the Master machine to a Spark interpreter of the Zeppelin memory platform through a service interface of the Zeppelin memory platform, and registering a code warehouse of Zeppelin Notebook of the Zeppelin memory platform to the Amazon S3;

after the step is completed, the method further comprises the step of restarting the Zeppelin storage platform.

For Zeppelin Notebook, if an operation needs to be performed, a trigger call is performed through a service interface of the Zeppelin storage platform, and the trigger call mainly completes a Spark task related to Zeppelin Notebook to process data features, for example, normalize or vectorize data, etc., and the finally processed data can be saved in an Amazon S3 or a RedShift data warehouse so as to be read by a subsequent machine learning task.

Step 104, creating a required machine learning instance through the AWS.

The machine learning example is used for completing tasks such as training a machine learning model, and data required for training the machine learning model is the data subjected to the data feature processing and is read from the Amazon S3 or the RedShift data warehouse.

Fig. 2 is a flowchart of a method according to a second embodiment of the present invention, which is different from the first embodiment in that in step 104, a mirroring method is used to create a required machine learning instance, and specifically, the creating, by AWS, the required machine learning instance includes:

step 1041, selecting an AWS EC2 instance, and creating a mirror image for the AWS EC2 instance; the mirror image comprises a predetermined software tool;

the image needs to include a currently mainstream machine learning platform or toolkit, the predetermined software tools including, but not limited to Tensorflow, MXnet, theano, pytorch, CNTK, caffe. In this embodiment, we use Tensorflow as the machine learning platform. Related toolkits may be installed in a designated Conda environment based on Conda or Pip tools as needed before creating the image.

Step 1042, creating a required machine learning instance using the AWS bot 3 interface according to the mirror image.

And starting a corresponding working platform environment (such as a TensorFlow Python3.6 version environment in the example) in the required machine learning example, and executing a Jupyter Notebook or a customized Python script which needs to be run so as to finish the machine requirement target.

After the machine learning instance is created, further comprising: the AWS bot 3 interface is adopted to specify the IP address of the required machine learning instance, and/or the AWS bot 3 interface is adopted to specify the identity of the required machine learning instance. Further, the method further comprises the following steps: and adding a Jupyter Notebook function in the required machine learning example, and setting the Jupyter Notebook function to be in a starting-up self-starting state.

The method of the present invention is described in further detail below in connection with the specific requirements of the development effort:

in this embodiment, the EMR cluster needs to be started to complete the data feature processing in a certain business model. The feature processing here is to complete the primary feature processing with sufficient computing resources in the EMR cluster, in hopes of reducing the time expenditure and resource utilization that would otherwise be required to process the primary feature if the model processing stage resources were relatively small.

Because AWS provides a very large number of EMR instance types, the actual service can complete setting the designated resources according to its own needs, for example, set EMR cluster names, use r4. Xlage type instances, number 50, use bid instance types to obtain AWS idle resources with less cost, set security groups of clusters, zeppelin version, zeppelin for synchronizing S3 addresses of notbook, custom Jar files synchronized from S3 after EMR cluster creation is completed, etc.

The data processing flow in Zeppelin Notebook needs to rely on the actual processing logic code, and the code of this block can be subjected to custom development, and the compiling and packaging work of the code can be completed by publishing the code to a Jenkins platform and the like. The Jar file is then synchronized to the Master machine of the EMR cluster. Since Zeppelin cannot automatically identify a custom Jar file, it is necessary to register the custom Jar file in the Spark interpreter of the Zeppelin storage platform. After the registration is successful, the Zeppelin storage platform is restarted, and then the call to the relevant logic code in the custom Jar can be completed in the subsequent Zeppelin Notebook.

After registering the user-defined Jar file, restarting the Zeppeilin service, and then accessing the WEB page of the Zeppeilin to perform data characteristic processing operation. The flow of the process will vary with the different models and the different demands of the business. The current flow may be generally described as reading data in Amazon S3 or a data warehouse, organizing related data into Spark dataframes, completing the transformation of the feature data using Spark ML toolkit, and storing the transformed data in a storage medium such as S3 or a data warehouse.

In daily development, the EMR cluster resources are released continuously in consideration of factors such as cost, and Zeppelin Notebook in the EMR cluster needs to be stored in time for subsequent editing and calling. For this reason, in the present invention, a code repository is configured for the Zeppelin storage platform, and Zeppelin Notebook in this embodiment is saved to the bucket and path specified by Amazon S3. Such that subsequently re-established EMR clusters will again be synchronized Zeppelin Notebook from the AWS-specified buckets and paths for re-editing and recall by the developer.

The saved Zeppelin Notebook is called for as part of the daily process flow, and is executed in a fixed period (e.g., daily or weekly, etc.) to output the data characteristic process results. The modification of the variable in the specified Paragraph in Zeppelin Notebook (e.g., the import of a specified date, etc. parameter) can be done and the call completed through the service interface of the Zeppelin storage platform. After the Notebook completes the call, the number of Paragraph successes, failures is returned for caller awareness.

In this embodiment, an AWS EC2 needs to be started to complete the debugging of the service model. The characteristic data stored in the data warehouse or S3 after the characteristic processing is carried out on the data can be called out for use. The specific usage mode may be that the data in the data warehouse or the S3 is read in the processing logic of the service model and loaded into the memory. The machine learning model uses the TensorFlow toolkit to complete reprocessing of feature data. Reprocessing here refers to converting the feature data into data that can be directly read and recognized by the model.

The AWS provides a plurality of EC2 instance types, including general type, computation optimization type, memory optimization type, acceleration computation type, storage optimization type, and the like. The model in this case assumes that deep learning is required to select the EC2 instance of the acceleration calculation type. While considering economic benefits, bidding examples may be selected for use based on the parameters of the present invention. The benefit of the bid instance is that in most cases free resources can be selected from the market of the AWS and the computing resources obtained at a lower cost. In addition to instance type and whether or not to use bidding, the present invention also provides for setting the size of the EC2 local storage space, the setting of the AWS original mirror.

By default, the present invention will complete a fast EC2 boot process based on custom mirroring without reinstalling the Python's dependent toolkit. Thereby saving time spent installing the tool pack. The Python-dependent toolkit may involve functions like database access, numerical calculation, graph drawing, parameter searching, etc. The instance types provided by the AWS for machine learning may not themselves contain these toolkits, so configuration in conjunction with the business is required, and the designated environment of Conda is started for installation after EC2 machine creation is complete, so that the business model is invoked.

Of course, after the EC2 is created, the lP address or the ID of the EC2 may be specified, and the toolkit list may be updated again, so as to complete the installation again. This function mainly takes into account the need to extend existing functions in the course of model discovery.

The juyter Notebook function provides an exploration function for completing a machine learning business model in a WEB page using the Python language. In this embodiment, after the juyter Notebook function is automatically started, the WEB page of the juyter can be directly accessed, the Notebook is created, relevant development work is performed in the Notebook, and the tool kit is used to complete index drawing of the service model.

The principle of the method of the invention is as follows:

in the existing machine learning task, the data feature processing needs relatively more machine resources, such as a CPU and a memory, and the types of resources needed by different data processing modes also differ, one is computationally intensive, one is memory intensive, or both. Traditional data centers cannot provide computing resources or different types of computing resources anytime anywhere due to limited hardware resources. Cloud computing offers a very large number of possibilities. The Amazon-based cloud computing service solves the problem of resource limitation of the traditional data center, and simultaneously provides a plurality of services so as to solve the requirements in different scenes.

When the machine learning task is carried out by means of the AWS service, the task of calculating resources to solve the data characteristic processing is firstly applied through the EMR service of the AWS, and the data characteristic processing taking Spark as a calculation frame needs to use a plurality of EMR examples to carry out distributed calculation. In order to solve the different feature processing and more conveniently perform data verification and debugging, a Spark calculation task is generally required to be submitted on EMR, and data is written into services such as S3 or a data warehouse. This commit process includes encoding, compiling, publishing, performing file synchronization, committing jobs, etc., and modifying a computing DAG takes a significant amount of time to wait, delaying development efficiency. By using the Zeppein service, the arrangement, debugging and verification of the computing DAG can be completed in the Zepplin Notebook without compiling, publishing, synchronizing, submitting Job and the like again for the DAG, thereby greatly improving the efficiency.

As a machine learning task, tensorflow is used as a platform tool kit in the embodiment, and processed characteristic data is required to be used as model input to complete tuning, parameter debugging, model training, reasoning and the like of a model. The current learning targets such as deep learning and reinforcement learning depend on the CPU and the GPU. For this purpose, computing resources need to be acquired based on AWS EC2 cluster services, and Python procedures are performed in order to accomplish specific business objectives.

Both parts require creating computing resources, configuring a toolkit environment, configuring code synchronization policies, executing computing DAGs or machine learning targets, which require a complete set of automation mechanisms to complete.

The beneficial effects of the invention are as follows:

1. in contrast to the traditional model, the created EC2 does not require reinstallation and configuration environment due to the mirror image used in the present invention. While it takes 11 minutes to complete the EC2 resource for machine learning in the conventional mode, the time spent in the present invention is reduced to 6.5 minutes, which is greatly reduced.

2. Since machine learning requires a related exploration at different stages, for example, a third party Python toolkit is required. The invention provides a function based on configuration, and the installation of a new tool kit can be completed by only adding related parameters in configuration parameters and executing updating operation without reapplying EC2 machine resources.

3. Due to the requirements of different models on data processing, the calculation DAG also needs to be adjusted in time in the feature processing process. The invention supports updating operation of the Jar file relied on in Zeppelin Notebook, and does not need to reapply EMR resources.

4. The invention provides an interface for executing Zeppelin Notebook, which needs to be operated daily after the business model is determined, and the call to the appointed Zeppelin Notebook can be completed through the corresponding interface. At the same time, through a further interface, the running of a specific Python program in EC2 can be initiated. Is more convenient and efficient.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims

1. The machine learning system building method based on the Amazon network server is characterized by comprising the following steps of:

creating an Amazon EMR cluster by adopting an AWS bot 3 interface, wherein the Amazon EMR cluster is configured with a Zeppelin storage platform;

copying precompiled Spark task code from Amazon S3 onto a Master machine of the Amazon EMR cluster;

registering a memory path of the Spark task code on the Master machine into a Spark interpreter of the Zeppelin storage platform through a service interface of the Zeppelin storage platform, and registering a code warehouse of Zeppelin Notebook of the Zeppelin storage platform into the Amazon S3;

creating a required machine learning instance through AWS; creating a required machine learning instance by adopting a mirror image method; the method specifically comprises the following steps of;

selecting an AWS EC2 instance, and creating a mirror image for the AWS EC2 instance; the mirror image comprises a predetermined software tool;

and creating a required machine learning instance by adopting the AWS bot 3 interface according to the mirror image.

2. The amazon-network-server-based machine learning system building method of claim 1, further comprising:

the AWS bot 3 interface is adopted to specify the IP address of the required machine learning instance, and/or the AWS bot 3 interface is adopted to specify the identity of the required machine learning instance.

3. The amazon-network-server-based machine learning system building method of claim 1, further comprising:

and adding a Jupyter Notebook function in the required machine learning example, and setting the Jupyter Notebook function to be in a starting-up self-starting state.

4. The amazon web server-based machine learning system building method of claim 1, wherein the predetermined software tool is a Tensorflow.

5. The Amazon web server-based machine learning system building method of claim 1, further comprising, after said registering the code repository of Zeppelin Notebook of the Zeppelin storage platform into the Amazon S3:

and restarting the Zeppelin storage platform.

6. The Amazon web server-based machine learning system building method of claim 1, wherein precompiled Spark task code is copied from Amazon S3 to a Master machine of the Amazon EMR cluster using aws cli Shell commands.