CN110502213A

CN110502213A - A kind of artificial intelligence capability development platform

Info

Publication number: CN110502213A
Application number: CN201910441898.9A
Authority: CN
Inventors: 刘阳
Original assignee: Networks Technology Co Ltd
Current assignee: Networks Technology Co Ltd
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2019-11-26

Abstract

The invention discloses a kind of artificial intelligence capability development platforms, including hardware layer, system layer, data Layer, resource management scheduler module, ccf layer and model management module；The hardware layer includes CPU, GPU and memory, and the CPU and memory are electrically connected, and the GPU is for carrying out high performance parallel operation；The system layer supports (SuSE) Linux OS, and the data file in the data Layer carries out storage management using parallel file system and cloud object storage system；The resource management scheduler module is for being managed and dispatching to calculate node；Deep learning Open Framework there are many being integrated built in the ccf layer；The model management module is for being monitored management to model training, generation and result.Artificial intelligence capability development platform of the invention reduces the complexity that user oneself carries out the maintenance cost of open source product and uses by the way that a variety of deep learning Open Frameworks are arranged to support multiple types artificial intelligence application to land.

Description

A kind of artificial intelligence capability development platform

Technical field

The present invention relates to a kind of field of artificial intelligence more particularly to a kind of artificial intelligence capability development platforms.

Background technique

Currently, the maturation of cloud computing, big data has been catalyzed the progress and rapid development of artificial intelligence (AI), enable machine Enough functions of largely simulating people realize batch hommization and personalized services client.With the continuous development of AI, The birth of miscellaneous Open Framework becomes art technology for how enterprise manages the resource of machine and AI Open Framework Personnel's technical problem urgently to be resolved.

Summary of the invention

For overcome the deficiencies in the prior art, the purpose of the present invention is to provide a kind of artificial intelligence capability development platform, To reduce open source product maintenance cost and shorten the model training time.

The purpose of the present invention adopts the following technical scheme that realization:

A kind of artificial intelligence capability development platform, including hardware layer, system layer, data Layer, resource management scheduler module, frame Rack-layer and model management module；The hardware layer includes CPU, GPU and memory, and the CPU and memory are electrically connected, described GPU is for carrying out high performance parallel operation；The system layer supports (SuSE) Linux OS, the data file in the data Layer Storage management is carried out using parallel file system and cloud object storage system；The resource management scheduler module be used for CPU, GPU and calculate node are managed and dispatch；Deep learning Open Framework there are many being integrated built in the ccf layer；The model Management module is for being monitored management to model training, generation and result.

Further, the linux operating system includes Redhat system and Ubuntu system.

Further, a variety of deep learning Open Frameworks include Tensorflow, PyTorch, Keras and Caffe.

Further, the platform further includes model release module, and the model release module will be for that will train completion Artificial intelligence model publication is for convenience of the artificial intelligence inference service called for other application system calling.

Further, other application system can be called by api interface a kind of in Restful or Stream or gRPC Form is so that the AI inference service that manually intelligent capability development platform is issued.

Further, the model management module is for providing full graphics interface operation, the full graphics interface operation packet Data prediction needed for including deep learning project, model imports and management, model training, super ginseng are searched for, training process is visual Change the verifying with model.

Further, mainly pass through random search, the gloomy window estimation technique of pa of tree construction in the super ginseng search and be based on One of Bayes's optimization of Gaussian Process realizes that super ginseng optimizes.

Further, the data Layer is integrated with Apache Spark, converts unstructured and structuring to simplify The process of data set.

Further, in ccf layer using distributed deep learning frame, the distribution deep learning frame utilizes more Gyration state network technology improves the speed-up ratio of parallel training.

Further, the resource management scheduler module is realized using EGO resource management framework to CPU, GPU, calculating Node is managed and dispatches.

Compared with prior art, the beneficial effects of the present invention are:

Artificial intelligence capability development platform of the invention is by being arranged a variety of deep learning Open Frameworks to support multiple types Type artificial intelligence application landing reduces the complexity that user oneself carries out the maintenance cost of open source product and uses；And the present invention Platform resource management, scheduling and multi-tenant management function are provided, meet several artificial intelligence applications of different business department into The unified operation of row and management.

Detailed description of the invention

Fig. 1 is the frame construction drawing of artificial intelligence capability development platform of the invention.

Specific embodiment

In the following, being described further in conjunction with attached drawing and specific embodiment to the present invention, it should be noted that not Under the premise of conflicting, new implementation can be formed between various embodiments described below or between each technical characteristic in any combination Example.

As shown in Figure 1, present embodiments providing a kind of artificial intelligence capability development platform, including hardware layer, system layer, number According to layer, resource management scheduler module, ccf layer and model management module；The hardware layer includes CPU, GPU and memory, described CPU and memory are electrically connected, and the GPU is for carrying out high performance parallel operation；The system layer supports Linux operation system It unites, the data file in the data Layer carries out storage management using parallel file system and cloud object storage system；It is described Resource management scheduler module is for being managed and dispatching to CPU, GPU and calculate node；It is integrated with built in the ccf layer more Kind deep learning Open Framework；The model management module is for being monitored management to model training, generation and result.

Further, the linux operating system includes Redhat system and Ubuntu system；In the present embodiment in addition to Except both above-mentioned operating systems, the release of Linux manufacturer popular in the world at present can also be；A variety of deep learnings Open Framework includes Tensorflow, PyTorch, Keras and Caffe.It that is to say to be integrated in the platform and more flow at present Capable Open Framework, platform are the platforms of a high opening, and the open source of the built-in industry prevalence for integrating various out-of-the-boxs is deep Degree study, machine learning library and frame, including Tensorflow, PyTorch, Keras, Caffe etc., and can with persistence maintenance this A little open source libraries and frame, therefore the landing of multiple types AI application is supported to realize, reduce the dimension that user oneself carries out open source product Shield cost and the complexity used.In the present embodiment, only Tensorflow is briefly described, those skilled in the art work as When knowing these titles of PyTorch, Keras, Caffe, it can also complete corresponding frame and build.

Tensorflow is one and carries out opening for numerical value calculating using data flow diagram (data flow graphs) technology Source software library.Data flow diagram is a digraph, and using node, (general round or rectangular description, indicates a mathematics The terminal of the starting point and data output of operation or data input) and line (indicating digital, matrix or Tensor tensor) retouch State mathematical computations.What data flow diagram can be convenient is assigned to each node completion asynchronous parallel meter in different calculating equipment It calculates, is very suitable to large-scale machine learning application.TensorFlow supports the platform of various isomeries, supports multi -CPU/GPU, clothes Business device, mobile device have good cross-platform characteristic；TensorFlow framework is flexible, can support various network models, With good versatility；In addition, TensorFlow kernel is developed using C/C++, and C++ is provided, Python, Java, Go The Client API of language.Its framework is flexible, can support various network models, has good Universal and scalability. Tensorflow.js supports to support using webGL operation GPU training deep learning model in IOS, Android system in web terminal Load operating machine learning model in system.

Further, the platform further includes model release module, and the model release module will be for that will train completion Artificial intelligence model publication is for convenience of the artificial intelligence inference service called for other application system calling.Other application system Form can be called by api interface a kind of in Restful or Stream or gRPC so that manually intelligent capability exploitation is put down The AI inference service of platform publication.Platform is made to provide good support, research staff to the opening of AI ability by above-mentioned module Trained AI model can be issued into the AI inference service that can facilitate calling, other application system by one key of visualization interface The AI inference service that form can be called to issue using AI ability platform by Restful, the api interfaces such as Stream, gRPC.More Preferably, for the model management module for providing full graphics interface operation, the full graphics interface operation includes deep learning Data prediction needed for project, model importing and management, surpass ginseng search, training process visualization and model at model training Verifying.Mainly by random search, the gloomy window estimation technique of pa of tree construction and based on Gaussian in the super ginseng search One of Bayes's optimization of Process realizes that super ginseng optimizes.The platform provides visual model training and monitoring hand Section simplifies AI model realization and management difficulty so that research staff can be carried out by visualization interface data observation and pretreatment, Set training mode, observe training problem and be adjusted, carrying out super ginseng adjustment etc..The super ginseng optimization of platform support model training is searched Suo Gongneng automatically provides optimal super ginseng combination proposal for AI model training, greatly saves AI Model R & D personnel's trial and error Cost.

The data Layer is integrated with Apache Spark, to simplify the unstructured mistake with structured data sets of conversion Journey.Apache Spark is a big data processing frame around speed, ease for use and complicated analysis building, Spark just like Lower advantage: Spark, which provides a comprehensive, unified frame, various has heterogeneity (text data, chart numbers for managing According to etc.) data set and data source (batch data or real-time flow data) big data processing demand.

In ccf layer using distributed deep learning frame, the distribution deep learning frame utilizes polycyclic dynamic network Technology improves the speed-up ratio of parallel training.It that is to say it specifically using the DDL technology (Distributed of IBM Deep Learning), which utilizes polycyclic dynamic network technology, makes full use of the bandwidth and performance of different layers communication network Data are transmitted, improve the speed-up ratio of parallel training.

The resource management scheduler module is realized using EGO resource management framework carries out pipe to CPU, GPU, calculate node Reason and scheduling.EGO cluster management frame is made of multilayer, including workload manager or application framework, a distribution Formula file system, a resource manager, and the system management facility towards cluster.EGO is absorbed in resource management, across multiple Application framework interacts and coordinates the shared of resource.

What AlphaMind referred in the present embodiment is artificial intelligence capability development platform.Specifically, AlphaMind exists It is operating system and deep learning ccf layer on hardware, operating system supports the distribution of most popular Linux manufacturer in the world Version, including Redhat, Ubuntu etc.；Data file uses IBM SpectrumScale parallel file system and IBM Cloud Object Store object-oriented store management；It is resource management scheduler module upwards again, built-in herein IBM The product (i.e. CwS) of Spectrum Conduct with Spark, and using the well-known EGO of industry as resource management, scheduling Module, EGO can be managed and dispatch to CPU, GPU, calculate node.EGO also supports the management of multi-tenant.It thus can be with Meets under the more people of client or more team's cooperative cooperating scenes the needs of using identical physical resources.

The currently a popular Open Framework of the platform intergration, such as TensorFlow, Caffe, in deep learning ccf layer, Using the DDL technology (Distributed Deep Learning) of IBM, which utilizes polycyclic dynamic network technology, sufficiently benefit Data are transmitted with the bandwidth of different layers communication network and performance, improve the speed-up ratio of parallel training.Again upwards, which provides All-graphic interface operation, include data prediction needed for deep learning project, model import and management, model training and Super ginseng search, training process visualization, the verifying and issuing function of model；Model distribution platform Inferenceas a is also provided Service；The platform provides a variety of API accesses, including RESTFul, the API such as Streaming, gRPC；And to the external of model Service provides dynamic function extending transversely and load balancing, meets in access pressure jump scene.

The function that the development platform of the present embodiment may be implemented is as follows:

1, multi-tenant supports,

It is a multi-tenant solution, different users has different visual angle and configuration to weigh deep learning component part Limit.By configuring user (user), resource group (Resource Group) and Spark running example (Spark Instance Group the distribution to tenant) is realized.

2, scheduling of resource

By the configuration of EGO software and its strategy, management, scheduling and monitoring to resource are realized.

3, the monitoring and optimization of individualized training model

Deep learning training mission is monitored by the correlation log grabbed from depth of foundation learning framework；By summarizing The monitoring data of training mission visualizes training process；Training visualization function can provide runing time, the number of iterations, damage Mistake value, accuracy and weight histogram, activation, neural network gradient visualization capability.From these charts, user can be with Solve whether some training is operating normally or going wrong.

4, surpass ginseng optimization

It is super to join the parameter for referring to the setting value before model training process starts.Concept with hyper parameter difference is parameter, it It is a part that model training learns in the process, such as regression coefficient, neural network weight etc..Simple characterising parameter is mould What type training obtained, hyper parameter is that human configuration parameter (is substantially the parameter of parameter, changes hyper parameter every time, model will Re -training).

What RNN of deep learning model such as CNN might have ten to 100 super ginsengs, such as: learning rate, regulation etc..This A little parameters will have an impact model training process and final model result.Super ginseng optimization has become deep learning model tune One of excellent direct obstacle.Here, we execute three kinds of algorithms: random search, the gloomy window estimation technique (TPE) of pa of tree construction and base Optimize in the Bayes of Gaussian Process, automatically to optimize these super ginsengs.Wherein letter is carried out for random search Single to illustrate: random search (randomsearch) is the method for going to find a function approximate optimal solution using random number, is different from net The force search mode of lattice search.Principle: in certain section, constantly randomly rather than have tendentiousness generate random point, And the value of its constraint function and objective function is calculated, and to the point for meeting constraint condition, compare the value of its objective function one by one, it will be bad Point abandon, the point retained finally just obtains the approximate solution of optimal solution.This method is built upon on the basis of probability theory, Taken random point is more, then the probability for obtaining optimal solution is also bigger.This method has that precision is poor, finds approximation The efficiency of optimal solution is higher than grid search.Random search is generally used for roughing or generaI investigation.

5, fabric distribution training

We introduce Fabric, a kind of conventional distributed deep learning frame.And to across more GPU and multinode, compatibility The distributed deep learning of existing TensorFlow and Caffe model provides extensive support.Fabric supports different ladders The paralell design of descent method is spent, there is the speed-up ratio greater than 80%.

A kind of Fabric, conventional distribution training frame, mainly supports following functions:

The training of the single node, more GPU of the TensorFlow/Caffe model supports of non-PS；

The training of the multinode, more GPU of TensorFlow/Caffe model supports with PS；

Multinode, including synchronous gradient data control algorithm, asynchronous gradient data control algolithm, synchronous weighted data control Including algorithm, asynchronous weighted data control algolithm, distributed synchronization/asynchronous training algorithm of support；

Broadcast/gradient successively decreases/NCCL that is supported across more GPU of weighted data；

Continue to train by preservation/recovery training review point/snapshot document support.

6, Fabric carries out elastic distribution

Elasticity design will be controlled using fine granularity in the training process.Since coarse-grained policies are restricted to the utilization of resources, More GPU can not be added to the training stage, it is also restricted to extending up.Spark scheduler program is responsible for distribution task, session tune It spends program and is then responsible for dynamic request resource.In resource reclaim, it is current that scheduler program can keep task object to complete Training iteration simultaneously sends gradient result to parameter server and then exits again.In the case where cluster expansion, scheduler program can be More task objects are dynamically added in training mission and execute training iteration undetermined.

The development platform of the present embodiment pass through simplify tool and data preparation development Experience, solve data science man and The significant challenge that developer faces, while substantially reducing required time AI systematic training；It is with following advantage:

Ease for use: Application developer can use limited deep learning knowledge to cultivate and dispose to be directed to and calculate The new software tool " AI Vision " of the deep learning model of machine vision application demand；

Tool for data preparation: it with IBM Spectrum Conductor cluster virtualization Integrated Simulation, integrates Apache Spark, conversion is unstructured and the processes of structured data sets to simplify, to prepare deep learning training；

Training time is reduced: the distributed computing version of TensorFlow, the welcome open source machine initiated by Google Device learning framework.The distributed version of TensorFlow accelerates the virtual cluster of server using GPU, utilizes cost-effectiveness The training time of deep learning is foreshortened to a few hours from several weeks by height, high performance calculation method

More easily model development: one kind being known as the new software tool of " DL Insight ", enables data science man Higher accuracy is quickly obtained from deep learning model.The tool monitors deep learning training process, and adjust automatically peak It is worth performance parameter

Platform is the platform of a high opening, the open source depth of the built-in industry prevalence for integrating various out-of-the-boxs Habit, machine learning library and frame, including Tensorflow, PyTorch, Keras, Caffe etc., and these can be opened with persistence maintenance Source library and frame, thus support multiple types AI application landing realize, reduce user oneself carry out open source product maintenance at Sheet and the complexity used.

Platform provides resource management, scheduling and multi-tenant management function, meets several AI application objects of different business department The unified operation of reason and management, logically independent construction demand；Platform supports the parallel training of multimachine, more GPU cards, accelerates AI application The time of model training can faster shorten the period that AI application is realized.

The super ginseng Optimizing Search function of platform support model training, automatically provides optimal super ginseng group for AI model training and builds jointly View greatly saves the cost of AI Model R & D personnel's trial and error；Platform provides visual model training and monitoring means simplify AI model realization and management difficulty.Research staff can be carried out data observation and pretreatment by visualization interface, set training mould Formula, observation training problem are simultaneously adjusted, carry out super ginseng and adjust etc.；Platform provides more for the special technical requirements of AI application Add the file system and scheduling means of optimization, more stable, more quick model reasoning operation branch can be provided for AI application Support；Platform provides good support to the opening of AI ability, and research staff can be by one key of visualization interface trained AI Model is issued into the AI inference service that can facilitate calling, and other application system can be connect by API such as Restful, Stream, gRPC The AI inference service that mouth calling form is issued using AI ability platform.

The above embodiment is only the preferred embodiment of the present invention, and the scope of protection of the present invention is not limited thereto, The variation and replacement for any unsubstantiality that those skilled in the art is done on the basis of the present invention belong to institute of the present invention Claimed range.

Claims

1. a kind of artificial intelligence capability development platform, which is characterized in that including hardware layer, system layer, data Layer, resource management tune Spend module, ccf layer and model management module；The hardware layer includes CPU, GPU and memory, and the CPU and memory are electrical Connection, the GPU is for carrying out high performance parallel operation；The system layer supports (SuSE) Linux OS, in the data Layer Data file carries out storage management using parallel file system and cloud object storage system；The resource management scheduler module is used In CPU, GPU and calculate node are managed and are dispatched；Deep learning Open Framework there are many being integrated built in the ccf layer； The model management module is for being monitored management to model training, generation and result.

2. a kind of artificial intelligence capability development platform as described in claim 1, which is characterized in that the linux operating system Including Redhat system and Ubuntu system.

3. a kind of artificial intelligence capability development platform as described in claim 1, which is characterized in that a variety of deep learnings open source frames Frame includes Tensorflow, PyTorch, Keras and Caffe.

4. a kind of artificial intelligence capability development platform as described in claim 1, which is characterized in that the platform further includes model Release module, the model release module are used to train the artificial intelligence model of completion to issue for convenience of the artificial intelligence called Inference service is for other application system calling.

5. a kind of artificial intelligence capability development platform as claimed in claim 4, which is characterized in that other application system can pass through A kind of api interface calling form is so that manually intelligent capability development platform publication in Restful or Stream or gRPC AI inference service.

6. a kind of artificial intelligence capability development platform as described in any one of claim 1-5, which is characterized in that the mould For type management module for providing full graphics interface operation, the full graphics interface operation includes data needed for deep learning project Pretreatment, model imports and management, model training, super ginseng are searched for, the verifying of training process visualization and model.

7. a kind of artificial intelligence capability development platform as claimed in claim 6, which is characterized in that main in the super ginseng search By the gloomy window estimation technique of the pa of random search, tree construction and the Bayes based on Gaussian Process one of optimize come Realize super ginseng optimization.

8. a kind of artificial intelligence capability development platform as described in claim 1, which is characterized in that the data Layer is integrated with Apache Spark, to simplify the unstructured process with structured data sets of conversion.

9. a kind of artificial intelligence capability development platform as described in claim 1, which is characterized in that in ccf layer using distributed Deep learning frame, the distribution deep learning frame improve the acceleration of parallel training using polycyclic dynamic network technology Than.

10. a kind of artificial intelligence capability development platform as described in claim 1, which is characterized in that the resource management scheduling Module is realized using EGO resource management framework to be managed and dispatches to CPU, GPU, calculate node.