CN107733696A

CN107733696A - A kind of machine learning and artificial intelligence application all-in-one dispositions method

Info

Publication number: CN107733696A
Application number: CN201710881113.0A
Authority: CN
Inventors: 李云鹏; 倪岭; 任义龙; 张建; 刘伟佳; 赵志强
Original assignee: Nanjing Days Mdt Infotech Ltd
Current assignee: Nanjing Days Mdt Infotech Ltd
Priority date: 2017-09-26
Filing date: 2017-09-26
Publication date: 2018-02-23

Abstract

The invention discloses a kind of machine learning and artificial intelligence application all-in-one dispositions method, this method through the following steps that realize：(1) system architecture is built, the system architecture is logically separated into application layer, computation layer and accumulation layer；(2) building network framework, the network architecture is logically divided into extranets, management net, calculates net and storage net；(3) design is optimized to the autgmentability of system, computing resource is increased using framework extending transversely, improves performance, memory capacity is increased using layer architecture.All-in-one dispositions method disclosed by the invention is by using abundant algorithms library, high-performance calculation engine etc., design that the integrated system architecture of function, flexible networking be topological and the outstanding efficient system expandability, so that all-in-one accelerates big data machine learning rate, the operational efficiency of artificial intelligence analysis's program is improved.

Description

A kind of machine learning and artificial intelligence application all-in-one dispositions method

Technical field

The present invention relates to machine learning and field of artificial intelligence, more particularly to a kind of machine learning and artificial intelligence should With all-in-one dispositions method.

Background technology

Artificial intelligence is just suggested early in the last century 50's, and it is cybernetics, information theory, computer science, mathematics A variety of subjects such as logic, neuro-physiology, psychology, linguistics, pedagogy, medical science, engineering technology and philosophy interpenetrate Cross discipline.People dream of with that time just incipient computer come construct complexity, possess and the same sample of the wisdom of humanity The machine of matter characteristic.This omnipotent machine, it has our all perception (or even more more than people), and we are all Rationality, it can think deeply as us.Machine learning is to study the learning behavior that the mankind were simulated or realized to computer how, to obtain New knowledge or skills are taken, the existing structure of knowledge is reorganized and is allowed to constantly improve the performance of itself.It is artificial intelligence Core, it is the fundamental way for making computer that there is intelligence, every field of its application throughout artificial intelligence.Machine learning is most basic Way, be to parse data using algorithm, from learning, then the event in real world made a policy and predicted.With Traditional different to solve particular task, the software program of hard coded, machine learning is to be passed through with substantial amounts of data come " training " How various algorithms complete task from data learning.

Machine learning is a very popular field in Artificial Intelligence Development, the research purpose of machine learning, is desirable to Computer has the ability for obtaining knowledge from real world as the mankind, while establishes the computational theory of study, and construction is various Learning system is simultaneously applied it in every field.Machine learning research mainly has three directions, first, to simulate the mankind's Learning process is set out, it is intended to establishes the understanding physiological mode of study, the development of this direction and cognitive science is closely related；Two It is basic research, develops the theories of learning of various suitable machine features, inquire into all possible learning method, comparative anthropology practises The similarities and differences with machine learning are with contacting；Third, application study, establishes various practical learning systems or knowledge acquisition aid, Automatic acquisition knowledge system is established in the application field of Artificial Intelligence Science, is accumulated experience, is improved knowledge base and control knowledge, enter And the level of intelligence of machine is set to be similar to the mankind.

At present, the scientific and technological giant including Baidu and Google, 2016 in artificial intelligence be dropped in 20,000,000,000 to Between 30000000000 dollars, wherein in 90% input research and development and deployment, also 10%, which is used for artificial intelligence, purchases.Current artificial intelligence Exogenous investment of the speed 3 times since 2013 can have been invested to increase.Artificial Intelligence Development field is concentrated mainly on high-tech/electricity Letter, automobile/assembling and financial services.Machine learning can be used to assist in industrial quarters and solve problem conscientiously, particularly instantly Focus, such as deep learning, influence unmanned, that artificial intelligence assistant is to industrial quarters be huge.

Big data has promoted the development of artificial intelligence, meanwhile, the development of artificial intelligence also allows data to produce huge value, As " intelligent data ".Artificial intelligence is now had been supplied in various big data applications, such as：Search is recommended, shopping is recommended, voice is known Not, image recognition, chat robots, intelligent medical etc..Machine learning and artificial intelligence are continuous on the basis of big data Grow up, in order to allow rambling mass data to produce value, it is necessary to be carried out using complicated network model to data Analyze in large quantities, the model of high-accuracy at ability training, this just needs huge amount of calculation, therefore computing capability is to engineering Practise and the development of artificial intelligence becomes more and more important.

Current big data machine learning algorithm and artificial intelligence analysis are a relatively inefficient use, and resources occupation rate is higher. It is slower to the processing speed of mass data, and requirement of the mass data to hardware in processing procedure is high, can not meet number According to the intelligence computation requirement of driving enterprise's rapid growth.

The content of the invention

To solve the deficiencies in the prior art, it is an object of the invention to provide a kind of machine learning and artificial intelligence application one Body machine dispositions method, the process employs special design and a variety of optimisation techniques so that all-in-one have superelevation calculate performance, The speed of service of program can be significantly speeded up, the machine learning and artificial intelligence application being suitably applied under big data environment.

In order to realize above-mentioned target, the present invention adopts the following technical scheme that：A kind of machine learning and artificial intelligence application All-in-one dispositions method, it is characterised in that comprise the following steps：

Step 1, data storage and data processing are isolated, using the Shared-Nothing framves of enhanced scalability Structure builds overall system architecture, and the system architecture is logically separated into application layer, computation layer and accumulation layer, and application layer, Computation layer and accumulation layer all use distributed structure/architecture；

Step 2, building network framework, network architecture are divided into single chassis networking topology or multi-frame networking topology, the net Network framework is logically divided into extranets, management net, calculates net and storage net；

Step 3, design is optimized to the autgmentability of system.

Further, the application layer is according to the application node for being actually needed configuration varying number；The computation layer according to It is actually needed the calculate node of configuration varying number；The accumulation layer is according to the memory node for being actually needed configuration varying number.

Further, the calculate node configures following software stack：

Support a variety of programming languages；

API for machine learning and deep learning is provided；

It is integrated with deep learning framework TensorFlow；

It is integrated with the distributed computing framework Spark optimized；

The distributed memory file system Alluxio that optimized is integrated with to accelerate reading and writing data；

It is integrated with the RDMA characteristics optimized.

Further, the memory node provides two kinds of storage services of database and universal file system；The data Storehouse includes relevant database PostgreSQL and sequential type database, and the relevant database PostgreSQL uses HAWQ Distributed architecture, the sequential type database use OpenTSDB+Hase distributed architectures；The universal file system uses HDFS+Ceph mixed structures, HAQW bottoms use HDFS.

Further, extranets network interface card and management net network interface card are disposed on the application node, is disposed in the calculate node Manage net network interface card and calculate storage net network interface card, management net network interface card is disposed on the memory node and calculates storage net network interface card.

Further, the single chassis networking topology includes a frame, and construction method is：

An Ethernet switch is equipped with, the port number of the Ethernet switch is more than or equal to total node in frame Number；

It is equipped with one and calculates storage network switch, the port number for calculating storage network switch is more than or equal in frame Total node number；

It is equipped with an outside network switch.

Further, the multi-frame networking topology includes multiple frames, and construction method is：

Each frame is equipped with an Ethernet switch, and the port number of the Ethernet switch is more than total section in frame Points, and reserve port to connect other frames；

Each frame is equipped with one and calculates storage network switch, and the port number for calculating storage network switch is more than frame Interior total node number, and reserve port to connect other frames；

It is equipped with appropriate number of outside network switch；

Core switch is equipped with, the management network switch of each frame is connected to the core switch using simply tree-like On.

Further, the storage network switch that calculates is InfiniBand interchangers, the InfiniBand of each frame The multiple core switch of interchanger connection form fat tree construction.

Further, the autgmentability to system optimizes comprising the following steps that for design：

Performance is improved using framework extending transversely；

Memory capacity is increased using layer architecture.

Further, described the step of using framework extending transversely to improve performance for：

Increase the calculate node number in the computation layer；

Increase the appropriate network switch.

The present invention is advantageous in that：

(1) all-in-one is isolated data storage and data processing, using the Shared-Nothing of enhanced scalability Framework, client, data processing, data storage are separated, are logically divided into three levels：Application layer, computation layer, storage Layer.Each level uses distributed structure/architecture, can reach higher calculating concurrency and reading and writing data concurrency, while allow whole Individual system is with good expansibility, reliability and maintainability.

(2) distributed super fusion hardware structure and the ingenious collocation with software stack, avoid storage and the waste of computing resource, The stability of data analysis streamline is ensured, lifts analysis efficiency.For the every aspect of hardware structure, including CPU, internal memory, Hierarchical storage, GPU have carried out special optimization, have fully excavated the ability of hardware.Go back Deep integrating simultaneously The frameworks such as TensorFlow, a large amount of optimizations are carried out to distributed machines learning algorithm and communication mechanism.

(3) making full use of by framework, algorithm improvement and hardware, realize that the calculating of the order of magnitude accelerates, reduce enterprise To big data infrastructure and the input of manpower.By data cleansing, modeling analysis, high quality, significant information are obtained, from And excavate data value.

Brief description of the drawings

Fig. 1 is flow chart of the present invention；

Fig. 2 is the integral frame schematic diagram of system；

Fig. 3 is the deployment of components framework schematic diagram of system.

Embodiment

Make specific introduce to the present invention below in conjunction with the drawings and specific embodiments.

Shown in reference picture 1, a kind of machine learning of the present invention and artificial intelligence application all-in-one dispositions method, including following step Suddenly：

Step 1, all-in-one are isolated data storage and data processing, using the Shared- of enhanced scalability Nothing frameworks, three layers can be divided on the whole：Application layer, computation layer, accumulation layer.All-in-one uses layer architecture, has possessed The hardware protection of whole redundancy, any one calculate node or memory node break down, and can ensure that data will not lose, And all-in-one still is able to normal work, the reliability of system is drastically increased.

Client, data processing, data storage are separated, are logically divided into three levels, each level uses Distributed structure/architecture, higher calculating concurrency and reading and writing data concurrency can be reached, while allow whole system to have well Scalability, reliability and maintainability.

Wherein, application layer mainly runs user interface service, as processing login, monitoring, management, calculating task layout/carry The work such as friendship.It is required that CPU and memory configurations are medium, storage capacity configuration is low；The calculating that computation layer is used for performing user's submission is appointed Business.It is required that CPU and memory configurations are high, storage capacity configuration is low；Accumulation layer is mainly that calculate node provides massive store.It is required that CPU and memory configurations are low, and storage capacity configuration is high.As shown in Fig. 2 all-in-one can also be actually needed according to different, neatly Application node, calculate node and the memory node of varying number, and the scalability with height are configured, is integrated with WebUI, money The functions such as source control, system monitoring, scheduling of resource, task management.

Shown in reference picture 3.Application node allows user easily to carry out task management, system by providing a Web UI The way to manages such as monitoring, resource management, outermost layer of the application node as all-in-one, it is exposed to user's operation.Application node, tool Body will provide following functional interface：Application management (using submission/using deletion/application state inquiry), data storage and inquiry (copy/paste/upload/download/establishment/movement/is deleted for (structured storage interface/unstructured memory interface), file management Except), monitoring resource (GPU/CPU/Memory/Network/Disk/Others), management (resource management/Role Management/user Manage/assure reason/node administration).

Calculate node optimizes for a large amount of consumings of computing resource, using the software stack specially designed：

A. the support of a variety of programming languages, such as Python, R, Java, Scala are provided；

B. the API for machine learning and deep learning is provided, while also supports some other general-purpose computations API；

C. it is integrated with deep learning framework TensorFlow.Tensorflow is as deep learning framework, substantial amounts of application Based on being developed on this framework, all-in-one is integrated with the framework so that the application based on this Development of Framework can be directly at it Upper operation.

D. it is integrated with the distributed computing framework Spark optimized.Spark is an efficient distributed computing system, On this basis, Spark underlying algorithm storehouse is optimized so that distributed task scheduling has operation faster in each calculate node Speed.

E. the distributed memory file system Alluxio that optimized is integrated with to accelerate reading and writing data.Alluxio is one Distributed memory file system, it is allowed to file is reliably shared with the speed of internal memory in cluster frameworks, on this basis, Further it is optimized so that Scheduling Framework can preferably utilize Alluxio distributed memory characteristic.

F. the RDMA characteristics optimized are integrated with：JXIO.RDMA (Remote Direct Memory Access) technology can To solve the delay issue that servers' data is handled in network transmission.RDMA is by network the directly incoming computer of data Memory block, data are moved quickly into remote system stored device from a system, without being had any impact to operating system, So only need to use seldom cpu resource.It eliminates external memory storage duplication and text exchange operation, in releasing Deposit bandwidth and cpu cycle is used to provide application program capacity.

The characteristics of according further to all-in-one Distributed Architecture, the calculating platform of optimization can be deployed to each calculate node In, because upper strata is using mesos progress task schedulings and resource management, therefore the role of each calculate node is identical , do not differentiate between the concepts of master and worker nodes.

The memory node of all-in-one is responsible for providing store function, and two kinds of main offer database and universal file system are deposited Storage service.Database can be divided into two classes, respectively relevant database PostgreSQL and sequential type database.Distribution collection PostgreSQL HAWQ and the OpenTSDB+Hase of time series database is respectively adopted in group's scheme.Shown in reference picture 3, file system System equally uses distributed frame, and using HDFS+Ceph mixed structures, HAQW bottoms use HDFS, and other components use Ceph is stored.

Therefore, upper layer data management tool can automatically select data and be stored on HDFS according to the storage mode of file Or on Ceph.Database software layer is deployed in computing cluster, and file system software is deployed in storage cluster, each company-data Management function is positioned such that：

1) application cluster：

The unified access interface of data is provided；

The import/export interfaces of large-scale data are provided；

Dispose the management software client of database；

Dispose the monitoring tools of database positioning.

2) computing cluster：

Dispose database management language；

SQL/REST api interfaces are provided.

3) storage cluster：

Using HDFS, Ceph distributed file system of mixing；

Support block storage, object storage.

Step 2, building network framework, it is divided into two kinds of single chassis networking topological sum multi-frame networking topology, logically draws It is divided into extranets, management net, calculates net and storage net.

Extranets：For connecting the interchanger of user, the network for accessing all-in-one service is externally provided.External network connects Mouth uses common 1Gbps Ethernets, and extranets network interface card is only disposed on application node.

Manage net：Calculating task etc. is submitted for monitoring, managing each node of all-in-one, and to calculate node.This A little tasks are not very high to network bandwidth and delay requirement, while to avoid influenceing calculating net and storage net, using independently of meter The common 1Gbps Ethernets for calculating net and storage net can be, it is necessary to which all deployment manages net network interface card on each node.

Calculate net：For connecting each calculate node, very high to network delay requirement (height uses InfiniBand nets with version Card).

Store net：It is very high to network bandwidth and delay requirement for connecting each memory node, here using high bandwidth and The 56Gbps InfiniBand (standard edition can use 10Gbps RoCE network interface cards) of low delay by storage net and calculate net fusion For a network.Except all disposing InfiniBand network interface cards in calculate node and memory node, it is contemplated that application node may also The data of access memory node are needed, so application node can contemplate also deployment InfiniBand network interface cards.

All-in-one mounting means in units of frame, in each frame can include several application nodes, calculate node and Memory node.Each frame is equipped with an Ethernet switch (management net), an InfiniBand interchanger (calculates storage Net), the port number of every interchanger should be not less than the total node number in the frame.If necessary to extend multiple frames, exchange Machine will also reserve certain port number to be connected to other frames.As for outside network switch, it is contemplated that application node is relatively It is few, it may be considered that multiple frames share an interchanger.Networking for multiple frames is, it is necessary to increase extra core switch To connect each frame, the management network switch of each frame, a core switch can be converged to using simply tree-like. And using InfiniBand calculating to store network switch needs to use multiple core switch to form fat tree construction to ensure There is full bandwidth pathway between any two node.

Step 3, all-in-one design deployment makes it have the good system expandability, be broadly divided into behavior extension and Expanding storage depth.

All-in-one uses framework extending transversely, can be by increasing the calculate node in computation layer, so as to increase entirety Computing resource (GPU/CPU/Memory), and then improve the application program speed of service., may when rolling up calculate node Network is caused to turn into bottleneck, in order to keep computing resource (GPU/CPU/Memory), storage and network to be in a kind of balanced mode Above, it is necessary to which increasing the appropriate network switch goes solve the problems, such as network bottleneck.All-in-one uses layer architecture, by data storage list Member separates with calculation processing unit, therefore, in the case of a large amount of storages are needed, directly can laterally increase memory capacity, non- It is often convenient.

The basic principles, principal features and advantages of the present invention have been shown and described above.The technical staff of the industry should Understand, the invention is not limited in any way for above-described embodiment, all to be obtained by the way of equivalent substitution or equivalent transformation Technical scheme, all fall within protection scope of the present invention.

Claims

1. a kind of machine learning and artificial intelligence application all-in-one dispositions method, it is characterised in that comprise the following steps：

Step 1, data storage and data processing are isolated, taken using the Shared-Nothing frameworks of enhanced scalability Overall system architecture is built, the system architecture is logically separated into application layer, computation layer and accumulation layer, and application layer, calculating Layer and accumulation layer all use distributed structure/architecture；

Step 2, building network framework, network architecture are divided into single chassis networking topology or multi-frame networking topology, the network rack Structure is logically divided into extranets, management net, calculates net and storage net；

Step 3, design is optimized to the autgmentability of system.

2. a kind of machine learning according to claim 1 and artificial intelligence application all-in-one dispositions method, it is characterised in that In step 1, the application layer is according to the application node for being actually needed configuration varying number；The computation layer is according to being actually needed Configure the calculate node of varying number；The accumulation layer is according to the memory node for being actually needed configuration varying number.

3. a kind of machine learning according to claim 2 and artificial intelligence application all-in-one dispositions method, it is characterised in that The calculate node configures following software stack：

A. a variety of programming languages are supported；

B., API for machine learning and deep learning is provided；

C. it is integrated with deep learning framework TensorFlow；

D. it is integrated with the distributed computing framework Spark optimized；

E. the distributed memory file system Alluxio that optimized is integrated with to accelerate reading and writing data；

F. the RDMA characteristics optimized are integrated with.

4. a kind of machine learning according to claim 2 and artificial intelligence application all-in-one dispositions method, it is characterised in that The memory node provides two kinds of storage services of database and universal file system；The database includes relevant database PostgreSQL and sequential type database, the relevant database PostgreSQL uses HAWQ distributed architectures, when described Sequence type database uses OpenTSDB+Hase distributed architectures；The universal file system is using HDFS+Ceph mixing knots Structure, HAQW bottoms use HDFS.

5. a kind of machine learning according to claim 2 and artificial intelligence application all-in-one dispositions method, it is characterised in that Extranets network interface card and management net network interface card are disposed on the application node, management net network interface card is disposed in the calculate node and calculating is deposited Net network interface card is stored up, management net network interface card is disposed on the memory node and calculates storage net network interface card.

6. according to a kind of machine learning according to any one of claims 1 to 5 and artificial intelligence application all-in-one dispositions method, Characterized in that, the topology of single chassis networking described in step 2 includes a frame, construction method is：

An Ethernet switch is equipped with, the port number of the Ethernet switch is more than or equal to the total node number in frame；

It is equipped with one and calculates storage network switch, the port number for calculating storage network switch is more than or equal to total in frame Nodes；

It is equipped with an outside network switch.

7. according to a kind of machine learning according to any one of claims 1 to 5 and artificial intelligence application all-in-one dispositions method, Characterized in that, the topology of multi-frame networking described in step 2 includes multiple frames, construction method is：

Each frame is equipped with an Ethernet switch, and the port number of the Ethernet switch is more than total node in frame Number, and reserve port to connect other frames；

Each frame is equipped with one and calculates storage network switch, and the port number for calculating storage network switch is more than in frame Total node number, and reserve port to connect other frames；

It is equipped with appropriate number of outside network switch；

Core switch is equipped with, the management network switch of each frame is connected on the core switch using simply tree-like.

8. a kind of machine learning according to claim 7 and artificial intelligence application all-in-one dispositions method, it is characterised in that The storage network switch that calculates is InfiniBand interchangers, and the InfiniBand interchangers connection of each frame is multiple described Core switch forms fat tree construction.

9. a kind of machine learning and artificial intelligence application all-in-one dispositions method according to any one of claim 2~5, Characterized in that, comprising the following steps that for design is optimized to the autgmentability of system described in step 1：

Performance is improved using framework extending transversely；

Memory capacity is increased using layer architecture.

10. a kind of machine learning according to claim 9 and artificial intelligence application all-in-one dispositions method, its feature exist In, described the step of using framework extending transversely to improve performance for：

Increase the calculate node number in the computation layer；

Increase the appropriate network switch.