CN111506367B

CN111506367B - Multi-cluster type artificial intelligence online service method and system

Info

Publication number: CN111506367B
Application number: CN202010310086.3A
Authority: CN
Inventors: 谢冬鸣; 林健
Original assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Current assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2023-09-22
Anticipated expiration: 2040-04-20
Also published as: CN111506367A

Abstract

The invention discloses a multi-cluster type artificial intelligent online service method and a system, wherein the method is characterized in that a target job request and a target job submitted by a user are acquired, the target job is analyzed, and a job operation mode of the target job is determined; determining a target job cluster corresponding to the target job and a target gateway component or a target container required by executing the target job according to the job operation mode; executing the target operation in the target operation cluster through the target gateway component or the target container, obtaining a processing result and feeding back, wherein the artificial intelligent online service can be deployed in a multi-cluster deployment environment planned by various requirements such as safety and the like and is transparent to an end user; and the external cluster resources are conveniently and efficiently accessed in the artificial intelligent online service, so that the cross-cluster security access is realized.

Description

Multi-cluster type artificial intelligence online service method and system

Technical Field

The invention relates to the field of distributed systems, in particular to a multi-cluster artificial intelligence online service method and system.

Background

In recent years, artificial intelligence technology has been developed at a high speed and has been widely used by various industries; providing artificial intelligence technology in an online service mode is an important way for technology landing and supporting application; the online service mode of the artificial intelligent service comprises cloud computing modes such as public cloud, private cloud, hybrid cloud and the like, and also comprises a traditional local area network built-in service mode.

The operation of the artificial intelligence online service in the multi-cluster environment has definite service requirements; however, the respective features of the artificial intelligence service and the multi-cluster online service environment bring some problems to the efficient and safe implementation of such services: the artificial intelligent online service needs to support two operation modes of batch processing and micro-service, and multiple interaction modes such as non-real-time web interface interaction, programming access API and the like; the different interaction modes have respective technical characteristics, and the requirements of the different interaction modes on the service software architecture and the communication components are different; when designing the multi-cluster service software, the multi-cluster service software is difficult to uniformly manage through a set of simple mechanism; moreover, in a multi-cluster environment, the networks of each cluster cannot be freely accessed, and certain rule constraints are required to be met, including an ingress and egress IP range constraint, a protocol constraint, a port constraint, a connection durability constraint and the like; in addition, compared with a common multi-cluster environment, the hybrid cloud also relates to other security problems crossing management domains and coexistence problems of multiple user authentication mechanisms.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a multi-cluster type artificial intelligence online service method and system, and aims to solve the technical problems that in the prior art, the artificial intelligence operation types are more, the difference between the artificial intelligence operation types and external interaction modes is large, permission constraint exists, and the design is complex.

In order to achieve the above object, the present invention provides a multi-cluster type artificial intelligence online service method, which includes the following steps:

acquiring a target job request and a target job submitted by a user, analyzing the target job, and determining a job operation mode of the target job;

determining a target job cluster corresponding to the target job and a target gateway component or a target container required by executing the target job according to the job operation mode;

and executing the target job in the target job cluster through the target gateway component or the target container, obtaining a processing result and feeding back.

Preferably, the obtaining the target job request and the target job submitted by the user, analyzing the target job, and determining the job operation mode of the target job includes:

Acquiring a target job request and a target job submitted by a user, analyzing the target job, and acquiring state query information and operation log query information corresponding to the target job;

and receiving a job operation mode of the target job obtained from a local business system in response to the state query information and the operation log query information.

Preferably, the obtaining the target job request and the target job submitted by the user, analyzing the target job, and obtaining the state query information and the operation log query information corresponding to the target job includes:

acquiring a target job request submitted by a user and a target job;

inquiring a target file storage catalog corresponding to the target job in a preset log storage service file system in response to the target job request;

and acquiring state query information and operation log query information corresponding to the target job according to the target file storage directory.

Preferably, after the step of obtaining the target job request and the target job submitted by the user, the multi-cluster artificial intelligence online service method further includes:

acquiring user information corresponding to the target operation request, and carrying out identity authentication on the user information through a preset identity authentication component;

Allowing a step of inquiring a target file storage directory corresponding to the target job in a preset log storage service file system in response to the target job request when the identity authentication is passed;

and when the identity authentication fails, prohibiting to respond to the target operation request, and feeding back authentication failure information.

Preferably, the determining, according to the job operation mode, a target job cluster corresponding to the target job, and a target gateway component or a target container required for executing the target job, includes:

when the job operation mode is a batch job mode, determining that a target job cluster corresponding to the target job is a job cluster for running batch job, and a target gateway component required for executing the target job is a log gateway component or a target container is a batch job container;

when the operation mode is a micro-service operation mode, determining that a target operation cluster corresponding to the target operation is an operation cluster for operating the micro-service operation, and determining that a target gateway component required for executing the target operation is a proxy forwarding component or a target container is a micro-service operation container.

Preferably, the executing the target job in the target job cluster through the target gateway component or the target container, obtaining a processing result and feeding back, including:

Processing the target job in the job cluster running the batch job through the log gateway component or the batch job container to obtain a processing result, and feeding back the processing result to a terminal user;

and processing the target job in the job cluster running the micro-service job through the proxy forwarding component or the micro-service job container to obtain a processing result, and feeding back the processing result to a terminal user.

Preferably, before the target job is executed in the target job cluster through the target gateway component or the target container to obtain a processing result and perform feedback, the multi-cluster artificial intelligence online service method further includes:

and forwarding the target job to the target job cluster according to a preset forwarding rule or a dynamic forwarding rule.

In addition, in order to achieve the above object, the present invention also provides a multi-cluster type artificial intelligence online service system, which includes:

the analysis module is used for acquiring a target job request and a target job submitted by a user, analyzing the target job and determining a job operation mode of the target job;

The component determining module is used for determining a target job cluster corresponding to the target job according to the job running mode and a target gateway component or a target container required by executing the target job;

and the execution module is used for executing the target job in the target job cluster through the target gateway component or the target container, obtaining a processing result and feeding back the processing result.

Preferably, the component determining module includes:

the batch processing component module is used for determining that a target job cluster corresponding to the target job is a job cluster for running batch processing jobs when the job running mode is a batch processing job mode, and a target gateway component required for executing the target job is a log gateway component or a target container is a batch processing job container;

and the micro-service component module is used for determining that the target job cluster corresponding to the target job is the job cluster for running the micro-service job when the job running mode is the micro-service job mode, and the target gateway component required by executing the target job is a proxy forwarding component or the target container is a micro-service job container.

Preferably, the execution module includes:

The batch processing execution module is used for processing the target job through the log gateway component or the batch processing job container in the job cluster running batch processing job to obtain a processing result, and feeding back the processing result to a terminal user;

and the micro-service execution module is used for processing the target job in the job cluster running the micro-service job through the proxy forwarding component or the micro-service job container to obtain a processing result, and feeding back the processing result to a terminal user.

According to the multi-cluster type artificial intelligence online service method, a target job request and a target job submitted by a user are acquired, the target job is analyzed, and a job operation mode of the target job is determined; determining a target job cluster corresponding to the target job and a target gateway component or a target container required by executing the target job according to the job operation mode; executing the target operation in the target operation cluster through the target gateway component or the target container, obtaining a processing result and feeding back, wherein the artificial intelligent online service can be deployed in a multi-cluster deployment environment planned by various requirements such as safety and the like and is transparent to an end user; and the external cluster resources are conveniently and efficiently accessed in the artificial intelligent online service, so that the cross-cluster security access is realized.

Drawings

FIG. 1 is a flow chart of a first embodiment of a multi-cluster artificial intelligence online service method of the present invention;

FIG. 2 is a flow chart of a second embodiment of a multi-cluster artificial intelligence online service method of the present invention;

FIG. 3 is a flow chart of a third embodiment of a multi-cluster artificial intelligence online service method of the present invention;

FIG. 4 is a functional block diagram of a first embodiment of a multi-cluster artificial intelligence online service system according to the present invention;

FIG. 5 is a schematic diagram of an operation cluster architecture of the multi-cluster artificial intelligence online service method of the present invention.

FIG. 6 is a schematic diagram of a service cluster architecture of the multi-cluster artificial intelligence online service method of the present invention.

FIG. 7 is a schematic diagram of a central gateway architecture of a multi-cluster artificial intelligence online service method according to the present invention.

FIG. 8 is a schematic diagram of operation of a job cluster in the multi-cluster artificial intelligence online service method of the present invention.

FIG. 9 is a diagram of log components of a multi-cluster artificial intelligence online service method of the present invention.

FIG. 10 is a schematic diagram of a proxy forwarding component of the multi-cluster artificial intelligence online service method of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The solution of the embodiment of the invention mainly comprises the following steps: according to the method, the target job request and the target job submitted by a user are acquired, the target job is analyzed, and the job operation mode of the target job is determined; determining a target job cluster corresponding to the target job and a target gateway component or a target container required by executing the target job according to the job operation mode; executing the target operation in the target operation cluster through the target gateway component or the target container, obtaining a processing result and feeding back, wherein the artificial intelligent online service can be deployed in a multi-cluster deployment environment planned by various requirements such as safety and the like and is transparent to an end user; the external cluster resource is conveniently and efficiently accessed in the artificial intelligent online service, so that the cross-cluster safe access is realized, and the technical problems of multiple artificial intelligent operation types, large difference between the artificial intelligent operation types and external interaction modes, permission constraint and complex design in the prior art are solved.

Referring to fig. 1, fig. 1 is a flowchart of a first embodiment of a multi-cluster artificial intelligence online service method according to the present invention.

In a first embodiment, the multi-cluster artificial intelligence online service method includes the steps of:

step S10, acquiring a target job request and a target job submitted by a user, analyzing the target job, and determining a job operation mode of the target job.

The target job request user initiates a job request to a service cluster through a World Wide Web (Web) interface or an application program interface (Application Programming Interface, API), and generally, the target job request includes user information and job information of the job; and analyzing the target job to determine a job operation mode of the target job.

In a specific implementation, the execution body of the embodiment is a service cluster, which is used for managing the operation metadata and the life cycle submitted by the user, namely, realizing the logic function of the artificial intelligent online service system; the target operation may be a deep learning operation, wherein the computing operation mainly related to the application scenario of the deep learning includes training, reasoning, development environment adjustment and visualization, etc., so the operation type of the deep learning operation mainly includes training operation, development environment adjustment and measurement operation, visualization operation, reasoning operation, etc.; wherein, the training operation can be single machine or distributed, and the distributed implementation mode can be various; for distributed training jobs, the running mode is realized into a plurality of scheduling modes, batch starting and running are usually required to be carried out according to a certain strategy, all tasks are exited after training, and the jobs are automatically ended; for other job types such as reasoning, development environment tuning, and visualization, the operation is not usually performed in batch form, and the operation time is manually controlled by the user, so that the service needs to be performed for a long time, and the service is provided for the client through a web page or an API during the operation, and the job is commonly called as a micro-service job.

Step S20, determining a target job cluster corresponding to the target job according to the job operation mode, and a target gateway component or a target container required for executing the target job.

It should be noted that, different job operation modes correspond to different target job clusters and different components or containers of the target job; according to the operation mode, an appropriate operation system, namely an appropriate operation cluster, can be determined, and targeted deployment of the artificial intelligence online service can be realized by adapting to a corresponding gateway component or container, so that cross-cluster security access is realized.

And step S30, executing the target job in the target job cluster through the target gateway component or the target container, obtaining a processing result and feeding back the processing result.

It can be understood that after the target job cluster and the target gateway component or the target container are determined, the target job can be executed in the target job cluster through the target gateway component or the target container, so as to obtain a processing result and feed back, generally to a user terminal, so that the user knows the job processing situation in real time, deploys the artificial intelligent online service in a multi-cluster deployment environment planned due to various requirements such as security, and is transparent to the end user.

In the specific implementation, the service cluster queries a target job cluster where a target job is located from a service system of the service cluster according to job information carried by a target job request of a user, and forwards the target job request to the target job cluster continuously through a central gateway; when the target job cluster receives the target job request, the target job request is uniformly sent to an agent forwarding component in the target job cluster, the agent forwarding component sends an authentication confirmation request to the identity service component according to the job information and the user information in the request, and the request is forwarded to a corresponding job container or log component according to forwarding rules after the confirmation information is acquired.

Further, before the step S30, the multi-cluster artificial intelligence online service method further includes:

The preset forwarding rule is a preset forwarding rule, and the dynamic forwarding rule is a dynamic rule created by the service cluster through an API, or a rule is created immediately according to a request of a user; forwarding the target job to a correct target service through the preset forwarding rule and the dynamic forwarding rule; and forwarding the received request to the relative target job cluster according to the external artificial intelligence micro-service request.

In a specific implementation, a part of forwarding rules are preset through a proxy forwarding component, and meanwhile, dynamic rules created by a service cluster through an API (application program interface) are received, or the rules are created immediately according to a request of a user; such dynamic rules correspond to each batch job or microservice run in the job cluster. When the proxy forwarding component receives a specific job request, authentication is firstly carried out to the authentication service according to the job and user information in the job request, and after the user authentication passes, a matched job container or log component is found according to the rule owned by the user, and the request is forwarded.

According to the scheme, the target job request and the target job submitted by the user are obtained, the target job is analyzed, and the job operation mode of the target job is determined; determining a target job cluster corresponding to the target job and a target gateway component or a target container required by executing the target job according to the job operation mode; executing the target operation in the target operation cluster through the target gateway component or the target container, obtaining a processing result and feeding back, wherein the artificial intelligent online service can be deployed in a multi-cluster deployment environment planned by various requirements such as safety and the like and is transparent to an end user; and the external cluster resources are conveniently and efficiently accessed in the artificial intelligent online service, so that the cross-cluster security access is realized.

Further, fig. 2 is a flow chart of a second embodiment of the multi-cluster type artificial intelligence online service method of the present invention, as shown in fig. 2, the second embodiment of the multi-cluster type artificial intelligence online service method of the present invention is proposed based on the first embodiment, in this embodiment, the step S10 includes the following steps:

step S11, a target job request and a target job submitted by a user are obtained, the target job is analyzed, and state query information and operation log query information corresponding to the target job are obtained.

It can be understood that, by analyzing the target job, state query information and running log query information corresponding to the target job can be obtained, where the state query information is queried interface state information reflecting the target job, and some of various interactive interfaces based on hypertext transfer protocol (HyperText Transfer Protocol, HTTP) protocols are stateful and some of them are stateless; some have idempotent properties and some do not; some user authentication mechanisms can be added at will, and some users can avoid affecting the original protocol work only by designing a transparent user authentication mechanism; and all types of jobs can generate logs, and the operation log inquiry information is the inquired log information generated when the target job operates or the history log information. Further, the step S11 includes the steps of:

Acquiring a target job request submitted by a user and a target job;

It should be understood that, in response to the target job request, a target file storage directory corresponding to the target job may be queried in a preset log storage service file system, so as to obtain state query information and running log query information corresponding to the target job; the preset log storage service file system is pre-stored with file storage catalogues distributed for different jobs, and when receiving related requests of service clusters for the artificial intelligence job logs, the preset log storage service file system can accurately acquire the log content of the bottom layer and respond to the log content.

Further, after the step of obtaining the target job request and the target job submitted by the user, the multi-cluster artificial intelligence online service method further comprises:

It should be noted that, in actual operation, the job semantics of the target job request may be bound with the corresponding user identity information, and the preset identity authentication component may be deployed in a server cluster, may be deployed in an internal job cluster, and may be deployed separately; acquiring user information corresponding to the target operation request, and carrying out identity authentication on the user information through a preset identity authentication component; the authentication mode is that authority authentication can be performed through the user information, the follow-up steps are continuously executed if the identity authentication passes, the response to the target operation request is forbidden if the identity authentication does not pass, and authentication failure information is fed back, wherein the authentication failure information can be fed back to a server, namely a service cluster, or to a user equipment, or to other terminals, and the embodiment is not limited to the above.

In a specific implementation, the user information is generally parsed from an HTTP request, and the identity authentication service uses data of a Header or a Cookie in an HTTP protocol as source information of identity authentication; of course, other source information may be used as the basis of identity authentication, which is not limited in this embodiment; the identity authentication service can be a Rest service developed by Java, and operates on a service cluster realized by using Kubernetes in the form of Pod; it may be a service component developed in other development languages, or may run separately outside of the job cluster and the service cluster. Its main functions include: providing an identity authentication function for the service cluster, and providing a login function and subsequent identity validity authentication capability for the service cluster when a user initiates a request to the service cluster through a Web interface or an API; after the job cluster receives the job request, the job cluster inquires the identity authentication service according to the user information in the request, judges the authenticity of the identity information, and ensures that the user is legal and not forged.

And step S12, receiving a job operation mode of the target job obtained from a local business system in response to the state query information and the operation log query information.

In a specific implementation, the job operation mode is generally determined when the job is submitted, and the corresponding job operation mode is stored in a local service system, namely, the job operation mode of the target job obtained from the local service system is received in response to the state query information and the operation log query information, wherein the job operation mode can be queried in the own service system; generally, there are two different types of job operation modes: firstly, batch processing operation represented by model training and batch reasoning, and secondly, micro service operation represented by real-time reasoning, debugging tool and visualization tool; batch jobs are similar to classical high-performance computing or big data processing jobs, being generally managed and executed by a batch job scheduler, with a limited lifecycle associated with a particular business; the latter is similar to classical web services or Platform-as-a-Service (PaaS) layer API services, which are typically managed and executed by a micro-Service scheduler, with stateless, infinitely extensible lifecycles.

According to the scheme, the target job request and the target job submitted by the user are obtained, and the target job is analyzed to obtain the state query information and the operation log query information corresponding to the target job; receiving a job operation mode of the target job obtained from a local service system in response to the state query information and the operation log query information, wherein the operation mode can deploy artificial intelligent online service in a multi-cluster deployment environment planned by various requirements such as security and the like and is transparent to an end user; and the external cluster resources are conveniently and efficiently accessed in the artificial intelligent online service, so that the cross-cluster security access is realized.

Further, fig. 3 is a flow chart of a second embodiment of the multi-cluster type artificial intelligence online service method of the present invention, as shown in fig. 3, a third embodiment of the multi-cluster type artificial intelligence online service method of the present invention is proposed based on the first embodiment, in which the step S20 includes the following steps:

and S21, when the job operation mode is a batch job mode, determining that a target job cluster corresponding to the target job is a job cluster for running batch job, and the target gateway component required for executing the target job is a log gateway component or the target container is a batch job container.

It should be noted that, when in batch operation mode, a log gateway component needs to be deployed; performing log collection and query services, wherein the job cluster running the batch job can provide real computing resources and other runtime environments for the batch job; providing scheduling capability for batch processing jobs, which is different from scheduling for micro-service type jobs, wherein the batch processing jobs need to have professional group scheduling capability; a distributed file system providing or accessing log storage; scheduling the received batch job to a proper resource node for operation; according to the log request sent by the service cluster, correctly returning the log response of the corresponding job through the log service component in the job cluster; typical group scheduling capabilities include "single process", "multi-process distributed", "PS-workbench distributed" and "MPI distributed", etc.; if a job cluster needs to support artificial intelligence batch jobs, it needs to support at least one or more of the above-described exemplary group scheduling capabilities.

Step S22, when the operation mode is a micro-service operation mode, determining that a target operation cluster corresponding to the target operation is an operation cluster for operating the micro-service operation, and the target gateway component required for executing the target operation is a proxy forwarding component or the target container is a micro-service operation container.

It can be understood that when in the micro-service operation mode, an agent forwarding component needs to be deployed, a general scenario is a Rest reasoning service or a web development environment, a service cluster includes an agent forwarding component, which may also be called a central gateway, and the agent forwarding component can transparently forward a user request to each operation cluster and obtain a result for a request of a micro-service type or a log request of a batch operation; and forwarding the job request sent by the service cluster to a proper batch job or micro-service job container through the proxy forwarding component.

Accordingly, the step S30 includes the steps of:

It should be appreciated that different job operating modes correspond to different target job processing modes, the target job being processed by the log gateway component or the batch job container in the job cluster running batch jobs; processing the target job in the job cluster running the micro-service job through the proxy forwarding component or the micro-service job container; the method can realize the deployment of the artificial intelligent online service in a multi-cluster deployment environment planned by various requirements such as security and the like, and is transparent to end users.

According to the scheme, when the operation mode is the batch operation mode, the target operation cluster corresponding to the target operation is determined to be the operation cluster for running the batch operation, and the target gateway component required by executing the target operation is a log gateway component or the target container is a batch operation container; when the operation mode is a micro-service operation mode, determining that a target operation cluster corresponding to the target operation is an operation cluster for operating the micro-service operation, and a target gateway component required by executing the target operation is a proxy forwarding component, or a target container is a micro-service operation container; the system can deploy the artificial intelligent online service in a multi-cluster deployment environment planned by various requirements such as security and the like, and is transparent to end users; and the external cluster resources are conveniently and efficiently accessed in the artificial intelligent online service, so that the cross-cluster security access is realized.

Based on the embodiment of the multi-cluster type artificial intelligence online service method, the invention further provides a multi-cluster type artificial intelligence online service system.

Referring to fig. 4, fig. 4 is a functional block diagram of a first embodiment of the multi-cluster artificial intelligence online service system of the present invention.

In a first embodiment of the multi-cluster type artificial intelligence online service system of the present invention, the multi-cluster type artificial intelligence online service system comprises:

the analysis module 10 is configured to obtain a target job request and a target job submitted by a user, analyze the target job, and determine a job operation mode of the target job.

The component determining module 20 is configured to determine, according to the job operation mode, a target job cluster corresponding to the target job, and a target gateway component or a target container required for executing the target job.

Further, the component determination module 20 includes:

and the batch processing component module is used for determining that the target job cluster corresponding to the target job is the job cluster for running batch processing job when the job running mode is a batch processing job mode, and the target gateway component required by executing the target job is a log gateway component or the target container is a batch processing job container.

It will be appreciated that when in batch mode of operation, a log gateway component needs to be deployed; performing log collection and query services, wherein the job cluster running the batch job can provide real computing resources and other runtime environments for the batch job; providing scheduling capability for batch processing jobs, which is different from scheduling for micro-service type jobs, wherein the batch processing jobs need to have professional group scheduling capability; a distributed file system providing or accessing log storage; scheduling the received batch job to a proper resource node for operation; according to the log request sent by the service cluster, correctly returning the log response of the corresponding job through the log service component in the job cluster; typical group scheduling capabilities include "single process", "multi-process distributed", "PS-workbench distributed" and "MPI distributed", etc.; if a job cluster needs to support artificial intelligence batch jobs, it needs to support at least one or more of the above-described exemplary group scheduling capabilities.

It should be understood that when in the micro-service operation mode, an agent forwarding component needs to be deployed, where a general scenario is a Rest reasoning service, or a web development environment, and the service cluster includes an agent forwarding component, which may also be called a central gateway, and the agent forwarding component may transparently forward a request of a user to each operation cluster and obtain a result for a request of a micro-service type or a log request of a batch operation; and forwarding the job request sent by the service cluster to a proper batch job or micro-service job container through the proxy forwarding component.

And the execution module 30 is used for executing the target job in the target job cluster through the target gateway component or the target container, obtaining a processing result and feeding back.

Further, the execution module 30 includes:

The steps of implementing each functional module of the multi-cluster artificial intelligence online service system may refer to each embodiment of the multi-cluster artificial intelligence online service method of the present invention, which is not described herein.

Based on the embodiment of the multi-cluster type artificial intelligence online service method, the embodiment of the invention also provides a working cluster architecture schematic diagram of the multi-cluster type artificial intelligence online service method, and fig. 5 is a working cluster architecture schematic diagram of the multi-cluster type artificial intelligence online service method.

Referring to FIG. 5, an artificial intelligence batch job may be scheduled to run in a different job cluster after being managed in a unified manner via a service cluster; when a user accesses an artificial intelligent micro-service job, the request is uniformly forwarded to an agent through a service cluster gateway, and a plurality of job clusters are transparent to the user; one of the functions of the architecture mode is to solve the network planning problem of the service clusters and the operation clusters, in an actual scene, a plurality of operation clusters can be positioned in a plurality of subnets of an enterprise, and each operation cluster can realize mutual isolation between networks and isolation between data and business.

Based on the design of the embodiment of the invention, the artificial intelligent online service system can dynamically add a new job cluster when running, and the service cluster can submit the subsequent artificial intelligent job to the newly added job cluster after the new job cluster is registered to the service cluster; therefore, the dynamic capacity expansion of the resources can be realized, and when the workload is increased and the existing computing resources are insufficient, a new job cluster can be used for expanding the resources. Meanwhile, the new cluster resource can rent public cloud resources, and the public cloud resources and the original cluster resource form a hybrid cloud to work cooperatively.

The communication mode between the service cluster and the plurality of job clusters is limited between the central gateway of the service cluster and the proxy forwarding component of the job cluster, and the communication link is single. In practical implementation, each cluster may implement independent network constraints, and then verify and satisfy these network constraints on the proxy forwarding component of the service cluster and the central gateway of the service cluster, so as to ensure the security of communication between multiple clusters.

Correspondingly, the embodiment of the invention also provides a service cluster architecture schematic diagram of the multi-cluster type artificial intelligent online service method, and fig. 6 is a service cluster architecture schematic diagram of the multi-cluster type artificial intelligent online service method.

Referring to fig. 6, functional components required in the artificial intelligent online service system are deployed in the service cluster, and mainly complete the service capability required by the system; its main capabilities include:

providing a service portal to an end user; providing a function of managing a batch job or a micro-service job life cycle by a user; forwarding the artificial intelligent job submitted by the user to a proper back-end job cluster for operation; receiving request information of a user for the existing batch processing operation or micro-service, judging the operation, inquiring an operation cluster where the operation is located, and forwarding the request to the operation cluster; when a log request of a job is received, carrying out service judgment, inquiring a job cluster where the job is located, and forwarding the request to the job cluster; in practice, service clusters run on top of Kubernetes, on which individual service components in the service clusters run in the form of Docker containers, which provide a series of service abstractions to ensure that numerous service components in the service clusters can run stably in a highly useful form.

Correspondingly, the embodiment of the invention also provides a central gateway architecture diagram of the multi-cluster type artificial intelligent online service method, and fig. 7 is a central gateway architecture diagram of the multi-cluster type artificial intelligent online service method.

Referring to fig. 7, in actual operation, the central gateway component was developed using Spring Cloud Gateway, running in Pod form on a service cluster implemented as Kubernetes; when receiving a user request, the central gateway component firstly carries out logic judgment to judge whether the request needs to be forwarded to a job cluster; if so, firstly, carrying out identity verification on an identity authentication service according to the operation and user information in the operation request, after the user identity verification is passed, acquiring the operation cluster information of the operation by inquiring the service component, forwarding the request to an actual operation cluster in an agent manner, and responding to the terminal user after acquiring the corresponding operation cluster information; the whole process is transparent to the end user, and the end user cannot feel the existence of an independent operation cluster.

Correspondingly, the embodiment of the invention also provides a working cluster operation schematic diagram of the multi-cluster type artificial intelligent online service method, and fig. 8 is a working cluster operation schematic diagram of the multi-cluster type artificial intelligent online service method.

Referring to FIG. 8, job clusters are implemented in a Kubernetes container orchestration service; in the embodiment, the artificial intelligent operation task operates in a mode of a Docker container, and the operation base mirror image adopts a Docker mirror image; kubernetes uses the open-source container management software Docker as platform software to create and manage containers; the Docker container runs on a plurality of computing nodes, and one computing node can run a plurality of containers; an application program corresponding to the artificial intelligence operation of the user runs in the Docker container, and the user can access resources in the container through an interface exposed by the operation container; the job cluster batch job scheduler can use an open source scheduling component capable of running on the Kubernetes, and only needs to provide a high-performance task scheduling engine, so that a plurality of artificial intelligent batch jobs in the system can be stably and efficiently scheduled; the job cluster provides service in Kubernetes REST API mode for service components in the service cluster to submit artificial intelligent batch job in REST API mode; after receiving the request, the job cluster distributes the jobs to a scheduler for scheduling operation according to scheduling information in the job metadata information.

Correspondingly, the embodiment of the invention also provides a log component schematic diagram of the multi-cluster type artificial intelligence online service method, and fig. 9 is a log component schematic diagram of the multi-cluster type artificial intelligence online service method.

Referring to fig. 9, in actual operation, the log storage file system required by the job cluster is a network file system (Network File System, NFS); the storage system keeps the job log file because NFS has a single storage space globally accessible in the distributed system, facilitating batch job multi-node log writing; NFS is mounted in the form of persistent volumes Persistent Volumes in Kubernetes into a Docker container for the operation of an artificial intelligence job, transparently providing a log storage file system for the artificial intelligence job; at the same time, this NFS service is also read by the log component in the system for servicing the end user's log request.

The log component is a Rest service developed by using Golang and runs on a job cluster which is realized by using Kubernetes in the form of Pod; the log component reads the NFS deployed in the embodiment, queries and obtains log file contents of a specific artificial intelligence job, and returns the log file contents to the user.

In practice, log component workflow may be illustrated by taking a batch job such as TensorFlow distributed training as an example. Assume that user1 runs a batch job comprising 1 PS task process and 2 worker task processes, and that this job is named job1. Creating a log folder like '/user1/job1/' in a log file system NFS after the batch job is scheduled to run, and writing three job logs of 'ps1.Log', 'worker1.Log' and 'worker2.Log' under the folder; taking the example of the log of the job1 to be checked by the user1, when the log component receives a log request of the 'ps1' container of the 'user1' to be checked by the 'user1', the log component obtains three parameter information of the 'user1', 'job1' and 'ps1' from request parameters, then searches the journal storage file system NFS for the '/user1/job1/ps1.Log', and then returns the content of the journal file to the user.

Correspondingly, the embodiment of the application also provides a schematic diagram of a proxy forwarding component of the multi-cluster type artificial intelligent online service method, and fig. 10 is a schematic diagram of a proxy forwarding component of the multi-cluster type artificial intelligent online service method.

Referring to fig. 10, a proxy forwarding component of the job cluster was developed using Spring Cloud Gateway, running in Pod on the job cluster implemented in Kubernetes, and proxy all job requests sent to the job cluster in the HTTPS protocol. The proxy forwarding component of the embodiment of the application presets part of forwarding rules and receives dynamic rules created by the service cluster through the API or creates the rules in real time according to the request of a user; such dynamic rules correspond to each batch job or microservice run in the job cluster. When the proxy forwarding component receives a specific job request, authentication is firstly carried out to the authentication service according to the job and user information in the job request, and after the user authentication passes, a matched job container or log component is found according to the rule owned by the user, and the request is forwarded.

The upper graph in fig. 10 shows the workflow of a micro-service job request sent from a service cluster to a job cluster in a job cluster proxy forwarding component of an embodiment of the present application, where the workflow is: the job cluster agent forwarding component receives a request of micro-service job sent by a service cluster; analyzing user identity information from the received HTTP request; the analyzed identity information is sent to the identity authentication service deployed in the embodiment of the application, the identity authentication information is obtained, if the identity authentication fails, the process is ended, and failure information is returned to the service cluster; if the identity verification is successful, continuing the subsequent flow; according to the routing information in the request, searching the existing routing rules in the request, finding the matched routing of the micro-service operation and forwarding the request to the micro-service operation container; the micro-service operation container returns a result to the operation cluster agent forwarding component; and returning the response returned by the micro-service job to the service cluster.

The following diagram in fig. 10 illustrates the workflow of an artificial intelligence job log request sent from a service cluster to a job cluster in a job cluster proxy forwarding component of an embodiment of the present application, the workflow being: the job cluster agent forwarding component receives a request of a job log sent by a service cluster; analyzing user identity information from the received HTTP request; the analyzed identity information is sent to the identity authentication service deployed in the embodiment of the application, the identity authentication information is obtained, if the identity authentication fails, the process is ended, and failure information is returned to the service cluster; if the identity verification is successful, continuing the subsequent flow; according to the routing information in the request, matching the routing information to preset log component routing information; forwarding the request to the log component obtained in the step 4; the log component container returns a result to the job cluster agent forwarding component; and returning the response returned by the log component to the service cluster.

Job clusters running batch jobs job clusters running micro-service jobs job clusters running batch jobs running micro-service jobs

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The multi-cluster type artificial intelligence online service method is characterized by comprising the following steps of:

executing the target operation in the target operation cluster through the target gateway component or the target container, obtaining a processing result and feeding back the processing result;

the determining, according to the job operation mode, a target job cluster corresponding to the target job, and a target gateway component or a target container required for executing the target job, includes:

When the operation mode is a micro-service operation mode, determining that a target operation cluster corresponding to the target operation is an operation cluster for operating the micro-service operation, and a target gateway component required by executing the target operation is a proxy forwarding component, or a target container is a micro-service operation container;

the target job is executed in the target job cluster through the target gateway component or the target container, a processing result is obtained, and feedback is performed, including:

processing the target job in the job cluster running the micro-service job through the proxy forwarding component or the micro-service job container to obtain a processing result, and feeding back the processing result to a terminal user;

wherein the target job cluster is implemented in a Kubernetes container orchestration service.

2. The multi-cluster artificial intelligence online service method of claim 1, wherein the obtaining the target job request and the target job submitted by the user, analyzing the target job, determining the job operation mode of the target job, comprises:

3. The method for online service of multi-cluster artificial intelligence according to claim 2, wherein the steps of obtaining a target job request and a target job submitted by a user, analyzing the target job, and obtaining state query information and operation log query information corresponding to the target job, comprise:

acquiring a target job request submitted by a user and a target job;

4. The multi-cluster artificial intelligence online service method according to claim 3, wherein after the step of obtaining the target job request and the target job submitted by the user, the multi-cluster artificial intelligence online service method further comprises:

5. The multi-cluster artificial intelligence online service method of any of claims 1-4, wherein the multi-cluster artificial intelligence online service method further comprises, before the target job is executed in the target job cluster by the target gateway component or target container to obtain a processing result and perform feedback:

6. A multi-cluster artificial intelligence online service system, the multi-cluster artificial intelligence online service system comprising:

the execution module is used for executing the target job in the target job cluster through the target gateway component or the target container, obtaining a processing result and feeding back the processing result;

the component determination module includes:

the micro-service component module is used for determining that a target job cluster corresponding to the target job is a job cluster for running the micro-service job when the job running mode is a micro-service job mode, and a target gateway component required for executing the target job is a proxy forwarding component or a target container is a micro-service job container;

the execution module comprises:

The micro-service execution module is used for processing the target job in the job cluster running the micro-service job through the proxy forwarding component or the micro-service job container to obtain a processing result, and feeding back the processing result to a terminal user;

the component determination module is further configured to operate the target job cluster above Kubernetes, and the target gateway component or the target container operates above Kubernetes in a Docker container format.