Machine learning training system and method based on edge cloud computing
Technical Field
The invention relates to a machine learning training system and method based on edge cloud computing.
Background
The current machine learning platform is more, and the work of setting up and maintaining the machine learning platform is also more loaded down with trivial details. How to quickly set up a training platform for machine learning and quickly start a training task becomes a difficult problem to be solved urgently. The cloud computing platform can be used for solving the problem.
Cloud computing generally refers to an infrastructure platform based on a large-scale data center, a cloud end of cloud computing is formed by establishing a centralized large-scale data center and a computing center, and resources and services required by a user are provided to the outside by the cloud end, so the large-scale data center is also called a core cloud. The edge cloud computing is an open cloud platform integrating network, computing, storage and application core capabilities at one side close to an object or a data source, and nearest-end service is provided nearby.
The edge cloud is composed of server nodes distributed in the same region, specifically processes service requests of local users, and rapidly and flexibly provides cloud computing services for the users. The edge clouds are connected through a backbone network, and a user logs in the edge cloud with the nearest geographic position through the network and uses the service provided by the edge cloud nearby. On one hand, the edge cloud is responsible for processing data flow between the core cloud and the user terminal, and by utilizing the correlation between communication data, network overhead is reduced, time delay is reduced, and cloud computing service quality is guaranteed; on the other hand, the edge cloud storage terminal accesses common data and common data required by the cloud computing service.
For machine learning, a training task needs high-density computing resources, the computing resources of the core cloud are located at the far end of data generated by a user and an application model, quick response of user requirements cannot be achieved, the computing resources needed by training can be closer to a user terminal by the edge cloud, the model trained by machine learning can be more conveniently issued to the terminal, and meanwhile the terminal can timely feed back the use condition of the model.
In addition, training models produced by most current machine learning platforms cannot be rapidly released and applied, and cannot be rapidly fed back by users and timely correct the models. Finally, the user cannot quickly experience the intelligent effect brought by machine learning.
In summary, the training methods for most machine learning platforms at present have the following disadvantages:
(1) the number of machine learning platforms is large, the deployment and implementation processes are complex, the maintenance cost is high, and unified management cannot be realized;
(2) the computing resources of the training platform are far away from the user data and cannot be close to the user;
(3) the training model cannot be quickly released to the end user;
(4) the application effect of the training model cannot be timely acquired and fed back to the training platform for training again.
Based on the defects, the invention designs a machine learning training method based on edge cloud computing to solve the problems of complex deployment and implementation and unified management of a machine learning platform, the problem that computing resources are close to users, the problem that a training model is rapidly released and the problem that user feedback is collected and then trained.
Disclosure of Invention
In order to solve the problems, the invention provides a machine learning training system and method based on edge cloud computing.
Firstly, the invention provides a machine learning training system based on edge cloud computing, which can continuously acquire new data expansion and optimization training samples by a user terminal, train an output model by a machine learning platform, rapidly release the model output by training to the user terminal, collect the application condition of the model by the user terminal and feed the application condition back to the machine learning platform to complete the retraining of the model, and form a virtuous circle to continuously improve the training capability and the model accuracy of the machine learning platform.
Meanwhile, the invention provides a machine learning training method based on edge cloud computing, which uses an edge cloud computing platform supporting virtualization and container technology to provide computing resources required by training, so that the computing resources are closer to users, and meanwhile, quick deployment and unified management of the machine learning platform are realized by means of resource pooling and resource management capacity of the cloud computing platform.
In order to achieve the purpose, the invention adopts the following technical scheme:
a machine learning training system based on edge cloud computing comprises an edge cloud computing system and a terminal, wherein:
the edge cloud computing system comprises a machine learning platform and a cloud platform, wherein the cloud platform simultaneously manages a virtualization platform and a container platform, the virtualization platform realizes pooling of resources through a virtualization technology and elastic allocation of the resources, so that the utilization rate of infrastructure is improved, and the container platform realizes decoupling of the machine learning platform and hardware resources by using the container technology;
the machine learning platform is configured to perform specific execution of a training task, and specifically comprises a model production device and a model feedback device, wherein the model production device is configured to perform training scheduling, verification, archiving, publishing, subscribing and updating of a management model; the model feedback device is configured to realize the collection, analysis and feedback of data in the production environment;
and the terminal receives the training model, collects the training result and feeds the training result back to the machine learning platform.
Further, the model production device comprises a training scheduling module, a model verification module, a model archiving module, a model publishing module, an updating decision module and a sample management module, wherein:
the training scheduling module realizes the scheduling and scheduling functions of the training tasks and ensures the effective operation of the training tasks;
the model verification module realizes the local verification function before model archiving and prevents the output of invalid models;
the model filing module realizes the functions of inquiring, retrieving, storing, backing up, deleting, destroying and classifying management of the model files;
the model issuing module rapidly issues the model output by training and distributes the model to each user terminal;
the user subscription module realizes the function of subscribing the model by the user, allows the user terminal to subscribe the model and distribute the model according to the user requirement, and avoids unnecessary transmission;
after the updating decision module receives feedback from a user, the updating decision module decides whether the model needs to be updated or not, and informs the training module to create a new training task if the model needs to be updated;
and the sample management module is used for realizing the management of sampling, storage, expansion, optimization, updating and the like of the samples.
Further, the model feedback device comprises a collection module, an analysis module and a feedback module, wherein:
the collection module is used for realizing the collection of actual effect data of the model application and finishing the data acquisition of the terminal;
the analysis module is responsible for cleaning, filtering, analyzing and summarizing data, generating a sample and completing data preparation work;
the feedback module is responsible for expanding the initial samples output by the analysis module to a formal sample set after checking and feeding the initial samples back to an update decision module of the model production device to decide whether the model needs to be updated or not.
Furthermore, the edge cloud computing system is located at the near end of the terminal, and data change is responded quickly.
The training method based on the system comprises the following steps:
after the training task starts, removing the sample set to obtain a corresponding sample;
applying for container resources, and starting training after initialization configuration;
after the training is finished, outputting a model;
verifying the model on the test sample set, if the model is an effective model, archiving, and otherwise finishing training;
after archiving, starting a publishing process, distributing the model to a corresponding terminal, and receiving feedback accuracy and performance data;
filtering and analyzing the feedback data, and screening effective data from the feedback data and expanding the effective data to a sample set;
deciding whether to start a model updating task by analyzing the accuracy data and the performance data; if the updating is needed, a new training task is started, and if the updating is not needed, the feedback is terminated.
The training task scheduling method based on the system comprises the following steps:
summarizing all currently received training tasks, including: timing tasks, temporary tasks, subscription tasks and updating tasks;
performing priority evaluation on the tasks, and sequencing;
calculating the time cost of all tasks and evaluating the resource requirements of all tasks;
evaluating available resources at present, collecting container resources in the cloud platform, and calculating the amount of the available resources;
performing task arrangement according to the priority, the time cost, the resource requirement and the available resources;
after the tasks are arranged, the container resources are applied according to the sequence, and the training tasks are scheduled.
Further, the priority is in order from big to small, and the tasks are subscribed to, temporarily and updated.
Further, the time cost evaluation parameters include, but are not limited to: model parameters, sample set size, task type, and historical time cost.
Further, the resource demand evaluation parameters include, but are not limited to: model size, sample set size, and historical demand.
The training model issuing method based on the system comprises the following steps:
firstly, inquiring a user list subscribed to the model;
carrying out priority sequencing on the users, wherein sequencing parameters comprise user levels, model use frequency and online states;
and inquiring the authorization of the terminal to the machine learning platform, directly pushing the model under the condition that the authority permits direct updating, and only notifying the model updating message to each terminal if the authority permits only receiving notification.
The feedback and update implementation method based on the system comprises the following steps:
firstly, collecting accuracy data, performance data and use frequency data of model application;
pushing the collected data, filtering the integrity and effectiveness of the data, and then performing classification, aggregation and statistical analysis;
according to the analysis result, a part of data is used for generating a user use report, and a part of available data is used for generating sample data;
and directly feeding back a user report, verifying sample data according to the requirements of a formal sample, expanding the sample data to a sample set after the sample data passes the requirements of the formal sample, and finally feeding back the data.
Compared with the prior art, the invention has the beneficial effects that:
1. the method can be applied to the edge cloud which is positioned at the nearest end of the user data and has strong computing resources, so that the machine learning platform based on the edge cloud can train the model in the environment closest to the user, and the problems of long distance between the user data and the computing resources, high data transmission cost and delayed response are solved. Meanwhile, the centralized management of the deep learning platform is completed by utilizing the resource management function of the edge cloud computing system.
2. The training method provided by the invention realizes a model factory device, can uniformly schedule a training task, complete the training, verification and archiving of the model, and can quickly release the model, so that a user can quickly experience the model. The problems that the model is updated slowly and cannot be adapted to the user scene quickly are solved fundamentally.
3. The training method realizes a feedback device, can uninterruptedly collect performance data of model application, and then feeds the performance data back to the machine learning platform to form a virtuous cycle of model continuous optimization.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a schematic view of an edge cloud of the present invention;
FIG. 2 is a machine learning platform deployment diagram of the present invention;
FIG. 3 is an overall flow chart of the training method of the present invention;
FIG. 4 is a schematic view of a pattern production apparatus of the present invention;
FIG. 5 is a schematic view of a model feedback arrangement of the present invention;
FIG. 6 is a training task scheduling flow diagram of the present invention;
FIG. 7 is a flow diagram of model release in accordance with the present invention;
FIG. 8 is a flow chart of the feedback and update of the present invention.
The specific implementation mode is as follows:
the invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
In the present invention, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only terms of relationships determined for convenience of describing structural relationships of the parts or elements of the present invention, and are not intended to refer to any parts or elements of the present invention, and are not to be construed as limiting the present invention.
In the present invention, terms such as "fixedly connected", "connected", and the like are to be understood in a broad sense, and mean either a fixed connection or an integrally connected or detachable connection; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be determined according to specific situations by persons skilled in the relevant scientific or technical field, and are not to be construed as limiting the present invention.
As described in the background art, the existing machine learning platform has many problems that the deployment implementation process is complicated, the maintenance cost is high, unified management cannot be performed, the machine learning platform cannot approach a user, and a training model cannot be rapidly released to an end user.
As shown in fig. 1, the present invention employs an edge cloud computing system to manage a machine learning platform. The edge cloud computing system simultaneously supports two technologies of virtualization and container, and can simultaneously manage two objects of a virtual machine and a container. The edge cloud computing system can be used for uniformly managing various services of the machine learning platform, and the daily work of deployment, operation and maintenance, upgrading and reconstruction is effectively reduced. The non-core service of machine learning runs on a virtualization platform, pooling of resources is achieved through a virtualization technology, and flexible resource allocation is achieved, so that the utilization rate of infrastructure is improved. The core service (including but not limited to the training model service) runs on the container platform, and the decoupling of the machine learning platform and the hardware resource is realized by using the container technology, so that the method not only can be used for supporting various machine learning platforms, but also can improve the utilization rate of GPU hardware resources and reduce the loss, thereby ensuring that the GPU resources are effectively distributed to the core service. Meanwhile, the edge cloud computing system is located at the near end of the user terminal, computing resources are closer to data, transmission cost of the data can be effectively reduced, and change of the data can be responded quickly.
In particular to a machine learning training platform, a model production device and a model feedback device. Among other things, machine learning platforms support various platforms (including but not limited to Tensorflow, Caffe) that are currently popular for specifically performing training tasks. The model production device is used for managing training scheduling, verification, archiving, publishing, subscribing and updating of the model. The model feedback device is used for realizing data collection, analysis and feedback in the production environment.
The model production device, as shown in fig. 4, adopts the training scheduling module to implement the scheduling and scheduling functions of the training tasks, and can effectively ensure the effective operation of the training tasks. The model verification module realizes the local verification function before model archiving and prevents the output of invalid models. The model filing module realizes the functions of inquiring, retrieving, storing, backing up, deleting, destroying and classifying management of the model files. The model issuing module can rapidly issue the training output model and distribute the training output model to each user terminal. The user subscription module realizes the function of subscribing the model by the user, allows the user terminal to subscribe the model, and distributes the model according to the user requirement, thereby avoiding unnecessary transmission. And after the updating decision module receives feedback from the user, deciding whether the model needs to be updated or not, and informing the training module to create a new training task if the model needs to be updated. And the sample management module is responsible for realizing the management of sampling, storage, expansion, optimization, updating and the like of the samples. The model production device can be used for effectively managing the model, timely distribution of the model can be guaranteed, and the latest model can be guaranteed to be applied timely.
The model feedback device, as shown in fig. 5, uses a collection module to collect actual effect data of the model application, and completes data collection of the user terminal. The analysis module is responsible for cleaning, filtering, analyzing and summarizing the data, generating a sample and completing the preparation work of the data. And the feedback module is responsible for expanding the initial sample output by the analysis module to a formal sample set after checking and feeding the initial sample back to an updating module of the model production device to decide whether the model needs to be updated or not. The feedback device can automatically complete the collection and report of user data, and the machine learning platform can timely obtain user feedback, so that the change of the user data is quickly responded, and a model for continuously improving and optimizing is provided for a user.
Machine learning platform deployment method
(1) As shown in fig. 1, the edge cloud is located between the core cloud and the user terminal, and provides computing resources for the user nearby;
(2) as shown in fig. 2, the edge cloud platform provides two resources, a virtual machine and a container, to the outside;
(3) the machine learning platform runs on the cloud platform, and the non-core service: the model production device and the model feedback device use virtual machine resources to run on a virtualization platform, and the training platform relying on physical GPU resources uses container resources to run on a container platform.
The overall process of the training method of the invention is shown in fig. 3:
(1) after a training task starts, firstly, removing a sample set to obtain a corresponding sample;
(2) applying for container resources, and starting training after initialization configuration;
(3) after the training is finished, outputting a model;
(4) verifying the model on the test sample set, if the model is an effective model, archiving, and otherwise finishing training;
(5) after archiving, starting a publishing process and distributing the model to users;
(6) when the user terminal uses the model, the accuracy and the performance data are collected and fed back to the machine learning platform;
(7) after the machine learning receives the feedback, the data is filtered and analyzed, and effective data is screened out from the data and expanded to a sample set;
(8) whether a model updating task is started or not is decided by an updating module of the model production device through analyzing the accuracy data and the performance data; if the updating is needed, a new training task is started, and if the updating is not needed, the feedback is terminated.
In conclusion, the method is suitable for training tasks of all models, and comprises the steps of starting the training tasks, generating the models, verifying, archiving and releasing the models, feeding back the models to the user terminal, and finally deciding whether to start the model updating task according to the feedback result. The whole process can form virtuous circle and continuously upgrade and reform the model.
The training task scheduling implementation steps involved in the present invention are as shown in fig. 6:
(1) the training scheduling module summarizes all currently received training tasks, and comprises the following steps: timing tasks, temporary tasks, subscription tasks and updating tasks;
(2) and performing priority evaluation on the tasks, wherein the priority is as follows: subscribing tasks, temporary tasks, updating tasks and timing tasks, and sequencing;
(3) calculating the time cost of all tasks, and the time cost evaluation parameters include but are not limited to: model parameter number, sample set size, task type and historical time cost;
(4) evaluating resource requirements of all tasks, the resource requirement evaluation parameters including but not limited to: model size, sample set size, historical demand;
(5) evaluating available resources at present, collecting container resources in the cloud platform, and calculating the amount of the available resources;
(6) performing task arrangement according to the priority, time cost, resource requirements and available resources;
(7) after the tasks are arranged, the container resources are applied according to the sequence, and the training tasks are scheduled.
The model release implementation steps involved in the present invention are as shown in fig. 7:
(1) firstly, inquiring a user list subscribed to the model;
(2) and (3) carrying out priority sequencing on the users, wherein the sequencing parameters are as follows: user level, model usage frequency, online status;
(3) inquiring authorization of a user terminal to a machine learning platform, directly pushing the model under the condition that permission allows direct updating, and only notifying a model updating message to a user if the permission only allows receiving notification;
(4) the model is downloaded by the user and the new model is applied.
The feedback and update implementation steps involved in the present invention are as shown in fig. 8:
(1) firstly, collecting accuracy data, performance data and use frequency data of model application by a user terminal;
(2) pushing the collected data to a data analysis module of a feedback device, filtering the integrity and effectiveness of the data, and then performing classification, aggregation and statistical analysis;
(3) according to the analysis result, a part of data is used for generating a user use report, and a part of available data is used for generating sample data;
(4) and directly feeding back the user report, verifying the sample data according to the requirements of the formal sample, expanding the sample data to a sample set after the sample data passes the requirements of the formal sample, and feeding back the data to an updating decision module of the model production device.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.