WO2024041035A1 - 机器学习模型的管理方法、装置、管理平台和存储介质 - Google Patents

机器学习模型的管理方法、装置、管理平台和存储介质 Download PDF

Info

Publication number
WO2024041035A1
WO2024041035A1 PCT/CN2023/093188 CN2023093188W WO2024041035A1 WO 2024041035 A1 WO2024041035 A1 WO 2024041035A1 CN 2023093188 W CN2023093188 W CN 2023093188W WO 2024041035 A1 WO2024041035 A1 WO 2024041035A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
target model
file
description information
configuration
Prior art date
Application number
PCT/CN2023/093188
Other languages
English (en)
French (fr)
Inventor
石鸿伟
陈超
史精文
徐倩
黄韬
刘韵洁
Original Assignee
网络通信与安全紫金山实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网络通信与安全紫金山实验室 filed Critical 网络通信与安全紫金山实验室
Publication of WO2024041035A1 publication Critical patent/WO2024041035A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management

Definitions

  • This application relates to the field of artificial intelligence technology, for example, to a management method, device, management platform and storage medium for a machine learning model.
  • TensorFlow is an end-to-end machine learning platform provided by Google and open source. It has a comprehensive and healthy ecosystem and a rich resource library, which can help developers easily build applications powered by machine learning.
  • the Kubernetes platform can create multiple instances of containers for an application, and achieve load-balanced access to this group of applications through the platform's built-in load balancing strategy, and the platform cluster ensures the availability of application functions when some machines fail.
  • Kubernetes supports persistent volume storage and persistent volume declarations, which can be mounted by application containers to achieve persistent storage of container data and synchronous sharing of data on each node in the cluster.
  • This application provides a management method, device, management platform and storage medium for a machine learning model to realize automatic online, offline and deletion operations of the machine learning model, thereby easing the work difficulty of developers.
  • This application provides a management method for TensorFlow models, which is executed by the model management platform.
  • the method includes:
  • This application also provides a management device for TensorFlow models, including:
  • the configuration description information storage module is configured to receive the file compression package of the target model through the model import page, and after determining that the decompressed content of the file compression package passes verification, store the configuration description information of the target model in the database;
  • the decompression folder storage module is set to establish a model file directory under the identification directory of the TensorFlow service based on the file storage identification returned by the database for the target model, and store the decompression folder corresponding to the file compression package in the model file directory;
  • the enabling configuration item generation module is set to receive the enabling instructions for the target model through the model list page, obtain the configuration description information of the target model from the database, and generate the enabling configuration items of the target model based on the configuration description information; displayed on the model list page
  • the list of models matches the models stored in the database;
  • the enable configuration item addition module is set to add the enable configuration item of the target model in the model configuration file of the TensorFlow service to launch the target model in the model management platform.
  • the model management platform includes a Kubernetes cluster. Multiple Kubernetes nodes in the Kubernetes cluster deploy TensorFlow service containers respectively; each TensorFlow service container has a preset refresh time; each TensorFlow The service container and each model maintenance service container respectively mount pre-created shared storage volumes; wherein each TensorFlow service container is configured to execute the TensorFlow model management method described in any embodiment of this application.
  • This application also provides a computer-readable storage medium that stores computer instructions, so that when the processor executes the computer instructions, the management method of the TensorFlow model described in any embodiment of the application is implemented.
  • Figure 1a is a flow chart of a TensorFlow model management method provided according to Embodiment 1 of the present application;
  • Figure 1b is a schematic structural diagram of enabling the target model in the method provided in Embodiment 1 of the present application;
  • Figure 2a is a flow chart of another TensorFlow model management method provided according to Embodiment 2 of the present application.
  • Figure 2b is a structural representation of deleting the target model in the method provided in Embodiment 2 of the present application. intention;
  • Figure 3 is a schematic structural diagram of a TensorFlow model management device provided according to Embodiment 3 of the present application.
  • Figure 4 is a schematic structural diagram of a management platform that implements the TensorFlow model management method according to the embodiment of the present application.
  • the inventor found that the relevant technology has the following defects: when adding a new machine learning model, the new model's cured files need to be manually copied to a path recognized by the service, and the model configuration file needs to be modified, and then restarted.
  • the newly added model can only be started after serving.
  • the operation of adding and modifying the model is complex and error-prone. If the business scenario requires frequent model updates, this will greatly increase the workload of the staff and make it difficult to synchronize and share data in containers on different cluster servers.
  • Figure 1a is a flow chart of a TensorFlow model management method provided in Embodiment 1 of the present application. This embodiment can be applied to situations where machine learning models are updated or added. This method can be executed by a TensorFlow model management device.
  • the management device of the TensorFlow model can be implemented in the form of hardware and/or software.
  • the method includes:
  • the model import page may be a page that allows model import on the model management platform.
  • the target model can be a model to be added or updated on the model management platform.
  • Configuration description information can It is the basic configuration information describing the target model.
  • the configuration description information may include the unique identifier of the model and the type of the model.
  • the database can be a MySQL database.
  • the file compression package associated with the target model for offline training by the algorithm engineer is imported from the model import page of the model management platform.
  • the model management platform business service completes the basic verification of the decompressed content of the imported file compression package, it parses the model
  • the model configuration information file agreed in the package extracts the configuration description information and stores the configuration description information in the MySQL database.
  • model management platform deployment environment preparation includes: preparing the Kubernetes cluster environment, installing the nfs-utils tool, creating a shared storage volume (Persistent Volume, PV) in nfs format, and setting the access model to ReadWriteMany, which can be mounted by multiple nodes. , and create the corresponding shared volume claim (Persistent Volume Claim, PVC).
  • PV Persistent Volume
  • PVC Persistent Volume Claim
  • Kubernetes was selected as the deployment platform and the TensorFlow service container of the multi-node deployment platform.
  • the access mode is ReadWriteMany, which can be mounted by multiple nodes and multiple service containers, and create corresponding PVC.
  • the TensorFlow service container of all nodes will be configured with a model-aware directory.
  • the directory contains the configuration file models.config, such as the directory path. It can be /models/, select the newly created PVC (named PVC_NAME) to mount, and the model configuration files and configuration description information record files models.config in all TensorFlow service containers are persistently stored and shared for data synchronization.
  • the configuration description information record file may be an information record file composed of basic configuration information describing the target model
  • the model configuration file may be an information file describing the configuration of the target model
  • the TensorFlow service container can realize the import, activation and deactivation of models.
  • the model maintenance service container can also be mounted on each Kubernetes node.
  • the model maintenance service container can organize and manage each model in the database.
  • the configuration description information of the target model is stored in the database, including: decompressing the file compressed package to obtain the decompressed folder corresponding to the file compressed package; parsing the decompressed file package. Describe the directory structure of the decompressed folder, and read the configuration description information record file in the decompressed folder; if it is determined that the directory structure meets the preset structural requirements, and the configuration When the configuration description information record file contains the attribute value of the model's necessary attribute, it is determined that the decompressed content of the file compression package passes the verification; and the configuration description information of the target model recorded in the configuration description information record file is stored in the database.
  • the preset structural requirement can be a preset directory structure on the model management platform.
  • the preset structural requirement can be set to include a model file directory, and there is a corresponding sub-directory under the model file directory. Table of contents. If the current directory structure does not match the preset structure requirements, the verification will not pass. If the current directory structure matches the preset structure requirements, it is necessary to then determine whether the configuration description information record file contains the attribute values of the model's required attributes. If the configuration description information record file contains the attribute values of the model's required attributes, the verification is passed. , if the configuration description information record file does not contain the attribute values of the model's required attributes, the verification will not pass.
  • the attribute values of the model's required attributes include parameters such as the model type and unique identifier.
  • the model management platform needs to decompress the file compression package, and through the decompression process, the decompressed folder can be obtained.
  • the decompressed folder can be obtained.
  • the model management platform can store the configuration description information of the target model recorded in the configuration description information record file into the database, that is, store parameters such as the model type and unique identification in the database.
  • the file storage identification can be that when the target model is stored in the database, the database will send the identification of the target model storage to the model management platform.
  • the file storage identification of the feedback corresponding to each target model stored in the database is unique.
  • the model management platform can according to The file storage identifier identifies the relevant target model.
  • the model file directory can be a list directory of files that can be retrieved.
  • the model file directory is located on the model management platform, and the corresponding decompressed folder can be stored in the relevant model file directory.
  • the directory name folder can be a folder with files named after the directory name.
  • the subdirectory can be the next hierarchical directory that exists under the model file directory. It is understandable that there are A corresponding subdirectory, and the model file directory can create a new directory name folder, and use the file storage identifier as the folder directory name.
  • the algorithm engineer trained the target model based on the TensorFlow platform based on the sampled data, and packaged the model configuration file and configuration description information record file containing the model configuration information according to the agreed configuration file format.
  • the trained model is saved as a pb file (that is, saved_model.pb) and a variable folder (variables).
  • the target model package is in the model file directory and contains the file config.json of the model configuration information.
  • Operation and maintenance imports the target model package provided by the algorithm engineer into the system from the World Wide Web (WEB) page of the model management platform. Import the compressed file package of the target model into the model management platform.
  • the model management platform will decompress the compressed package of the target model and verify whether its directory structure meets the preset structural requirements and whether the required attributes in the configuration description information are not empty. If so, This means that the decompressed content of the file compression package has passed the verification.
  • the model management platform extracts the basic information and configuration information of the target model into the database, and uses the database table to automatically increment the primary key to generate the unique identifier KEY of the model record.
  • the type TYPE of the target model is a subdirectory, which stores the target model files.
  • the configured Tensorflow model file directory is models, and the directory where the model files are stored is /models/KEY/TYPE.
  • the advantage of this setting is that in the identification directory of the TensorFlow service, create a new model file directory with the file storage identifier as the directory name folder and the target model type as the subdirectory. In this way, the relevant model file directories of the target model can be built more clearly on the model management platform, and can be searched more accurately and quickly.
  • the model list displayed on the model list page matches the models stored in the database.
  • the model list page may be a page that can display the target model on the model management platform.
  • the enabling directive may be a directive that enables the use of the target model.
  • the activation instruction may be an instruction to activate the model on the model management platform.
  • the enabled configuration item can be the content that needs to be configured on the target model based on the feedback configuration description information from the database.
  • receive the enable instruction for the target model through the model list page obtain the configuration description information of the target model from the database, and generate the enable configuration items for the target model based on the configuration description information, including: receiving the enable command for the target model through the model list page.
  • the startup instruction of the model and obtains the configuration description information of the target model from the database according to the file storage identification of the target model included in the startup instruction; extracts the type of the target model from the configuration description information, and extracts the type of the target model according to the A file describing the target model Store the identifier and type, and construct a unique name and storage path corresponding to the target model; fill the unique name and storage path in a preset enabled configuration item generation template to generate enabled configuration items of the target model.
  • the enabled configuration item generation template may be a template used for configuring the enabled configuration item.
  • the model management platform you can view the basic information and detailed configuration description information of the target model, such as the number of training times, standardized parameters, etc. Through these values, you can understand the model training scenario and help analyze the model. Pros and cons.
  • the model management platform can send a request to the TensorFlow service to bring the model online.
  • the model management platform can automatically write the configuration description information of the target model to the configuration description information record file models.config of the TensorFlow service.
  • the model list page receives the activation instruction for the target model through the model list page, and obtain the configuration description information of the target model from the database according to the file storage identifier of the target model included in the startup instruction, that is, KEY; extract the target from the configuration description information
  • the type of the model that is, TYPE, and based on the file storage ID and type of the target model, build a unique name and storage path corresponding to the target model; fill the unique name and storage path into the preset enabled configuration item generation template, and generate Enable configuration items for the target model.
  • the default enabled configuration item generation template is: config: ⁇ name:,base_path:,model_platform:"TensorFlow" ⁇ .
  • config is "model type_KEY", that is, TYPE_KEY, so the information format for enabling configuration items required by models.config is: config: ⁇ name:"TYPE_KEY",base_path:”/ models/KEY/TYPE",model_platform:"TensorFlow” ⁇ .
  • the advantage of this setting is that by extracting the type of the target model from the configuration description information, and based on the file storage ID and type of the target model, a unique name and storage path corresponding to the target model are constructed, and the enabled configuration items of the target model are generated. In this way, each target model can generate corresponding enabled configuration items, so that the target model can be launched on the model management platform more accurately.
  • the unique name and storage path can be determined through the type and file storage identification of the target model, thus easing the work. Difficulty of personnel's work.
  • a model configuration file may be an information file describing the configuration of a target model.
  • the model management platform may be a platform that can manage multiple target models, for example, a platform that can manage description management information, model configuration files and other information of the target model.
  • the TensorFlow service comes with Restful and Google Remote Procedure Call (GRPC) interface services, which can access the enabled target model on the model management platform through the Restful application programming interface (Application Programming Interface, API) and GRPC interface. .
  • GRPC Remote Procedure Call
  • Both methods require the name, input, and output of the model to be clarified first.
  • the name of the model can be viewed through the model list page. After determining the model name, you can view the interface URL through TensorFlow's own metadata: http://$ ⁇ url ⁇ :$ ⁇ port ⁇ /v1/models/$ ⁇ MODEL_NAME ⁇ / metadata, obtain the input and output of the corresponding target model, assemble the input parameters according to the input format, access the model interface through the Restful API or GRPC communication protocol, obtain the return data from the target model operation, and parse the return data according to the output format.
  • FIG. 1b it is a schematic structural diagram for enabling the target model.
  • the activation instructions for the target model are received through the model list page (WEB page), and the target model configuration description information is queried through the enable model interface to the database interface.
  • the database Return the model configuration information to the model management platform.
  • the model management platform generates the enabled configuration items of the target model based on the configuration description information, which may include obtaining the model name and storage path, and configures them in the configuration description information record file models.config.
  • the model configuration file of the TensorFlow service add and update the enabling configuration items of the target model to launch the target model in the model management platform.
  • the method further includes: receiving a detailed viewing instruction for the target model through the model list page, and storing the file storage identifier of the target model according to the file storage identifier of the target model included in the detailed viewing instruction, Obtain the configuration description information of the target model from the database; feed the configuration description information of the target model to the model list page for user display.
  • the detailed viewing instruction may be an instruction describing viewing detailed information of the current model. For example, by sending a detailed viewing instruction to the model management platform, where the detailed viewing instruction includes a file storage identifier, the file storage identifier can be searched in the corresponding database and fed back to the user.
  • users can view detailed configuration description information of the target model, such as the number of training times, standardized parameters, etc.
  • the relevant parameter values of the target model users can more accurately understand the model training scenario, which helps analyze the pros and cons of the model.
  • the technical solution of the embodiment of this application receives the file compression package of the target model through the model import page, and after determining that the decompressed content has passed the verification, the configuration description information is stored in the database; according to the file storage identification returned by the database for the target model, in TensorFlow Create a model file directory in the identification directory of the service and store the decompressed folder in the model file directory; receive activation through the model list page Instructions to obtain configuration description information from the database and generate enabled configuration items; add enabled configuration items to launch the target model in the model management platform.
  • the technical solution of this application solves the problems of complex modification operations, inability to visualize, and inability to synchronize and share data caused by adding or updating machine learning models. It realizes the automatic import and online operation of machine learning models, and reduces the workload for developers. The difficulty of the work realizes the synchronization and sharing of data on different cluster servers.
  • Figure 2a is a flow chart of another TensorFlow model management method provided in Embodiment 2 of the present application. This embodiment is refined based on the above embodiments. In this embodiment, the method will be used to meet the target multiplexing resource requirements. When assigning conditions, the operation of dynamically allocating target reuse resources to target device drivers is refined. As shown in Figure 2a, the method includes:
  • S230 Receive the activation instruction for the target model through the model list page, obtain the configuration description information of the target model from the database, and generate the activation configuration items of the target model based on the configuration description information.
  • S250 Receive a deactivation instruction for the target model through the model list page, and obtain the configuration description information of the target model from the database according to the file storage identifier of the target model included in the deactivation instruction.
  • the deactivation instruction may be an instruction to deactivate the model on the model management platform.
  • S260 Extract the type of the target model from the configuration description information, and construct a unique name corresponding to the target model according to the file storage identifier and type of the target model.
  • the file storage identification KEY and configuration description information of the target model can be obtained. You can also get the type TYPE of the target model and construct a unique name TYPE_KEY.
  • TYPE_KEY is the name of the target model.
  • the model configuration file of the TensorFlow service after deleting the enabled configuration item matching the unique name, it also includes: receiving a deletion instruction for the target model through the model list page, and deleting the target model according to all the deletion instructions included in the deactivation instruction.
  • the file storage identifier of the target model is described, and the configuration description information of the target model is obtained from the database; the type of the target model is extracted from the configuration description information, and the file storage identifier and type of the target model are determined.
  • the model file directory of the target model under the identification directory of the TensorFlow service delete the model file directory, and delete the configuration description information of the target model in the database to delete the target model in the model management platform.
  • the deletion instruction may be an instruction to delete the model on the model management platform.
  • the model can be deleted from the model management platform, and unnecessary models can be deleted on the model management platform.
  • the model management platform delete the relevant model file directory of the target model, and also delete the relevant description and configuration information in the database before you can delete the target model.
  • FIG. 2b it is a schematic structural diagram for deleting the target model.
  • the deletion instruction for the target model is received through the model list page (WEB page), and the target model configuration description information is queried through the delete model interface to the database interface.
  • the database Return the model configuration description information to the model management platform.
  • the model management platform determines the model file directory of the target model in the identification directory of the TensorFlow service based on the configuration description information; deletes the model file directory, and deletes the configuration description information of the target model in the database. After confirming that the configuration description information of the target model has been deleted, you need to delete the folder according to the model storage path to delete the target model in the model management platform.
  • the file storage identification KEY and configuration description information of the target model can be obtained.
  • the model management platform will delete the KEY layer directory according to the stored parent directory path, that is, /models/KEY, and delete the record of the target model in the database. Implement deletion of the target model.
  • the technical solution of the embodiment of this application receives the file compression package of the target model through the model import page, and after determining that the decompressed content has passed the verification, the configuration description information is stored in the database; according to the file storage identification returned by the database for the target model, in TensorFlow Create a model file directory in the identification directory of the service and store the decompressed folder in the model file directory; receive activation through the model list page Instructions: obtain configuration description information from the database and generate enabled configuration items; add enabled configuration items to launch the target model in the model management platform; receive deactivation instructions for the target model through the model list page, and include the deactivation instructions according to the The file storage identifier of the target model is obtained, and the configuration description information of the target model is obtained from the database; the type of the target model is extracted from the configuration description information, and based on the file storage identifier and type of the target model, a The unique name corresponding to the target model; in the model configuration file of the TensorFlow service, delete the enabled configuration item matching the unique name to
  • FIG 3 is a schematic structural diagram of a TensorFlow model management device provided in Embodiment 3 of the present application.
  • the TensorFlow model management device provided in this embodiment can be implemented by software and/or hardware, and can be configured in a server or terminal device to implement a TensorFlow model management method in the embodiment of the present application.
  • the device includes: a configuration description information storage module 310, a decompression folder storage module 320, an enabled configuration item generating module 330, and an enabled configuration item adding module 340.
  • the configuration description information storage module 310 is configured to receive the file compression package of the target model through the model import page, and after determining that the decompressed content of the file compression package passes verification, store the configuration description information of the target model in the database; the decompression folder storage module 320, set to create a model file directory under the identification directory of the TensorFlow service based on the file storage identification returned by the database for the target model, and store the decompression folder corresponding to the file compression package in the model file directory; enable the configuration item generation module 330.
  • the model list displayed on the model list page is the same as The model stored in the database matches;
  • the enabled configuration item adding module 340 is configured to add the enabled configuration item of the target model in the model configuration file of the TensorFlow service to launch the target model in the model management platform.
  • the technical solution of the embodiment of this application receives the file compression package of the target model through the model import page, and after determining that the decompressed content has passed the verification, the configuration description information is stored in the database; according to the file storage identification returned by the database for the target model, in TensorFlow Establish a model file directory in the identification directory of the service, and store the decompressed folder in the model file directory; receive activation instructions through the model list page, obtain configuration description information from the database, and generate activation configuration items; add activation configuration items to The target model is launched in the model management platform.
  • the technical solution of this application solves the problems of complex modification operations, inability to visualize, and inability to synchronize and share data caused by adding or updating machine learning models. It realizes the automatic import and online operation of machine learning models, and reduces the workload for developers. work Difficulty, achieving data synchronization and sharing on different cluster servers.
  • the configuration description information storage module 310 is configured to store the configuration description information of the target model into the database in the following manner: decompressing the file compression package to obtain a decompression folder corresponding to the file compression package; parsing the decompression The directory structure of the folder, and read the configuration description information record file in the unzipped folder; when it is determined that the directory structure meets the preset structural requirements, and the configuration description information record file contains the attribute values of the model's necessary attributes. Next, it is determined that the decompressed content of the file compression package passes the verification; and the configuration description information of the target model recorded in the configuration description information record file is stored in the database.
  • the decompression folder storage module 320 is configured to establish a model file directory under the identification directory of the TensorFlow service in the following manner: under the identification directory of the TensorFlow service, create a new folder with the file storage identifier as the directory name, as follows: The type of the target model is the model file directory of the subdirectory.
  • the enabling configuration item generation module 330 is configured to generate the enabling configuration items of the target model in the following manner: receiving an enabling instruction for the target model through the model list page, and storing the file according to the target model included in the startup instruction. Identification, obtain the configuration description information of the target model from the database; extract the type of the target model from the configuration description information, and construct a unique corresponding to the target model according to the file storage identification and type of the target model. Name and storage path; fill the unique name and storage path in the preset enabled configuration item generation template to generate the enabled configuration item of the target model.
  • the device further includes a configuration description information feedback module, configured to: after storing the configuration description information of the target model in the database, receive detailed viewing instructions for the target model through the model list page, and view the instructions according to the detailed information.
  • the file storage identifier of the target model included in obtains the configuration description information of the target model from the database; and feeds back the configuration description information of the target model to the model list page for user display.
  • the device further includes an enabling configuration item deletion module, including: a configuration description information acquisition unit configured to receive a deactivation instruction for the target model through the model list page, and to activate the configuration item deletion module according to the target included in the deactivation instruction.
  • the file storage identification of the model is used to obtain the configuration description information of the target model from the database;
  • the unique name building unit is configured to extract the type of the target model from the configuration description information and use the file storage identification and Type, construct a unique name corresponding to the target model;
  • the enabled configuration item deletion unit is set to delete the enabled configuration item matching the unique name in the model configuration file of the TensorFlow service to go offline in the model management platform target model.
  • the device further includes a configuration description information deletion unit configured to: in the model configuration file of the TensorFlow service, after deleting the enabled configuration item matching the unique name, delete it through the model
  • the type list page receives the deletion instruction for the target model, and obtains the configuration description information of the target model from the database according to the file storage identifier of the target model included in the deletion instruction; extracts the target model from the configuration description information. type, and determine the model file directory of the target model under the identification directory of the TensorFlow service according to the file storage identification and type of the target model; delete the model file directory, and delete the target model in the database Configure description information to delete the target model in the model management platform.
  • the TensorFlow model management device provided by the embodiments of this application can execute the TensorFlow model management method provided by any embodiment of this application, and has functional modules and effects corresponding to the execution method.
  • FIG 4 is a schematic structural diagram of a model management platform provided in Embodiment 4 of the present application.
  • the model management platform 410 includes a Kubernetes cluster 420. Multiple Kubernetes nodes 430 in the Kubernetes cluster 420 respectively deploy TensorFlow service containers 440; each TensorFlow service container 440 has a preset refresh time; each TensorFlow service container 440 is hung separately. Load a pre-created shared storage volume.
  • Each TensorFlow service container 440 is configured to execute the TensorFlow model management method described in any one of this application.
  • the method includes: receiving the file compression package of the target model through the model import page, and determining that the decompressed content of the file compression package passes the verification Finally, the configuration description information of the target model is stored in the database; according to the file storage identification returned by the database for the target model, a model file directory is established under the identification directory of the TensorFlow service, and the decompression folder corresponding to the file compression package is stored in the model file directory; receive the activation instruction for the target model through the model list page, obtain the configuration description information of the target model from the database, and generate the activation configuration items of the target model based on the configuration description information; the model list displayed on the model list page is the same as The model stored in the database matches; in the model configuration file of the TensorFlow service, add the enable configuration item of the target model to launch the target model in the model management platform.
  • distributed storage generally uses a distributed file management system such as Network File System (NFS).
  • NFS Network File System
  • a model maintenance service container (not shown in the figure) can be deployed to implement organization and management of each model stored in the database.
  • the above-mentioned model maintenance service container can also be mounted on a shared storage volume to further expand the functions of the model management platform.
  • the technical solution of the embodiment of this application receives the file compression package of the target model through the model import page, and after determining that the decompressed content has passed the verification, the configuration description information is stored in the database; according to the database requirements For the file storage identifier returned by the target model, create a model file directory under the identification directory of the TensorFlow service, and store the decompressed folder in the model file directory; receive the activation instructions through the model list page, obtain the configuration description information from the database, and generate Enable configuration items; add enable configuration items to launch the target model in the model management platform.
  • the technical solution of this application solves the problems of complex modification operations, inability to visualize, and inability to synchronize and share data due to adding or updating machine learning models. It realizes the automatic online, offline and deletion operations of machine learning models, and eases the development process. The difficulty of personnel's work is achieved by synchronizing and sharing data on different cluster servers.
  • Embodiment 5 of the present application also provides a method that includes a computer-readable storage medium, and the computer-readable instructions are used to execute a management method of a TensorFlow model when executed by a computer processor.
  • the method includes: receiving a target through a model import page.
  • the configuration description information of the target model is stored in the database; based on the file storage identification returned by the database for the target model, the model is established in the identification directory of the TensorFlow service file directory, and store the decompressed folder corresponding to the file compression package in the model file directory; receive the enable instruction for the target model through the model list page, obtain the configuration description information of the target model from the database, and based on the configuration description information, Generate the enabling configuration items of the target model; the model list displayed on the model list page matches the model stored in the database; in the model configuration file of the TensorFlow service, add the enabling configuration items of the target model to go online in the model management platform target model.
  • the computer-readable storage medium includes computer-executable instructions, which are not limited to the above-mentioned method operations, and can also execute the TensorFlow model management method provided by any embodiment of the present application. Related operations.
  • the present application can be implemented with the help of software and necessary general hardware, and of course can also be implemented with hardware.
  • the technical solution of this application can be embodied in the form of a software product.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer's floppy disk, read-only memory (Read-Only Memory, ROM), Random Access Memory (RAM), flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to cause a computer device (which can be a personal computer, server, or network device, etc.) to execute each step of this application methods described in the examples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了机器学习模型的管理方法、装置、管理平台和存储介质。通过模型导入页面接收目标模型的文件压缩包,并在确定解压内容通过验证后,将配置描述信息存储至数据库;根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将解压文件夹存储于模型文件目录下;通过模型列表页面接收启用指令,从数据库中获取配置描述信息,生成启用配置项;增加启用配置项,以在模型管理平台中上线目标模型。

Description

机器学习模型的管理方法、装置、管理平台和存储介质
本申请要求在2022年08月23日提交中国专利局、申请号为202211015154.9的中国专利申请的优先权,以上申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,例如涉及一种机器学习模型的管理方法、装置、管理平台和存储介质。
背景技术
随着人工智能技术的飞速发展,通过机器学习及深度学习模型对业务数据进行建模分析已是各行业的重要需求之一。TensorFlow是一个由Google公司提供并开源的端到端的机器学习平台,具有全面健康的生态和丰富的资源库,可帮助开发者轻松构建由机器学习提供支持的应用。Kubernetes平台,可以对一个应用创建多个实例的容器,通过平台内置的负载均衡策略,实现对这一组应用的负载均衡访问,并且平台集群保证了在部分机器故障时应用功能的可用性。Kubernetes支持持久化卷存储以及持久卷声明,供应用的容器挂载,以实现集群各节点容器数据的持久存储和数据的同步共享。
发明内容
本申请提供了一种机器学习模型的管理方法、装置、管理平台和存储介质,以实现机器学习模型自动化上下线以及删除的操作,减轻了开发人员的工作难度。
本申请提供了一种TensorFlow模型的管理方法,由模型管理平台执行,所述方法包括:
通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库;
根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下;
通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项;模型列表页面中展示的模型列表与数据库中所存储的模型相匹配;
在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在 模型管理平台中上线目标模型。
本申请还提供了一种TensorFlow模型的管理装置,包括:
配置描述信息存储模块,设置为通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库;
解压文件夹存储模块,设置为根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下;
启用配置项生成模块,设置为通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项;模型列表页面中展示的模型列表与数据库中所存储的模型相匹配;
启用配置项增加模块,设置为在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
本申请还提供了一种模型管理平台,所述模型管理平台中包括Kubernetes集群,所述Kubernetes集群中的多个Kubernetes节点分别部署TensorFlow服务容器;各TensorFlow服务容器具有预设的刷新时间;各TensorFlow服务容器和各模型维护服务容器分别挂载预先创建的共享存储卷;其中,各TensorFlow服务容器设置为执行本申请任一实施例所述的TensorFlow模型的管理方法。
本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,使处理器执行所述计算机指令时实现本申请任一实施例所述的TensorFlow模型的管理方法。
附图说明
下面将对实施例描述中所需要使用的附图作简单地介绍。
图1a是根据本申请实施例一提供的一种TensorFlow模型的管理方法的流程图;
图1b是根据本申请实施例一提供的方法中的对目标模型进行启用的结构示意图;
图2a是根据本申请实施例二提供的另一种TensorFlow模型的管理方法的流程图;
图2b是根据本申请实施例二提供的方法中的对目标模型进行删除的结构示 意图;
图3是根据本申请实施例三提供的一种TensorFlow模型的管理装置的结构示意图;
图4是实现本申请实施例的TensorFlow模型的管理方法的管理平台的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行说明。
本申请的说明书和权利要求书及上述附图中的术语“目标”、“当前”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
发明人在实现本申请的过程中,发现相关技术存在如下缺陷:当新增机器学习模型时,需要将新模型固化后的文件手动拷贝到服务可识别的路径,并且修改模型配置文件,然后重启服务后,才能启动新增的模型,模型新增及修改操作复杂,容易出错。如果业务场景需要经常更新模型,这将大大加重了工作人员的工作量、以及比较难实现处于不同集群服务器上容器的数据的同步和共享。
实施例一
图1a为本申请实施例一提供的一种TensorFlow模型的管理方法的流程图,本实施例可适用于机器学习模型更新或者新增的情况,该方法可以由TensorFlow模型的管理装置来执行,该TensorFlow模型的管理装置可以采用硬件和/或软件的形式实现。
如图1a所示,该方法包括:
S110、通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库。
模型导入页面可以是在模型管理平台上能够进行模型导入的页面。目标模型可以是在模型管理平台上进行待新增或者待更新的模型。配置描述信息可以 是描述目标模型的基本配置信息,示例性的,配置描述信息可以包括模型的唯一标识和模型的类型。数据库可以是MySQL数据库。
可以理解的是,由模型管理平台的模型导入页面导入算法工程师离线训练的目标模型所关联的文件压缩包,模型管理平台业务服务完成对导入文件压缩包的解压内容的基本校验后,解析模型包中约定好的模型配置信息文件,提取配置描述信息,并将配置描述信息存储到MySQL数据库。
可选的,模型管理平台部署环境准备包括:准备Kubernetes集群环境,并且安装nfs-utils工具,建nfs格式的共享存储卷(Persistent Volume,PV),访问模型设置为ReadWriteMany,可供多节点挂载,并且创建对应的共享卷声明(Persistent Volume Claim,PVC)。
考虑到模型管理平台的负载均衡以及高可用性能,选择Kubernetes作为部署平台,多节点部署平台的TensorFlow服务容器。考虑到容器数据的持久化存储以及不同容器的数据共享,选择Kubernetes支持的PV和PVC存储资源对象,创建存储PV,访问模式为ReadWriteMany,可供多个节点多个服务容器挂载,并且创建对应的PVC。
基于Kubernetes多节点部署TensorFlow服务容器,通过“--model_config_file_poll_wait_seconds=30”设置刷新时间为30s,所有节点的TensorFlow服务容器,将配置的模型可识别目录,目录包含了配置文件models.config,比如目录路径可以是/models/,选择新建的PVC(名称为PVC_NAME)挂载,所有TensorFlow服务容器中模型配置文件及配置描述信息记录文件models.config持久化存储及数据同步共享。
配置描述信息记录文件可以是由描述目标模型的基本配置信息而构成的信息记录文件,模型配置文件可以是关于目标模型的配置进行描述的信息文件。
基于Kubernetes多节点部署管理平台选择PVC_NAME挂载,如此实现各TensorFlow服务容器在模型配置文件及配置描述信息记录文件models.config这两种类型数据的同步共享。
另外的,TensorFlow服务容器中可以实现模型的导入、启用和停用,在每个Kubernetes节点上还可以挂载模型维护服务容器,模型维护服务容器可以实现对数据库中各模型的组织和管理。
可选的,在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库,包括:对文件压缩包进行解压处理,得到与文件压缩包对应的解压文件夹;解析所述解压文件夹的目录结构,并读取解压文件夹中的配置描述信息记录文件;在如果确定所述目录结构满足预设的结构要求,且配 置描述信息记录文件中包含模型必备属性的属性值的情况下,确定文件压缩包的解压内容通过验证;将所述配置描述信息记录文件中记录的目标模型的配置描述信息存储至数据库。
在本实施例中,预设的结构要求可以是在模型管理平台上的目录预设结构,示例性的,可以设置预设的结构要求包含模型文件目录,并且在模型文件目录下存在相应的子目录。要是当前目录结构与预设的结构要求不匹配,则没有通过验证。如果当前目录结构与预设的结构要求匹配,需要接着判断配置描述信息记录文件中是否包含模型必备属性的属性值,若配置描述信息记录文件中包含模型必备属性的属性值,则通过验证,若配置描述信息记录文件中不包含模型必备属性的属性值,则没有通过验证。
模型必备属性的属性值,包含模型的类型以及唯一标识等参数。
可以理解的是,模型管理平台需要对文件压缩包进行解压,通过解压处理,可以得到解压文件夹。解析解压文件的目录结构,并获取得到配置描述信息记录文件,可以确定目录结构是由几级目录结构来构成。
判断目录结构是否满足预设的结构要求,如果不满足,则直接反馈验证不通过的信息。如果满足,则确定文件压缩包的解压内容通过验证。并且,模型管理平台可以将配置描述信息记录文件中记录的目标模型的配置描述信息存储至数据库,也即将模型的类型和唯一标识等参数存储至数据库。
S120、根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下。
文件存储标识可以是当目标模型存储于数据库中,数据库会向模型管理平台发送关于目标模型存储的标识,数据库存储每一个目标模型对应的反馈的文件存储标识都是唯一的,模型管理平台可以根据文件存储标识确定相关目标模型。
模型文件目录可以是能够进行检索文件的列表目录。模型文件目录位于模型管理平台上,能够在相关模型文件目录下存储相应的解压文件夹。
可选的,根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,包括:在所述TensorFlow服务的识别目录下,新建以文件存储标识为目录名文件夹,以目标模型的类型为子目录的模型文件目录。
目录名文件夹可以是以目录名进行文件命名的文件夹。子目录可以是在模型文件目录下存在的下一分级的目录,可以理解的是,一个模型文件目录下有 一个对应的子目录,并且模型文件目录可以新建目录名文件夹,并且以文件存储标识来作为文件夹目录名。
可以理解的是,算法工程师根据采样数据,基于TensorFlow平台训练出目标模型,并根据约定好的配置文件格式,将包含模型配置信息的模型配置文件和配置描述信息记录文件打包。训练出的模型保存为一个pb文件(也即saved_model.pb)和一个变量文件夹(variables),目标模型包在模型文件目录下,包含模型配置信息的文件config.json。
运维将算法工程师提供的目标模型包,由模型管理平台的万维网(World Wide Web,WEB)页面导入到系统。目标模型的文件压缩包导入模型管理平台,模型管理平台会解压目标模型压缩包,并校验其目录结构是否满足预设的结构要求、以及配置描述信息中必须属性是否不为空,若满足,则说明文件压缩包的解压内容通过验证。
模型管理平台提取目标模型的基本信息以及配置信息入库,并利用数据库表自增主键生成该条模型记录的唯一标识KEY,在TensorFlow服务配置的可识别目录下,新建以KEY为目录名文件夹,目标模型的类型TYPE为子目录,存储目标模型文件,如配置的Tensorflow模型文件目录为models,存储模型文件的目录为/models/KEY/TYPE。
这样设置的好处在于:通过在TensorFlow服务的识别目录下,新建以文件存储标识为目录名文件夹,以目标模型的类型为子目录的模型文件目录。这样可以更加清楚明确地在模型管理平台构建目标模型的相关的模型文件目录,可以更加准确快速地进行查找。
S130、通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项。
模型列表页面中展示的模型列表与数据库中所存储的模型相匹配。
模型列表页面可以是在模型管理平台上能够展示目标模型的页面。启用指令可以是能够使用目标模型的指令。启用指令可以是在模型管理平台对模型进行启用的指令。启用配置项可以是根据数据库的反馈配置描述信息,需要对目标模型进行配置的内容。
可选的,通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项,包括:通过模型列表页面接收对目标模型的启用指令,并根据启动指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件 存储标识和类型,构建与所述目标模型对应的唯一名称以及存储路径;将所述唯一名称以及存储路径填充于预设的启用配置项生成模板中,生成目标模型的启用配置项。
启用配置项生成模板可以是进行启用配置项配置所使用的模板。
示例性的,在模型管理平台上可以查看目标模型的基本信息和详细的配置描述信息,比如训练的次数、标准化参数等值,通过这些值,可以了解模型训练的场景,有助于分析模型的优劣。
模型管理平台可以发送TensorFlow服务上线该模型请求。模型管理平台可以自动将目标模型的配置描述信息写入到TensorFlow服务的配置描述信息记录文件models.config。
接着,通过模型列表页面接收对目标模型的启用指令,并根据启动指令中包括的目标模型的文件存储标识,也即KEY,从数据库中获取目标模型的配置描述信息;在配置描述信息中提取目标模型的类型,也即TYPE,并根据目标模型的文件存储标识和类型,构建与目标模型对应的唯一名称以及存储路径;将唯一名称以及存储路径填充于预设的启用配置项生成模板中,生成目标模型的启用配置项。示例性的,假设预设的启用配置项生成模板为:config:{name:,base_path:,model_platform:"TensorFlow"}。因此,可以约定目标模型在models.config中唯一名称为"模型类型_KEY",即TYPE_KEY,所以models.config要求的启用配置项的信息格式:config:{name:"TYPE_KEY",base_path:"/models/KEY/TYPE",model_platform:"TensorFlow"}。
这样设置的好处在于:通过从配置描述信息中提取目标模型的类型,并根据目标模型的文件存储标识和类型,构建与目标模型对应的唯一名称以及存储路径,并生成目标模型的启用配置项。这样可以每个目标模型都能生成对应的启用配置项,从而能够更加准确地在模型管理平台上来上线目标模型,通过目标模型的类型和文件存储标识来确定唯一名称和存储路径,从而能够减轻工作人员的工作难度。
S140、在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
模型配置文件可以是关于目标模型的配置进行描述的信息文件。模型管理平台可以是能够管理多个目标模型的平台,例如可以管理目标模型的描述管理信息、模型配置文件等信息的平台。
续前例,在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,也就是增加,config:{name:"TYPE_KEY",base_path: "/models/KEY/TYPE",model_platform:"TensorFlow"}至模型管理平台中,这样就可以实现在模型管理平台中上线目标模型。
另外的,在模型管理平台上访问已启动的目标模型。TensorFlow服务自带Restful和谷歌远程过程调用(Google Remote Procedure Call,GRPC)接口服务,可通过Restful应用程序编程接口(Application Programming Interface,API)和GRPC接口形式在模型管理平台上访问已启用的目标模型。
两种方式都需要先明确模型的名称、输入、输出。模型的名称,可通过模型列表页面查看,确定模型名称后,可通过TensorFlow自身的元数据查看接口url:http://${url}:${port}/v1/models/${MODEL_NAME}/metadata,获得对应目标模型的输入和输出,按输入的格式,组装入参,通过Restful API或者GRPC的通信协议访问模型接口,获得目标模型运行返回数据后,根据输出格式,解析返回数据。
如图1b所示,为对目标模型进行启用的结构示意图,通过模型列表页面(WEB页面)接收对目标模型的启用指令,并通过启用模型接口到数据库接口进行目标模型配置描述信息的查询,数据库将模型配置信息返回给模型管理平台。模型管理平台根据配置描述信息,生成目标模型的启用配置项,可以包括模型名称和存储路径的获取,并将其配置于配置描述信息记录文件models.config。在TensorFlow服务的模型配置文件中,增加并更新目标模型的启用配置项,以在模型管理平台中上线目标模型。
可选的,在将目标模型的配置描述信息存储至数据库之后,还包括:通过模型列表页面接收对目标模型的详情查看指令,并根据详情查看指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;将所述目标模型的配置描述信息反馈至模型列表页面进行用户展示。
详情查看指令可以是描述查看当前模型详细信息的指令。示例性的,可以通过向模型管理平台发送详情查看指令,其中在详情查看指令中包括文件存储标识,可以在相应的数据库通过文件存储标识进行查找并反馈给用户。
在模型管理平台上,用户可以查看目标模型详细的配置描述信息,比如训练的次数、标准化参数等值。用户通过接收到目标模型的相关参数值,可以更加准确地了解模型训练的场景,有助于分析模型的优劣。
本申请实施例的技术方案,通过模型导入页面接收目标模型的文件压缩包,并在确定解压内容通过验证后,将配置描述信息存储至数据库;根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将解压文件夹存储于模型文件目录下;通过模型列表页面接收启用 指令,从数据库中获取配置描述信息,生成启用配置项;增加启用配置项,以在模型管理平台中上线目标模型。本申请的技术方案解决了因新增或者更新机器学习模型而导致的修改操作复杂、不能够可视化以及数据不能同步和共享的问题,实现了机器学习模型自动化导入及上线的操作,减轻了开发人员的工作难度,实现了在不同集群服务器上数据的同步和共享。
实施例二
图2a为本申请实施例二提供的另一种TensorFlow模型的管理方法的流程图,本实施例以上述各实施例为基础进行细化,在本实施例中,将在满足目标复用资源的分配条件时,将目标复用资源动态分配至目标设备驱动的操作进行细化。如图2a所示,该方法包括:
S210、通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库。
S220、根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下。
S230、通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项。
S240、在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
S250、通过模型列表页面接收对目标模型的停用指令,并根据停用指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息。
停用指令可以是在模型管理平台对模型进行停用的指令。
S260、在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,构建与所述目标模型对应的唯一名称。
S270、在TensorFlow服务的模型配置文件中,删除与所述唯一名称匹配的启用配置项,以在模型管理平台中下线目标模型。
示例性的,在模型管理平台上,当接收到目标模型的停用指令之后,可以获得目标模型的文件存储标识KEY和配置描述信息。还可以得到目标模型的类型TYPE,并构建唯一的名称TYPE_KEY。
可以发送TensorFlow服务下线该目标模型的请求,模型管理平台会根据约 定的唯一目标模型名,即TYPE_KEY,会删除模型配置文件models.config中以”TYPE_KEY”为模型名的启用配置项模块,TensorFlow服务自动刷新操作触发后,会在模型管理平台上,下线以TYPE_KEY为名称的目标模型。
可选的,在TensorFlow服务的模型配置文件中,删除与所述唯一名称匹配的启用配置项之后,还包括:通过模型列表页面接收对目标模型的删除指令,并根据停用指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,确定所述目标模型在TensorFlow服务的识别目录下的模型文件目录;删除所述模型文件目录,并在所述数据库中删除目标模型的配置描述信息,以在模型管理平台中删除目标模型。
删除指令可以是在模型管理平台对模型进行删除的指令。
可以理解的是,只有模型的状态为未启用的状态时,才可以从模型管理平台将模型进行删除操作,能够实现对非必要的模型在模型管理平台上进行删除。在模型管理平台上,对目标模型的相关模型文件目录进行删除,并且也需要删除掉数据库中的相关描述配置信息,才可以实现对目标模型的删除。
如图2b所示,为对目标模型进行删除的结构示意图,通过模型列表页面(WEB页面)接收对目标模型的删除指令,并通过删除模型接口到数据库接口进行目标模型配置描述信息的查询,数据库将模型配置描述信息返回给模型管理平台。模型管理平台根据配置描述信息,确定目标模型在TensorFlow服务的识别目录下的模型文件目录;删除模型文件目录,并在数据库中删除目标模型的配置描述信息。在确定删除完成目标模型的配置描述信息之后,需要根据模型存储路径删除文件夹,以在模型管理平台中删除目标模型。
示例性的,在模型管理平台上,当接收到目标模型的删除指令之后,可以获得目标模型的文件存储标识KEY和配置描述信息。还可以得到目标模型的类型TYPE,确定目标模型在TensorFlow服务的识别目录下的模型文件目录,也即KEY命名的目录名文件夹这个目录。
可以发送TensorFlow服务删除该目标模型的请求,模型管理平台会根据存储的父级目录路径,也即/models/KEY,来删除KEY这一层目录,并且删除该目标模型在数据库中的记录,来实现删除目标模型。
本申请实施例的技术方案,通过模型导入页面接收目标模型的文件压缩包,并在确定解压内容通过验证后,将配置描述信息存储至数据库;根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将解压文件夹存储于模型文件目录下;通过模型列表页面接收启用 指令,从数据库中获取配置描述信息,生成启用配置项;增加启用配置项,以在模型管理平台中上线目标模型;通过模型列表页面接收对目标模型的停用指令,并根据停用指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,构建与所述目标模型对应的唯一名称;在TensorFlow服务的模型配置文件中,删除与所述唯一名称匹配的启用配置项,以在模型管理平台中下线目标模型。实现了机器学习模型自动化停用和删除的操作,从而可以简化模型操作步骤,减轻了开发人员的工作难度。
实施例三
图3为本申请实施例三提供的一种TensorFlow模型的管理装置的结构示意图。本实施例所提供的一种TensorFlow模型的管理装置可以通过软件和/或硬件来实现,可配置于服务器或者终端设备中来实现本申请实施例中的一种TensorFlow模型的管理方法。如图3所示,该装置包括:配置描述信息存储模块310、解压文件夹存储模块320、启用配置项生成模块330和启用配置项增加模块340。
配置描述信息存储模块310,设置为通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库;解压文件夹存储模块320,设置为根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下;启用配置项生成模块330,设置为通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项;模型列表页面中展示的模型列表与数据库中所存储的模型相匹配;启用配置项增加模块340,设置为在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
本申请实施例的技术方案,通过模型导入页面接收目标模型的文件压缩包,并在确定解压内容通过验证后,将配置描述信息存储至数据库;根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将解压文件夹存储于模型文件目录下;通过模型列表页面接收启用指令,从数据库中获取配置描述信息,生成启用配置项;增加启用配置项,以在模型管理平台中上线目标模型。本申请的技术方案解决了因新增或者更新机器学习模型而导致的修改操作复杂、不能够可视化以及数据不能同步和共享的问题,实现了机器学习模型自动化导入及上线的操作,减轻了开发人员的工作 难度,实现了在不同集群服务器上数据的同步和共享。
可选的,配置描述信息存储模块310是设置为通过如下方式将目标模型的配置描述信息存储至数据库:对文件压缩包进行解压处理,得到与文件压缩包对应的解压文件夹;解析所述解压文件夹的目录结构,并读取解压文件夹中的配置描述信息记录文件;在确定所述目录结构满足预设的结构要求,且配置描述信息记录文件中包含模型必备属性的属性值的情况下,确定文件压缩包的解压内容通过验证;将所述配置描述信息记录文件中记录的目标模型的配置描述信息存储至数据库。
可选的,解压文件夹存储模块320是设置为通过如下方式在TensorFlow服务的识别目录下建立模型文件目录:在所述TensorFlow服务的识别目录下,新建以文件存储标识为目录名文件夹,以目标模型的类型为子目录的模型文件目录。
可选的,启用配置项生成模块330是设置为通过如下方式生成目标模型的启用配置项:通过模型列表页面接收对目标模型的启用指令,并根据启动指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,构建与所述目标模型对应的唯一名称以及存储路径;将所述唯一名称以及存储路径填充于预设的启用配置项生成模板中,生成目标模型的启用配置项。
可选的,所述装置还包括,配置描述信息反馈模块,设置为:在将目标模型的配置描述信息存储至数据库之后,通过模型列表页面接收对目标模型的详情查看指令,并根据详情查看指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;将所述目标模型的配置描述信息反馈至模型列表页面进行用户展示。
可选的,所述装置还包括,启用配置项删除模块,包括:配置描述信息获取单元,设置为通过模型列表页面接收对目标模型的停用指令,并根据停用指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;唯一名称构建单元,设置为在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,构建与所述目标模型对应的唯一名称;启用配置项删除单元,设置为在TensorFlow服务的模型配置文件中,删除与所述唯一名称匹配的启用配置项,以在模型管理平台中下线目标模型。
可选的,所述装置还包括,配置描述信息删除单元,设置为:在TensorFlow服务的模型配置文件中,删除与所述唯一名称匹配的启用配置项之后,通过模 型列表页面接收对目标模型的删除指令,并根据删除指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,确定所述目标模型在TensorFlow服务的识别目录下的模型文件目录;删除所述模型文件目录,并在所述数据库中删除目标模型的配置描述信息,以在模型管理平台中删除目标模型。
本申请实施例所提供的TensorFlow模型的管理装置可执行本申请任意实施例所提供的TensorFlow模型的管理方法,具备执行方法相应的功能模块和效果。
实施例四
图4是本申请实施例四提供的一种模型管理平台的结构示意图。所述模型管理平台410中包括Kubernetes集群420,所述Kubernetes集群420中的多个Kubernetes节点430分别部署TensorFlow服务容器440;各TensorFlow服务容器440具有预设的刷新时间;各TensorFlow服务容器440分别挂载预先创建的共享存储卷。
各TensorFlow服务容器440设置为执行本申请中任一项所述的TensorFlow模型的管理方法,该方法包括:通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库;根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下;通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项;模型列表页面中展示的模型列表与数据库中所存储的模型相匹配;在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
另外的,分布式存储一般都是使用的网络文件系统(Network File System,NFS)这类型的分布式文件管理系统,可以在各TensorFlow服务容器分别挂载预先创建的NFS格式的共享存储卷,这里不做限定。
在多个Kubernetes节点430中,可以部署模型维护服务容器(图中未示出),实现对数据库中存储的各模型的组织和管理。相应的,同样可以将该上述模型维护服务容器挂载共享存储卷,以进一步扩充模型管理平台的功能。
本申请实施例的技术方案,通过模型导入页面接收目标模型的文件压缩包,并在确定解压内容通过验证后,将配置描述信息存储至数据库;根据数据库针 对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将解压文件夹存储于模型文件目录下;通过模型列表页面接收启用指令,从数据库中获取配置描述信息,生成启用配置项;增加启用配置项,以在模型管理平台中上线目标模型。本申请的技术方案解决了因新增或者更新机器学习模型而导致的修改操作复杂、不能够可视化以及数据不能同步和共享的问题,实现了机器学习模型自动化上下线以及删除的操作,减轻了开发人员的工作难度,实现了在不同集群服务器上数据的同步和共享。
实施例五
本申请实施例五还提供一种包含计算机可读存储介质,所述计算机可读指令在由计算机处理器执行时用于执行一种TensorFlow模型的管理方法,该方法包括:通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库;根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下;通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项;模型列表页面中展示的模型列表与数据库中所存储的模型相匹配;在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
当然,本申请实施例所提供的一种包含计算机可读存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的TensorFlow模型的管理方法中的相关操作。
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现。基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括至少一个指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
值得注意的是,上述TensorFlow模型的管理装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。

Claims (10)

  1. 一种TensorFlow模型的管理方法,由模型管理平台执行,所述管理方法包括:
    通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库;
    根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下;
    通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项;模型列表页面中展示的模型列表与数据库中所存储的模型相匹配;
    在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
  2. 根据权利要求1所述的方法,其中,在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库,包括:
    对文件压缩包进行解压处理,得到与文件压缩包对应的解压文件夹;
    解析所述解压文件夹的目录结构,并读取解压文件夹中的配置描述信息记录文件;
    在确定所述目录结构满足预设的结构要求,且配置描述信息记录文件中包含模型必备属性的属性值的情况下,确定文件压缩包的解压内容通过验证;
    将所述配置描述信息记录文件中记录的目标模型的配置描述信息存储至数据库。
  3. 根据权利要求1所述的方法,其中,根据数据库针对目标模型返回的文件存储标识,在TensorFlow服务的识别目录下建立模型文件目录,包括:
    在所述TensorFlow服务的识别目录下,新建以文件存储标识为目录名文件夹,以目标模型的类型为子目录的模型文件目录。
  4. 根据权利要求3所述的方法,其中,通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项,包括:
    通过模型列表页面接收对目标模型的启用指令,并根据启动指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;
    在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的 文件存储标识和类型,构建与所述目标模型对应的唯一名称以及存储路径;
    将所述唯一名称以及存储路径填充于预设的启用配置项生成模板中,生成目标模型的启用配置项。
  5. 根据权利要求1所述的方法,在将目标模型的配置描述信息存储至数据库之后,还包括:
    通过模型列表页面接收对目标模型的详情查看指令,并根据详情查看指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;
    将所述目标模型的配置描述信息反馈至模型列表页面进行用户展示。
  6. 根据权利要求1所述的方法,在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项之后,还包括:
    通过模型列表页面接收对目标模型的停用指令,并根据停用指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;
    在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,构建与所述目标模型对应的唯一名称;
    在TensorFlow服务的模型配置文件中,删除与所述唯一名称匹配的启用配置项,以在模型管理平台中下线目标模型。
  7. 根据权利要求6所述的方法,在TensorFlow服务的模型配置文件中,删除与所述唯一名称匹配的启用配置项之后,还包括:
    通过模型列表页面接收对目标模型的删除指令,并根据删除指令中包括的所述目标模型的文件存储标识,从数据库中获取目标模型的配置描述信息;
    在所述配置描述信息中提取所述目标模型的类型,并根据所述目标模型的文件存储标识和类型,确定所述目标模型在TensorFlow服务的识别目录下的模型文件目录;
    删除所述模型文件目录,并在所述数据库中删除目标模型的配置描述信息,以在模型管理平台中删除目标模型。
  8. 一种TensorFlow模型的管理装置,包括:
    配置描述信息存储模块,设置为通过模型导入页面接收目标模型的文件压缩包,并在确定文件压缩包的解压内容通过验证后,将目标模型的配置描述信息存储至数据库;
    解压文件夹存储模块,设置为根据数据库针对目标模型返回的文件存储标 识,在TensorFlow服务的识别目录下建立模型文件目录,并将与文件压缩包对应的解压文件夹存储于模型文件目录下;
    启用配置项生成模块,设置为通过模型列表页面接收对目标模型的启用指令,从数据库中获取目标模型的配置描述信息,并根据配置描述信息,生成目标模型的启用配置项;模型列表页面中展示的模型列表与数据库中所存储的模型相匹配;
    启用配置项增加模块,设置为在TensorFlow服务的模型配置文件中,增加目标模型的启用配置项,以在模型管理平台中上线目标模型。
  9. 一种模型管理平台,包括:Kubernetes集群,所述Kubernetes集群中的多个Kubernetes节点分别部署TensorFlow服务容器;各TensorFlow服务容器具有预设的刷新时间;各TensorFlow服务容器分别挂载预先创建的共享存储卷;
    其中,各TensorFlow服务容器设置为执行如权利要求1-7任一项所述的TensorFlow模型的管理方法。
  10. 计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的TensorFlow模型的管理方法。
PCT/CN2023/093188 2022-08-23 2023-05-10 机器学习模型的管理方法、装置、管理平台和存储介质 WO2024041035A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211015154.9 2022-08-23
CN202211015154.9A CN115293365A (zh) 2022-08-23 2022-08-23 机器学习模型的管理方法、装置、管理平台和存储介质

Publications (1)

Publication Number Publication Date
WO2024041035A1 true WO2024041035A1 (zh) 2024-02-29

Family

ID=83831710

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093188 WO2024041035A1 (zh) 2022-08-23 2023-05-10 机器学习模型的管理方法、装置、管理平台和存储介质

Country Status (2)

Country Link
CN (1) CN115293365A (zh)
WO (1) WO2024041035A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115293365A (zh) * 2022-08-23 2022-11-04 网络通信与安全紫金山实验室 机器学习模型的管理方法、装置、管理平台和存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491549A (zh) * 2018-04-09 2018-09-04 深圳市茁壮网络股份有限公司 一种分布式存储系统中文件目录的创建方法及装置
CN109254992A (zh) * 2018-10-12 2019-01-22 北京京东金融科技控股有限公司 项目生成方法及系统、计算机系统和计算机可读存储介质
CN109508238A (zh) * 2019-01-05 2019-03-22 咪付(广西)网络技术有限公司 一种用于深度学习的资源管理系统及方法
CN110083334A (zh) * 2018-01-25 2019-08-02 北京顺智信科技有限公司 模型上线的方法及装置
CN110569085A (zh) * 2019-08-15 2019-12-13 上海易点时空网络有限公司 配置文件加载方法及系统
CN112015519A (zh) * 2020-08-28 2020-12-01 江苏银承网络科技股份有限公司 模型线上部署方法及装置
CN114385192A (zh) * 2022-01-18 2022-04-22 北京字节跳动网络技术有限公司 一种应用部署方法、装置、计算机设备和存储介质
CN114721674A (zh) * 2022-04-26 2022-07-08 上海浦东发展银行股份有限公司 一种模型部署方法、装置、设备及存储介质
CN115293365A (zh) * 2022-08-23 2022-11-04 网络通信与安全紫金山实验室 机器学习模型的管理方法、装置、管理平台和存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083334A (zh) * 2018-01-25 2019-08-02 北京顺智信科技有限公司 模型上线的方法及装置
CN108491549A (zh) * 2018-04-09 2018-09-04 深圳市茁壮网络股份有限公司 一种分布式存储系统中文件目录的创建方法及装置
CN109254992A (zh) * 2018-10-12 2019-01-22 北京京东金融科技控股有限公司 项目生成方法及系统、计算机系统和计算机可读存储介质
CN109508238A (zh) * 2019-01-05 2019-03-22 咪付(广西)网络技术有限公司 一种用于深度学习的资源管理系统及方法
CN110569085A (zh) * 2019-08-15 2019-12-13 上海易点时空网络有限公司 配置文件加载方法及系统
CN112015519A (zh) * 2020-08-28 2020-12-01 江苏银承网络科技股份有限公司 模型线上部署方法及装置
CN114385192A (zh) * 2022-01-18 2022-04-22 北京字节跳动网络技术有限公司 一种应用部署方法、装置、计算机设备和存储介质
CN114721674A (zh) * 2022-04-26 2022-07-08 上海浦东发展银行股份有限公司 一种模型部署方法、装置、设备及存储介质
CN115293365A (zh) * 2022-08-23 2022-11-04 网络通信与安全紫金山实验室 机器学习模型的管理方法、装置、管理平台和存储介质

Also Published As

Publication number Publication date
CN115293365A (zh) 2022-11-04

Similar Documents

Publication Publication Date Title
CN111259006B (zh) 一种通用的分布式异构数据一体化物理汇聚、组织、发布与服务方法及系统
CN108809722B (zh) 一种部署Kubernetes集群的方法、装置和存储介质
CN110768833B (zh) 基于kubernetes的应用编排部署方法及装置
AU2017253672C1 (en) Automatically updating a hybrid application
CN109104467B (zh) 开发环境构建方法、装置以及平台系统和存储介质
CN108829409B (zh) 一种分布式系统快速部署方法及系统
CN111324571B (zh) 一种容器集群管理方法、装置及系统
US11016785B2 (en) Method and system for mirror image package preparation and application operation
CN107783816A (zh) 虚拟机的创建方法及装置、大数据集群创建的方法及装置
CN107797767A (zh) 一种基于容器技术部署分布式存储系统及其存储方法
WO2024041035A1 (zh) 机器学习模型的管理方法、装置、管理平台和存储介质
CN109213498A (zh) 一种互联网web前端的配置方法及服务器
CN111800468A (zh) 一种基于云的多集群管理方法、装置、介质及电子设备
CN112882726A (zh) 基于Hadoop和Docker的环境系统的部署方法
CN113805969A (zh) 参数配置的获取方法、装置、存储介质和电子设备
WO2021169811A1 (zh) 特效生成方法、装置、系统、设备和存储介质
CN113268232B (zh) 一种页面皮肤生成方法、装置和计算机可读存储介质
US10997122B2 (en) File redundancy detection and mitigation
CN117271460B (zh) 基于科研数字对象语用关系的科研数联网服务方法与系统
CN117971774B (zh) 文件集恢复方法、装置、计算机设备、介质及程序产品
CN112055057B (zh) 一种Web系统动态扩展的方法及系统和设备
CN118092982A (zh) 一种云原生应用的多集群运维方法、设备及介质
CN115580646A (zh) 中间件配置方法、装置、服务器、介质及产品
CN117224968A (zh) 对象数据处理方法、装置和存储介质
CN115757633A (zh) 一种集群间持久化存储同步方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23856129

Country of ref document: EP

Kind code of ref document: A1