CN115220817A - Model distributed loading method, device, equipment and storage medium - Google Patents

Model distributed loading method, device, equipment and storage medium Download PDF

Info

Publication number
CN115220817A
CN115220817A CN202210897604.5A CN202210897604A CN115220817A CN 115220817 A CN115220817 A CN 115220817A CN 202210897604 A CN202210897604 A CN 202210897604A CN 115220817 A CN115220817 A CN 115220817A
Authority
CN
China
Prior art keywords
model
cluster
loading
node
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210897604.5A
Other languages
Chinese (zh)
Inventor
刘涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210897604.5A priority Critical patent/CN115220817A/en
Publication of CN115220817A publication Critical patent/CN115220817A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing

Abstract

The invention relates to an artificial intelligence technology, and discloses a model distributed loading method, which comprises the following steps: the method comprises the steps of obtaining a model set, carrying out host cluster mapping processing on the model set to obtain original mapping information, receiving a model to be on-line, carrying out cluster updating on the model to be on-line based on the original mapping information to obtain a standard model cluster, receiving a model loading service, carrying out distributed loading on models of all nodes in the standard model cluster based on a preset subscription and release system, and feeding back a loading result to a sending end of the model loading service. Furthermore, the invention also relates to a blockchain technique, wherein the original mapping information can be stored in a node of the blockchain. The invention also provides a model distributed loading device, electronic equipment and a readable storage medium. The invention can solve the problem of low model loading efficiency.

Description

Model distributed loading method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a distributed model loading method and device, electronic equipment and a readable storage medium.
Background
With the development of artificial intelligence, the number of intention recognition models at the training site of the online training system is expanded sharply, the models are loaded mainly by a single server, but the total memory of the existing models already exceeds the upper limit of the host computer stand-alone memory, for example, the single server memory 32G supports about 500 intention models, and the single server memory 64G supports about 1000 intention models.
In the prior art, the support upper limit of the model is limited by the memory upper limit of the deployment server under the condition of a large number of models, so that the model loading is unstable and errors are easy to occur, and the model loading efficiency is seriously restricted.
Disclosure of Invention
The invention provides a distributed model loading method and device, electronic equipment and a readable storage medium, and mainly aims to solve the problem of low model loading efficiency.
In order to achieve the above object, the present invention provides a distributed model loading method, including:
obtaining a model set, and carrying out host cluster mapping processing on the model set to obtain original mapping information;
receiving a model to be on-line, and performing cluster updating on the model to be on-line based on the original mapping information to obtain a standard model cluster;
receiving a model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back a loading result to an issuing end of the model loading service.
Optionally, before performing the host cluster mapping process on the model set, the method further includes:
and constructing an original host cluster, and performing host node configuration and model distribution configuration on the original host cluster to obtain a standard host cluster.
Optionally, the performing host cluster mapping processing on the model set to obtain original mapping information includes:
initializing and cleaning a registration directory of the standard host cluster, and copying the model in the model set based on the model distribution configuration to obtain a copy set;
allocating the replica set to nodes of the standard host cluster using the host node configuration and the model allocation configuration;
and writing the distributed nodes into the registration directory, and generating a mapping relation between the copy set and the standard host cluster to obtain the original mapping information.
Optionally, the cluster updating is performed on the to-be-online model based on the original mapping information to obtain a standard model cluster, including:
inquiring whether the model to be online has data records in the original mapping information;
if the model to be on-line has data record in the original mapping information, the original mapping information is not updated;
if the model to be on-line does not have data record in the original mapping information, the host node configuration and the model distribution configuration are reused to carry out model distribution on the model to be on-line, so that the standard model cluster is obtained, and the original mapping information is updated by using the mapping relation between the model to be on-line and the standard host cluster, so that the standard mapping information is obtained.
Optionally, the receiving a model loading service, performing distributed loading on the model of each node in the standard model cluster based on a preset subscription and publication system, and feeding back a loading result to an issuing end of the model loading service includes:
receiving the model loading service by using the subscription and publication system, and sending the model loading service to each node of the standard model cluster based on the subscription and publication system;
and carrying out model monitoring on each node of the standard model cluster, carrying out model loading on the model in each node of the standard model cluster based on the monitoring result, and feeding back the loading results of all the nodes to an issuing end of the model loading service.
Optionally, the performing model monitoring on each node of the standard model cluster, and performing model loading on the model in each node of the standard model cluster based on the monitoring result includes:
carrying out model detection processing on the model in each node of the standard model cluster;
if the model in the node is detected incorrectly, the monitoring result is abnormal, and the model in the node is not loaded;
and if the model in the node is detected correctly, the monitoring result is normal, and the model in the node is loaded.
Optionally, after the cluster update is performed on the to-be-online model based on the original mapping information to obtain a standard model cluster, the method further includes:
receiving a model deletion service, and determining a model to be deleted based on the model deletion service;
sending the model deleting service to each node of the standard model cluster based on the subscription and release system;
if the model to be deleted does not exist in the node, the model in the node is not processed;
and if the model to be deleted exists in the node, deleting the model to be deleted in the node, and feeding back the deletion results of all the nodes to the sending end of the model deletion service.
In order to solve the above problem, the present invention further provides a model distributed loading apparatus, including:
the host cluster mapping module is used for acquiring a model set and carrying out host cluster mapping processing on the model set to obtain original mapping information;
the cluster updating module is used for receiving the model to be online and performing cluster updating on the model to be online based on the original mapping information to obtain a standard model cluster;
and the model loading module is used for receiving the model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back the loading result to the sending end of the model loading service.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
and the processor executes the computer program stored in the memory to realize the model distributed loading method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the model distributed loading method described above.
According to the method, the system and the device, host cluster mapping processing is carried out through the model set to obtain original mapping information, cluster updating is carried out on the model to be online based on the original mapping information, cluster configuration can be flexibly and rapidly carried out through the number of hosts and the number of models to obtain a standard model cluster, the problem that the model which is growing on line exceeds the upper limit of a single-machine memory is solved, and the stability of model service is improved. Meanwhile, the distributed loading is carried out on the model of each node in the standard model cluster based on the preset subscription and release system, the model loading efficiency is greatly improved, the loading result can be fed back to the sending end of the model loading service, and the efficiency of the model loading monitoring is also improved. Therefore, the distributed model loading method, the distributed model loading device, the electronic equipment and the computer readable storage medium can solve the problem of low model loading efficiency.
Drawings
Fig. 1 is a schematic flowchart of a distributed loading method for a model according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart showing a detailed implementation of one of the steps in FIG. 1;
FIG. 3 is a schematic flow chart showing another step of FIG. 1;
FIG. 4 is a functional block diagram of a distributed model loading apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device for implementing the model distributed loading method according to an embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention provides a distributed loading method for a model. The execution subject of the model distributed loading method includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiment of the present invention. In other words, the model distributed loading method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of a distributed model loading method according to an embodiment of the present invention. In this embodiment, the model distributed loading method includes the following steps S1 to S3:
s1, obtaining a model set, and carrying out host cluster mapping processing on the model set to obtain original mapping information.
In the embodiment of the present invention, the model set refers to models in all business product lines, for example, in the financial field, the model set may be an intention recognition model set at an online training system of each business sub-company and product line.
In detail, before the host cluster mapping process is performed on the model set, the method further includes:
and constructing an original host cluster, and performing host node configuration and model distribution configuration on the original host cluster to obtain a standard host cluster.
In an embodiment of the present invention, the original host cluster may be a ZooKeeper cluster, the ZooKeeper cluster is a distributed service management framework designed based on an observer pattern and provides services to the outside in a cluster manner, one cluster includes a plurality of nodes, each node corresponds to a host deploying a ZooKeeper program, all the nodes provide services to the outside together, and the whole cluster environment provides a comprehensive support for the consistency of distributed data.
The HOST node configuration refers to configuration through HOST _ LIST, namely LIST configuration is carried out on the HOSTs of the cluster; the model allocation configuration comprises REPLICAS configuration and ADMIN-IP configuration, wherein the REPLICAS configuration refers to determining the number of copies of the model, namely determining how many machines each model is deployed on, the ADMIN-IP configuration refers to that when the cluster is initially started, each cluster can manually set one node or defaultly select one node from HOST-LIST for cluster machine and model allocation, the nodes uniformly calculate and allocate according to the configured allocation strategy algorithm, and the allocation strategy algorithm comprises a uniform distribution strategy, a maximum remaining memory priority distribution strategy and the like.
In an optional embodiment of the present invention, the uniform distribution strategy is to allocate a model a for each machine with a number of REPLICAS in HOST _ LIST, and allocate a model B for the next time, until a result that the number of models allocated to each machine is relatively uniform is finally achieved; the maximum remaining memory priority distribution strategy refers to that the host with the maximum remaining memory is selected from the cluster hosts to be loaded preferentially.
Specifically, referring to fig. 2, the performing host cluster mapping processing on the model set to obtain original mapping information includes the following steps S10 to S12:
s10, initializing and cleaning a registration directory of the standard host cluster, and copying the model in the model set based on the model distribution configuration to obtain a copy set;
s11, distributing the copy set to the nodes of the standard host cluster by utilizing the host node configuration and the model distribution configuration;
and S12, writing the distributed nodes into the registration directory, and generating a mapping relation between the copy set and the standard host cluster to obtain the original mapping information.
In an optional embodiment of the present invention, since the ZooKeeper cluster is registered as the temporary node when being started, the registration directory of the cluster needs to be cleared first, and all models are distributed through the configured information, so as to obtain the mapping relationship between the cluster and the models. The original mapping information includes fingerprints of the current model allocation policy, a cluster machine list, MD5 of the number of copies, and the like, and may be stored in the PG database.
For example, there are m1, m2, m3 models, the HOST node configuration HOST _ LIST includes four nodes/HOSTs (Pod), REPLICAS configuration in the model allocation configuration is 3, i.e. each model copy number is 3, admin \uip configuration uses a uniform distribution strategy for model allocation, then m1 is allocated to Pod1, pod2, pod3; m2 is allocated to Pod2, pod3, pod4; m3 is assigned to Pod3, pod4, pod1.
S2, receiving a model to be on-line, and performing cluster updating on the model to be on-line based on the original mapping information to obtain a standard model cluster.
In the embodiment of the invention, the original mapping information is essentially information of a table structure, the table is provided with a unique key in each environment (test/online), whether the table changes (including fingerprints of the current model distribution strategy, a cluster machine list, MD5 of the number of copies and the like) is inquired before online every time, and corresponding updating is carried out, so that the accuracy of model loading is improved.
In detail, the cluster updating is performed on the model to be online based on the original mapping information to obtain a standard model cluster, including:
inquiring whether the model to be online has data records in the original mapping information;
if the model to be online has data records in the original mapping information, the original mapping information is not updated;
if the model to be on-line does not have data record in the original mapping information, the host node configuration and the model distribution configuration are reused to carry out model distribution on the model to be on-line, so that the standard model cluster is obtained, and the original mapping information is updated by using the mapping relation between the model to be on-line and the standard host cluster, so that the standard mapping information is obtained.
In an optional embodiment of the present invention, for a model to be online, if a data record exists in the original mapping information, which indicates that the online is the online restarted by a single point Pod on the line or has no model configuration policy change compared with the previous version, all the pods are loaded according to the original mapping information; if the query record does not exist, it is indicated that the model to be online changes from the previous version (possibly machine list, copy number or model distribution strategy algorithm changes), then at this time, the ADMIN _ IP configuration is needed to start to call the host model distribution strategy for re-distribution, and the distribution result is written into the original mapping information to obtain the standard mapping information, and after the distribution is completed, each node of the standard model cluster reads the latest distribution model.
In the embodiment of the invention, the cluster updating is carried out on the model to be online based on the original mapping information, the cluster configuration can be flexibly and quickly carried out according to the number of the hosts and the number of the models to obtain the standard model cluster, the limitation that the model which is growing on line exceeds the upper limit of a single-machine memory is solved, and the stability of the model service is improved.
And S3, receiving a model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back a loading result to an issuing end of the model loading service.
In this embodiment of the present invention, the preset subscription and publication system may be a Redis subscription and publication system, where the Redis subscription and publication system (pub/sub) is in a message communication mode, a sender (pub) sends a message, a subscriber (sub) receives the message, the sender (pub) sends the message to a channel (channel), the subscriber (sub) receives the message through a subscription channel, a client of the Redis subscription and publication system may subscribe to any number of channels, and each node in the standard model cluster is used as a subscription model loading information of the subscriber (sub).
In detail, referring to fig. 3, the receiving a model loading service, performing distributed loading on the model of each node in the standard model cluster based on a preset subscription and publication system, and feeding back a loading result to an issuing end of the model loading service includes the following steps S30 to S31:
s30, receiving the model loading service by using the subscription and publication system, and sending the model loading service to each node of the standard model cluster based on the subscription and publication system;
and S31, carrying out model monitoring on each node of the standard model cluster, carrying out model loading on the model in each node of the standard model cluster based on the monitoring result, and feeding back the loading results of all the nodes to an issuing end of the model loading service.
In an optional embodiment of the present invention, the whole process of the model loading logic is based on a Redis subscription and publishing system, when a model loading interface receives a model loading service of an upstream server (a sender), the release of a Redis message is triggered, each node (Pod) in the standard model cluster performs subscription monitoring of the release of the Redis message, monitors the subscription message, analyzes the message, and then determines whether to perform model loading or not according to an analysis result.
Specifically, the performing model monitoring on each node of the standard model cluster, and performing model loading on the model in each node of the standard model cluster based on the monitoring result includes:
carrying out model detection processing on the model in each node of the standard model cluster;
if the model in the node is detected incorrectly, the monitoring result is abnormal, and the model in the node is not loaded;
and if the model in the node is detected correctly, the monitoring result is normal, and the model in the node is loaded.
In the embodiment of the invention, model loading can be carried out through TFS (TensorFlow-Serving) service, and the method mainly comprises preloading, formal loading, cluster loading monitoring and call-back, wherein the preloading mainly carries out model detection processing on the models in each node, namely, the models in the nodes are detected to be problematic or not, and unpredictable errors (such as whether versions are consistent or not, whether the models are incomplete or not and the like) caused by blind and direct loading on a line are avoided; formal loading refers to loading a model in a node with normal preloading; the cluster loading monitoring refers to monitoring the loading condition of a model in a cluster node through Redis; the callback refers to returning the cluster loading state (success/failure) of the model after the model cluster is loaded, that is, feeding back the loading results of all the nodes to the sending end of the model loading service.
In another optional embodiment of the present invention, after the cluster updating is performed on the to-be-online model based on the original mapping information to obtain a standard model cluster, the method further includes:
receiving a model deletion service, and determining a model to be deleted based on the model deletion service;
sending the model deleting service to each node of the standard model cluster based on the subscription and release system;
if the model to be deleted does not exist in the node, the model in the node is not processed;
and if the model to be deleted exists in the node, deleting the model to be deleted in the node, and feeding back the deletion results of all the nodes to the sending end of the model deletion service.
In the embodiment of the invention, the logic of model deletion is the same as the logic of model loading, and is also a message subscription and publishing mechanism based on Redis, each node performs subscription monitoring of Redis message publishing in the process of model deletion, monitors the model deletion service, analyzes the model needing to be deleted, detects whether the model to be deleted exists and deletes the model, and returns the model deletion result back to the sending end of the model deletion service until each node of a cluster successfully completes the model deletion.
In the embodiment of the invention, for the ZK node, the state of the node is init _ status when the host cluster is mapped, and the state of the node is monitor _ status when the cluster ZK node is monitored, so that the cluster state including normal starting of the cluster, online of the cluster and the like can be mastered more quickly for cluster machines and node states in the ZK registration directory.
According to the method, the system and the device, host cluster mapping processing is carried out through the model set to obtain original mapping information, the to-be-online models are subjected to cluster updating based on the original mapping information, cluster configuration can be flexibly and rapidly carried out through the number of hosts and the number of models to obtain standard model clusters, the problem that the increasing models on line exceed the upper limit of a single-computer memory is solved, and the stability of model service is improved. Meanwhile, the distributed loading is carried out on the model of each node in the standard model cluster based on the preset subscription and release system, the model loading efficiency is greatly improved, the loading result can be fed back to the sending end of the model loading service, and the efficiency of the model loading monitoring is also improved. Therefore, the distributed model loading method provided by the invention can solve the problem of low model loading efficiency.
Fig. 4 is a functional block diagram of a model distributed loading apparatus according to an embodiment of the present invention.
The model distributed loading apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the model distributed loading apparatus 100 may include a host cluster mapping module 101, a cluster updating module 102, and a model loading module 103. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the host cluster mapping module 101 is configured to obtain a model set, and perform host cluster mapping processing on the model set to obtain original mapping information;
the cluster updating module 102 is configured to receive a model to be online, and perform cluster updating on the model to be online based on the original mapping information to obtain a standard model cluster;
the model loading module 103 is configured to receive a model loading service, perform distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feed back a loading result to an issuing end of the model loading service.
In detail, the specific implementation of each module of the model distributed loading apparatus 100 is as follows:
step one, obtaining a model set, and carrying out host cluster mapping processing on the model set to obtain original mapping information.
In the embodiment of the present invention, the model set refers to models in all business product lines, for example, in the financial field, the model set may be an intention recognition model set at an online training system of each business sub-company and product line.
In detail, before the host cluster mapping process is performed on the model set, the method further includes:
and constructing an original host cluster, and performing host node configuration and model distribution configuration on the original host cluster to obtain a standard host cluster.
In an embodiment of the present invention, the original host cluster may be a ZooKeeper cluster, the ZooKeeper cluster is a distributed service management framework designed based on an observer pattern and provides services to the outside in a cluster manner, one cluster includes a plurality of nodes, each node corresponds to a host deploying a ZooKeeper program, all the nodes provide services to the outside together, and the whole cluster environment provides a comprehensive support for the consistency of distributed data.
The HOST node configuration refers to configuration through HOST _ LIST, namely LIST configuration is performed on HOSTs of a cluster; the model allocation configuration comprises REPLICAS configuration and ADMIN-IP configuration, wherein the REPLICAS configuration refers to determining the number of copies of the model, namely determining how many machines each model is deployed on, the ADMIN-IP configuration refers to that when the cluster is initially started, each cluster can manually set one node or defaultly select one node from HOST-LIST for cluster machine and model allocation, the nodes uniformly calculate and allocate according to the configured allocation strategy algorithm, and the allocation strategy algorithm comprises a uniform distribution strategy, a maximum remaining memory priority distribution strategy and the like.
In an optional embodiment of the present invention, the uniform distribution strategy is to allocate a model a for each machine with a number of REPLICAS in HOST _ LIST, and allocate a model B for the next time, until a result that the number of models allocated to each machine is relatively uniform is finally achieved; the maximum remaining memory priority distribution strategy refers to that the host with the maximum remaining memory is selected from the cluster hosts to be loaded preferentially.
Specifically, the performing host cluster mapping processing on the model set to obtain original mapping information includes:
initializing and cleaning a registration directory of the standard host cluster, and copying the model in the model set based on the model distribution configuration to obtain a copy set;
allocating the replica set to nodes of the standard host cluster using the host node configuration and the model allocation configuration;
writing the distributed nodes into the registration directory, and generating a mapping relation between the copy set and the standard host cluster to obtain the original mapping information.
In an optional embodiment of the present invention, since the ZooKeeper cluster is registered as the temporary node when being started, the registration directory of the cluster needs to be cleared first, and all models are distributed through the configured information, so as to obtain the mapping relationship between the cluster and the models. The original mapping information includes fingerprints of the current model allocation policy, a cluster machine list, MD5 of the number of copies, and the like, and may be stored in the PG database.
For example, there are m1, m2, m3 models, the HOST node configuration HOST _ LIST includes four nodes/HOSTs (Pod), REPLICAS configuration in the model allocation configuration is 3, i.e. each model copy number is 3, admin \uip configuration uses a uniform distribution strategy for model allocation, then m1 is allocated to Pod1, pod2, pod3; m2 is allocated to Pod2, pod3, pod4; m3 is assigned to Pod3, pod4, pod1.
And step two, receiving a model to be on-line, and performing cluster updating on the model to be on-line based on the original mapping information to obtain a standard model cluster.
In the embodiment of the invention, the original mapping information is essentially information of a table structure, the table is provided with a unique key in each environment (test/online), whether the table changes (including fingerprints of the current model distribution strategy, a cluster machine list, MD5 of the number of copies and the like) is inquired before online every time, and corresponding updating is carried out, so that the accuracy of model loading is improved.
In detail, the cluster updating is performed on the model to be online based on the original mapping information to obtain a standard model cluster, including:
inquiring whether the model to be online has data records in the original mapping information;
if the model to be online has data records in the original mapping information, the original mapping information is not updated;
if the model to be on-line does not have data record in the original mapping information, the host node configuration and the model distribution configuration are reused to carry out model distribution on the model to be on-line to obtain the standard model cluster, and the original mapping information is updated by utilizing the mapping relation between the model to be on-line and the standard host cluster to obtain the standard mapping information.
In an optional embodiment of the invention, for the model to be online, if data records exist in the original mapping information after being inquired, the online is the online restarted single-point Pod on the line or the online without model configuration strategy change compared with the previous version, all the pods are loaded according to the original mapping information; if the query record does not exist, it is indicated that the model to be on-line changes from the previous version (possibly machine list, copy number or model distribution strategy algorithm changes), then at this time, the ADMIN _ IP configuration is needed to start to call the host model distribution strategy for re-distribution, and the distribution result is written into the original mapping information to obtain the standard mapping information, and after the distribution is completed, each node of the standard model cluster reads the latest distribution model.
In the embodiment of the invention, the cluster updating is carried out on the model to be online based on the original mapping information, the cluster configuration can be flexibly and quickly carried out according to the number of the hosts and the number of the models to obtain the standard model cluster, the limitation that the model which is growing on line exceeds the upper limit of a single-machine memory is solved, and the stability of the model service is improved.
And step three, receiving a model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back a loading result to an issuing end of the model loading service.
In this embodiment of the present invention, the preset subscription and publication system may be a Redis subscription and publication system, where the Redis subscription and publication system (pub/sub) is in a message communication mode, a sender (pub) sends a message, a subscriber (sub) receives the message, the sender (pub) sends the message to a channel (channel), the subscriber (sub) receives the message through a subscription channel, a client of the Redis subscription and publication system may subscribe to any number of channels, and each node in the standard model cluster is used as a subscription model loading information of the subscriber (sub).
In detail, the receiving the model loading service, performing distributed loading on the model of each node in the standard model cluster based on a preset subscription and publication system, and feeding back a loading result to an issuing end of the model loading service includes:
receiving the model loading service by using the subscription and publication system, and sending the model loading service to each node of the standard model cluster based on the subscription and publication system;
and carrying out model monitoring on each node of the standard model cluster, carrying out model loading on the model in each node of the standard model cluster based on a monitoring result, and feeding back the loading results of all the nodes to an issuing end of the model loading service.
In an optional embodiment of the present invention, the whole process of the model loading logic is based on a Redis subscription and publishing system, when a model loading interface receives a model loading service of an upstream server (a sender), the release of a Redis message is triggered, each node (Pod) in the standard model cluster performs subscription monitoring of the release of the Redis message, monitors the subscription message, analyzes the message, and then determines whether to perform model loading or not according to an analysis result.
Specifically, the performing model monitoring on each node of the standard model cluster, and performing model loading on the model in each node of the standard model cluster based on the monitoring result includes:
carrying out model detection processing on the model in each node of the standard model cluster;
if the model in the node is detected incorrectly, the monitoring result is abnormal, and the model in the node is not loaded;
and if the model in the node is detected correctly, the monitoring result is normal, and the model in the node is loaded.
In the embodiment of the invention, model loading can be carried out through TFS (TensorFlow-Serving) service, and the method mainly comprises preloading, formal loading, cluster loading monitoring and call-back, wherein the preloading mainly carries out model detection processing on the models in each node, namely, the models in the nodes are detected to be problematic or not, and unpredictable errors (such as whether versions are consistent or not, whether the models are incomplete or not and the like) caused by blind and direct loading on a line are avoided; formal loading refers to loading a model in a node with normal preloading; the cluster loading monitoring refers to monitoring the loading condition of a model in a cluster node through Redis; the callback refers to returning the cluster loading state (success/failure) of the model after the model cluster is loaded, that is, feeding back the loading results of all the nodes to the sending end of the model loading service.
In another optional embodiment of the present invention, after the cluster updating is performed on the model to be on-line based on the original mapping information to obtain a standard model cluster, the method further includes:
receiving a model deletion service, and determining a model to be deleted based on the model deletion service;
sending the model deleting service to each node of the standard model cluster based on the subscription and release system;
if the model to be deleted does not exist in the node, the model in the node is not processed;
and if the model to be deleted exists in the node, deleting the model to be deleted in the node, and feeding back the deletion results of all the nodes to the sending end of the model deletion service.
In the embodiment of the invention, the logic of model deletion is the same as the logic of model loading, and is also a message subscription and publishing mechanism based on Redis, each node performs subscription monitoring of Redis message publishing in the process of model deletion, monitors the model deletion service, analyzes the model needing to be deleted, detects whether the model to be deleted exists and deletes the model, and returns the model deletion result back to the sending end of the model deletion service until each node of a cluster successfully completes the model deletion.
In the embodiment of the invention, for the ZK node, the state of the node is init _ status when the host cluster is mapped, and the state of the node is monitor _ status when the cluster ZK node is monitored, so that the cluster state including normal starting of the cluster, online of the cluster and the like can be mastered more quickly for cluster machines and node states in the ZK registration directory.
According to the method, the system and the device, host cluster mapping processing is carried out through the model set to obtain original mapping information, cluster updating is carried out on the model to be online based on the original mapping information, cluster configuration can be flexibly and rapidly carried out through the number of hosts and the number of models to obtain a standard model cluster, the problem that the model which is growing on line exceeds the upper limit of a single-machine memory is solved, and the stability of model service is improved. Meanwhile, the distributed loading is carried out on the model of each node in the standard model cluster based on the preset subscription and release system, the model loading efficiency is greatly improved, the loading result can be fed back to the sending end of the model loading service, and the efficiency of the model loading monitoring is also improved. Therefore, the distributed model loading device provided by the invention can solve the problem of low model loading efficiency.
Fig. 5 is a schematic structural diagram of an electronic device for implementing the model distributed loading method according to an embodiment of the present invention.
The electronic device may comprise a processor 10, a memory 11, a communication interface 12 and a bus 13, and may further comprise a computer program, such as a model distributed loader, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as a code of a model distributed loader, etc., but also to temporarily store data that has been output or will be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., model distributed loader, etc.) stored in the memory 11 and calling data stored in the memory 11.
The communication interface 12 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
The bus 13 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 13 may be divided into an address bus, a data bus, a control bus, etc. The bus 13 is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 5 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 5 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and another electronic device.
Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The model distributed loader stored in the memory 11 of the electronic device is a combination of instructions that, when executed in the processor 10, can implement:
obtaining a model set, and carrying out host cluster mapping processing on the model set to obtain original mapping information;
receiving a model to be on-line, and performing cluster updating on the model to be on-line based on the original mapping information to obtain a standard model cluster;
and receiving a model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back a loading result to an issuing end of the model loading service.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the electronic device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor of an electronic device, implements:
obtaining a model set, and carrying out host cluster mapping processing on the model set to obtain original mapping information;
receiving a model to be on-line, and performing cluster updating on the model to be on-line based on the original mapping information to obtain a standard model cluster;
and receiving a model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back a loading result to an issuing end of the model loading service.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for distributed loading of models, the method comprising:
obtaining a model set, and carrying out host cluster mapping processing on the model set to obtain original mapping information;
receiving a model to be on-line, and performing cluster updating on the model to be on-line based on the original mapping information to obtain a standard model cluster;
and receiving a model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back a loading result to an issuing end of the model loading service.
2. The method for distributed loading of models as claimed in claim 1 wherein prior to said host cluster mapping process on said set of models, said method further comprises:
and constructing an original host cluster, and performing host node configuration and model distribution configuration on the original host cluster to obtain a standard host cluster.
3. The distributed loading method of models according to claim 2, wherein said performing a host cluster mapping process on said model set to obtain original mapping information comprises:
initializing and cleaning a registration directory of the standard host cluster, and copying the model in the model set based on the model distribution configuration to obtain a copy set;
allocating the replica set to nodes of the standard host cluster using the host node configuration and the model allocation configuration;
writing the distributed nodes into the registration directory, and generating a mapping relation between the copy set and the standard host cluster to obtain the original mapping information.
4. The distributed model loading method of claim 3, wherein the updating of the cluster of the model to be online based on the original mapping information to obtain a standard model cluster comprises:
inquiring whether the model to be on-line has data record in the original mapping information or not;
if the model to be online has data records in the original mapping information, the original mapping information is not updated;
if the model to be on-line does not have data record in the original mapping information, the host node configuration and the model distribution configuration are reused to carry out model distribution on the model to be on-line, so that the standard model cluster is obtained, and the original mapping information is updated by using the mapping relation between the model to be on-line and the standard host cluster, so that the standard mapping information is obtained.
5. The model distributed loading method according to claim 1, wherein the receiving of the model loading service, the distributed loading of the model of each node in the standard model cluster based on a preset subscription and publication system, and the feeding back of the loading result to the sending end of the model loading service, comprises:
receiving the model loading service by using the subscription and publication system, and sending the model loading service to each node of the standard model cluster based on the subscription and publication system;
and carrying out model monitoring on each node of the standard model cluster, carrying out model loading on the model in each node of the standard model cluster based on the monitoring result, and feeding back the loading results of all the nodes to an issuing end of the model loading service.
6. The distributed model loading method of claim 5, wherein the performing model monitoring on each node of the standard model cluster and performing model loading on the model in each node of the standard model cluster based on the monitoring result comprises:
carrying out model detection processing on the model in each node of the standard model cluster;
if the model in the node is detected incorrectly, the monitoring result is abnormal, and the model in the node is not loaded;
and if the model in the node is detected correctly, the monitoring result is normal, and the model in the node is loaded.
7. The distributed model loading method according to any one of claims 1 to 6, wherein after the cluster update is performed on the model to be online based on the original mapping information to obtain a standard model cluster, the method further comprises:
receiving a model deletion service, and determining a model to be deleted based on the model deletion service;
sending the model deleting service to each node of the standard model cluster based on the subscription and publication system;
if the model to be deleted does not exist in the node, the model in the node is not processed;
and if the model to be deleted exists in the node, deleting the model to be deleted in the node, and feeding back the deletion results of all the nodes to the sending end of the model deletion service.
8. An apparatus for distributed loading of models, the apparatus comprising:
the host cluster mapping module is used for acquiring a model set and carrying out host cluster mapping processing on the model set to obtain original mapping information;
the cluster updating module is used for receiving the model to be online and performing cluster updating on the model to be online based on the original mapping information to obtain a standard model cluster;
and the model loading module is used for receiving the model loading service, carrying out distributed loading on the model of each node in the standard model cluster based on a preset subscription and release system, and feeding back the loading result to the sending end of the model loading service.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the model distributed loading method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method for distributed loading of a model according to any one of claims 1 to 7.
CN202210897604.5A 2022-07-28 2022-07-28 Model distributed loading method, device, equipment and storage medium Pending CN115220817A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210897604.5A CN115220817A (en) 2022-07-28 2022-07-28 Model distributed loading method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210897604.5A CN115220817A (en) 2022-07-28 2022-07-28 Model distributed loading method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115220817A true CN115220817A (en) 2022-10-21

Family

ID=83612916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210897604.5A Pending CN115220817A (en) 2022-07-28 2022-07-28 Model distributed loading method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115220817A (en)

Similar Documents

Publication Publication Date Title
CN111625252A (en) Cluster upgrading maintenance method and device, electronic equipment and storage medium
CN110290166B (en) Cross-cluster data interaction method, system and device and readable storage medium
CN104657158A (en) Method and device for processing business in business system
CN113468049A (en) Test method, device, equipment and medium based on configurable interface
CN114816820A (en) Method, device, equipment and storage medium for repairing chproxy cluster fault
CN112445623A (en) Multi-cluster management method and device, electronic equipment and storage medium
CN111651426A (en) Data migration method and device and computer readable storage medium
CN115118738A (en) Disaster recovery backup method, device, equipment and medium based on RDMA
CN115543198A (en) Method and device for lake entering of unstructured data, electronic equipment and storage medium
CN112948380A (en) Data storage method and device based on big data, electronic equipment and storage medium
CN110007946B (en) Method, device, equipment and medium for updating algorithm model
CN115220817A (en) Model distributed loading method, device, equipment and storage medium
CN115145870A (en) Method and device for positioning reason of failed task, electronic equipment and storage medium
CN115687384A (en) UUID (user identifier) identification generation method, device, equipment and storage medium
CN114510400A (en) Task execution method and device, electronic equipment and storage medium
CN114237982A (en) System disaster recovery switching method, device, equipment and storage medium
CN112015534A (en) Configurated platform scheduling method, system and storage medium
CN111966388A (en) Space-saving mirror image version update management method, device, equipment and medium
CN109814911A (en) Method, apparatus, computer equipment and storage medium for Manage Scripts program
CN114785789B (en) Database fault management method and device, electronic equipment and storage medium
CN114185589A (en) Spring boot-based file transmission method, device, equipment and medium
CN114860314B (en) Deployment upgrading method, device, equipment and medium based on database compatibility
CN115242628A (en) Application downloading method, device and equipment based on module deployment and storage medium
CN113362040B (en) Approval chain configuration updating method and device, electronic equipment and storage medium
CN116860508B (en) Distributed system software defect continuous self-healing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination