WO2023098460A1 - Model updating method and apparatus and related device - Google Patents

Model updating method and apparatus and related device Download PDF

Info

Publication number
WO2023098460A1
WO2023098460A1 PCT/CN2022/131668 CN2022131668W WO2023098460A1 WO 2023098460 A1 WO2023098460 A1 WO 2023098460A1 CN 2022131668 W CN2022131668 W CN 2022131668W WO 2023098460 A1 WO2023098460 A1 WO 2023098460A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
updated
trained
training
sample set
Prior art date
Application number
PCT/CN2022/131668
Other languages
French (fr)
Chinese (zh)
Inventor
吕超群
刘凌辉
杨锦
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023098460A1 publication Critical patent/WO2023098460A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the present application relates to the field of artificial intelligence, in particular to a model updating method, device and related equipment.
  • AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • classification models are used to classify data, realize data classification automation, and improve classification efficiency.
  • Image recognition models are used to identify image content and realize image content automatic recognition and improve recognition efficiency.
  • the existing methods for updating the model based on incremental learning usually use offline learning or online learning to update the model.
  • offline learning it is necessary to manually track the performance of the model, continuously train the model repeatedly and Manual deployment and online deployment will inevitably consume more human resources and time, and the model update efficiency is relatively low;
  • online learning methods new models will be continuously trained, continuously verified, and continuously used.
  • the new model replaces the old model, which is bound to consume a lot of computing resources.
  • This application provides a model update method, device and related equipment, which can solve the problem of low model update efficiency in the method of updating the model in the offline learning mode, and at the same time solve the problem that the method of updating the model in the online learning mode consumes a large amount of calculations resource problem.
  • a method for updating a model includes: firstly, obtaining a training sample set, and then, when the first trigger mechanism is used to determine that the first model needs to be updated and trained, the first model is updated through the training sample set.
  • the training is updated to obtain the trained model, and finally, when the second trigger mechanism is used to determine that the first model needs to be replaced with the updated trained model, the first model is replaced with the updated trained model.
  • the present application determines whether the first model needs to be updated and trained through the first trigger mechanism, and determines whether the first model needs to be replaced with the updated and trained model through the second trigger mechanism, which can trigger the model on demand for automatic training.
  • Update training and automatic update deployment can reduce the consumption of computing resources while improving the efficiency of model update.
  • the first triggering mechanism includes: if the number of difficult samples in the training sample set reaches the first threshold, then determine that the first model needs to be updated and trained; or, if the current time reaches the model update time , it is determined that the first model needs to be updated and trained; or, if the number of samples in the training sample set reaches the second threshold, it is determined that the first model needs to be updated and trained; or, if the online duration of the first model reaches the preset duration , it is determined that the first model needs to be updated and trained.
  • the second threshold is a natural number greater than 1.
  • the present application provides multiple mechanisms for determining whether to trigger the model to perform update training, and the user can choose any trigger mechanism, which has strong flexibility.
  • the second trigger mechanism includes: if the prediction performance of the first model is lower than the prediction performance of the updated model, then determining that the first model needs to be replaced with the updated model; or, if If the prediction performance of the updated and trained model is within the expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
  • the present application provides multiple mechanisms for determining whether to trigger the replacement of the old model with the new model, and the user can choose any trigger mechanism, which has strong flexibility.
  • the update training of the first model through the training sample set can be implemented specifically as follows: first, the training sample set is screened to determine the difficult samples in the training sample set, and then, using the training sample set The concentrated hard examples are used to update and train the first model.
  • this application uses the difficult samples in the training sample set to update and train the model, instead of using all the samples in the training sample set to update and train the model as in the prior art. In this way, the consumption of computing resources can be further reduced. Improve model update efficiency.
  • the samples in the training sample set can be screened in the following way to determine the difficult samples in the training sample set: first, each sample in the training sample set is input into the first model pair for inference, The attributes of the inference results corresponding to each sample are obtained, and the attributes include any of the following: confidence and cross entropy, and then, according to the attributes of the inference results of each sample, determine whether each sample is a difficult sample.
  • a device for updating a model includes: an acquisition unit, configured to acquire a training sample set;
  • a model training unit configured to perform update training on the first model through the training sample set to obtain an updated trained model when the first trigger mechanism is used to determine that the first model needs to be updated and trained;
  • a model deploying unit configured to replace the first model with the updated trained model when it is determined that the first model needs to be replaced with the updated trained model by using the second trigger mechanism.
  • the first trigger mechanism includes: the first trigger mechanism includes: if the number of difficult samples in the training sample set reaches a first threshold, it is determined that the first model needs to be updated and trained; or , if the current time reaches the model update time, it is determined that the first model needs to be updated and trained; or, if the number of samples in the training sample set reaches the second threshold, it is determined that the first model needs to be updated and trained; or, if the first When the online duration of the model reaches the preset duration, it is determined that the first model needs to be updated and trained.
  • the second threshold is a natural number greater than 1.
  • the second trigger mechanism includes: if the prediction performance of the first model is lower than the prediction performance of the updated model, then determining that the first model needs to be replaced with the updated model; or, if If the prediction performance of the updated and trained model is within the expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
  • the above-mentioned model training unit is specifically used to: firstly, filter the training sample set to determine the difficult examples in the training sample set, and then use the difficult examples in the training sample set to perform the first
  • the model is updated for training.
  • the above model training unit is specifically used to: firstly, input each sample in the training sample set into the first model pair for inference, and obtain the attributes of the inference results corresponding to each sample, and the attributes include the following Either: confidence, cross-entropy, and then, based on the properties of each sample's inference results, determine whether each sample is a hard sample.
  • a computer-readable storage medium stores instructions, and the instructions are used to implement the method provided in the above-mentioned first aspect or any possible implementation manner of the first aspect.
  • a computing device in a fourth aspect, includes a processor and a memory; the processor is configured to execute instructions stored in the memory, so that the computing device realizes any possibility of the above first aspect or the first aspect The method provided by the implementation of .
  • a computer program product including a computer program.
  • the computer program When the computer program is read and executed by a computing device, the computing device executes the above-mentioned first aspect or any possible implementation of the first aspect. provided method.
  • Fig. 1 is a schematic diagram of an artificial intelligence subject framework provided by the present application
  • Fig. 2 is a schematic structural diagram of a model updating method provided by the present application.
  • FIG. 3 is a schematic flow diagram of determining difficult samples from the training sample set provided by the present application.
  • Fig. 4 is a schematic diagram of deployment of a model updating device provided by the present application.
  • FIG. 5A is a schematic structural diagram of a model updating device provided by the present application.
  • Fig. 5B is a schematic structural diagram of another model updating device provided by the present application.
  • Fig. 6 is a schematic flow chart of another model updating method provided by the present application.
  • FIG. 7 is a schematic structural diagram of a computing device provided by the present application.
  • Offline learning It can also be called batch learning or offline training. It is a batch (referring to a batch of data) that updates the model weights after training. In this case, all training data must be available during model training. Moreover, only after the model training is completed, the model can be deployed online to predict the online data. Offline learning has the disadvantages of low model training efficiency, difficulty in expanding the training process to large data scenarios, and inability of the model to adapt to dynamically changing environments.
  • Online learning It can also be called adaptive learning or online training. It refers to receiving data in a certain order. Every time a data is received, the model will predict the data and train the current model, and then process the next data. . Online learning is to update the weights directly after a data training is completed, rather than updating the weights after a batch is trained. That is to say, online learning does not need to provide a complete training data set at the beginning. As more real-time online data arrives, the model will be continuously updated during operation.
  • Incremental learning refers to a model that can continuously learn new knowledge from new samples and preserve most of the previously learned knowledge. Incremental learning is very similar to the human learning model itself. Because people learn and accept new things every day in the process of growing up, learning is carried out gradually, and human beings generally will not forget the knowledge they have learned. The idea of incremental learning can be described as: whenever new data is added, the model does not need to rebuild all the knowledge bases, but only updates the changes caused by the new data on the basis of the original knowledge base. We found that the incremental learning method is more in line with human thinking principles. Online learning must be incremental, because online learning is implemented by streaming in data one by one to update the model. Incremental learning is not necessarily online, because given a model and a batch of offline data, incremental learning can use this batch of offline data to update the previously trained model without training a model from scratch.
  • Difficult samples It can be referred to as difficult examples for short, which refers to difficult samples in which the inference results of the model do not meet expectations during the inference process.
  • model updating is a long-term process, such as updating and training the model on a weekly or monthly basis, or updating and training the model when the accumulated data reaches a certain amount. If the full amount of data is used for model update training, it will take a lot of labeling manpower and training time.
  • difficult samples are screened from the full amount of data, and only difficult samples are used for model update training, which can save labeling manpower and training time, and improve model update efficiency.
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements.
  • Intelligent information chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
  • IT value chain reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • computing power is provided by smart chips, such as central processing unit (central processing unit, CPU), neural network processor (neural-network processing unit, NPU), image processing unit (graphics processing unit, GPU), Application specific integrated circuit (ASIC), programmable logic gate array (field programmable gate array, FPGA) and other hardware acceleration chips;
  • the basic platform includes distributed computing framework and network and other related platform guarantees and supports, which may include cloud Storage and computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • machine learning models also called machine learning algorithms, hereinafter referred to as models
  • models are common means of use.
  • object detection models in unmanned inspection scenes where items are placed can detect the category and position, realize automatic detection, and improve detection efficiency.
  • the prediction accuracy of the model will continue to decrease.
  • the model needs to be updated. For example, the colors of the images captured by the camera in winter are more monotonous than the images captured in spring. If the model is trained using images with brighter colors captured in spring, the recognition accuracy of the model for images captured in spring is higher. However, the recognition accuracy of images captured in winter is relatively low.
  • the commonly used model update method is the model update method based on incremental learning.
  • the existing incremental learning-based model update methods usually use offline learning or online learning to update the model.
  • offline learning it is necessary to manually track the performance of the model and continuously train the model repeatedly. After the update training is completed, it will be manually deployed online, which will inevitably consume more human resources and time, and the update efficiency of the model will be relatively low; when the online learning method is adopted, new models will be continuously trained and updated continuously. For verification, the new model is constantly used to replace the old model. Although the update efficiency of the model can be improved, it will consume a lot of computing resources.
  • this application provides a model update method, device and related equipment.
  • a mechanism for automatically triggering model updates and training under certain conditions and a mechanism for automatically triggering model updates and deployments under certain conditions, it is possible to trigger models for automatic update training and automatic update deployment on demand, which can improve model updates. While improving efficiency, reduce the consumption of computing resources.
  • Fig. 2 is a schematic flow chart of a model updating method provided by the present application. As shown in Fig. 2, the method includes:
  • the training sample set includes multiple samples, and the multiple samples may all be newly generated samples online, or all may be samples obtained offline, or some may be newly generated samples online, and some may be obtained offline. samples, which are not specifically limited here.
  • the training sample set includes samples obtained offline, some or all of the samples obtained offline may be old samples used for training the model before the model is deployed online, or new samples obtained offline , can also be old samples generated by using generative adversarial networks (GAN), and this application does not limit the source of samples in the training sample set.
  • GAN generative adversarial networks
  • the samples included in the training sample set may be various types of data such as images, videos, audios, texts, etc., which are not specifically limited here.
  • S202 Use the first trigger mechanism to determine whether the first model needs to be updated and trained. When it is determined that the first model needs to be updated and trained, execute S203 and S205. When it is determined that the first model does not need to be updated and trained, execute S204. .
  • the first model can be a model of various purposes such as an image classification model, an object detection model, a sound classification model or a text classification model
  • the neural network for realizing the first model can be a random forest (random forest), a support vector machine (support vector machine, SVM), graph neural networks (graph neural networks, GNN), convolutional neural networks (convolutional neural networks, CNN), etc., are not specifically limited here.
  • the first trigger mechanism can be any of the following forms:
  • Form 1 Obtain the number of difficult samples in the training sample set in real time or periodically, and when it is detected that the number of difficult samples reaches the first threshold, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered .
  • the first threshold is a natural number greater than 0, and the size of the first threshold can be set according to actual conditions.
  • the first threshold can be 300, 500, etc., which is not specifically limited here.
  • the determination process of difficult samples in the training sample set please refer to FIG. 3 and related descriptions.
  • Form 2 Obtain the number of samples in the training sample set in real time or periodically. When it is detected that the number of samples reaches the second threshold, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered.
  • the second threshold is a natural number greater than 1, and the size of the second threshold can be set according to actual conditions.
  • the second threshold can be 500, 1000, etc., which is not specifically limited here.
  • Form 3 Monitor whether the current time reaches the preset model update time. When it is detected that the current time reaches the preset model update time, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered.
  • the preset model update time can be set according to the actual situation, for example, set to 2:00 am every day, or set to 00:00 on the 15th of every month, which is not specifically limited here.
  • Form 4 Monitor whether the online duration of the first model reaches the preset duration. When it is detected that the online duration reaches the preset duration, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered.
  • the preset duration can be set according to the actual situation, for example, set to 500 hours or 1000 hours, etc., which is not specifically limited here.
  • the difficult samples in the training sample set can be determined through the steps shown in Figure 3:
  • the attributes include confidence, cross entropy (cross entropy) and so on.
  • a sample corresponds to one or more inference results
  • the first confidence threshold is less than, it is determined that the sample is a difficult sample, otherwise, it is not a difficult sample.
  • the mean value of the confidence levels corresponding to multiple inference results can be calculated. If the mean value of the confidence level is less than the second confidence level threshold, it is determined that the sample is a difficult sample; otherwise, it is not a difficult sample.
  • S203 Perform update training on the first model by using the training sample set to obtain an updated and trained model.
  • the implementation form of the first triggering mechanism is the above-mentioned form 1
  • all the samples in the training sample set can be directly used to update the first model to obtain an updated model.
  • the first model can be updated and trained by using the difficult samples in the training sample set to obtain an updated and trained model.
  • the implementation form of the first trigger mechanism is the above-mentioned form 3 or form 4
  • all the samples in the training sample set can be directly used to update the first model, or the difficult examples in the training sample set can be used to update the first model Train to get the updated trained model.
  • the former is a preferred way to update and train the first model.
  • the first model performs update training, which is not specifically limited here.
  • S205 Use the second trigger mechanism to determine whether the first model needs to be replaced by the updated trained model. When it is determined that the first model needs to be replaced, perform S206. When it is determined that the first model does not need to be replaced, perform S207.
  • the second trigger mechanism can be any of the following forms:
  • Form 1' Evaluate the prediction performance of the updated model and the first model (such as prediction accuracy, recall rate, etc.) The trained model is updated to replace the first model, and then the execution of step S206 is automatically triggered. For example, assuming that the predicted accuracy of the evaluated first model is 0.80 and the predicted accuracy of the updated trained model is 0.81, it is determined that the first model needs to be replaced with the updated trained model.
  • Form 2' only evaluate the predictive performance of the model after updating the training, if the predictive performance of the model after updating the training is within the range of expected predictive performance, then it is determined that the first model needs to be replaced with the model after training, and then the step S206 is automatically triggered implement.
  • the prediction performance is prediction accuracy
  • the expected prediction performance range is 0.80-0.90
  • the prediction accuracy of the updated and trained model is 0.85
  • the method for updating the prediction performance of the trained model and the evaluation method for the prediction performance of the first model may be a hold-out method, a cross validation method (cross validation), etc., which are not specifically limited here.
  • S201 after replacing the first model with the updated trained model, S201 will be executed again to obtain a new training sample set, and then S202 to S207 will be executed for a new round of model update.
  • the first trigger mechanism is used to determine whether the first model needs to be updated and trained
  • the second trigger mechanism is used to determine whether the updated model needs to be used to replace the first model.
  • the deployment of the model update device provided by the present application is flexible, and can be deployed in an edge environment, specifically an edge computing device in the edge environment or a software system running on one or more edge computing devices.
  • the edge environment refers to an edge computing device cluster built on the edge of the network geographically close to users to provide computing, storage, and communication resources.
  • the model update device can also be deployed in a cloud environment, which is an entity that uses basic resources to provide users with cloud services under the cloud computing model.
  • the cloud environment includes a cloud data center and a cloud service platform, and the cloud data center includes a large number of basic resources (including computing resources, storage resources and network resources) owned by the cloud service provider.
  • the model update device can be a server in the cloud data center, or a virtual machine created in the cloud data center, or a software system deployed on a server or a virtual machine in the cloud data center, and the software system can be distributed in a distributed manner. Deploy on multiple servers, or distributed on multiple virtual machines, or distributed on virtual machines and servers.
  • the model update device can also be partially deployed in the edge environment and partially deployed in the cloud environment, as shown in FIG. 4 .
  • each module inside the model update device can also be divided into multiple types, and each module can be a software module, or a hardware module, or partly a software module and partly a hardware module, which is not limited in this application.
  • each module can be a software module, or a hardware module, or partly a software module and partly a hardware module, which is not limited in this application.
  • FIG. 5A and the model updating device 500B shown in FIG. 5B there are two ways of dividing the model updating device exemplarily shown in this application.
  • the model update apparatus 500A shown in FIG. 5A includes: an acquisition unit 501 , a model training unit 502 and a model deployment unit 503 .
  • each module in the model updating device 500A can also be deployed on the same edge computing device, or on the same cloud data center, or on the same physical machine.
  • it can also be partially Deployed on the edge computing device, partly deployed on the cloud data center, for example, the acquisition unit 501 is deployed on the edge computing device, and the model training unit 502 and model deployment unit 503 are deployed on the cloud data center, which is not specifically limited in this application.
  • the obtaining unit 501 is configured to obtain a training sample set.
  • the model training unit 502 is configured to perform update training on the first model through the training sample set when using the first trigger mechanism to determine that the first model needs to be updated and trained, to obtain an updated and trained model;
  • the model deploying unit 503 is configured to replace the first model with the updated trained model when it is determined that the first model needs to be replaced with the updated trained model by using the second trigger mechanism.
  • the first triggering mechanism includes: if the number of difficult samples in the training sample set reaches the first threshold, then determine that the first model needs to be updated and trained; or, if the current time reaches the model update time , it is determined that the first model needs to be updated and trained; or, if the number of samples in the training sample set reaches a second threshold, it is determined that the first model needs to be updated and trained, wherein the second threshold is a natural number greater than 1; or, If the online duration of the first model reaches the preset duration, it is determined that the first model needs to be updated and trained.
  • the second trigger mechanism includes: if the prediction performance of the updated model is better than the prediction performance of the first model, it is determined that the updated model needs to be used to replace the first model; or, if If the prediction performance of the updated and trained model is within the expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
  • the model training unit 502 can specifically update and train the first model through the training sample set in the following manner: first, filter the training sample set to determine the difficult samples in the training sample set, Then, the first model is updated and trained by using the difficult samples in the training sample set.
  • the model training unit 502 can specifically filter the samples in the training sample set in the following manner to determine the difficult samples in the training sample set: first, input each sample in the training sample set into the first The model performs inference to obtain the attributes of the inference results corresponding to each sample.
  • the attributes include any of the following: confidence, cross entropy, and then, according to the attributes of the inference results of each sample, determine whether each sample is a difficult sample .
  • the device 500B includes: a storage unit 510 , a management and control unit 520 , an inference unit 530 , a training unit 540 and an evaluation unit 550 .
  • each module in the model updating device 500B can also be deployed on the same edge computing device, or on the same cloud data center, or on the same physical machine.
  • it can also be partially Deployed on the edge computing device, partly deployed on the cloud data center, for example, the storage unit 510 and the reasoning unit 530 are deployed on the edge computing device, and the management and control unit 520, the training unit 540 and the evaluation unit 550 are deployed on the cloud data center, which is not specifically limited in this application.
  • the storage unit 510 is configured to store the training sample set and the first model acquired by the model updating apparatus 500B, and also store the evaluation sample set, as shown in FIG. 5B .
  • the training sample set is used to update and train the first model to obtain an updated and trained model
  • the evaluation sample set is used to evaluate the prediction performance of the obtained updated and trained model.
  • the storage unit 510 may also store a verification sample set, which is used to verify the performance of the updated trained model on the verification sample set before using the evaluation sample set to evaluate the prediction performance of the updated model after training, and at the same time, By adjusting the hyperparameters of the updated trained model, the updated trained model is in an optimal state.
  • the management and control unit 520 is used to control the entire model update process (including the model update training process and the model update deployment process). 540 whether to perform update training on the first model, whether the management and evaluation unit 550 evaluates the prediction performance of the updated model, and whether the management reasoning unit 530 uses the updated model to replace the first model.
  • the inference unit 530 the training unit 540 and the evaluation unit 550 are all in a non-running state.
  • the management and control unit 520 executes S601 to trigger the inference unit 530 to enter the running state.
  • the trigger model and data acquisition subunit 5301 executes S602 to acquire the first model and training sample set from the storage unit 510, and then the data screening subunit 5302 will train Each sample in the sample set is input into the first model for inference, and the attributes of the inference results corresponding to each sample are obtained.
  • the attributes include confidence, cross entropy, etc., and according to the attributes corresponding to the inference results corresponding to each sample, determine the Whether it is a hard sample, if it is determined that it is a hard sample, execute S603 to store the hard sample in the hard sample set of the storage unit 510 .
  • the data set management subunit 5201A in the management and control unit 520 can perform S604 to monitor the number of difficult samples in the difficult sample set in real time or periodically, and When it is detected that the number of difficult samples reaches the first threshold, it is determined that the first model needs to be updated and trained; otherwise, continue to monitor the number of difficult samples until it is detected that the number of difficult samples reaches the first threshold.
  • the first model is updated for training.
  • management and control unit 520 can monitor whether the current time reaches the model update time, and when it is detected that the current time reaches the model update time , it is determined that the first model needs to be updated and trained, otherwise, the current time is continuously monitored until the current time reaches the model update time, and it is determined that the first model needs to be updated and trained.
  • the first trigger 5202A in the management and control unit 520 executes S605 to trigger the training unit 540 to enter the running state, specifically, to trigger the model and data acquisition in the training unit 540
  • the subunit 5401 executes S606 to acquire the first model and the hard sample set from the storage unit 510, and then, the model training subunit 5402 uses the hard samples in the hard sample set to perform update training on the first model to obtain an updated trained model.
  • the management and control unit 520 can monitor the number of iterations that the training unit 540 uses the difficult samples in the difficult sample set to update and train the first model, and when the number of iterations reaches the maximum number of iterations , the first trigger 5202A in the management and control unit 520 notifies the training unit 540 that the training is over.
  • the management and control unit 520 can also monitor whether the current time reaches the preset training end time, and when the current time reaches the training end time, the first trigger 5202A in the management and control unit 520 notifies the training unit 540 that the training is over.
  • the management and control unit 520 can also monitor the duration of the training unit 540 updating the first model. When the training duration reaches the maximum training duration, the first trigger 5202A in the management and control unit 520 notifies the training unit 540 that the training is over.
  • the training unit 540 may execute S607 to send the first message to the management and control unit 520, notifying the management and control unit 520 that the first model update training is over, and execute S608 to store the updated and trained model in storage unit 510 .
  • the management and control unit 520 executes S609 to trigger the evaluation unit 550 to enter the running state, specifically, triggers the model and data acquisition subunit 5501 to execute S610 to acquire the evaluation sample set from the storage unit 510 , update the trained model and the first model, and then, the model evaluation subunit 5502 uses the evaluation sample set to evaluate the prediction performance of the updated model and the first model respectively, and finally, the evaluation unit 550 executes S611 to evaluate the obtained updated training
  • the final model and the predicted performance of the first model are uploaded to the management and control unit 520.
  • the management and control unit 520 After the management and control unit 520 receives the updated model after training and the prediction performance of the first model uploaded by the evaluation unit 550, the management and control unit 520 can determine whether the prediction performance of the updated and trained model is better than the prediction performance of the first model. When the prediction performance of the former is better than that of the latter, the management and control unit 520 executes S612 to control the reasoning unit 530 to update and deploy the model, wherein the specific process for the management and control unit 520 to control the reasoning unit 530 to update and deploy the model is as follows: the management and control unit 520 Obtain the updated and trained model from the storage unit 510, and then send the updated and trained model to the inference unit 530, so that the model deployment subunit 5303 in the inference unit 530 deploys the updated and trained model, that is, uses the updated and trained model Replace the first model that was previously deployed locally.
  • the management and control unit 520 may also send a model update deployment instruction to the reasoning unit 530, instructing the model deployment subunit 5303 in the reasoning unit 530 to obtain the updated model from the storage unit 510, and use the updated model to replace the previous model.
  • the management and control unit 520 may be deployed with a second trigger 5202B, which is used to determine whether to control the reasoning unit 530 to perform model update deployment according to the prediction performance of the updated trained model and the prediction performance of the first model.
  • the evaluation unit 550 enters the running state, which may only use the evaluation sample set to evaluate the prediction performance of the updated model, and does not evaluate the prediction performance of the first model, and then only evaluates The prediction performance of the updated trained model is uploaded to the management and control unit 520 .
  • the management and control unit 520 judges whether the predicted performance is within the expected predicted performance range, and controls the reasoning unit 530 to update the model if it is determined to be within the expected predicted performance range. deploy.
  • the second trigger 5202B deployed in the management and control unit 520 is used to judge whether to control the reasoning unit 530 to perform model update deployment according to the prediction performance of the updated and trained model.
  • the model update device determines whether the first model needs to be updated and trained through two trigger mechanisms, and determines whether It is necessary to replace the first model with the updated and trained model, which can trigger the model for automatic update training and automatic update deployment on demand, which can reduce the consumption of computing resources while improving the efficiency of model update.
  • FIG. 7 is a schematic structural diagram of a computing device 700 provided by the present application.
  • the computing device 700 includes: a processor 710 , a memory 720 and a communication interface 730 , wherein the processor 710 , the memory 720 , and the communication interface 730 They can be connected to each other through a bus 740 .
  • the processor 710 can read the program codes (including instructions) stored in the memory 720, and execute the program codes stored in the memory 720, so that the computing device 700 executes the steps in the model update method provided by the above method embodiments, or makes the computing device 700
  • the model updating apparatus 500A or 500B is deployed.
  • the processor 710 may have multiple specific implementation forms, such as a central processing unit (central processing unit, CPU), or a combination of a CPU and a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • Processor 710 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 720, which enable computing device 700 to provide various services.
  • the memory 720 is used to store program codes, which are executed under the control of the processor 710, so as to execute the processing steps in any of the above-mentioned embodiments in FIG. 2 , FIG. 3 or FIG. 6 .
  • the program code may include one or more software modules.
  • the one or more software modules may be the software modules provided in the embodiment of FIG. Steps S201 to S207 in the embodiment of FIG. 2 will not be repeated here.
  • the one or more software modules may be the software modules provided in the embodiment of FIG. 5B, such as the storage unit 510, the management and control unit 520, the reasoning unit 530, the training unit 540, and the evaluation unit 550, which can be specifically used to execute the embodiment of FIG. 6 Steps S601 to S612 in Step S601 will not be repeated here.
  • the memory 720 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM); the memory 720 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory). only memory, ROM), flash memory (flash memory), hard disk (hard disk drive, HDD) or solid-state drive (solid-state drive, SSD); the memory 720 may also include a combination of the above types.
  • volatile memory volatile memory
  • RAM random access memory
  • non-volatile memory such as a read-only memory (read-only memory). only memory, ROM), flash memory (flash memory), hard disk (hard disk drive, HDD) or solid-state drive (solid-state drive, SSD
  • ROM read-only memory
  • flash memory flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the communication interface 730 can be a wired interface (such as an Ethernet interface, a fiber optic interface, other types of interfaces (such as an infiniBand interface)) or a wireless interface (such as a cellular network interface or using a wireless local area network interface) for communicating with other computing devices or devices. communication.
  • the communication interface 730 can adopt a protocol family above the transmission control protocol/internet protocol (transmission control protocol/internet protocol, TCP/IP), for example, a remote function call (remote function call, RFC) protocol, a simple object access protocol (simple object access protocol (SOAP) protocol, simple network management protocol (simple network management protocol, SNMP) protocol, common object request broker architecture (common object request broker architecture, CORBA) protocol and distributed protocols, etc.
  • TCP/IP transmission control protocol/internet protocol
  • RFC remote function call
  • SOAP simple object access protocol
  • simple network management protocol simple network management protocol
  • CORBA common object request broker architecture
  • the bus 740 can be a peripheral component interconnect express (PCIe) bus, or an extended industry standard architecture (EISA) bus, a unified bus (Ubus or UB), a computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe peripheral component interconnect express
  • EISA extended industry standard architecture
  • Ubus or UB unified bus
  • CXL compute express link
  • CCIX cache coherent interconnect for accelerators
  • the bus 740 can be divided into an address bus, a data bus, a control bus, and the like.
  • the bus 740 may also include a power bus, a control bus, a status signal bus, and the like.
  • the various buses are labeled as bus 740 in the figure. For ease of representation, only one thick line is used in FIG. 7 , but it does not mean that there is only one bus or one type of bus.
  • the above-mentioned computing device 700 is used to execute the method executed in the above-mentioned embodiment of the model update method, which belongs to the same idea as the above-mentioned method embodiment, and its specific implementation process is detailed in the above-mentioned method embodiment, and will not be repeated here.
  • computing device 700 is only an example provided by the embodiment of the present application, and the computing device 700 may have more or fewer components than those shown in FIG. 7 , and two or more components may be combined, or It can be realized with different configurations of components.
  • the present application also provides a computer-readable storage medium, in which instructions are stored, and when the instructions are executed, some or all steps of the model updating method described in the above-mentioned embodiments can be implemented.
  • the present application also provides a computer program product.
  • the computer program product is read and executed by a computer, some or all steps of the model updating method described in the above method embodiments can be realized.
  • all or part may be implemented by software, hardware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, DSL) or wireless (eg, infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium, or a semiconductor medium.

Abstract

The present application provides a model updating method and apparatus and a related device, applied to the field of artificial intelligence (AI). The method comprises: first, obtaining a training sample set; then, when a first trigger mechanism is used to determine that updating and training a first model is required, updating and training the first model by means of the training sample set to obtain a trained model; and finally, when a second trigger mechanism is used to determine that replacing the first model with the updated and trained model is required, replacing the first model with the updated and trained model. The method can solve the problem of low model updating efficiency in the prior art, and meanwhile, reduce consumption of computing resources.

Description

一种模型更新方法、装置及相关设备A model update method, device and related equipment
本申请要求于2021年11月30日提交中国专利局、申请号为202111443976.2、申请名称为“一种模型更新方法、装置及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111443976.2 and the application title "A Model Updating Method, Device and Related Equipment" filed with the China Patent Office on November 30, 2021, the entire contents of which are hereby incorporated by reference In this application.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种模型更新方法、装置及相关设备。The present application relates to the field of artificial intelligence, in particular to a model updating method, device and related equipment.
背景技术Background technique
AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人、自然语言处理、计算机视觉、决策与推理、人机交互、推荐与搜索以及AI基础理论等。AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
在AI领域,机器学习模型是常见使用手段,比如,分类模型,用于对数据进行分类,实现数据分类自动化,提高分类效率,图像识别模型,用于对图像的内容进行识别,实现图像内容自动识别,提高识别效率。模型在部署上线后,如果线上待预测数据的特征会随时间发生变化,则需要对模型进行更新,以保障模型能够适应动态变化的环境,否则模型的预测精度会不断降低。目前,通常基于增量学习对模型进行更新,提升模型在当前环境中的精度,而且更新后的模型也不会遗忘以前已经学习到的知识。In the field of AI, machine learning models are commonly used. For example, classification models are used to classify data, realize data classification automation, and improve classification efficiency. Image recognition models are used to identify image content and realize image content automatic recognition and improve recognition efficiency. After the model is deployed online, if the characteristics of the online data to be predicted will change over time, the model needs to be updated to ensure that the model can adapt to the dynamically changing environment, otherwise the prediction accuracy of the model will continue to decrease. At present, the model is usually updated based on incremental learning to improve the accuracy of the model in the current environment, and the updated model will not forget the previously learned knowledge.
然而,现有的基于增量学习对模型进行更新的方法,通常采用的是离线学习方式或者在线学习方式对模型进行更新,当采用离线学习方式时,需要人工跟踪模型性能,持续反复训练模型并人工部署上线,这样势必会消耗较多的人力资源和时间,模型的更新效率比较低;当采用在线学习方式时,会持续不断地训练产生新模型,不断地对新模型进行验证,不断地使用新模型替换旧模型,这样势必会消耗大量的计算资源。However, the existing methods for updating the model based on incremental learning usually use offline learning or online learning to update the model. When using offline learning, it is necessary to manually track the performance of the model, continuously train the model repeatedly and Manual deployment and online deployment will inevitably consume more human resources and time, and the model update efficiency is relatively low; when using online learning methods, new models will be continuously trained, continuously verified, and continuously used. The new model replaces the old model, which is bound to consume a lot of computing resources.
发明内容Contents of the invention
本申请提供了一种模型更新方法、装置及相关设备,可以在解决离线学习方式更新模型的方法存在的模型更新效率低的问题的同时,解决在线学习方式更新模型的方法存在的会消耗大量计算资源的问题。This application provides a model update method, device and related equipment, which can solve the problem of low model update efficiency in the method of updating the model in the offline learning mode, and at the same time solve the problem that the method of updating the model in the online learning mode consumes a large amount of calculations resource problem.
第一方面,提供一种模型更新方法,该方法包括:首先,获取训练样本集,然后,在使用第一触发机制确定需要对第一模型进行更新训练时,通过训练样本集对第一模型进行更新训练,得到训练后的模型,最后,在使用第二触发机制确定需要使用更新训练后的模型替换第一模型时,使用更新训练后的模型替换第一模型。In the first aspect, a method for updating a model is provided. The method includes: firstly, obtaining a training sample set, and then, when the first trigger mechanism is used to determine that the first model needs to be updated and trained, the first model is updated through the training sample set. The training is updated to obtain the trained model, and finally, when the second trigger mechanism is used to determine that the first model needs to be replaced with the updated trained model, the first model is replaced with the updated trained model.
通过上述方案可知,本申请通过第一触发机制确定是否需要对第一模型进行更新训练,通过第二触发机制确定是否需要使用更新训练后的模型替换第一模型,能够实现按需触发模型进行自动更新训练以及自动更新部署,可以在提高模型更新效率的同时,减少计算资源的消耗。It can be seen from the above scheme that the present application determines whether the first model needs to be updated and trained through the first trigger mechanism, and determines whether the first model needs to be replaced with the updated and trained model through the second trigger mechanism, which can trigger the model on demand for automatic training. Update training and automatic update deployment can reduce the consumption of computing resources while improving the efficiency of model update.
在一种可能的实现方式中,第一触发机制包括:若训练样本集中的难例样本的数量达到第一阈值,则确定需要对第一模型进行更新训练;或者,若当前时间到达模型更新时间,则确定需要对第一模型进行更新训练;或者,若训练样本集中的样本数量达到第二阈值,则确定需要对第一模型进行更新训练;或者,若第一模型的上线时长达到预设时长,则确定需要对第一模型进行更新训练。其中,第二阈值为大于1的自然数。In a possible implementation, the first triggering mechanism includes: if the number of difficult samples in the training sample set reaches the first threshold, then determine that the first model needs to be updated and trained; or, if the current time reaches the model update time , it is determined that the first model needs to be updated and trained; or, if the number of samples in the training sample set reaches the second threshold, it is determined that the first model needs to be updated and trained; or, if the online duration of the first model reaches the preset duration , it is determined that the first model needs to be updated and trained. Wherein, the second threshold is a natural number greater than 1.
通过上述方案可知,本申请提供了多种确定是否需要触发模型进行更新训练的机制,用户可以从中选择任意一种触发机制,具有较强的灵活性。From the above solution, it can be seen that the present application provides multiple mechanisms for determining whether to trigger the model to perform update training, and the user can choose any trigger mechanism, which has strong flexibility.
在一种可能的实现方式中,第二触发机制包括:若第一模型的预测性能低于更新训练后的模型的预测性能,则确定需要使用更新训练后的模型替换第一模型;或者,若更新训练后的模型的预测性能处于期望预测性能范围,则确定需要使用更新训练后的模型替换第一模型。In a possible implementation manner, the second trigger mechanism includes: if the prediction performance of the first model is lower than the prediction performance of the updated model, then determining that the first model needs to be replaced with the updated model; or, if If the prediction performance of the updated and trained model is within the expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
通过上述方案可知,本申请提供了多种确定是否需要触发使用新模型替换旧模型的机制,用户可以从中选择任意一种触发机制,具有较强的灵活性。From the above solutions, it can be seen that the present application provides multiple mechanisms for determining whether to trigger the replacement of the old model with the new model, and the user can choose any trigger mechanism, which has strong flexibility.
在一种可能的实现方式中,具体可以通过如下方式实现通过训练样本集对第一模型进行更新训练:首先,对训练样本集进行筛选,确定训练样本集中的难例样本,然后,使用训练样本集中的难例样本,对第一模型进行更新训练。In a possible implementation, the update training of the first model through the training sample set can be implemented specifically as follows: first, the training sample set is screened to determine the difficult samples in the training sample set, and then, using the training sample set The concentrated hard examples are used to update and train the first model.
通过上述方案可知,本申请使用训练样本集中的难例样本对模型进行更新训练,而不是像现有技术使用训练样本集中的所有样本对模型进行更新训练,如此,可以进一步减少计算资源的消耗,提高模型更新效率。It can be seen from the above scheme that this application uses the difficult samples in the training sample set to update and train the model, instead of using all the samples in the training sample set to update and train the model as in the prior art. In this way, the consumption of computing resources can be further reduced. Improve model update efficiency.
在一种可能的实现方式中,具体可以通过如下方式对训练样本集中的样本进行筛选,确定训练样本集中的难例样本:首先,将训练样本集中的每个样本输入第一模型对进行推理,得到每个样本对应的推理结果的属性,属性包括如下任意一种:置信度、交叉熵,然后,根据每个样本的推理结果的属性,确定每个样本是否是难例样本。In a possible implementation, the samples in the training sample set can be screened in the following way to determine the difficult samples in the training sample set: first, each sample in the training sample set is input into the first model pair for inference, The attributes of the inference results corresponding to each sample are obtained, and the attributes include any of the following: confidence and cross entropy, and then, according to the attributes of the inference results of each sample, determine whether each sample is a difficult sample.
第二方面,提供一种模型更新装置,装置包括:获取单元,用于获取训练样本集;In a second aspect, a device for updating a model is provided, and the device includes: an acquisition unit, configured to acquire a training sample set;
模型训练单元,用于在使用第一触发机制确定需要对所述第一模型进行更新训练时,通过所述训练样本集对所述第一模型进行更新训练,得到更新训练后的模型;A model training unit, configured to perform update training on the first model through the training sample set to obtain an updated trained model when the first trigger mechanism is used to determine that the first model needs to be updated and trained;
模型部署单元,用于在使用第二触发机制确定需要使用所述更新训练后的模型替换所述第一模型时,使用所述更新训练后的模型替换所述第一模型。A model deploying unit, configured to replace the first model with the updated trained model when it is determined that the first model needs to be replaced with the updated trained model by using the second trigger mechanism.
在一种可能的实现方式中,所述第一触发机制包括:第一触发机制包括:若训练样本集中的难例样本的数量达到第一阈值,则确定需要对第一模型进行更新训练;或者,若当前时间到达模型更新时间,则确定需要对第一模型进行更新训练;或者,若训练样本集中的样本数量达到第二阈值,则确定需要对第一模型进行更新训练;或者,若第一模型的上线时长达到预设时长,则确定需要对第一模型进行更新训练。其中,第二阈值为大于1的自然数。In a possible implementation manner, the first trigger mechanism includes: the first trigger mechanism includes: if the number of difficult samples in the training sample set reaches a first threshold, it is determined that the first model needs to be updated and trained; or , if the current time reaches the model update time, it is determined that the first model needs to be updated and trained; or, if the number of samples in the training sample set reaches the second threshold, it is determined that the first model needs to be updated and trained; or, if the first When the online duration of the model reaches the preset duration, it is determined that the first model needs to be updated and trained. Wherein, the second threshold is a natural number greater than 1.
在一种可能的实现方式中,第二触发机制包括:若第一模型的预测性能低于更新训练后的模型的预测性能,则确定需要使用更新训练后的模型替换第一模型;或者,若更新训练后的模型的预测性能处于期望预测性能范围,则确定需要使用更新训练后的模型替换第一模型。In a possible implementation manner, the second trigger mechanism includes: if the prediction performance of the first model is lower than the prediction performance of the updated model, then determining that the first model needs to be replaced with the updated model; or, if If the prediction performance of the updated and trained model is within the expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
在一种可能的实现方式中,上述模型训练单元,具体用于:首先,对训练样本集进行筛选,确定训练样本集中的难例样本,然后,使用训练样本集中的难例样本,对第一模型进行更新训练。In a possible implementation, the above-mentioned model training unit is specifically used to: firstly, filter the training sample set to determine the difficult examples in the training sample set, and then use the difficult examples in the training sample set to perform the first The model is updated for training.
在一种可能的实现方式中,上述模型训练单元,具体用于:首先,将训练样本集中的每个样本输入第一模型对进行推理,得到每个样本对应的推理结果的属性,属性包括如下任意一种:置信度、交叉熵,然后,根据每个样本的推理结果的属性,确定每个样本是否是难例 样本。In a possible implementation manner, the above model training unit is specifically used to: firstly, input each sample in the training sample set into the first model pair for inference, and obtain the attributes of the inference results corresponding to each sample, and the attributes include the following Either: confidence, cross-entropy, and then, based on the properties of each sample's inference results, determine whether each sample is a hard sample.
第三方面,提供一种计算机可读存储介质,该计算机可读存储介质存储有指令,所述指令用于实现如上述第一方面或者第一方面的任意可能的实现方式提供的方法。In a third aspect, a computer-readable storage medium is provided, and the computer-readable storage medium stores instructions, and the instructions are used to implement the method provided in the above-mentioned first aspect or any possible implementation manner of the first aspect.
第四方面,提供一种计算设备,该计算设备包括处理器和存储器;所述处理器用于执行所述存储器存储的指令,使得所述计算设备实现如上述第一方面或者第一方面的任意可能的实现方式提供的方法。In a fourth aspect, there is provided a computing device, the computing device includes a processor and a memory; the processor is configured to execute instructions stored in the memory, so that the computing device realizes any possibility of the above first aspect or the first aspect The method provided by the implementation of .
第五方面,提供一种计算机程序产品,包括计算机程序,当所述计算机程序被计算设备读取并执行时,使得所述计算设备执行如上述第一方面或者第一方面的任意可能的实现方式提供的方法。In a fifth aspect, a computer program product is provided, including a computer program. When the computer program is read and executed by a computing device, the computing device executes the above-mentioned first aspect or any possible implementation of the first aspect. provided method.
附图说明Description of drawings
图1是本申请提供的一种人工智能主体框架示意图;Fig. 1 is a schematic diagram of an artificial intelligence subject framework provided by the present application;
图2是本申请提供的一种模型更新方法的结构示意图;Fig. 2 is a schematic structural diagram of a model updating method provided by the present application;
图3是本申请提供的一种从训练样本集中确定难例样本的流程示意图;FIG. 3 is a schematic flow diagram of determining difficult samples from the training sample set provided by the present application;
图4是本申请提供的一种模型更新装置的部署示意图;Fig. 4 is a schematic diagram of deployment of a model updating device provided by the present application;
图5A是本申请提供的一种模型更新装置的结构示意图;FIG. 5A is a schematic structural diagram of a model updating device provided by the present application;
图5B是本申请提供的另一种模型更新装置的结构示意图;Fig. 5B is a schematic structural diagram of another model updating device provided by the present application;
图6是本申请提供的另一种模型更新方法的流程示意图;Fig. 6 is a schematic flow chart of another model updating method provided by the present application;
图7是本申请提供的一种计算设备的结构示意图。FIG. 7 is a schematic structural diagram of a computing device provided by the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请提供的技术方案进行描述。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solution provided by this application will be described below with reference to the accompanying drawings. Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
为了使本申请提供的技术方案更清晰,在具体描述本申请提供的技术方案之前,首先进行相关术语的解释。In order to make the technical solution provided by this application clearer, before describing the technical solution provided by this application in detail, explanations of relevant terms are firstly made.
离线学习(offline learning):也可以称为批量学习或者离线训练,是一个batch(指一批数据)训练完才更新模型权重,这样的话要求所有的训练数据在模型训练期间必须是可用的。而且,只有在模型训练完成之后,模型才能被部署上线对线上数据进行预测。离线学习存在模型训练效率低、训练过程不易拓展于大数据场景以及模型无法适应动态变化的环境等缺点。Offline learning (offline learning): It can also be called batch learning or offline training. It is a batch (referring to a batch of data) that updates the model weights after training. In this case, all training data must be available during model training. Moreover, only after the model training is completed, the model can be deployed online to predict the online data. Offline learning has the disadvantages of low model training efficiency, difficulty in expanding the training process to large data scenarios, and inability of the model to adapt to dynamically changing environments.
在线学习(online learning):也可以称为适应性学习或者在线训练,是指按照一定顺序接收数据,每接收一个数据,模型会对该数据进行预测并对当前模型进行训练,然后处理下一个数据。在线学习是一个数据训练完了直接更新权重,而不是一个batch训练完了才更新权重。也就是说,在线学习不需要在一开始就提供完整的训练数据集,随着更多的实时的线上数据的到达,模型会在操作中不断地被训练更新。Online learning (online learning): It can also be called adaptive learning or online training. It refers to receiving data in a certain order. Every time a data is received, the model will predict the data and train the current model, and then process the next data. . Online learning is to update the weights directly after a data training is completed, rather than updating the weights after a batch is trained. That is to say, online learning does not need to provide a complete training data set at the beginning. As more real-time online data arrives, the model will be continuously updated during operation.
增量学习(incremental learning):是指一个模型能不断地从新样本中学习新的知识,并能保存大部分以前已经学习到的知识。增量学习非常类似于人类自身的学习模式。因为人在成长过程中,每天学习和接收新的事物,学习是逐步进行的,而且,对已经学习到的知识,人类一般是不会遗忘的。增量学习思想可以描述为:每当新增数据时,模型并不需要重建所有的知识库,而是在原有知识库的基础上,仅对由于新增数据所引起的变化进行更新。我们 发现,增量学习方法更加符合人的思维原理。在线学习一定是增量的,因为在线学习,实现方式就是数据一条一条流进来更新模型。而增量学习不一定是在线的,因为给定一个模型和一批离线数据,增量学习可以用这一批离线数据,去更新之前训练好的模型,而不需要从头开始训练一个模型。Incremental learning: refers to a model that can continuously learn new knowledge from new samples and preserve most of the previously learned knowledge. Incremental learning is very similar to the human learning model itself. Because people learn and accept new things every day in the process of growing up, learning is carried out gradually, and human beings generally will not forget the knowledge they have learned. The idea of incremental learning can be described as: whenever new data is added, the model does not need to rebuild all the knowledge bases, but only updates the changes caused by the new data on the basis of the original knowledge base. We found that the incremental learning method is more in line with human thinking principles. Online learning must be incremental, because online learning is implemented by streaming in data one by one to update the model. Incremental learning is not necessarily online, because given a model and a batch of offline data, incremental learning can use this batch of offline data to update the previously trained model without training a model from scratch.
难例样本:可以简称为难例,指模型在推理过程中推理结果未达到预期的困难样本。在实际业务场景中,模型更新是一个长期的过程,比如说按照每周、每个月对模型进行更新训练,或者累计数据至一定量时对模型进行更新训练。若将全量的数据用于模型更新训练,需要耗费较大的标注人力和训练耗时。为了提升模型更新效率,从全量数据中筛选出难例样本,仅使用难例样本对模型进行更新训练,可以节省标注人力和训练耗时,提高模型更新效率。Difficult samples: It can be referred to as difficult examples for short, which refers to difficult samples in which the inference results of the model do not meet expectations during the inference process. In actual business scenarios, model updating is a long-term process, such as updating and training the model on a weekly or monthly basis, or updating and training the model when the accumulated data reaches a certain amount. If the full amount of data is used for model update training, it will take a lot of labeling manpower and training time. In order to improve the efficiency of model update, difficult samples are screened from the full amount of data, and only difficult samples are used for model update training, which can save labeling manpower and training time, and improve model update efficiency.
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements.
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。The following is an elaboration on the above artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。"Intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。"IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
(1)基础设施(1) Infrastructure
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片提供,比如中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图像处理器(graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)、可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing power is provided by smart chips, such as central processing unit (central processing unit, CPU), neural network processor (neural-network processing unit, NPU), image processing unit (graphics processing unit, GPU), Application specific integrated circuit (ASIC), programmable logic gate array (field programmable gate array, FPGA) and other hardware acceleration chips; the basic platform includes distributed computing framework and network and other related platform guarantees and supports, which may include cloud Storage and computing, interconnection network, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
(2)数据(2) data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc. Among them, machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data. Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies. The typical functions are search and matching.
下面介绍本申请实施例涉及的应用场景。The application scenarios involved in the embodiments of the present application are introduced below.
在AI领域,机器学习模型(也可以称为机器学习算法,以下简称为模型)是常见使用手段,比如,物品摆放的无人巡检场景的物体检测模型,能够检测图片中物体的类别与位置,实现自动检测,提高检测效率。模型在部署上线后,如果线上待预测数据的特征会随时间发生变化,则模型的预测精度会不断降低。为了保障模型能够适应动态变化的环境,需要对模 型进行更新。例如,摄像头在冬季拍摄的图像的颜色比在春季拍摄的图像的颜色单调,如果模型是使用春季拍摄得到的颜色比较鲜艳的图像训练的,那么模型对春季拍到的图像的识别精度较高,而对冬季拍到的图像的识别的精度较低。In the field of AI, machine learning models (also called machine learning algorithms, hereinafter referred to as models) are common means of use. For example, object detection models in unmanned inspection scenes where items are placed can detect the category and position, realize automatic detection, and improve detection efficiency. After the model is deployed online, if the characteristics of the online data to be predicted change over time, the prediction accuracy of the model will continue to decrease. In order to ensure that the model can adapt to the dynamically changing environment, the model needs to be updated. For example, the colors of the images captured by the camera in winter are more monotonous than the images captured in spring. If the model is trained using images with brighter colors captured in spring, the recognition accuracy of the model for images captured in spring is higher. However, the recognition accuracy of images captured in winter is relatively low.
在对模型进行更新,让模型适应动态变化的环境时,需要保证模型不会遗忘以前已经学习到的知识,即更新后的模型在对旧环境产生的数据进行预测时,精度依旧满足要求。目前,常用的模型更新方法是基于增量学习的模型更新方法。When updating the model to adapt to the dynamically changing environment, it is necessary to ensure that the model will not forget the knowledge it has learned before, that is, the accuracy of the updated model still meets the requirements when predicting the data generated in the old environment. At present, the commonly used model update method is the model update method based on incremental learning.
然而,现有的基于增量学习的模型更新方法,通常采用的是离线学习方式或者在线学习方式对模型进行更新,当采用离线学习方式时,需要人工跟踪模型性能,持续反复训练模型,在模型更新训练完成后,由人工部署上线,这样势必会消耗较多的人力资源和时间,模型的更新效率比较低;当采用在线学习方式时,会持续不断地训练产生新模型,不断地对新模型进行验证,不断地使用新模型替换旧模型,虽然可以提高模型的更新效率,但会消耗大量的计算资源。However, the existing incremental learning-based model update methods usually use offline learning or online learning to update the model. When using offline learning, it is necessary to manually track the performance of the model and continuously train the model repeatedly. After the update training is completed, it will be manually deployed online, which will inevitably consume more human resources and time, and the update efficiency of the model will be relatively low; when the online learning method is adopted, new models will be continuously trained and updated continuously. For verification, the new model is constantly used to replace the old model. Although the update efficiency of the model can be improved, it will consume a lot of computing resources.
为了解决上述现有的基于增量学习的模型更新方法存在的要么模型更新效率低,要么更新效率高但会消耗大量计算资源的问题,本申请提供了一种模型更新方法、装置及相关设备,通过设计一套在一定条件下自动触发模型进行更新训练的机制以及在一定条件下自动触发模型进行更新部署的机制,可以实现按需触发模型进行自动更新训练以及自动更新部署,可以在提高模型更新效率的同时,减少计算资源的消耗。In order to solve the problem that the above-mentioned existing incremental learning-based model update method has either low model update efficiency or high update efficiency but consumes a large amount of computing resources, this application provides a model update method, device and related equipment. By designing a mechanism for automatically triggering model updates and training under certain conditions and a mechanism for automatically triggering model updates and deployments under certain conditions, it is possible to trigger models for automatic update training and automatic update deployment on demand, which can improve model updates. While improving efficiency, reduce the consumption of computing resources.
参见图2,图2是本申请提供的一种模型更新方法的流程示意图,如图2所示,方法包括:Referring to Fig. 2, Fig. 2 is a schematic flow chart of a model updating method provided by the present application. As shown in Fig. 2, the method includes:
S201、获取训练样本集。S201. Obtain a training sample set.
其中,训练样本集包括多个样本,该多个样本可以全部是线上新产生的样本,也可以全部是线下获取的样本,还可以部分是线上新产生的样本,部分是线下获取的样本,此处不作具体限定。在训练样本集包括线下获取的样本的情况下,该线下获取的样本可以部分或者全部为在模型部署上线以前,对模型进行训练所使用的旧样本,也可以是线下获取的新样本,还可以是使用生成式对抗网络(generative adversarial networks,GAN)生成的旧样本,本申请不限定训练样本集中的样本来源。Wherein, the training sample set includes multiple samples, and the multiple samples may all be newly generated samples online, or all may be samples obtained offline, or some may be newly generated samples online, and some may be obtained offline. samples, which are not specifically limited here. When the training sample set includes samples obtained offline, some or all of the samples obtained offline may be old samples used for training the model before the model is deployed online, or new samples obtained offline , can also be old samples generated by using generative adversarial networks (GAN), and this application does not limit the source of samples in the training sample set.
在具体实现中,训练样本集中包括的样本可以是图像、视频、音频、文字等各种类型的数据,此处不作具体限定。In a specific implementation, the samples included in the training sample set may be various types of data such as images, videos, audios, texts, etc., which are not specifically limited here.
S202、使用第一触发机制确定是否需要对第一模型进行更新训练,在确定需要对第一模型进行更新训练时,执行S203和S205,在确定不需要对第一模型进行更新训练时,执行S204。S202. Use the first trigger mechanism to determine whether the first model needs to be updated and trained. When it is determined that the first model needs to be updated and trained, execute S203 and S205. When it is determined that the first model does not need to be updated and trained, execute S204. .
其中,第一模型可以是图像分类模型、物体检测模型、声音分类模型或者文本分类模型等各种用途的模型,实现第一模型的神经网络可以是随机森林(random forest)、支持向量机(support vector machine,SVM)、图神经网络(graph neural networks,GNN)、卷积神经网络(convolutional neural networks,CNN)等,此处不作具体限定。Wherein, the first model can be a model of various purposes such as an image classification model, an object detection model, a sound classification model or a text classification model, and the neural network for realizing the first model can be a random forest (random forest), a support vector machine (support vector machine, SVM), graph neural networks (graph neural networks, GNN), convolutional neural networks (convolutional neural networks, CNN), etc., are not specifically limited here.
在本申请具体的实施例中,第一触发机制可以为以下几种形式中的任意一种:In a specific embodiment of the application, the first trigger mechanism can be any of the following forms:
形式1:实时或者周期性地获取训练样本集中难例样本的数量,在监测到难例样本的数量达到第一阈值时,则确定需要对第一模型进行更新训练,然后自动触发步骤S203的执行。其中,第一阈值为大于0的自然数,第一阈值的大小可以根据实际情况进行设置,例如,第一阈值可以300、500等,此处不作具体限定。其中,训练样本集中难例样本的确定过程请参见图3以及相关描述。Form 1: Obtain the number of difficult samples in the training sample set in real time or periodically, and when it is detected that the number of difficult samples reaches the first threshold, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered . Wherein, the first threshold is a natural number greater than 0, and the size of the first threshold can be set according to actual conditions. For example, the first threshold can be 300, 500, etc., which is not specifically limited here. For the determination process of difficult samples in the training sample set, please refer to FIG. 3 and related descriptions.
形式2:实时或者周期性地获取训练样本集中样本的数量,在监测到样本数量达到第二 阈值时,则确定需要对第一模型进行更新训练,然后自动触发步骤S203的执行。其中,第二阈值为大于1的自然数,第二阈值的大小可以根据实际情况进行设置,例如,第二阈值可以500、1000等,此处不作具体限定。Form 2: Obtain the number of samples in the training sample set in real time or periodically. When it is detected that the number of samples reaches the second threshold, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered. Wherein, the second threshold is a natural number greater than 1, and the size of the second threshold can be set according to actual conditions. For example, the second threshold can be 500, 1000, etc., which is not specifically limited here.
形式3:监测当前时间是否到达预设的模型更新时间,在监测到当前时间到达预设的模型更新时间时,则确定需要对第一模型进行更新训练,然后自动触发步骤S203的执行。其中,预设的模型更新时间可以根据实际情况进行设置,例如,设置为每天的凌晨2:00,或者,设置为每月15号的00:00,此处不作具体限定。Form 3: Monitor whether the current time reaches the preset model update time. When it is detected that the current time reaches the preset model update time, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered. Wherein, the preset model update time can be set according to the actual situation, for example, set to 2:00 am every day, or set to 00:00 on the 15th of every month, which is not specifically limited here.
形式4:监测第一模型的上线时长是否达到预设时长,在监测到上线时长达到预设时长时,则确定需要对第一模型进行更新训练,然后自动触发步骤S203的执行。其中,预设时长可以根据实际情况进行设置,例如,设置为500小时或者1000小时等,此处不作具体限定。Form 4: Monitor whether the online duration of the first model reaches the preset duration. When it is detected that the online duration reaches the preset duration, it is determined that the first model needs to be updated and trained, and then the execution of step S203 is automatically triggered. Wherein, the preset duration can be set according to the actual situation, for example, set to 500 hours or 1000 hours, etc., which is not specifically limited here.
需要说明的是,上述所列举的第一触发机制的几种实现形式仅仅是作为示例,其他能够确定是否需要触发第一模型进行更新训练的方式也在本申请的保护范围内,在此不作具体限制。It should be noted that the implementation forms of the first trigger mechanism listed above are only examples, and other methods that can determine whether the first model needs to be triggered for update training are also within the scope of protection of this application, and will not be described in detail here. limit.
在本申请具体的实施例中,可以通过图3所示的步骤确定训练样本集中的难例样本:In a specific embodiment of the present application, the difficult samples in the training sample set can be determined through the steps shown in Figure 3:
S301、将训练样本集中的每个样本输入第一模型进行推理,得到每个样本对应的推理结果的属性。S301. Input each sample in the training sample set into the first model for inference, and obtain the attribute of the inference result corresponding to each sample.
其中,属性包括置信度、交叉熵(cross entropy)等。Among them, the attributes include confidence, cross entropy (cross entropy) and so on.
S302、根据每个样本对应的推理结果对应的属性,确定每个样本是否是难例样本。S302. Determine whether each sample is a difficult sample according to the attribute corresponding to the inference result corresponding to each sample.
以属性为置信度为例,通常一个样本对应的推理结果为一个或者多个,若一个样本对应的推理结果为一个,则得到该样本对应的推理结果的置信度之后,判断该置信度是否小于第一置信度阈值,若小于,则确定该样本是难例样本,反之,则不是难例样本,若一个样本对应的推理结果为多个,则得到该样本对应的多个推理结果各自对应的置信度之后,可以计算得到多个推理结果对应的置信度的均值,若置信度均值小于第二置信度阈值,则确定该样本是难例样本,反之,则不是难例样本。Taking the attribute as the confidence degree as an example, usually a sample corresponds to one or more inference results, if a sample corresponds to one inference result, after obtaining the confidence degree of the inference result corresponding to the sample, judge whether the confidence degree is less than If the first confidence threshold is less than, it is determined that the sample is a difficult sample, otherwise, it is not a difficult sample. If there are multiple inference results corresponding to a sample, then the corresponding results of the multiple inference results corresponding to the sample are obtained. After the confidence level, the mean value of the confidence levels corresponding to multiple inference results can be calculated. If the mean value of the confidence level is less than the second confidence level threshold, it is determined that the sample is a difficult sample; otherwise, it is not a difficult sample.
再以属性为交叉熵为例,在得到一个样本对应的推理结果的交叉熵之后,判断该交叉熵是否小于交叉熵阈值,若小于,则确定该样本是难例样本,反之,则不是难例样本。Taking the attribute as cross-entropy as an example, after obtaining the cross-entropy of the inference result corresponding to a sample, judge whether the cross-entropy is less than the cross-entropy threshold, if it is less than, then determine that the sample is a difficult sample, otherwise, it is not a hard case sample.
S203、通过训练样本集对第一模型进行更新训练,得到更新训练后的模型。S203. Perform update training on the first model by using the training sample set to obtain an updated and trained model.
由S202相关描述可知,第一触发机制有几种可能的实现形式,在第一触发机制的实现形式不同时,通过训练样本集对第一模型进行更新训练,得到更新训练后的模型的过程也不同。From the relevant description of S202, it can be seen that there are several possible implementation forms of the first trigger mechanism. When the implementation forms of the first trigger mechanism are different, the process of updating the first model through the training sample set and obtaining the updated model is also different.
在第一触发机制的实现形式为上述形式1时,可以直接使用训练样本集中的全部样本对第一模型进行更新训练,得到更新训练后的模型。When the implementation form of the first triggering mechanism is the above-mentioned form 1, all the samples in the training sample set can be directly used to update the first model to obtain an updated model.
在第一触发机制的实现形式为上述形式2时,可以使用训练样本集中的难例样本对第一模型进行更新训练,得到更新训练后的模型。When the implementation form of the first trigger mechanism is the above-mentioned form 2, the first model can be updated and trained by using the difficult samples in the training sample set to obtain an updated and trained model.
在第一触发机制的实现形式为上述形式3或者形式4时,可以直接使用训练样本集中的全部样本对第一模型进行更新训练,也可以使用训练样本集中的难例样本对第一模型进行更新训练,得到更新训练后的模型。When the implementation form of the first trigger mechanism is the above-mentioned form 3 or form 4, all the samples in the training sample set can be directly used to update the first model, or the difficult examples in the training sample set can be used to update the first model Train to get the updated trained model.
可以理解,使用训练样本集中的难例样本对第一模型进行更新训练,相较于直接使用训练样本集中的全部样本对第一模型进行更新训练,耗费的计算资源较少,所需的训练时间较短,模型的训练效率较高。因此,在具体实现中,前者为对第一模型进行更新训练的优选方式。It can be understood that using the difficult samples in the training sample set to update the first model, compared to directly using all the samples in the training sample set to update the first model, consumes less computing resources and requires less training time. Shorter, the training efficiency of the model is higher. Therefore, in a specific implementation, the former is a preferred way to update and train the first model.
需要说明的是,上述几种对第一模型进行更新训练,得到更新训练后的模型的方式仅仅 是作为示例,在具体实现中,也可以使用训练样本集中的部分简单例样本和难例样本对第一模型进行更新训练,此处不作具体限定。It should be noted that the above methods of updating and training the first model to obtain the updated model are only examples. In specific implementation, some pairs of simple examples and difficult examples in the training sample set can also be used. The first model performs update training, which is not specifically limited here.
S204、不对第一模型进行更新训练。S204. Do not perform update training on the first model.
S205、使用第二触发机制确定是否需要使用更新训练后的模型替换第一模型,在确定需要替换第一模型时,执行S206,在确定不需要替换第一模型时,执行S207。S205. Use the second trigger mechanism to determine whether the first model needs to be replaced by the updated trained model. When it is determined that the first model needs to be replaced, perform S206. When it is determined that the first model does not need to be replaced, perform S207.
在本申请具体的实施例中,第二触发机制可以为以下几种形式中的任意一种:In a specific embodiment of the application, the second trigger mechanism can be any of the following forms:
形式1':分别评估更新训练后的模型和第一模型的预测性能(例如预测精度、召回率等),若更新训练后的模型的预测性能优于第一模型的预测性能,则确定需要使用更新训练后的模型替换第一模型,然后自动触发步骤S206的执行。例如,假设评估的第一模型的预测精度为0.80,更新训练后的模型的预测精度为0.81,则确定需要使用更新训练后的模型替换第一模型。Form 1': Evaluate the prediction performance of the updated model and the first model (such as prediction accuracy, recall rate, etc.) The trained model is updated to replace the first model, and then the execution of step S206 is automatically triggered. For example, assuming that the predicted accuracy of the evaluated first model is 0.80 and the predicted accuracy of the updated trained model is 0.81, it is determined that the first model needs to be replaced with the updated trained model.
形式2':仅评估更新训练后的模型的预测性能,若更新训练后的模型的预测性能处于期望预测性能范围,则确定需要使用更新训练后的模型替换第一模型,然后自动触发步骤S206的执行。例如,假设预测性能为预测精度,期望预测性能范围为0.80-0.90,更新训练后的模型的预测精度为0.85,则确定需要使用更新训练后的模型替换第一模型。Form 2': only evaluate the predictive performance of the model after updating the training, if the predictive performance of the model after updating the training is within the range of expected predictive performance, then it is determined that the first model needs to be replaced with the model after training, and then the step S206 is automatically triggered implement. For example, assuming that the prediction performance is prediction accuracy, the expected prediction performance range is 0.80-0.90, and the prediction accuracy of the updated and trained model is 0.85, it is determined that the first model needs to be replaced with the updated and trained model.
在具体实现中,更新训练后的模型的预测性能、第一模型的预测性能的评估方法可以为留出法、交叉验证法(cross validation)等,此处不作具体限定。In a specific implementation, the method for updating the prediction performance of the trained model and the evaluation method for the prediction performance of the first model may be a hold-out method, a cross validation method (cross validation), etc., which are not specifically limited here.
需要说明的是,上述所列举的第二触发机制的几种实现形式仅仅是作为示例,其他能够确定是否需要触发第一模型进行更新替换的方式也在本申请的保护范围内,在此不作具体限制。It should be noted that the implementation forms of the second trigger mechanism listed above are only examples, and other methods that can determine whether to trigger the update and replacement of the first model are also within the scope of protection of this application, and will not be described in detail here. limit.
S206、使用更新训练后的模型替换第一模型。S206. Use the updated trained model to replace the first model.
S207、不替换第一模型。S207, the first model is not replaced.
在本申请具体的实施例中,在使用更新训练后的模型替换第一模型之后,会再次执行S201获取新的训练样本集,然后执行S202至S207进行新一轮的模型更新。In a specific embodiment of the present application, after replacing the first model with the updated trained model, S201 will be executed again to obtain a new training sample set, and then S202 to S207 will be executed for a new round of model update.
综上可知,通过本申请提供的模型更新方法,通过第一触发机制确定是否需要对第一模型进行更新训练,通过第二触发机制确定是否需要使用更新训练后的模型替换第一模型,能够实现按需触发模型进行自动更新训练以及自动更新部署,可以在解决离线学习方式更新模型的方法存在的模型更新效率低的问题的同时,解决在线学习方式更新模型的方法存在的会消耗大量计算资源的问题。In summary, through the model update method provided by this application, the first trigger mechanism is used to determine whether the first model needs to be updated and trained, and the second trigger mechanism is used to determine whether the updated model needs to be used to replace the first model. On-demand triggering of the model for automatic update training and automatic update deployment can solve the problem of low model update efficiency in the offline learning method and at the same time solve the problem of consuming a large amount of computing resources in the online learning method. question.
上文详细阐述了本申请提供的模型更新方法,为了便于更好的实施本申请提供的上述方案,相应地,下面还提供用于配合实施上述方案的装置及相关设备。The model update method provided by the present application has been described in detail above. In order to facilitate better implementation of the above-mentioned solution provided by the present application, correspondingly, devices and related equipment for cooperating with implementing the above-mentioned solution are also provided below.
本申请提供的模型更新装置的部署灵活,可部署在边缘环境,具体可以是边缘环境中的一个边缘计算设备或运行在一个或者多个边缘计算设备上的软件系统。边缘环境指在地理位置上靠近用户的网络边缘侧构建的用于提供计算、存储、通信资源的边缘计算设备集群。The deployment of the model update device provided by the present application is flexible, and can be deployed in an edge environment, specifically an edge computing device in the edge environment or a software system running on one or more edge computing devices. The edge environment refers to an edge computing device cluster built on the edge of the network geographically close to users to provide computing, storage, and communication resources.
模型更新装置还可以部署在云环境,云环境是云计算模型下利用基础资源向用户提供云服务的实体。云环境包括云数据中心和云服务平台,该云数据中心包括云服务提供商拥有的大量基础资源(包括计算资源、存储资源和网络资源)。模型更新装置可以是云数据中心的服务器,也可以是创建在云数据中心中的虚拟机,还可以是部署在云数据中心中的服务器或者虚拟机上的软件系统,该软件系统可以分布式地部署在多个服务器上、或者分布式地部署在多个虚拟机上、或者分布式地部署在虚拟机和服务器上。The model update device can also be deployed in a cloud environment, which is an entity that uses basic resources to provide users with cloud services under the cloud computing model. The cloud environment includes a cloud data center and a cloud service platform, and the cloud data center includes a large number of basic resources (including computing resources, storage resources and network resources) owned by the cloud service provider. The model update device can be a server in the cloud data center, or a virtual machine created in the cloud data center, or a software system deployed on a server or a virtual machine in the cloud data center, and the software system can be distributed in a distributed manner. Deploy on multiple servers, or distributed on multiple virtual machines, or distributed on virtual machines and servers.
模型更新装置还可以部分部署在边缘环境,部分部署在云环境,如图4所示。The model update device can also be partially deployed in the edge environment and partially deployed in the cloud environment, as shown in FIG. 4 .
应理解,模型更新装置内部的单元模块也可以有多种划分,各个模块可以是软件模块,也可以是硬件模块,也可以部分是软件模块部分是硬件模块,本申请不对其进行限制。参见图5A所示的模型更新装置500A和图5B所示的模型更新装置500B,为本申请示例性示出的两种划分模型更新装置的方式。It should be understood that the unit modules inside the model update device can also be divided into multiple types, and each module can be a software module, or a hardware module, or partly a software module and partly a hardware module, which is not limited in this application. Referring to the model updating device 500A shown in FIG. 5A and the model updating device 500B shown in FIG. 5B , there are two ways of dividing the model updating device exemplarily shown in this application.
首先,介绍图5A所示的模型更新装置500A,如图5A所示,装置500A包括:获取单元501、模型训练单元502和模型部署单元503。First, the model update apparatus 500A shown in FIG. 5A is introduced. As shown in FIG. 5A , the apparatus 500A includes: an acquisition unit 501 , a model training unit 502 and a model deployment unit 503 .
需要说明的,由于模型更新装置500A部署灵活,因此模型更新装置500A中的各个模块也可以部署于同一个边缘计算设备,或者同一个云数据中心,或者同一个物理机上,当然,也可以是部分部署于边缘计算设备,部分部署于云数据中心,比如获取单元501部署于边缘计算设备,模型训练单元502和模型部署单元503部署于云数据中心,本申请不作具体限定。It should be noted that due to the flexible deployment of the model updating device 500A, each module in the model updating device 500A can also be deployed on the same edge computing device, or on the same cloud data center, or on the same physical machine. Of course, it can also be partially Deployed on the edge computing device, partly deployed on the cloud data center, for example, the acquisition unit 501 is deployed on the edge computing device, and the model training unit 502 and model deployment unit 503 are deployed on the cloud data center, which is not specifically limited in this application.
获取单元501,用于获取训练样本集。The obtaining unit 501 is configured to obtain a training sample set.
模型训练单元502,用于在使用第一触发机制确定需要对第一模型进行更新训练时,通过训练样本集对第一模型进行更新训练,得到更新训练后的模型;The model training unit 502 is configured to perform update training on the first model through the training sample set when using the first trigger mechanism to determine that the first model needs to be updated and trained, to obtain an updated and trained model;
模型部署单元503,用于在使用第二触发机制确定需要使用更新训练后的模型替换第一模型时,使用更新训练后的模型替换第一模型。The model deploying unit 503 is configured to replace the first model with the updated trained model when it is determined that the first model needs to be replaced with the updated trained model by using the second trigger mechanism.
在一种可能的实现方式中,第一触发机制包括:若训练样本集中的难例样本的数量达到第一阈值,则确定需要对第一模型进行更新训练;或者,若当前时间到达模型更新时间,则确定需要对第一模型进行更新训练;或者,若训练样本集中的样本数量达到第二阈值,则确定需要对第一模型进行更新训练,其中,第二阈值为大于1的自然数;或者,若第一模型的上线时长达到预设时长,则确定需要对第一模型进行更新训练。In a possible implementation, the first triggering mechanism includes: if the number of difficult samples in the training sample set reaches the first threshold, then determine that the first model needs to be updated and trained; or, if the current time reaches the model update time , it is determined that the first model needs to be updated and trained; or, if the number of samples in the training sample set reaches a second threshold, it is determined that the first model needs to be updated and trained, wherein the second threshold is a natural number greater than 1; or, If the online duration of the first model reaches the preset duration, it is determined that the first model needs to be updated and trained.
在一种可能的实现方式中,第二触发机制包括:若更新训练后的模型的预测性能优于第一模型的预测性能,则确定需要使用更新训练后的模型替换第一模型;或者,若更新训练后的模型的预测性能处于期望预测性能范围,则确定需要使用更新训练后的模型替换第一模型。In a possible implementation, the second trigger mechanism includes: if the prediction performance of the updated model is better than the prediction performance of the first model, it is determined that the updated model needs to be used to replace the first model; or, if If the prediction performance of the updated and trained model is within the expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
在一种可能的实现方式中,模型训练单元502,具体可以通过如下方式实现通过训练样本集对第一模型进行更新训练:首先,对训练样本集进行筛选,确定训练样本集中的难例样本,然后,使用训练样本集中的难例样本,对第一模型进行更新训练。In a possible implementation manner, the model training unit 502 can specifically update and train the first model through the training sample set in the following manner: first, filter the training sample set to determine the difficult samples in the training sample set, Then, the first model is updated and trained by using the difficult samples in the training sample set.
在一种可能的实现方式中,模型训练单元502,具体可以通过如下方式对训练样本集中的样本进行筛选,确定训练样本集中的难例样本:首先,将训练样本集中的每个样本输入第一模型对进行推理,得到每个样本对应的推理结果的属性,属性包括如下任意一种:置信度、交叉熵,然后,根据每个样本的推理结果的属性,确定每个样本是否是难例样本。In a possible implementation, the model training unit 502 can specifically filter the samples in the training sample set in the following manner to determine the difficult samples in the training sample set: first, input each sample in the training sample set into the first The model performs inference to obtain the attributes of the inference results corresponding to each sample. The attributes include any of the following: confidence, cross entropy, and then, according to the attributes of the inference results of each sample, determine whether each sample is a difficult sample .
接下来,介绍图5B所示的模型更新装置500B,如图5B所示,装置500B包括:存储单元510、管控单元520、推理单元530、训练单元540和评估单元550。Next, the model updating device 500B shown in FIG. 5B is introduced. As shown in FIG. 5B , the device 500B includes: a storage unit 510 , a management and control unit 520 , an inference unit 530 , a training unit 540 and an evaluation unit 550 .
需要说明的,由于模型更新装置500B部署灵活,因此模型更新装置500B中的各个模块也可以部署于同一个边缘计算设备,或者同一个云数据中心,或者同一个物理机上,当然,也可以是部分部署于边缘计算设备,部分部署于云数据中心,比如存储单元510和推理单元530部署于边缘计算设备,管控单元520、训练单元540和评估单元550部署于云数据中心,本申请不作具体限定。It should be noted that due to the flexible deployment of the model updating device 500B, each module in the model updating device 500B can also be deployed on the same edge computing device, or on the same cloud data center, or on the same physical machine. Of course, it can also be partially Deployed on the edge computing device, partly deployed on the cloud data center, for example, the storage unit 510 and the reasoning unit 530 are deployed on the edge computing device, and the management and control unit 520, the training unit 540 and the evaluation unit 550 are deployed on the cloud data center, which is not specifically limited in this application.
存储单元510,用于存储模型更新装置500B获取的训练样本集和第一模型,还存储评估样本集,如图5B所示。其中,训练样本集用于对第一模型进行更新训练得到更新训练后的模型,评估样本集用于对得到的更新训练后的模型的预测性能进行评估。可选地,存储单元510还可以存储验证样本集,用于在使用评估样本集对更新训练后的模型的预测性能进行评 估之前,验证更新训练后的模型在验证样本集上的表现,同时,通过调整更新训练后的模型的超参数,让更新训练后的模型处于最优状态。The storage unit 510 is configured to store the training sample set and the first model acquired by the model updating apparatus 500B, and also store the evaluation sample set, as shown in FIG. 5B . Wherein, the training sample set is used to update and train the first model to obtain an updated and trained model, and the evaluation sample set is used to evaluate the prediction performance of the obtained updated and trained model. Optionally, the storage unit 510 may also store a verification sample set, which is used to verify the performance of the updated trained model on the verification sample set before using the evaluation sample set to evaluate the prediction performance of the updated model after training, and at the same time, By adjusting the hyperparameters of the updated trained model, the updated trained model is in an optimal state.
管控单元520,用于对整个的模型更新过程(包括模型更新训练过程和模型更新部署过程)进行控制,具体地,其管控推理单元530是否对训练样本集中的难例样本进行筛选、管控训练单元540是否对第一模型进行更新训练、管控评估单元550是否对更新训练后的模型的预测性能进行评估,以及管控推理单元530是否使用更新训练后的模型替换第一模型。The management and control unit 520 is used to control the entire model update process (including the model update training process and the model update deployment process). 540 whether to perform update training on the first model, whether the management and evaluation unit 550 evaluates the prediction performance of the updated model, and whether the management reasoning unit 530 uses the updated model to replace the first model.
下面结合图6,对管控单元520对整个的模型更新过程进行控制的过程进行详细描述。The process of controlling the entire model update process by the management and control unit 520 will be described in detail below with reference to FIG. 6 .
需要说明的是,在初始状态下,推理单元530、训练单元540以及评估单元550均处于未运行状态。It should be noted that, in an initial state, the inference unit 530 , the training unit 540 and the evaluation unit 550 are all in a non-running state.
首先,管控单元520执行S601触发推理单元530进入运行状态,具体地,触发模型和数据获取子单元5301执行S602从存储单元510获取第一模型和训练样本集,然后,数据筛选子单元5302将训练样本集中的每个样本输入第一模型进行推理,得到每个样本对应的推理结果的属性,属性包括置信度、交叉熵等,并根据每个样本对应的推理结果对应的属性,确定每个样本是否是难例样本,在确定是难例样本的情况下,执行S603将该难例样本存储到存储单元510的难例样本集中。First, the management and control unit 520 executes S601 to trigger the inference unit 530 to enter the running state. Specifically, the trigger model and data acquisition subunit 5301 executes S602 to acquire the first model and training sample set from the storage unit 510, and then the data screening subunit 5302 will train Each sample in the sample set is input into the first model for inference, and the attributes of the inference results corresponding to each sample are obtained. The attributes include confidence, cross entropy, etc., and according to the attributes corresponding to the inference results corresponding to each sample, determine the Whether it is a hard sample, if it is determined that it is a hard sample, execute S603 to store the hard sample in the hard sample set of the storage unit 510 .
在推理单元530不断的添加难例样本到难例样本集的过程中,管控单元520中的数据集管理子单元5201A可以执行S604实时或者周期性地监测难例样本集中难例样本的数量,在监测到难例样本的数量达到第一阈值时,则确定需要对第一模型进行更新训练,否则,继续监测难例样本的数量,直至监测到难例样本的数量达到第一阈值,确定需要对第一模型进行更新训练。In the process of the reasoning unit 530 continuously adding difficult samples to the difficult sample set, the data set management subunit 5201A in the management and control unit 520 can perform S604 to monitor the number of difficult samples in the difficult sample set in real time or periodically, and When it is detected that the number of difficult samples reaches the first threshold, it is determined that the first model needs to be updated and trained; otherwise, continue to monitor the number of difficult samples until it is detected that the number of difficult samples reaches the first threshold. The first model is updated for training.
在一种可能的实现方式中,在推理单元530不断地添加难例样本到难例样本集的过程中,管控单元520可以监测当前时间是否到达模型更新时间,在监测到当前时间到达模型更新时间时,则确定需要对第一模型进行更新训练,否则,继续监测当前时间,直至监测到当前时间到达模型更新时间,确定需要对第一模型进行更新训练。In a possible implementation, during the process of reasoning unit 530 continuously adding hard samples to the hard sample set, management and control unit 520 can monitor whether the current time reaches the model update time, and when it is detected that the current time reaches the model update time , it is determined that the first model needs to be updated and trained, otherwise, the current time is continuously monitored until the current time reaches the model update time, and it is determined that the first model needs to be updated and trained.
管控单元520在确定需要对第一模型进行更新训练的情况下,管控单元520中的第一触发器5202A执行S605触发训练单元540进入运行状态,具体地,触发训练单元540中的模型和数据获取子单元5401执行S606从存储单元510获取第一模型和难例样本集,然后,模型训练子单元5402使用难例样本集中的难例样本对第一模型进行更新训练,得到更新训练后的模型。When the management and control unit 520 determines that the first model needs to be updated and trained, the first trigger 5202A in the management and control unit 520 executes S605 to trigger the training unit 540 to enter the running state, specifically, to trigger the model and data acquisition in the training unit 540 The subunit 5401 executes S606 to acquire the first model and the hard sample set from the storage unit 510, and then, the model training subunit 5402 uses the hard samples in the hard sample set to perform update training on the first model to obtain an updated trained model.
在训练单元540对第一模型进行更新训练的过程中,管控单元520可以监测训练单元540使用难例样本集中的难例样本对第一模型进行更新训练的迭代次数,在迭代次数达到最大迭代次数时,管控单元520中的第一触发器5202A通知训练单元540训练结束。可选地,管控单元520也可以监测当前时间是否到达预设的训练结束时间,在当前时间到达训练结束时间时,管控单元520中的第一触发器5202A通知训练单元540训练结束。可选地,管控单元520也可以监测训练单元540对第一模型进行更新训练的时长,在训练时长达到最大训练时长时,管控单元520中的第一触发器5202A通知训练单元540训练结束。In the process of updating and training the first model by the training unit 540, the management and control unit 520 can monitor the number of iterations that the training unit 540 uses the difficult samples in the difficult sample set to update and train the first model, and when the number of iterations reaches the maximum number of iterations , the first trigger 5202A in the management and control unit 520 notifies the training unit 540 that the training is over. Optionally, the management and control unit 520 can also monitor whether the current time reaches the preset training end time, and when the current time reaches the training end time, the first trigger 5202A in the management and control unit 520 notifies the training unit 540 that the training is over. Optionally, the management and control unit 520 can also monitor the duration of the training unit 540 updating the first model. When the training duration reaches the maximum training duration, the first trigger 5202A in the management and control unit 520 notifies the training unit 540 that the training is over.
训练单元540在得到更新训练后的模型后,训练单元540可以执行S607向管控单元520发送第一消息,通知管控单元520第一模型更新训练结束,并且,执行S608将更新训练后的模型存储至存储单元510。After the training unit 540 obtains the updated and trained model, the training unit 540 may execute S607 to send the first message to the management and control unit 520, notifying the management and control unit 520 that the first model update training is over, and execute S608 to store the updated and trained model in storage unit 510 .
管控单元520在接收到训练单元540发送的第一消息时,管控单元520执行S609触发评估单元550进入运行状态,具体地,触发模型和数据获取子单元5501执行S610从存储单元 510获取评估样本集、更新训练后的模型和第一模型,然后,模型评估子单元5502使用评估样本集分别评估更新训练后的模型和第一模型的预测性能,最后,评估单元550执行S611将评估得到的更新训练后的模型和第一模型的预测性能上传至管控单元520。When the management and control unit 520 receives the first message sent by the training unit 540, the management and control unit 520 executes S609 to trigger the evaluation unit 550 to enter the running state, specifically, triggers the model and data acquisition subunit 5501 to execute S610 to acquire the evaluation sample set from the storage unit 510 , update the trained model and the first model, and then, the model evaluation subunit 5502 uses the evaluation sample set to evaluate the prediction performance of the updated model and the first model respectively, and finally, the evaluation unit 550 executes S611 to evaluate the obtained updated training The final model and the predicted performance of the first model are uploaded to the management and control unit 520.
管控单元520在接收到评估单元550上传的更新训练后的模型和第一模型的预测性能后,管控单元520可以判断更新训练后的模型的预测性能是否优于第一模型的预测性能,在确定前者的预测性能优于后者的预测性能的情况下,管控单元520执行S612控制推理单元530进行模型更新部署,其中,管控单元520控制推理单元530进行模型更新部署的具体过程为:管控单元520从存储单元510获取更新训练后的模型,然后将更新训练后的模型发送给推理单元530,以使推理单元530中的模型部署子单元5303部署更新训练后的模型,即使用更新训练后的模型替换之前部署于本地的第一模型。可选地,管控单元520也可以向推理单元530发送模型更新部署指令,指示推理单元530中的模型部署子单元5303从存储单元510获取更新训练后的模型,并使用更新训练后的模型替换之前部署于本地的第一模型。在具体实现中,管控单元520中可以部署有第二触发器5202B,该触发器用于根据更新训练后的模型的预测性能以及第一模型的预测性能判断是否控制推理单元530进行模型更新部署。After the management and control unit 520 receives the updated model after training and the prediction performance of the first model uploaded by the evaluation unit 550, the management and control unit 520 can determine whether the prediction performance of the updated and trained model is better than the prediction performance of the first model. When the prediction performance of the former is better than that of the latter, the management and control unit 520 executes S612 to control the reasoning unit 530 to update and deploy the model, wherein the specific process for the management and control unit 520 to control the reasoning unit 530 to update and deploy the model is as follows: the management and control unit 520 Obtain the updated and trained model from the storage unit 510, and then send the updated and trained model to the inference unit 530, so that the model deployment subunit 5303 in the inference unit 530 deploys the updated and trained model, that is, uses the updated and trained model Replace the first model that was previously deployed locally. Optionally, the management and control unit 520 may also send a model update deployment instruction to the reasoning unit 530, instructing the model deployment subunit 5303 in the reasoning unit 530 to obtain the updated model from the storage unit 510, and use the updated model to replace the previous model. The first model deployed locally. In a specific implementation, the management and control unit 520 may be deployed with a second trigger 5202B, which is used to determine whether to control the reasoning unit 530 to perform model update deployment according to the prediction performance of the updated trained model and the prediction performance of the first model.
在一种可能的实现方式中,评估单元550进入运行状态,可以是仅仅使用评估样本集对更新训练后的模型的预测性能进行评估,并不评估第一模型的预测性能,然后,只将评估的更新训练后的模型的预测性能上传至管控单元520。管控单元520在接收到评估单元550上传的更新训练后的模型的预测性能后,判断该预测性能是否处于期望预测性能范围,在确定处于期望预测性能范围的情况下,控制推理单元530进行模型更新部署。此时,管控单元520中部署的第二触发器5202B,用于根据更新训练后的模型的预测性能判断是否控制推理单元530进行模型更新部署。In a possible implementation, the evaluation unit 550 enters the running state, which may only use the evaluation sample set to evaluate the prediction performance of the updated model, and does not evaluate the prediction performance of the first model, and then only evaluates The prediction performance of the updated trained model is uploaded to the management and control unit 520 . After receiving the predicted performance of the updated trained model uploaded by the evaluation unit 550, the management and control unit 520 judges whether the predicted performance is within the expected predicted performance range, and controls the reasoning unit 530 to update the model if it is determined to be within the expected predicted performance range. deploy. At this time, the second trigger 5202B deployed in the management and control unit 520 is used to judge whether to control the reasoning unit 530 to perform model update deployment according to the prediction performance of the updated and trained model.
具体地,图5A所示装置500A以及图5B所示装置500B执行各种操作的具体实现,可参照上述模型更新方法实施例中相关内容中的描述,为了说明书的简洁,这里不再赘述。Specifically, for the implementation of various operations performed by the device 500A shown in FIG. 5A and the device 500B shown in FIG. 5B , reference may be made to the description in the relevant content in the above-mentioned embodiment of the model updating method. For the sake of brevity, details are not repeated here.
综上可知,本申请提供的模型更新装置(如图5A所示的装置500A或者图5B所示的装置500B),通过两个触发机制对应确定是否需要对第一模型进行更新训练,以及确定是否需要使用更新训练后的模型替换第一模型,能够实现按需触发模型进行自动更新训练以及自动更新部署,可以在提高模型更新效率的同时,减少计算资源的消耗。In summary, the model update device provided by this application (the device 500A shown in FIG. 5A or the device 500B shown in FIG. 5B ) determines whether the first model needs to be updated and trained through two trigger mechanisms, and determines whether It is necessary to replace the first model with the updated and trained model, which can trigger the model for automatic update training and automatic update deployment on demand, which can reduce the consumption of computing resources while improving the efficiency of model update.
参见图7,图7是本申请提供的一种计算设备700的结构示意图,计算设备700包括:处理器710、存储器720和通信接口730,其中,处理器710、存储器720、通信接口730之间可以通过总线740相互连接。Referring to FIG. 7 , FIG. 7 is a schematic structural diagram of a computing device 700 provided by the present application. The computing device 700 includes: a processor 710 , a memory 720 and a communication interface 730 , wherein the processor 710 , the memory 720 , and the communication interface 730 They can be connected to each other through a bus 740 .
处理器710可以读取存储器720中存储的程序代码(包括指令),执行存储器720中存储的程序代码,使得计算设备700执行上述方法实施例提供的模型更新方法中的步骤,或者使得计算设备700部署模型更新装置500A或500B。The processor 710 can read the program codes (including instructions) stored in the memory 720, and execute the program codes stored in the memory 720, so that the computing device 700 executes the steps in the model update method provided by the above method embodiments, or makes the computing device 700 The model updating apparatus 500A or 500B is deployed.
处理器710可以有多种具体实现形式,例如中央处理单元(central processing unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。处理器710执行各种类型的数字存储指令,例如存储在存储器720中的软件或者固件程序,它能使计算设备700提供多种服务。The processor 710 may have multiple specific implementation forms, such as a central processing unit (central processing unit, CPU), or a combination of a CPU and a hardware chip. The aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof. The above-mentioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof. Processor 710 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 720, which enable computing device 700 to provide various services.
存储器720用于存储程序代码,并由处理器710来控制执行,以执行上述图2、图3或 图6任一实施例中的处理步骤。程序代码中可以包括一个或多个软件模块,这一个或多个软件模块可以为图5A实施例中提供的软件模块,如获取单元501、模型训练单元502和模型部署单元503,具体可用于执行图2实施例中的步骤S201~步骤S207,这里不再进行赘述。或者,这一个或多个软件模块可以为图5B实施例中提供的软件模块,如存储单元510、管控单元520、推理单元530、训练单元540和评估单元550,具体可用于执行图6实施例中的步骤S601~步骤S612,这里不再进行赘述。The memory 720 is used to store program codes, which are executed under the control of the processor 710, so as to execute the processing steps in any of the above-mentioned embodiments in FIG. 2 , FIG. 3 or FIG. 6 . The program code may include one or more software modules. The one or more software modules may be the software modules provided in the embodiment of FIG. Steps S201 to S207 in the embodiment of FIG. 2 will not be repeated here. Alternatively, the one or more software modules may be the software modules provided in the embodiment of FIG. 5B, such as the storage unit 510, the management and control unit 520, the reasoning unit 530, the training unit 540, and the evaluation unit 550, which can be specifically used to execute the embodiment of FIG. 6 Steps S601 to S612 in Step S601 will not be repeated here.
存储器720可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM);存储器720也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM)、快闪存储器(flash memory)、硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器720还可以包括上述种类的组合。The memory 720 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM); the memory 720 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory). only memory, ROM), flash memory (flash memory), hard disk (hard disk drive, HDD) or solid-state drive (solid-state drive, SSD); the memory 720 may also include a combination of the above types.
通信接口730可以为有线接口(例如以太网接口、光纤接口、其他类型接口(例如,infiniBand接口))或无线接口(例如蜂窝网络接口或使用无线局域网接口),用于与其他计算设备或装置进行通信。通信接口730可以采用传输控制协议/网际协议(transmission control protocol/internet protocol,TCP/IP)之上的协议族,例如,远程函数调用(remote function call,RFC)协议、简单对象访问协议(simple object access protocol,SOAP)协议、简单网络管理协议(simple network management protocol,SNMP)协议、公共对象请求代理体系结构(common object request broker architecture,CORBA)协议以及分布式协议等等。The communication interface 730 can be a wired interface (such as an Ethernet interface, a fiber optic interface, other types of interfaces (such as an infiniBand interface)) or a wireless interface (such as a cellular network interface or using a wireless local area network interface) for communicating with other computing devices or devices. communication. The communication interface 730 can adopt a protocol family above the transmission control protocol/internet protocol (transmission control protocol/internet protocol, TCP/IP), for example, a remote function call (remote function call, RFC) protocol, a simple object access protocol (simple object access protocol (SOAP) protocol, simple network management protocol (simple network management protocol, SNMP) protocol, common object request broker architecture (common object request broker architecture, CORBA) protocol and distributed protocols, etc.
总线740可以是快捷外围部件互连标准(peripheral component interconnect express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线740可以分为地址总线、数据总线、控制总线等。总线740除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线740。为便于表示,图7中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 740 can be a peripheral component interconnect express (PCIe) bus, or an extended industry standard architecture (EISA) bus, a unified bus (Ubus or UB), a computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc. The bus 740 can be divided into an address bus, a data bus, a control bus, and the like. In addition to the data bus, the bus 740 may also include a power bus, a control bus, a status signal bus, and the like. However, for clarity of illustration, the various buses are labeled as bus 740 in the figure. For ease of representation, only one thick line is used in FIG. 7 , but it does not mean that there is only one bus or one type of bus.
上述计算设备700用于执行上述模型更新方法实施例中执行的方法,与上述方法实施例属于同一构思,其具体实现过程详见上述方法实施例,这里不再赘述。The above-mentioned computing device 700 is used to execute the method executed in the above-mentioned embodiment of the model update method, which belongs to the same idea as the above-mentioned method embodiment, and its specific implementation process is detailed in the above-mentioned method embodiment, and will not be repeated here.
应当理解,计算设备700仅为本申请实施例提供的一个例子,并且,计算设备700可具有比图7示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。It should be understood that the computing device 700 is only an example provided by the embodiment of the present application, and the computing device 700 may have more or fewer components than those shown in FIG. 7 , and two or more components may be combined, or It can be realized with different configurations of components.
本申请还提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,该指令被运行时可以实现上述实施例中记载的模型更新方法的部分或者全部步骤。The present application also provides a computer-readable storage medium, in which instructions are stored, and when the instructions are executed, some or all steps of the model updating method described in the above-mentioned embodiments can be implemented.
本申请还提供一种计算机程序产品,当计算机程序产品被计算机读取并执行时,可以实现上述方法实施例中记载的模型更新方法的部分或者全部步骤。The present application also provides a computer program product. When the computer program product is read and executed by a computer, some or all steps of the model updating method described in the above method embodiments can be realized.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.
在上述实施例中,可以全部或部分地通过软件、硬件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站 站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质、或者半导体介质等。In the above-mentioned embodiments, all or part may be implemented by software, hardware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, DSL) or wireless (eg, infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium, or a semiconductor medium.
以上所述,仅为本申请的具体实施方式。熟悉本技术领域的技术人员根据本申请提供的具体实施方式,可想到变化或替换,都应涵盖在本申请的保护范围之内。The foregoing is only a specific implementation manner of the present application. Those skilled in the art may conceive changes or substitutions based on the specific implementation methods provided in this application, and all of them shall fall within the protection scope of this application.

Claims (12)

  1. 一种模型更新方法,其特征在于,所述方法包括:A method for updating a model, characterized in that the method comprises:
    获取训练样本集;Obtain a training sample set;
    在使用第一触发机制确定需要对所述第一模型进行更新训练时,通过所述训练样本集对所述第一模型进行更新训练,得到更新训练后的模型;When the first trigger mechanism is used to determine that the first model needs to be updated and trained, the first model is updated and trained through the training sample set to obtain an updated and trained model;
    在使用第二触发机制确定需要使用所述更新训练后的模型替换所述第一模型时,使用所述更新训练后的模型替换所述第一模型。When the second trigger mechanism is used to determine that the first model needs to be replaced with the updated trained model, the first model is replaced with the updated trained model.
  2. 根据权利要求1所述的方法,其特征在于,所述第一触发机制包括:The method according to claim 1, wherein the first trigger mechanism comprises:
    若所述训练样本集中的难例样本的数量达到第一阈值,则确定需要对所述第一模型进行更新训练;If the number of difficult samples in the training sample set reaches a first threshold, it is determined that the first model needs to be updated and trained;
    或者,若当前时间到达模型更新时间,则确定需要对所述第一模型进行更新训练;Or, if the current time reaches the model update time, it is determined that the first model needs to be updated and trained;
    或者,若所述训练样本集中的样本数量达到第二阈值,则确定需要对所述第一模型进行更新训练,其中,第二阈值为大于1的自然数;Or, if the number of samples in the training sample set reaches a second threshold, it is determined that the first model needs to be updated and trained, where the second threshold is a natural number greater than 1;
    或者,若所述第一模型的上线时长达到预设时长,则确定需要对所述第一模型进行更新训练。Alternatively, if the online duration of the first model reaches a preset duration, it is determined that the first model needs to be updated and trained.
  3. 根据权利要求1或2所述的方法,其特征在于,所述第二触发机制包括:The method according to claim 1 or 2, wherein the second trigger mechanism comprises:
    若所述更新训练后的模型的预测性能优于所述第一模型的预测性能,则确定需要使用所述更新训练后的模型替换所述第一模型;If the prediction performance of the updated trained model is better than the predicted performance of the first model, it is determined that the first model needs to be replaced with the updated trained model;
    或者,若所述更新训练后的模型的预测性能处于期望预测性能范围,则确定需要使用所述更新训练后的模型替换所述第一模型。Alternatively, if the prediction performance of the updated and trained model is within an expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述通过所述训练样本集对所述第一模型进行更新训练,包括:The method according to any one of claims 1 to 3, wherein the updating training of the first model through the training sample set includes:
    对所述训练样本集中的样本进行筛选,确定所述训练样本集中的难例样本;Screening samples in the training sample set to determine difficult samples in the training sample set;
    使用所述训练样本集中的难例样本,对所述第一模型进行更新训练。Perform update training on the first model by using the difficult samples in the training sample set.
  5. 根据权利要求4所述的方法,其特征在于,所述对所述训练样本集中的样本进行筛选,确定所述训练样本集中的难例样本,包括:The method according to claim 4, wherein the screening of the samples in the training sample set to determine the difficult samples in the training sample set includes:
    将所述训练样本集中的每个样本输入所述第一模型进行推理,得到所述每个样本对应的推理结果的属性,所述属性包括如下任意一种或者多种:置信度、交叉熵;Input each sample in the training sample set into the first model for inference, and obtain the attribute of the inference result corresponding to each sample, and the attribute includes any one or more of the following: confidence, cross-entropy;
    根据所述每个样本对应的推理结果的属性,确定所述每个样本是否是难例样本。Determine whether each sample is a difficult sample according to the attribute of the inference result corresponding to each sample.
  6. 一种模型更新装置,其特征在于,所述装置包括:A model update device, characterized in that the device comprises:
    获取单元,用于获取训练样本集;an acquisition unit, configured to acquire a training sample set;
    模型训练单元,用于在使用第一触发机制确定需要对所述第一模型进行更新训练时,通过所述训练样本集对所述第一模型进行更新训练,得到更新训练后的模型;A model training unit, configured to perform update training on the first model through the training sample set to obtain an updated trained model when the first trigger mechanism is used to determine that the first model needs to be updated and trained;
    模型部署单元,用于在使用第二触发机制确定需要使用所述更新训练后的模型替换所述第一模型时,使用所述更新训练后的模型替换所述第一模型。A model deploying unit, configured to replace the first model with the updated trained model when it is determined that the first model needs to be replaced with the updated trained model by using the second trigger mechanism.
  7. 根据权利要求6所述的装置,其特征在于,所述第一触发机制包括:The device according to claim 6, wherein the first trigger mechanism comprises:
    若所述训练样本集中的难例样本的数量达到第一阈值,则确定需要对所述第一模型进行更新训练;If the number of difficult samples in the training sample set reaches a first threshold, it is determined that the first model needs to be updated and trained;
    或者,若当前时间到达模型更新时间,则确定需要对所述第一模型进行更新训练;Or, if the current time reaches the model update time, it is determined that the first model needs to be updated and trained;
    或者,若所述训练样本集中的样本数量达到第二阈值,则确定需要对所述第一模型进行更新训练,其中,第二阈值为大于1的自然数;Or, if the number of samples in the training sample set reaches a second threshold, it is determined that the first model needs to be updated and trained, where the second threshold is a natural number greater than 1;
    或者,若所述第一模型的上线时长达到预设时长,则确定需要对所述第一模型进行更新训练。Alternatively, if the online duration of the first model reaches a preset duration, it is determined that the first model needs to be updated and trained.
  8. 根据权利要求6或7所述的装置,其特征在于,所述第二触发机制包括:The device according to claim 6 or 7, wherein the second trigger mechanism comprises:
    若所述更新训练后的模型的预测性能优于所述第一模型的预测性能,则确定需要使用所述更新训练后的模型替换所述第一模型;If the prediction performance of the updated trained model is better than the predicted performance of the first model, it is determined that the first model needs to be replaced with the updated trained model;
    或者,若所述更新训练后的模型的预测性能处于期望预测性能范围,则确定需要使用所述更新训练后的模型替换所述第一模型。Alternatively, if the prediction performance of the updated and trained model is within an expected prediction performance range, it is determined that the first model needs to be replaced with the updated and trained model.
  9. 根据权利要求6至8任一项所述的装置,其特征在于,所述模型训练单元,具体用于:The device according to any one of claims 6 to 8, wherein the model training unit is specifically used for:
    对所述训练样本集中的样本进行筛选,确定所述训练样本集中的难例样本;Screening samples in the training sample set to determine difficult samples in the training sample set;
    使用所述训练样本集中的难例样本,对所述第一模型进行更新训练。Perform update training on the first model by using the difficult samples in the training sample set.
  10. 根据权利要求9所述的装置,其特征在于,所述模型训练单元,具体用于:The device according to claim 9, wherein the model training unit is specifically used for:
    将所述训练样本集中的每个样本输入所述第一模型对进行推理,得到所述每个样本对应的推理结果的属性,所述属性包括如下任意一种:置信度、交叉熵;Input each sample in the training sample set into the first model pair for inference, and obtain the attribute of the inference result corresponding to each sample, and the attribute includes any of the following: confidence, cross entropy;
    根据所述每个样本对应的推理结果的属性,确定所述每个样本是否是难例样本。Determine whether each sample is a difficult sample according to the attribute of the inference result corresponding to each sample.
  11. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;所述处理器用于执行所述存储器存储的指令,使得所述计算设备实现权利要求1至5任一项所述的方法。A computing device, characterized in that the computing device includes a processor and a memory; the processor is configured to execute instructions stored in the memory, so that the computing device implements the method according to any one of claims 1 to 5 .
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有指令,所述指令用于实现权利要求1至5任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores instructions, and the instructions are used to implement the method according to any one of claims 1 to 5.
PCT/CN2022/131668 2021-11-30 2022-11-14 Model updating method and apparatus and related device WO2023098460A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111443976.2A CN116205304A (en) 2021-11-30 2021-11-30 Model updating method and device and related equipment
CN202111443976.2 2021-11-30

Publications (1)

Publication Number Publication Date
WO2023098460A1 true WO2023098460A1 (en) 2023-06-08

Family

ID=86511648

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/131668 WO2023098460A1 (en) 2021-11-30 2022-11-14 Model updating method and apparatus and related device

Country Status (2)

Country Link
CN (1) CN116205304A (en)
WO (1) WO2023098460A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598842A (en) * 2019-07-17 2019-12-20 深圳大学 Deep neural network hyper-parameter optimization method, electronic device and storage medium
US20200311541A1 (en) * 2019-03-28 2020-10-01 International Business Machines Corporation Metric value calculation for continuous learning system
US20210357805A1 (en) * 2020-05-15 2021-11-18 Vmware, Inc. Machine learning with an intelligent continuous learning service in a big data environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200311541A1 (en) * 2019-03-28 2020-10-01 International Business Machines Corporation Metric value calculation for continuous learning system
CN110598842A (en) * 2019-07-17 2019-12-20 深圳大学 Deep neural network hyper-parameter optimization method, electronic device and storage medium
US20210357805A1 (en) * 2020-05-15 2021-11-18 Vmware, Inc. Machine learning with an intelligent continuous learning service in a big data environment

Also Published As

Publication number Publication date
CN116205304A (en) 2023-06-02

Similar Documents

Publication Publication Date Title
US11410046B2 (en) Learning-based service migration in mobile edge computing
US11153175B2 (en) Latency management by edge analytics in industrial production environments
CN111682954B (en) Method, system, and computer readable medium for managing a network of microservices
US11108575B2 (en) Training models for IOT devices
US20190294975A1 (en) Predicting using digital twins
CN108809694B (en) Service arrangement method, system, device and computer readable storage medium
WO2021143155A1 (en) Model training method and apparatus
WO2022028304A1 (en) Multimedia data processing method and apparatus, device and readable storage medium
Chen et al. A survey on traffic prediction techniques using artificial intelligence for communication networks
US11412574B2 (en) Split predictions for IoT devices
CN111989696A (en) Neural network for scalable continuous learning in domains with sequential learning tasks
US11902396B2 (en) Model tiering for IoT device clusters
WO2022088082A1 (en) Task processing method, apparatus and device based on defect detection, and storage medium
US11595269B1 (en) Identifying upgrades to an edge network by artificial intelligence
US11711287B2 (en) Unified recommendation engine
US20230053575A1 (en) Partitioning and placement of models
CN115686846A (en) Container cluster online deployment method for fusing graph neural network and reinforcement learning in edge computing
US20240095529A1 (en) Neural Network Optimization Method and Apparatus
Raj et al. Edge/Fog Computing Paradigm: The Concept, Platforms and Applications.
Gilbert Artificial intelligence for autonomous networks
WO2023098460A1 (en) Model updating method and apparatus and related device
Liu et al. ScaleFlux: Efficient stateful scaling in NFV
Taherizadeh et al. Incremental learning from multi-level monitoring data and its application to component based software engineering
WO2023056786A1 (en) Attenuation weight tracking in graph neural networks
Gilbert The role of artificial intelligence for network automation and security

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22900265

Country of ref document: EP

Kind code of ref document: A1