WO2021143155A1 - 模型训练方法及装置 - Google Patents

模型训练方法及装置 Download PDF

Info

Publication number
WO2021143155A1
WO2021143155A1 PCT/CN2020/113610 CN2020113610W WO2021143155A1 WO 2021143155 A1 WO2021143155 A1 WO 2021143155A1 CN 2020113610 W CN2020113610 W CN 2020113610W WO 2021143155 A1 WO2021143155 A1 WO 2021143155A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
server
training
inference
input data
Prior art date
Application number
PCT/CN2020/113610
Other languages
English (en)
French (fr)
Inventor
刘志飘
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2022543575A priority Critical patent/JP2023511327A/ja
Priority to EP20913624.1A priority patent/EP4080419A4/en
Publication of WO2021143155A1 publication Critical patent/WO2021143155A1/zh
Priority to US17/865,106 priority patent/US20220351081A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/045Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • This application relates to the field of communication technology, and in particular to a model training method and device.
  • AI artificial intelligence
  • the hybrid cloud is usually used for AI model training and inference, that is, the model of "online training, offline inference" is adopted, and the training data is first uploaded to the online training platform of the public cloud for model training. Determine the training model that meets the needs, and then push the training model down to the offline inference platform of the private cloud, and the offline inference platform will publish it as a service for inference.
  • This implementation method can maximize the use of the computing power of the public cloud for model training and inference on the premise that the private cloud is used to ensure the security of user data.
  • the accuracy of the inference results obtained by using the training model for inference may be lower.
  • the existing AI model training methods in the hybrid cloud scenario cannot detect the decrease in the inference accuracy of the training model in time, and there may be frequent false alarms or unavailability of business systems (such as face recognition systems).
  • the aging, replacement or installation position adjustment of the bayonet camera (such as the camera at the door of the community) may cause the definition and angle of the video captured by the bayonet camera to change, that is, training The input data of the model changes.
  • Using the changed input data and the training model for inference may cause the accuracy of the subsequent training model's inference results to be greatly reduced, and affect the normal use of the video surveillance system for security.
  • This application provides a model training method and device.
  • the training model is evaluated according to the inference result of the training model, and the evaluation result of the model evaluation index of the training model is determined to achieve the inference of the monitoring training model.
  • the purpose of the effect is to retrain the training model in time according to the reasoning effect of the training model, determine the training model with better reasoning effect, improve the accuracy of the reasoning result, and ensure the performance of the business system.
  • an embodiment of the present application provides a model training method, which is applied to a system including a first server and a second server.
  • the first server is located in a private cloud and is used for model inference; the second server is located in a public cloud. Used for model training.
  • the method includes: a first server obtains a first training model from a second server, and the first server inputs input data into the first training model to perform model inference, and obtain an inference result. Subsequently, the first server evaluates the first training model according to the model evaluation index according to the inference result, and obtains the evaluation result of the model evaluation index.
  • the first server sends a retraining instruction for the first training model to the second server.
  • the retraining instruction is used to instruct the second server to retrain the first training model.
  • the first server can determine the inference effect of the first training model by evaluating the model evaluation index of the first training model, and realize the monitoring of the inference effect of the first training model.
  • the inference effect is poor, send a retraining instruction for model retraining to the second server, so that the second server can retrain the training model in time according to the inference effect of the training model to determine the training model with better inference effect , Improve the accuracy of the inference results and ensure the performance of the business system.
  • the method further includes: the first server sends the input data and the inference to the second server
  • the model training in the hybrid cloud scenario and the data closed loop in the inference system are realized, so that the application can use the input data input to the training model for inference and the inference result obtained by the model inference to retrain the training model, thereby improving
  • the reasoning effect of the training model that is, the accuracy of the reasoning result, guarantees the performance of the business system.
  • the input data and the inference result are used to retrain the first training model.
  • the model evaluation indicators include at least one of the following: accuracy of inference results, precision of inference results, recall of inference results, F1 score of inference results, F1-Score, and acceptance of inference results
  • accuracy of inference results precision of inference results
  • recall of inference results recall of inference results
  • F1 score of inference results F1-Score
  • acceptance of inference results The operator operates the area AUC under the ROC of the characteristic curve.
  • the first server does not send a retraining instruction for the first training model to the second server.
  • an embodiment of the present application provides a model training method, which is applied to a system including a first server and a second server.
  • the first server is located in a private cloud and is used for model inference; and the second server is located in a public cloud. Used for model training.
  • the method includes: the second server obtains the retraining instruction, the input data and the inference result for the first training model from the first server.
  • the retraining instruction is used to instruct the second server to retrain the first training model, the input data is the data input by the first server into the first training model, and the inference result is that the first server inputs the input data into the first training model.
  • the second server determines the retraining sample set according to the input data and the inference result, and then retrains the first training model according to the retraining sample set to determine the second training model, which is used to replace the first training model. Train the model. Finally, the second server sends the second training model to the first server.
  • the second server obtains the retraining instruction, input data, and inference result for the first training model from the first server, which specifically includes: the second server responds to the retraining received from the first server Instructions to obtain input data and inference results.
  • the second server determines the retraining sample set according to the input data and the inference result, which specifically includes: the second server annotates the input data to obtain the annotated input data, and then the annotated input
  • the data and inference results are stored in the retraining sample set.
  • the method before the second server annotates the input data and obtains the annotated input data, the method further includes: if the inference result is a correct inference result, the second server retains the inference result and the The input data corresponding to the inference result; if the inference result is a wrong inference result, the second server deletes the inference result and the input data corresponding to the inference result, or the second server replaces the inference result with the correct inference result corresponding to the input data Inference results.
  • this application also provides a model training device as a first server, which is applied to a system of a first server and a second server.
  • the first server is located in a private cloud and is used for model inference
  • the second server is located in a public cloud. In the cloud, it is used for model training.
  • the device includes an acquisition unit, an inference unit, an evaluation unit, and a sending unit: the acquisition unit is used to acquire the first training model from the second server.
  • the inference unit is used to input the input data into the first training model for model inference, and obtain the inference result.
  • the evaluation unit is used to evaluate the first training model according to the model evaluation index according to the inference result, and obtain the evaluation result of the model evaluation index.
  • the sending unit is configured to send a retraining instruction for the first training model to the second server if the evaluation result of at least one model evaluation index does not exceed its corresponding preset threshold.
  • the retraining instruction is used to instruct the second server to retrain the first training model.
  • the model evaluation indicators include at least one of the following: accuracy of inference results, precision of inference results, recall of inference results, F1 score of inference results, F1-Score, and acceptance of inference results
  • accuracy of inference results precision of inference results
  • recall of inference results recall of inference results
  • F1 score of inference results F1-Score
  • acceptance of inference results The operator operates the area AUC under the ROC of the characteristic curve.
  • the sending unit is also used to send input data and inference results to the second server.
  • the input data and the inference result are used to retrain the first training model.
  • the sending unit is further configured to not send the retraining for the first training model to the second server if the evaluation results of the model evaluation indicators all exceed their corresponding preset thresholds. instruction.
  • This application also provides a model training device as a second server, which is applied to a system of a first server and a second server.
  • the first server is located in a private cloud for model inference
  • the second server is located in a public cloud.
  • the device includes an acquiring unit, a determining unit, a sending unit, and a processing unit: the acquiring unit is used to acquire retraining instructions, input data, and inference results for the first training model from the first server.
  • the determining unit is used to determine the retraining sample set according to the input data and the inference result.
  • the determining unit is further configured to retrain the first training model according to the retraining sample set to determine the second training model.
  • the second training model is used to replace the first training model.
  • the sending unit is used to send the second training model to the first server.
  • the retraining instruction is used to instruct the second server to retrain the first training model
  • the input data is the data that the first server inputs into the first training model for model inference
  • the inference result is that the first server inputs the input data The result obtained after model inference in the first training model.
  • the processing unit is configured to, if the inference result is a correct inference result, the second server retains the inference result and the input data corresponding to the inference result. If the inference result is an incorrect inference result, the second server deletes the inference result and the input data corresponding to the inference result, or the second server replaces the inference result with the correct inference result corresponding to the input data.
  • the obtaining unit is specifically configured to obtain input data and inference results in response to the retraining instruction received from the first server.
  • the determining unit is specifically used to label the input data, obtain the labeled input data, and store the labeled input data and the inference result in the retraining sample set.
  • the present application provides a model training device, which is characterized in that the device includes a processor, a memory, and a communication interface.
  • the communication interface is used to communicate with other devices or communication networks
  • the memory is used to store one or more programs
  • the one or more programs include computer execution instructions.
  • the processor executes the The computer executes the instructions to make the device execute the model training method described in any one of the first aspect or the second aspect and various optional implementation manners thereof.
  • the present application provides a computer program product containing instructions, characterized in that, when the instructions are executed by a processor, the computer is caused to execute the first aspect or the second aspect and various optional implementations thereof The model training method described in any one of the methods.
  • the present application provides a computer-readable storage medium storing one or more programs.
  • the computer-readable storage medium stores instructions.
  • the one or more programs include instructions.
  • the instructions are executed by a processor.
  • the computer is caused to execute the model training method described in any one of the first aspect or the second aspect and various optional implementation manners thereof.
  • Fig. 1 is a schematic diagram 1 of a system for AI model training and inference in a hybrid cloud scenario in the prior art
  • FIG. 2 is a schematic diagram of a system for AI model training and inference in a hybrid cloud scenario in the prior art
  • FIG. 3 is a schematic diagram of a system for AI model training and inference in a hybrid cloud scenario provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of the hardware structure of a chip provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of a model training device provided by an embodiment of the application.
  • FIG. 6 is a schematic flowchart of a model training method provided by an embodiment of this application.
  • FIG. 7 is a schematic diagram of the area AUC under the receiver operating characteristic curve ROC of a reasoning result provided by an embodiment of the application.
  • FIG. 8 is a schematic diagram of a model training device as a first server provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of a model training device as a second server provided by an embodiment of the application.
  • Cloud Servers or server clusters.
  • the services provided by the servers or server clusters that make up the cloud are cloud services (such as storage, computing, etc.), and can also be described as cloud computing services, that is, cloud services. Users obtain them from the cloud through the network. Resources and services required.
  • Private cloud also known as internal cloud or company cloud, refers to a cloud that provides cloud computing services to specific users (not the general public) through the Internet or a dedicated internal network. Private clouds are generally deployed in firewalls or hosting sites, which can effectively guarantee the security and quality of the services provided.
  • Public cloud refers to a cloud that provides cloud computing services to all users through the Internet or a dedicated internal network.
  • the core of the public cloud is a shared resource service, which can be used in the entire open public network.
  • Hybrid cloud is a cloud that integrates two forms of public cloud and private cloud to provide users with cloud computing services.
  • a system for AI model training and inference in a hybrid cloud scenario may include a first server 10 and a second server 20.
  • the first server 10 is located in a private cloud and is used for model inference
  • the second server 20 is located in a public cloud and is used for model training.
  • the first server 10 may be a server in a private cloud, or may be a server cluster composed of multiple servers in the private cloud.
  • the second server 20 may be a server in the public cloud, or may be a server cluster composed of multiple servers in the public cloud.
  • the system for AI model training and inference in the hybrid cloud scenario shown in Figure 1 may include a data collection module 101, a model training module 102, and a first model.
  • the data collection module 101, the model training module 102, and the first model storage module 103 are located in the second server 20. Specifically, if the second server 20 is a server in the public cloud, the data collection module 101, the model training module 102, and the first model storage module 103 are all located on the second server 20; if the second server 20 is a public cloud If the data collection module 101, the model training module 102, and the first model storage module 103 are located on the same server in the server cluster, or on different servers in the server cluster, if there is a server cluster composed of multiple servers in the server cluster.
  • the data collection module 101 is configured to acquire data used for model training, and add the acquired data as training samples to the training sample set, and then send the training sample set to the model training module 102.
  • the data collection module 101 can use an object storage service (OBS, object storage service) to store the data it obtains.
  • OBS object storage service
  • the model training module 102 is used to receive the training sample set sent by the data collection module 101, and perform model training on a preset model according to the training sample set to obtain a training model that meets application requirements, and store the training model in the first model Storage module 103.
  • the preset model may be a pre-stored model (for example, a training model obtained through model training before), or a model set according to the application scenario of the training model.
  • NPL neuro-linguistic programming
  • users usually use their own training sample sets to perform model training on certain benchmark models to obtain the required training model.
  • the preset model referred to in the application embodiment is the reference model.
  • the model training module 102 is configured to obtain a preset model for model training from the first model storage module 103.
  • the first model storage module 103 is configured to obtain a trained training model from the model training module 102 and store the obtained training model.
  • the first model storage module 103 is also used to store the description information of the training model.
  • the description information of the training model includes the name, purpose, and life cycle of the training model.
  • the first model storage module 103 is also used to modify the description information of the training model.
  • the first model storage module 103 is also used to perform life cycle management on the training model according to the life cycle in the description information of the training model, and the duration of the training model storage. That is, the first model storage module 103 is used to store, update or delete the training model and the description information of the training model during the life cycle of the training model, and delete or delete the training model after the life cycle of the training model ends. Update the training model and the description information of the training model, etc.
  • the time when the first model storage module 103 obtains the training model A is 11:30am, and the life cycle of the training model A is 2h, or the life cycle of the training model is from 11:30am to 1:30pm . Then, if the current time is any time from 11:30am to 1:30pm, the first model storage module 103 can update or delete the training model A stored in it, or modify the description of the training model A stored in it. information. By modifying the description information of the training model A, the first model storage module 103 can change the name and purpose of the training model A or extend/shorten the life cycle of the training model A.
  • the first model storage module 103 does not modify the stored life cycle of the training model A during the time period from 11:30am to 1:30pm, then after 1:30pm, the life cycle of the training model A ends, and the first The model storage module 103 deletes the stored training model A and the description information of the training model A.
  • the first model storage module 103 is also used to store a preset model used for model training.
  • the first model storage module 103 is also used to send the training model received from the model training module 102 to the second model storage module 104.
  • the first model storage module 103 is further configured to send description information of the training model to the second model storage module 104, so that the model reasoning management module 105 uses the training model to perform model reasoning.
  • the second model storage module 104 is configured to obtain and store the training model sent by the first model storage module 103.
  • the second model storage module 104 is further configured to obtain and store the description information of the training model sent by the first model storage module 103.
  • the second model storage module 104 is also used to modify the description information of the stored training model.
  • the second model storage module 104 is also used to perform life cycle management on the training model according to the life cycle in the description information of the training model, and the duration of storing the training model.
  • the second model storage module 104 is also used to send the training model obtained from the first model storage module 103 to the model reasoning management module 105.
  • the second model storage module 104 and the model reasoning management module 105 are located in the first server 10. Specifically, if the first server 10 is a server in the private cloud, the second model storage module 104 and the model inference management module 105 are both located on the first server 10; if the first server 10 is a plurality of servers in the private cloud For a server cluster composed of servers, the second model storage module 104 and the model reasoning management module 105 are located on the same server in the server cluster, or on different servers in the server.
  • the model reasoning management module 105 is used to call the training model from the second model storage module 104 and publish the training model as a service.
  • the model reasoning management module 105 is also used for updating and deleting the training model in the second model storage module 104, and updating and deleting the service published by the training model corresponding to the training model.
  • the model reasoning management module 105 is also used to call the training model corresponding to the service after receiving the instruction to provide the service, input the input data into the training model for model reasoning to obtain the reasoning result, and send the reasoning result to the user terminal.
  • the model reasoning management module 105 is also used to tailor the training model, such as reducing the number of network layers and merging operators in the deep learning model, so as to speed up the reasoning process and improve the efficiency of model reasoning.
  • the accuracy of the inference results obtained by using the training model for model inference may be reduced, that is, the inference effect of the training model may be poor, and the performance of the business system may also be Will be affected.
  • this application proposes a model training method, which is applied in a hybrid cloud scenario.
  • the training model can be evaluated according to the inference results obtained by the model inference of the training model and the model evaluation index, so as to monitor the training.
  • the reasoning effect of the model, and timely retraining of the training model with poor reasoning effect to improve the accuracy of the reasoning result obtained by the model reasoning, and to ensure the performance of the business system (such as the face recognition system). Therefore, as shown in FIG. 3, if the AI model training and inference system in the hybrid cloud scenario is divided into modules according to functions, the embodiment of the present application also adds model retraining on the basis of the modules shown in FIG. Management module 201 and model evaluation module 202.
  • the second server 20 is a server in the public cloud
  • the data collection module 101, the model training module 102, the first model storage module 103, and the model retraining management module 201 are located on the second server 20
  • the server 20 is a server cluster composed of multiple servers in the private cloud, and the data collection module 101, the model training module 102, the first model storage module 103, and the model retraining management module 201 are located on the same server in the server cluster, or On different servers in the server cluster.
  • the model retraining management module 201 is configured to receive the retraining instruction for the training model sent by the model evaluation module 202.
  • the model retraining management module 201 is also used for instructing the data collection module 101 to obtain data for model retraining from the model evaluation module 202 in response to the retraining instruction for the training model it receives, and use the obtained data as The retraining samples are added to the retraining sample set sent to the model training module 102.
  • the model retraining management module 201 is further configured to instruct the model training module 102 to obtain a training model for model retraining from the first model storage module 103 in response to the received retraining instruction.
  • the data collection module 101 is also used to obtain data for model retraining from the model evaluation module 202 according to the instructions of the model retraining management module 201, that is, input data and inference results.
  • the input data is input to the training model for model inference.
  • Data, the inference result is the inference result obtained by inputting the input data into the training model for model inference.
  • the data collection module 101 is also used to add the data obtained by it as a model retraining sample to the retraining sample set, and then send the retraining sample set to the model training module 102.
  • the data collection module 101 is also used to use an object storage service (OBS, object storage service) to store the acquired data.
  • OBS object storage service
  • the model training module 102 is also used to obtain a retraining sample set for model retraining from the data collection module 101, and obtain the retraining sample set for model retraining from the first model storage module 103 according to the instruction of the model retraining management module 201 Train the model.
  • the model training module 102 is also used to retrain the training model according to the set of retraining samples it obtains to obtain a retraining model that meets the application requirements, and replace the training model stored in the first model storage module 103 with The retraining model.
  • the second model storage module 104, the model reasoning management module 105, and the model evaluation module 202 are all located on the first server 10; if the first server 10 is a private cloud
  • the second model storage module 104, the model reasoning management module 105, and the model evaluation module 202 are located on the same server in the server cluster, or on different servers in the server cluster.
  • the introduction of the model training module 102, the first model storage module 103, the second model storage module 104, and the model reasoning management module 105 can be referred to the above description, and will not be repeated here.
  • the model evaluation module 202 is configured to evaluate the training model according to the inference result obtained by inputting the input data into the training model for model inference and the model evaluation index, and determine the evaluation result of the model evaluation index for the training model. Then, the model evaluation module 202 is also used to determine the inference effect of the training model according to the evaluation result of the model evaluation index and its corresponding preset threshold. The model evaluation module 202 is further configured to send a retraining instruction for the training model to the model retraining management module 201 according to the reasoning effect of the training model. The model evaluation module 202 is also used to send data for model retraining to the data collection module 101.
  • the model evaluation module 202 determines that the inference effect of the training model is poor, the model evaluation module 202 sends a retraining instruction for the training model to the model retraining management module 201. If the model evaluation module 202 determines that the inference effect of the training model is good, the model evaluation module 202 does not send a retraining instruction for the training model to the model retraining management module 201.
  • model evaluation module 202 can also be more specifically divided into an evaluation strategy configuration module 2021, a data collection module 2022, and a model evaluation index evaluation module 2023.
  • the evaluation strategy configuration module 2021 is used to configure the evaluation strategy of the training model, including configuring at least one model evaluation index used to evaluate the training model, the evaluation rule of the model evaluation index, the preset threshold corresponding to the model evaluation index, and the evaluation The selection rules of the input data and inference results of the model evaluation index, and the trigger strategy for the retraining of the training model.
  • the input data is the data input to the training model for model inference
  • the inference result is the prediction result obtained by inputting the input data into the training model for model inference. Therefore, there is a corresponding relationship between the input data and the inference result.
  • the evaluation strategy configuration module 2021 is also used to send the input data used to evaluate the model evaluation index and the selection rule of the inference result, at least one model evaluation index used to evaluate the training model, and the evaluation rule of the model evaluation index to the data collection Module 2022, which sends at least one model evaluation index used to evaluate the training model, the evaluation rule of the model evaluation index, the preset threshold corresponding to the model evaluation index, and the trigger strategy for retraining the training model to the model evaluation index evaluation module 2023.
  • the data collection module 2022 is used to obtain input data and inference results, and send the obtained input data and inference results to the data collection module 101.
  • the data collection module 2022 is also used to obtain input data for evaluating model evaluation indicators and selection rules of inference results, at least one model evaluation indicator for evaluating training models, and evaluation of model evaluation indicators from the evaluation strategy configuration module 2021 Rules and other information, and determine the input data and reasoning result for model evaluation based on this information, and send the input data and reasoning result for model evaluation to the model evaluation index evaluation module 2023.
  • the model evaluation index evaluation module 2023 is configured to obtain input data and inference results for model evaluation from the data collection module 2022, and obtain at least one model evaluation index and model evaluation index for evaluating the training model from the evaluation strategy configuration module 2021 Evaluation rules, preset thresholds corresponding to model evaluation indicators, and trigger strategies for retraining the training model.
  • the model evaluation index evaluation module 2023 is also used to use the input data and inference results for model evaluation to determine the training model used for evaluating the training model according to at least one model evaluation index used to evaluate the training model and the evaluation rule of the model evaluation index The evaluation result of at least one model evaluation index.
  • the model evaluation index evaluation module 2023 is also used to determine the inference effect of the training model according to the evaluation result of the model evaluation index and the preset threshold corresponding to the model evaluation index.
  • the model evaluation index evaluation module 2023 is also used to determine whether to send a retraining instruction for the training model to the model retraining management module 201 according to the trigger strategy for the retraining of the training model.
  • the trigger strategy for the retraining of the training model is: if the evaluation result of at least one model evaluation index used to evaluate the training model does not exceed its corresponding preset threshold, the model evaluation index evaluation module 2023 determines the training The model’s inference effect is poor, and a retraining instruction for the training model is sent to the model retraining management module 201; if the evaluation results of the model evaluation indicators used to evaluate the training model exceed their corresponding preset thresholds, the model is evaluated The index evaluation module 2023 determines that the inference effect of the training model is good, and does not send a retraining instruction for the training model to the model retraining management module 201.
  • the division is performed according to functional modules.
  • the system using hybrid cloud for AI model training and inference shown in FIG. 2 and FIG. 3 may also include a data labeling module, which is loosely coupled with other modules.
  • the data collection module 101 is also used to send the acquired data for model training/model retraining to the data labeling module after acquiring data for model training or model retraining. Subsequently, the data labeling module is used to label the received data used for model training/model retraining, and send the labelled data to the data collection module 101.
  • the data collection module 101 adds the labeled data obtained from the data labeling module to the training sample set/retraining sample set as training samples/retraining samples, so that the model training module 102 can use the training sample set/retraining For the data in the sample set, perform model training on the preset model to obtain a training model that meets the needs, or perform model retraining on the training model to obtain a retrained model.
  • the process of data labeling is to use a labeling tool to add a rectangular frame to the objects in the image, and then add labels to the objects in the rectangular frame, such as "a cat "Or “a cell phone” and so on.
  • this application can also implement the closed-loop data in the model training and inference system in the hybrid cloud scenario, that is, implement model inference, model evaluation, model retraining, retraining model delivery, and retraining.
  • the training model is an end-to-end business closed-loop inference, so that this application can use the input data input to the training model for inference and the inference result obtained by the model inference.
  • the training model changes due to environmental changes, the input data changes and other factors, the reasoning effect decreases
  • the training model can be retrained in time, and then the retrained model can be used for model inference, thereby improving the inference effect of the training model, that is, the accuracy of the inference result, and ensuring the performance of the business system.
  • FIG. 4 is a hardware structure of a chip provided by an embodiment of the application.
  • the chip includes a neural network processor 300.
  • the chip can be set in the first server 10 and/or the second server 20 shown in FIG. 1 to complete the work of each module shown in FIG. 2 or FIG. 3, including obtaining a training model through model training and using The training model is used for model reasoning, the training model is evaluated, and the retraining model is obtained through model retraining.
  • the neural network processor NPU300 is mounted as a coprocessor to a host CPU (central processing unit, CPU), and the host CPU distributes tasks.
  • the core part of the NPU 300 is the arithmetic circuit 303.
  • the controller 304 controls the arithmetic circuit 303 to extract data from the memory (weight memory or input memory) and perform calculations.
  • the arithmetic circuit 303 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit 303 fetches the data corresponding to the matrix B from the weight memory 302 and caches it on each PE in the arithmetic circuit 303.
  • the arithmetic circuit 303 takes the matrix A data and the matrix B from the input memory 301 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 308 (accumulator).
  • the vector calculation unit 307 can perform further processing on the output of the arithmetic circuit 303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 307 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 307 can store the processed output vector to the unified memory 306.
  • the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 307 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 303, for example, for use in a subsequent layer in a neural network.
  • the unified memory 306 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller 305 (direct memory access controller, DMAC) to store the input data in the external memory into the input memory 301 and/or unified memory 306, and the weight data in the external memory into the weight memory 302 , And store the data in the unified memory 306 into the external memory.
  • DMAC direct memory access controller
  • the bus interface unit 310 (bus interface unit, BIU) is used to implement interaction between the main CPU, the DMAC, and the fetch memory 309 through the bus.
  • the instruction fetch buffer 309 connected to the controller 304 is used to store instructions used by the controller 304.
  • the controller 304 is used to call the instructions cached in the instruction fetch memory 309 to control the working process of the computing accelerator.
  • the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are all on-chip (On-Chip) memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
  • the main CPU and NPU work together to implement the corresponding algorithms for the functions required by the first server 10 and the second server 20 in Figure 1, and the calculations of the modules in the system shown in Figure 2 or Figure 3 It can be executed by the arithmetic circuit 303 or the vector calculation unit 307.
  • the first server 10 and the second server 20 in FIG. 1 introduced above can perform each step of the model training method in the embodiment of the present application, and the chip shown in FIG. 4 can also be used to perform the model training in the embodiment of the present application.
  • the present application also provides a model training device.
  • the model training device 400 includes one or more processors, such as a processor 401 and/or a processor 407, at least one communication interface, such as a communication interface 404, and a communication interface. ⁇ 402 ⁇ Line 402.
  • the communication device 400 may further include a memory 403. The following description takes the processor 401 as an example.
  • the processor 401 may be a general-purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (ASIC), FPGA (Field Programmable Gate Array), Or one or more integrated circuits that integrate multiple processing circuit functions (such as CPU+ASIC).
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • FPGA Field Programmable Gate Array
  • the communication line 402 may include one or more paths for connecting different components.
  • the communication interface 404 may be a transceiver circuit for communicating with other devices or communication networks, such as cloud computing network, Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc. .
  • the transceiver circuit may be a device such as a transceiver or a transceiver.
  • the communication interface 404 may also be an input/output (I/O) circuit of the processor 401 to implement signal input and signal output of the processor 401.
  • the memory 403 may be a device having a storage function. For example, it can be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions. Dynamic storage devices can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage ( Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be stored by a computer Any other media taken, but not limited to this.
  • the memory 403 may exist independently and is connected to the processor 401 through a communication line 402. Of course, the memory 403 may also be integrated with the processor 401.
  • the memory 403 is used to store computer execution instructions for executing the solution of the present application, and the processor 401 controls the execution.
  • the processor 401 is configured to read and execute computer instructions (for example, for a CPU) or configuration files (for example, for an FPGA) stored in the memory 403, so as to implement the model training method provided in the embodiment of the present application.
  • the processor 401 may also execute the relevant processing functions in the model training method provided in the following embodiments of the present application, and the communication interface 404 is responsible for communicating with other devices or communication networks.
  • the embodiment does not specifically limit this.
  • the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.
  • the processor 401 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 5.
  • the model training apparatus 400 may also include multiple processors, such as the processor 401 and the processor 407 in FIG. 5. Each of these processors can be a single-CPU (single-CPU) processor or a multi-core (multi-CPU) processor.
  • the processor here may refer to one or more devices, circuits, and/or processing cores for processing data (for example, computer program instructions).
  • the model training apparatus 400 may further include an output device 405 and an input device 406.
  • the output device 405 communicates with the processor 401, and can output information in a variety of ways.
  • the output device 405 may be a touch screen, a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, and a projector (projector). ), or printer, etc.
  • the input device 406 communicates with the processor 401, and can receive user input in a variety of ways.
  • the input device 406 may be a mouse, a keyboard, a touch screen device, a sensor device, or the like.
  • the above-mentioned model training device 400 may sometimes be called a training device, which may be a general-purpose device or a special-purpose device.
  • the training device may be a client, a desktop computer, a portable computer, a network server, a PDA (personal digital assistant, PDA), a mobile phone, a tablet computer, a wireless terminal device, an embedded device, or a device with a similar structure.
  • PDA personal digital assistant
  • the aforementioned model training apparatus 400 may also be a software and/or hardware entity provided in each of the aforementioned single devices, such as a chip or a chip system used to perform the tasks provided in the embodiments of the present application.
  • the embodiment of the present application does not limit the type of the model training device 400.
  • FIG. 5 is only a simplified schematic diagram of an example for ease of understanding, and the model training device may also include other components, circuits, or devices, which are not shown in FIG. 5.
  • the model training device 400 shown in FIG. 5 can execute the model training method shown in FIG. 5.
  • This application also provides a model training method, which is used in the system described in FIG. 1 above.
  • the model training method involved in this application will be introduced below with reference to FIG. 6.
  • the method mainly includes steps S601-S608:
  • the first server obtains the first training model from the second server.
  • the first training model is obtained by the second server performing model training for the preset model.
  • the preset model may be a model stored in advance in the system, or a model determined according to the requirements of the application scenario of the required first training model. .
  • NPL neuro-linguistic programming
  • users usually use the training sample set obtained by themselves to model certain benchmark models, such as machine translation models, sentiment analysis models, etc. Training to obtain a required training model, and the preset model referred to in the embodiment of the present application is the reference model.
  • the preset model may be a machine translation model
  • the preset model is an emotion analysis model.
  • the second server uses the acquired training data to perform model training on the preset model to obtain the first training model, then stores the first training model and the description information of the first training model, and combines the first training model and the first training model.
  • the description information of the first training model is delivered to the first server.
  • the description information of the first training model includes the name, purpose, and life cycle of the first training model.
  • the description information of the first training model includes the name of the first training model, such as a face recognition model, the purpose of the first training model, such as face recognition, and the life cycle of the first training model, such as 1h.
  • the description information of the first training model includes the storage time of the first training model, for example 11:30am, and the storage time of the first training model, for example 0.5h, then the storage time of the first training model is 0.5h.
  • the first server or the second server may modify the stored description information of the first training model, and may also delete or update the first training model according to the life cycle of the first training model.
  • the first server inputs the input data into the first training model for model inference, and obtains the inference result.
  • the first server After receiving the service invocation request sent by the user terminal and the input data used for model inference, the first server determines the first training model corresponding to the service through the service published by the first server. Then, the first server inputs input data used for model inference, such as user input or locally stored data, into the first training model for model inference, obtains the inference result, and sends the inference result to the corresponding user terminal.
  • input data used for model inference such as user input or locally stored data
  • the first server obtains the face recognition model sent by the second server after the above step S602, stores the face recognition model, and stores the face recognition model.
  • the face recognition model is released as a face recognition service.
  • the user terminal sends a service call request for the face recognition service and the picture to the first server, and the first server determines the face recognition service that needs to be called Corresponding face recognition model, and input the picture into the face recognition model for model reasoning to determine the reasoning result, for example, the reasoning result is that the person in the picture is A.
  • the first server sends this inference result, that is, the person in the picture is A, to the user terminal.
  • the type of model reasoning can be divided into batch reasoning and real-time reasoning.
  • the data obtained in real time is input into the training model for model inference, and the process of obtaining the inference result is the process of real-time inference.
  • the process of inputting multiple pre-stored data into the training model for model inference to obtain multiple inference results is the process of batch inference, and there is a correspondence between multiple inference results and multiple data.
  • the first server receives input data, that is, the data sent by the user terminal in real time, and the service call request. Subsequently, the first server invokes the inference service corresponding to the service invocation request through an application programming interface (API, application programming interface) according to the received service invocation request, and then inputs the input data into the first training model corresponding to the inference service Model reasoning in the process to get the reasoning result.
  • API application programming interface
  • the gate after the gate captures a face image through a camera, it further uses the API between the gate and the first server to obtain the face image, as well as the face recognition service
  • the call request is sent to the first server.
  • the first server calls the corresponding face recognition model according to the call request of the face recognition service (for example, a call request in the form of http), and uses the face recognition model to recognize the face image, and sends the recognition result to the gate. So that the gate can be opened or kept closed according to the recognition result.
  • the first server receives input data and service call requests, where the input data is pre-stored data or path information of the pre-stored input data (for example, network file system (NFS, network file system) address) , File transfer protocol (FTP, file transfer protocol) address, etc.). Subsequently, the first server obtains the pre-stored input data according to the path information, calls the corresponding first training model according to the service call request, and inputs the obtained pre-stored data into the first training model for model inference, and obtains the inference result .
  • NFS network file system
  • FTP File transfer protocol
  • the user terminal sends a call request of the face recognition service and the path information of the input data (for example, the NFS address) to the first server, and the first server obtains the address pointing to the address according to the NFS address
  • the first server obtains the address pointing to the address according to the NFS address
  • the pictures stored in the folder and the facial recognition model corresponding to the facial recognition service Taking 10 pictures stored in the folder as an example, the first server inputs these 10 pictures into the face recognition model for model inference, and obtains 10 inference results.
  • the first server inputs the input data into the first training model for model inference, and after obtaining the inference result, stores the input data and the inference result.
  • the first server may also send the acquired input data and the inference result to the second server, so that the second server can retrain the first preset model according to the received input data and the inference result.
  • the first server evaluates the first training model according to the model evaluation index according to the inference result, and obtains the evaluation result of the model evaluation index.
  • model evaluation indicators include the accuracy of the inference result, the precision rate of the inference result, the recall rate of the inference result, the F1 score of the inference result F1-Score, and the area AUC under the operating characteristic curve ROC of the receiver of the inference result. At least one of.
  • model evaluation indicators may also include model evaluation indicators such as mean absolute error (MAE, mean absolute error) and mean square error (MSE, mean square error).
  • MAE mean absolute error
  • MSE mean square error
  • the above model evaluation indicators such as accuracy rate, precision rate, recall rate, F1 score F1-Score, AUC are mainly used for the evaluation of two-class model, and the above model evaluation indicators such as MAE and MSE are mainly used for regression models (such as Face recognition model) evaluation.
  • the model evaluation index used to evaluate the first training model is determined according to the current application scenario, that is, the purpose of the first training model.
  • the recall rate of the inference result can be set as a model evaluation index for evaluating the inference effect of the first training model.
  • the user can also set the model evaluation index used to evaluate the first training model as the accuracy rate of the inference result, the recall rate of the inference result, and the value of the inference result according to the needs of the application scenario. F1-Score.
  • the numbers on 60 cards out of 100 cards are odd numbers, that is, there are 60 positive samples, and the numbers on 40 cards are even numbers, that is, There are 40 negative class samples.
  • the first training model is used to predict the numbers in these 100 cards, that is, 100 inferences are performed, and 100 inference results are generated.
  • the reasoning results corresponding to 40 cards are accurate, and 20 reasoning results are wrong.
  • the negative sample there are 30 card corresponding inference results that are correct, and 10 inference results are wrong.
  • the number of samples predicting positive samples as positive samples TP is 40, the number of samples predicting positive samples as negative samples FN is 20, the number of samples predicting negative samples as positive samples FP is 10, and the number of negative samples is predicted as The number of negative samples TN is 30.
  • the accuracy of the inference result Accuracy is (TP+TN)/(TP+FN+FP+TN), which is 70%
  • the precision of the inference result Precision is TP/( TP+FP), ie 80%
  • the recall rate of the inference result Recall is TP/(TP+FN), ie 2/3
  • the F1 score of the inference result F1-Score is the harmonic mean of the precision rate and the recall rate 2Presicion*Recall/ (Presicion+Recall), which is 8/11.
  • ROC and AUC are shown in Figure 7, and the coordinates of a point A on the ROC are (1/4, 2/3).
  • the model evaluation of the first training model may be periodic, for example, according to a preset time interval, model evaluation is performed using the inference results of real-time inference within the first preset time interval to obtain the first training model The evaluation result of the model evaluation index. Then, the model evaluation is performed again according to the inference result of the real-time inference in the second preset time interval, and the evaluation result of the model evaluation index of the first training model is obtained again.
  • the number of real-time inferences performed using the first training model is 20 times.
  • take the preset time interval of 40min as an example use the inference results of real-time inference between 10:00am-10:40am to perform a model evaluation to obtain the evaluation results of the model evaluation indicators of the first training model; then, use 10:
  • the inference results of real-time inference between 40am-11: 20am are used for model evaluation, and the evaluation result of the model evaluation index of the first training model is once again obtained; finally, the inference results of real-time inference between 11:20am-12:00am are used Perform model evaluation, and once again obtain the evaluation result of the model evaluation index of the first training model.
  • the input data of batch reasoning or real-time reasoning is used to evaluate the reasoning result corresponding to the input data according to a preset interval of times, and the evaluation result of the model evaluation index of the first training model is obtained.
  • the evaluation strategy of the first training model including configuring at least one model evaluation index used to evaluate the training model, the evaluation rule of the model evaluation index, and the prediction corresponding to the model evaluation index.
  • the threshold used to evaluate the training model
  • the evaluation rule of the model evaluation index used to evaluate the training model
  • the prediction corresponding to the model evaluation index set the threshold, the selection rules for the input data and inference results used to evaluate the model evaluation index, and the trigger strategy for the retraining of the training model.
  • the first server may also configure the storage path of the input data and the inference result.
  • the retraining instruction for the first training model is used to instruct the second server to perform model retraining on the first training model.
  • the first server determines that the inference effect of the first training model is poor, and sends a retraining instruction for the first training model to The second server. If the evaluation results of the model evaluation indicators all exceed their corresponding preset thresholds, the first server determines that the inference effect of the first training model is good and does not need to be updated. Therefore, the first server does not send the refactoring for the first training model.
  • the training instruction is given to the second server.
  • the preset threshold corresponding to the evaluation result of the model evaluation index may be preset according to the current application scenario, or may be preset by the user.
  • the model evaluation index includes at least one of the accuracy rate of the inference result, the precision rate of the inference result, the recall rate of the inference result, the F1-Score of the inference result, the area under the ROC of the inference result, AUC, etc. Take for example.
  • the first server determines that the inference effect of the first training model is poor and sends A retraining instruction of a training model is given to the second server.
  • the first server determines that the inference effect of the first training model is good, and does not send a retraining instruction for the first training model to the second server.
  • the first server determines that the reasoning effect of the first training model is poor, and sends the result of the first training model.
  • a retraining instruction of a training model is given to the second server. If the evaluation result of at least one model evaluation index exceeds its corresponding preset threshold, the first server determines that the inference effect of the first training model is good and does not need to be updated. Therefore, the first server does not send information about the first training model. The retraining instruction is given to the second server.
  • the user can further according to the needs of the application scenario Configure the trigger condition for the first server to send the retraining instruction to the second server.
  • the first server sends a retraining instruction for the first training model to the second server. It is understandable that the first server sends the retraining instruction for the first training model, that is, the retraining for the first training model is started.
  • the user sets three model evaluation indicators for evaluating the first training model according to the needs of the business scenario. These three model evaluation indicators are the accuracy of the inference result and the inference The recall rate of the result and the F1-Score of the inference result. Then, the user can determine the condition that triggers the first server to send the retraining instruction to the second server according to the needs of the business scenario: if there is at least one evaluation result among the accuracy of the inference result and the F1-Score of the inference result If the corresponding preset threshold is not exceeded, and the evaluation result of the recall rate of the inference result does not exceed the corresponding preset threshold, the first server sends a retraining instruction for the first training model to the second server.
  • the evaluation result of the accuracy of the inference result is a
  • the evaluation result of the F1-Score of the inference result is b
  • the evaluation result of the recall rate of the inference result is c
  • a, b, and c correspond to
  • the preset thresholds are A, B, and C respectively.
  • the first server can periodically or non-periodically obtain the reasoning result for model evaluation, and according to the pre-configured evaluation strategy of the training model, and the pre-configured evaluation strategy of the training model. Configure the retraining trigger strategy of the training model to complete the evaluation of the inference effect of the training model, and then determine whether to trigger/not trigger the retraining of the training model, so as to monitor the inference effect of the training model and compare the inference effect in time.
  • the poor training model is retrained to ensure the performance of the business system.
  • the second server obtains the input data and the inference result from the first server.
  • the input data is data that the first server inputs into the first training model for model inference
  • the inference result is the inference result obtained by the first server input data into the first training model for model inference
  • the second server In response to the retraining instruction received from the first server, the second server sends to the first server an acquisition request for input data and inference results corresponding to the input data, and the acquisition request is used to request the first server to send the input first training The input data for model inference in the model, and the inference result corresponding to the input data is sent to the second server. Then, the first server sends data to the second server in response to the acquisition request.
  • the data includes the input data that the first server inputs to the first training model, and the reasoning obtained after inputting the input data into the first training model and performing model inference result.
  • the first server sends the input data and the inference result corresponding to the input data to the second server while sending the retraining instruction.
  • the second server may first perform step S605 after step S602, and then perform step S603 and step S604. At this time, step S605 may be performed periodically.
  • the second server periodically obtains input data for model evaluation from the first server at a preset time interval or a preset number of times, and the reasoning corresponding to the input data.
  • the first server can periodically send the input data used for model evaluation and the inference result corresponding to the input data to the second server according to the preset time interval or the preset number of times.
  • step S604 the second server obtains input data and inference results for model evaluation in response to the received retraining instruction to perform the first training
  • the retraining of the model can save resources for data transmission while sending the input data and inference results of the first training model to the first server.
  • the first server and the second server may not perform step S603 and step S604, but directly perform step S605 after step S602, and periodically or aperiodically Send a retraining instruction to the second server, so that the second server retrains the first training model by using the input data and the inference result corresponding to the input data according to the retraining instruction to determine the second training model.
  • this technical solution of directly instructing the second server to perform model retraining without performing model evaluation can make the retraining model match the input data in the current environment well, thereby achieving a better reasoning effect.
  • the number of retraining of the model may be too frequent, or retraining of the model may occur when the inference effect of the existing training model is good, which may cause heavy computational burden and unnecessary
  • the consumption of software and hardware resources, and the frequent replacement of training models used for model inference in the business system may cause the business system to become unstable and affect the performance of the business system.
  • the retraining instruction is sent to the second server according to the model's inference effect, and the model can be retrained when needed, which can greatly reduce the waste of software and hardware resources such as bandwidth, and ensure the business system. Stable operation.
  • the second server determines a retraining sample set according to the input data and the inference result.
  • the second server In response to the received retraining instruction for the first training model, the second server adds the input data of the first training model obtained from the first server and the inference result corresponding to the input data to the retraining instruction for the model.
  • the training retraining sample set stores the training data used for model retraining.
  • the second server after obtaining the input data and the inference result corresponding to the input data, the second server first annotates the input data to obtain the annotated input data, and then adds the annotated input data and the inference result to the retraining In the sample set, a retraining sample set for model retraining is obtained.
  • the second server screens or modifies the input data and the inference result according to whether the inference result is correct. For the correct inference result, the second server adds the inference result and the input data corresponding to the inference result into the retraining sample set. Or after labeling the input data corresponding to the correct inference result, the labelled input data and the inference result corresponding to the input data are stored in the retraining sample set.
  • the second server deletes the reasoning result and the input data corresponding to the reasoning result, or the second server modifies the wrong reasoning result to the correct reasoning result, and changes the modified reasoning result and input data (or The labeled input data) is added to the retraining sample set to obtain the retraining sample set for model retraining.
  • the second server performs model retraining on the first training model according to the retraining sample set, and determines the second training model.
  • the second server uses the retraining sample set determined in step S606 to perform model retraining on the first training model to obtain the second training model. Subsequently, the second server replaces the stored first training model with the second training model.
  • the second server further stores the description information of the second training model. For specific description of the description information, please refer to the above content, which will not be repeated here.
  • S608 The second server sends the second training model to the first server.
  • the second server sends the second training model obtained by retraining the first training model based on the retraining sample set determined in step S606, and the description information of the second training model to the first server. .
  • the first server replaces the stored first training model with the second training model, and stores the description information of the second training model.
  • the first server also publishes the second training model as a service, so that the user can use the service to call the second training model for model inference.
  • the first server deletes the service corresponding to the first training model, or replaces the service corresponding to the first training model with the service corresponding to the second training model.
  • the second server uses the correct inference result and the input data corresponding to the correct inference result to retrain the first training model to obtain the second training model, which can improve the inference effect of the second training model. , So as to ensure the performance of the business system where the second training model is located.
  • the retraining instruction sent by the first server to the second server may be for a preset model, so that the second server compares the preset model according to the set of retraining samples determined in step S606.
  • the model is retrained to obtain the second training model.
  • the second server replaces the stored first training model with the second training model, and stores the description information of the second training model.
  • the preset model please refer to the above content, which will not be repeated here.
  • the preset model determined by model retraining is more suitable for the current scenario, but the generalization ability is poor, that is, the inference effect when the previous data is used as the input data of the second training model cannot be guaranteed.
  • the present application provides a model training method applied in a hybrid cloud scenario, which can evaluate the model evaluation index of the first training model, determine the inference effect of the first training model, and realize the first training model.
  • the inference effect of the model is monitored, so that when the inference effect of the first training model is poor, the retraining instruction for model retraining is sent to the second server, so that the second server can timely perform training according to the inference effect of the trained model.
  • the model is retrained to determine the training model with better inference effect, improve the accuracy of the inference result, and ensure the performance of the business system.
  • the present application also provides a model training device, which is applied to a system of a first server and a second server.
  • the first server is located in a private cloud for model inference
  • the second server is located in a public cloud. Used for model training.
  • the device can be used to execute the steps performed by the first server in the foregoing method embodiment.
  • the device includes an acquisition unit 801, an inference unit 802, an evaluation unit 803, and a sending unit 804.
  • the obtaining unit 801 is configured to obtain the first training model from the second server.
  • the inference unit 802 is configured to input the input data into the first training model for model inference, and obtain the inference result.
  • the evaluation unit 803 is configured to evaluate the first training model according to the model evaluation index according to the inference result, and obtain the evaluation result of the model evaluation index.
  • the model evaluation indicators include at least one of the following: the accuracy of the inference result, the precision rate of the inference result, the recall rate of the inference result, the F1-Score of the inference result, the receiver operating characteristic curve of the inference result under the ROC Area AUC.
  • the sending unit 804 is configured to send a retraining instruction for the first training model to the second server if the evaluation result of at least one model evaluation index does not exceed its corresponding preset threshold.
  • the retraining instruction is used to instruct the second server to retrain the first training model.
  • the sending unit 804 is further configured to send input data and inference results to the second server.
  • the input data and the inference result are used to retrain the first training model.
  • the sending unit 804 is further configured to not send a retraining instruction for the first training model to the second server if the evaluation results of the model evaluation indicators all exceed their corresponding preset thresholds.
  • the present application also provides a model training device, which is applied to a system of a first server and a second server.
  • the first server is located in a private cloud for model inference
  • the second server is located in a public cloud. Used for model training.
  • the device can be used to execute the steps performed by the second server in the foregoing method embodiment.
  • the device includes an acquiring unit 901, a determining unit 902, a sending unit 903, and a processing unit 904.
  • the obtaining unit 901 is configured to obtain retraining instructions, input data, and inference results for the first training model from the first server.
  • the retraining instruction is used to instruct the second server to retrain the first training model
  • the input data is the data that the first server inputs into the first training model for model inference
  • the inference result is that the first server inputs the input data The result obtained after model inference in the first training model.
  • the obtaining unit 901 is specifically configured to obtain input data and inference results in response to a retraining instruction received from the first server.
  • the determining unit 902 is configured to determine a retraining sample set according to the input data and the inference result.
  • the processing unit 904 is configured to, if the inference result is a correct inference result, the second server retains the inference result and the input data corresponding to the inference result. If the inference result is an incorrect inference result, the second server deletes the inference result and the input data corresponding to the inference result, or the second server replaces the inference result with the correct inference result corresponding to the input data.
  • the determining unit 902 is specifically configured to annotate the input data, obtain the annotated input data, and store the annotated input data and the inference result in the retraining sample set.
  • the determining unit 902 is further configured to retrain the first training model according to the retraining sample set, and determine the second training model. Wherein, the second training model is used to replace the first training model.
  • the sending unit 903 is configured to send the second training model to the first server.
  • the embodiment of the present application also provides a computer-readable storage medium on which instructions are stored, and the instructions are executed by the processor when they are executed to execute the methods in the foregoing method embodiments.
  • the embodiments of the present application also provide a computer program product containing instructions, which when executed by a processor on a computer, cause the computer to execute the method in the foregoing method embodiment.
  • the embodiment of the present application also provides a chip, which includes a transceiver unit and a processing unit.
  • the transceiver unit may be an input/output circuit or a communication interface;
  • the processing unit is a processor, microprocessor, or integrated circuit integrated on the chip.
  • the chip can execute the method in the above method embodiment.
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • the above-mentioned embodiments may appear in the form of a computer program product in whole or in part, and the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • Computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • Computer instructions may be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL) or wireless (such as infrared, wireless, microwave, etc.) transmission to another website site, computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be divided. It can be combined or integrated into another device, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate parts may be physically separated or not physically separated.
  • the parts displayed as a unit may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed. To many different places. In the application process, some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art or the part of the technical solutions can be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • Including several instructions to make a device (which may be a personal computer, a server, a network device, a single-chip microcomputer, or a chip, etc.) or a processor execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种模型训练方法及装置,涉及通信技术领域,应用于包括第一服务器(10)和第二服务器(20)的系统中,第一服务器(10)位于私有云中,用于模型推理,第二服务器(20)位于公有云中,用于模型训练,根据推理结果及模型评估指标来评估训练模型,以监控训练模型的推理效果,及时对训练模型进行重训练,提高推理结果的准确性,保证业务系统的性能。该方法包括:第一服务器(10)从第二服务器(20)获取第一训练模型(S601),将输入数据输入第一训练模型进行推理得到推理结果(S602);然后根据推理结果,按照模型评估指标对第一训练模型进行评估,得到模型评估指标的评估结果(S603);若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则向第二服务器(20)发送针对第一训练模型的重训练指令(S604)。

Description

模型训练方法及装置
本申请要求于2020年1月16日提交国家知识产权局、申请号为202010049320.1、申请名称为“模型训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,尤其涉及一种模型训练方法及装置。
背景技术
数据、算法和算力是实现人工智能(artificial intelligence,AI)的三要素。数据采集手段的日渐丰富,使得AI芯片可以以较低的成本获取到越来越多的数据,且AI算法的持续突破,使得AI芯片的计算速度越来越快。因此AI芯片的算力越来越强,AI也越来越多的被应用在实际生活中。现有技术中,通常采用混合云来进行AI的模型训练和推理,即采用“线上训练、线下推理”的模式,先将训练数据上传到公有云的线上训练平台,进行模型训练,确定满足需求的训练模型,然后将该训练模型下推至私有云的线下推理平台中,由线下推理平台将其发布为服务,进行推理。这种实现方式可以在利用私有云保障用户数据安全的前提下,最大限度的利用公有云的算力,进行模型训练和推理。
在实际生产环境中,利用混合云进行模型训练,并确定训练模型后,由于训练模型的输入数据的变化,可能会导致利用该训练模型进行推理所得到的推理结果的准确度较低。现有的混合云场景下的AI模型训练方式无法及时感知训练模型推理准确度的下降,可能会出现业务系统(例如人脸识别系统)频繁误报或不可用的情况。例如,在用于安防的视频监控场景中,卡口摄像头(例如小区门口的摄像头)老化、更换或安装位置调整,可能会导致卡口摄像头拍摄视频的清晰度、角度等发生变化,也就是训练模型的输入数据发生变化。利用发生变化的输入数据和该训练模型进行推理,可能会导致后续的训练模型推理结果的准确度大大降低,影响用于安防的视频监控系统的正常使用。
发明内容
本申请提供一种模型训练方法及装置,在混合云场景下,根据训练模型的推理结果,进行该训练模型的评估,确定该训练模型的模型评估指标的评估结果,以达到监控训练模型的推理效果的目的,从而根据训练模型的推理效果,及时对训练模型进行重训练,确定推理效果更好的训练模型,提高推理结果的正确性,保证业务系统的性能。
为达到上述目的,本申请采用如下技术方案:
第一方面,本申请实施例提供一种模型训练方法,应用于包括第一服务器和第二服务器的系统中,第一服务器位于私有云中,用于模型推理;第二服务器位于公有云中,用于模型训练。该方法包括:第一服务器从第二服务器获取第一训练模型,第一服务器将输入数据输入所述第一训练模型中进行模型推理,得到推理结果。随后,第 一服务器根据推理结果,按照模型评估指标对该第一训练模型进行评估,得到模型评估指标的评估结果。最后,若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则第一服务器向第二服务器发送针对第一训练模型的重训练指令。其中,重训练指令用于指示第二服务器对第一训练模型进行重训练。
综上,第一服务器可以通过对第一训练模型的模型评估指标进行评估,来确定该第一训练模型的推理效果,实现对第一训练模型的推理效果的监控,从而在第一训练模型的推理效果较差时,向第二服务器发送用于模型重训练的重训练指令,使得第二服务器可以根据训练模型的推理效果,及时对训练模型进行重训练,以确定推理效果更好的训练模型,提高推理结果的准确度,保证业务系统的性能。
在一种可能的实现方式中,在第一服务器将输入数据输入所述第一训练模型中进行模型推理,得到推理结果之后,该方法还包括:第一服务器向第二服务器发送输入数据和推理结果,以实现混合云场景下的模型训练和推理系统中的数据闭环,以使得本申请可以利用输入训练模型进行推理的输入数据和模型推理得到的推理结果,对训练模型进行重训练,从而提高训练模型的推理效果,即推理结果的准确率,保证业务系统性能。其中,输入数据和所述推理结果用于对所述第一训练模型进行重训练。
在一种可能的实现方式中,模型评估指标包括以下至少一项:推理结果的准确率、推理结果的查准率、推理结果的召回率、推理结果的F1分数F1-Score、推理结果的接收者操作特征曲线ROC下的面积AUC。
在一种可能的实现方式中,若模型评估指标的评估结果均超过其对应的预设阈值,则第一服务器不向第二服务器发送针对第一训练模型的重训练指令。
第二方面,本申请实施例提供一种模型训练方法,应用于包括第一服务器和第二服务器的系统中,第一服务器位于私有云中,用于模型推理;第二服务器位于公有云中,用于模型训练。该方法包括:第二服务器从第一服务器获取针对第一训练模型的重训练指令、输入数据以及推理结果。其中,重训练指令用于指示第二服务器对第一训练模型进行重训练,输入数据为第一服务器输入第一训练模型中的数据,推理结果为第一服务器将输入数据输入所述第一训练模型中进行模型推理后得到的结果。然后,第二服务器根据输入数据以及推理结果,确定重训练样本集,再根据该重训练样本集对第一训练模型进行重训练,确定第二训练模型,该第二训练模型用于替换第一训练模型。最后,第二服务器向第一服务器发送所述第二训练模型。
在一种可能的实现方式中,第二服务器从第一服务器获取针对第一训练模型的重训练指令、输入数据以及推理结果,具体包括:第二服务器响应于从第一服务器接收到的重训练指令,获取输入数据和推理结果。
在一种可能的实现方式中,第二服务器根据输入数据以及推理结果,确定重训练样本集,具体包括:第二服务器对输入数据进行标注,得到标注后的输入数据,然后将标注后的输入数据和推理结果存储到重训练样本集中。
在一种可能的实现方式中,在第二服务器对输入数据进行标注,得到标注后的输入数据之前,该方法还包括:若推理结果为正确的推理结果,则第二服务器保留推理结果和该推理结果对应的输入数据;若推理结果为错误的推理结果,则第二服务器删除该推理结果和该推理结果对应的输入数据,或者,第二服务器将该推理结果替换为 输入数据对应的正确的推理结果。
第三方面,本申请还提供一种作为第一服务器的模型训练装置,应用于第一服务器和第二服务器的系统中,第一服务器位于私有云中,用于模型推理,第二服务器位于公有云中,用于模型训练。该装置作为第一服务器,包括获取单元、推理单元、评估单元以及发送单元:获取单元,用于从第二服务器获取第一训练模型。推理单元,用于将输入数据输入第一训练模型中进行模型推理,得到推理结果。评估单元,用于根据推理结果,按照模型评估指标对第一训练模型进行评估,得到模型评估指标的评估结果。发送单元,用于若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则向第二服务器发送针对所述第一训练模型的重训练指令。其中,重训练指令用于指示第二服务器对第一训练模型进行重训练。
在一种可能的实现方式中,模型评估指标包括以下至少一项:推理结果的准确率、推理结果的查准率、推理结果的召回率、推理结果的F1分数F1-Score、推理结果的接收者操作特征曲线ROC下的面积AUC。
在一种可能的实现方式中,发送单元,还用于向第二服务器发送输入数据和推理结果。其中,输入数据和推理结果用于对第一训练模型进行重训练。
第四方面,在一种可能的实现方式中,发送单元,还用于若模型评估指标的评估结果均超过其对应的预设阈值,则不向第二服务器发送针对第一训练模型的重训练指令。
本申请还提供一种作为第二服务器的模型训练装置,应用于第一服务器和第二服务器的系统中,第一服务器位于私有云中,用于模型推理,第二服务器位于公有云中,用于模型训练。该装置作为第二服务器,包括获取单元、确定单元、发送单元以及处理单元:获取单元,用于从第一服务器获取针对第一训练模型的重训练指令、输入数据以及推理结果。确定单元,用于根据输入数据以及所述推理结果,确定重训练样本集。确定单元,还用于根据重训练样本集对所述第一训练模型进行重训练,确定第二训练模型。其中,第二训练模型用于替换所述第一训练模型。发送单元,用于向第一服务器发送第二训练模型。其中,重训练指令用于指示第二服务器对所述第一训练模型进行重训练,输入数据为第一服务器输入第一训练模型中进行模型推理的数据,推理结果为第一服务器将输入数据输入第一训练模型中进行模型推理后得到的结果。
在一种可能的实现方式中,处理单元,用于若推理结果为正确的推理结果,则第二服务器保留该推理结果和该推理结果对应的输入数据。若推理结果为错误的推理结果,则第二服务器删除该推理结果和该推理结果对应的输入数据,或者,第二服务器将该推理结果替换为输入数据对应的正确的推理结果。
在一种可能的实现方式中,获取单元,具体用于响应于从第一服务器接收到的重训练指令,获取输入数据和推理结果。
在一种可能的实现方式中,确定单元,具体用于对输入数据进行标注,得到标注后的输入数据,并将标注后的输入数据和推理结果存储到重训练样本集中。
第五方面,本申请提供一种模型训练装置,其特征在于,该装置包括:处理器、存储器和通信接口。其中,通信接口用于与其他设备或通信网络通信,存储器用于存储一个或多个程序,所述一个或多个程序包括计算机执行指令,当该装置运行时,处 理器执行存储器存储的所述计算机执行指令以使该装置执行第一方面或第二方面及其各种可选的实现方式中任意之一所述的模型训练方法。
第六方面,本申请提供一种包含指令的计算机程序产品,其特征在于,当所述指令被处理器运行时,使得所述计算机执行第一方面或第二方面及其各种可选的实现方式中任意之一所述的模型训练方法。
第七方面,本申请提供一种存储一个或多个程序的计算机可读存储介质,计算机可读存储介质中存储有指令,所述一个或多个程序包括指令,所述指令当被处理器执行时,使得所述计算机执行第一方面或第二方面及其各种可选的实现方式中任意之一所述的模型训练方法。
附图说明
图1为现有技术中的一种混合云场景下的用于AI模型训练和推理的系统的示意图一;
图2为现有技术中的一种混合云场景下的用于AI模型训练和推理的系统的示意图二;
图3为本申请实施例提供的一种混合云场景下的用于AI模型训练和推理的系统的示意图;
图4为本申请实施例提供的一种芯片的硬件结构示意图;
图5为本申请实施例提供的一种模型训练装置示意图;
图6为本申请实施例提供的一种模型训练方法的流程示意图;
图7为本申请实施例提供的一种推理结果的接收者操作特征曲线ROC下的面积AUC的示意图;
图8为本申请实施例提供的一种作为第一服务器的模型训练装置的示意图;
图9为本申请实施例提供的一种作为第二服务器的模型训练装置的示意图。
具体实施方式
下面先对本申请中所涉及到的技术术语进行解释:
云:即服务器或服务器集群,这些构成云的服务器或者服务器集群提供的服务为云服务(例如存储、计算等),也可以描述为云计算服务,即云服务,用户通过网络从云中获得其所需资源和服务。
私有云(private cloud):私有云也称作内部云或者公司云,是指通过互联网或者专用内部网络,向特定用户(而非一般公众)提供云计算服务的云。私有云一般部署在防火墙内,或者主机托管的场所内,可以有效保证所提供的服务的安全性和质量。
公有云(public cloud):公有云是指通过互联网或者专用内部网络,向所有用户提供云计算服务的云。公有云的核心是共享资源服务,可在整个开放的公有网络中使用。
混合云(hybrid cloud):混合云是融合公有云和私有云这两种形式向用户提供云计算服务的云。
在对混合云进行简单说明后,先结合图1,对混合云场景下的用于AI模型训练和推理的系统进行介绍:
如图1所示,混合云场景下的用于AI模型训练和推理的系统中可以包括第一服 务器10和第二服务器20。其中,第一服务器10位于私有云中,用于模型推理;第二服务器20位于公有云中,用于模型训练。
其中,第一服务器10可以为可以是私有云中的一个服务器,也可以是私有云中的多个服务器组成的服务器集群。第二服务器20可以是公有云中的一个服务器,也可以是公有云中的多个服务器组成的服务器集群。
如图2所示,若按照功能来进行划分模块,则图1所示的混合云场景下的用于AI模型训练和推理的系统中可以包括数据收集模块101、模型训练模块102、第一模型存储模块103、第二模型存储模块104以及模型推理管理模块105。
其中,数据收集模块101、模型训练模块102以及第一模型存储模块103位于第二服务器20中。具体的,若第二服务器20为公有云中的一个服务器,则数据收集模块101、模型训练模块102以及第一模型存储模块103均位于该第二服务器20上;若第二服务器20为公有云中的多个服务器组成的服务器集群,则数据收集模块101、模型训练模块102以及第一模型存储模块103位于该服务器集群中的同一服务器上,或者该服务器集群中的不同服务器上。
数据收集模块101,用于获取用于模型训练的数据,并将获取到的数据作为训练样本加入到训练样本集合中,然后将该训练样本集合发送给模型训练模块102。一般的,数据收集模块101可利用对象存储服务(OBS,object storage service)来存储其获取到的数据。
模型训练模块102,用于接收数据收集模块101发送的训练样本集合,并根据该训练样本集合对预设模型进行模型训练,得到符合应用需求的训练模型,并将该训练模型存储到第一模型存储模块103中。其中,预设模型可以是预先存储的模型(例如之前经过模型训练得到的训练模型),也可以是根据训练模型的应用场景来设定的模型。例如,在神经语言程序学(NPL,neuro-linguistic programming)领域中,用户通常会利用其自身获取到的训练样本集对某些基准模型来进行模型训练,以得到所需的训练模型,则本申请实施例所指的预设模型为所述基准模型。
可选的,模型训练模块102,用于从第一模型存储模块103获取用于模型训练的预设模型。
第一模型存储模块103,用于从模型训练模块102获取训练好的训练模型,并存储其获取到的训练模型。第一模型存储模块103,还用于存储训练模型的描述信息,训练模型的描述信息中包括该训练模型的名称、用途以及生命周期等。此外,第一模型存储模块103还用于对训练模型的描述信息进行修改。第一模型存储模块103还用于根据训练模型的描述信息中的生命周期,以及其存储训练模型的时长等,对训练模型进行生命周期管理。也就是说,第一模型存储模块103用于在训练模型的生命周期内,存储、更新或删除该训练模型以及该训练模型的描述信息等,并在该训练模型的生命周期结束后,删除或者更新该训练模型以及该训练模型的描述信息等。
示例性的,第一模型存储模块103获取到训练模型A的时间为11:30am,且该训练模型A的生命周期为2h,或者说该训练模型的生命周期为从11:30am到1:30pm。那么,若当前时刻为11:30am到1:30pm中的任意时间点,则第一模型存储模块103可对其存储的训练模型A进行更新或删除,也可以修改其存储的训练模型A的描述 信息。通过对训练模型A的描述信息进行修改,第一模型存储模块103可以改变训练模型A的名称、用途或者延长/缩短该训练模型A的生命周期。若第一模型存储模块103在11:30am到1:30pm的时间段内,未修改其存储的训练模型A的生命周期,则在1:30pm之后,该训练模型A的生命周期结束,第一模型存储模块103删除其存储的训练模型A以及该训练模型A的描述信息。
可选的,第一模型存储模块103,还用于存储用于模型训练的预设模型。
第一模型存储模块103,还用于将其从模型训练模块102接收到的训练模型,发送给第二模型存储模块104。可选的,第一模型存储模块103还用于将训练模型的描述信息等发送给第二模型存储模块104,以便于模型推理管理模块105利用该训练模型进行模型推理。
第二模型存储模块104,用于获取并存储第一模型存储模块103发送的训练模型。可选的,第二模型存储模块104,还用于获取并存储该第一模型存储模块103发送的训练模型的描述信息。关于描述信息的介绍,可以参见上述内容,在此不再赘述。第二模型存储模块104,还用于对其存储的训练模型的描述信息进行修改。此外,第二模型存储模块104,还用于根据训练模型的描述信息中的生命周期,以及其存储该训练模型的时长等,对该训练模型进行生命周期管理。关于对训练模型进行生命周期管理的具体描述可以参见上述内容,在此不再赘述。第二模型存储模块104,还用于将其从第一模型存储模块103获取到的训练模型发送给模型推理管理模块105。
第二模型存储模块104以及模型推理管理模块105位于第一服务器10中。具体的,若第一服务器10为私有云中的一个服务器,则第二模型存储模块104以及模型推理管理模块105均位于该第一服务器10上;若第一服务器10为私有云中的多个服务器组成的服务器集群,则第二模型存储模块104以及模型推理管理模块105位于该服务器集群中的同一服务器上,或该服务器中的不同服务器上。
模型推理管理模块105,用于从第二模型存储模块104中调用训练模型,并将该训练模型发布为服务。模型推理管理模块105,还用于针对第二模型存储模块104中的训练模型的更新和删除,对其发布的与该训练模型对应的服务进行更新和删除。模型推理管理模块105,还用于在接收到提供服务的指令后,调用该服务对应的训练模型,将输入数据输入到该训练模型中进行模型推理得到推理结果,并将推理结果下发给用户终端。可选的,模型推理管理模块105,还用于对训练模型进行剪裁,例如减少深度学习模型中的网络层数、合并算子等,以加速推理过程,提高模型推理的效率。
在现有的采用混合云进行AI模型训练和推理的方式中,若输入训练模型的输入数据发生变化(以视频监控场景为例,在该场景下,监控摄像头的老化、更换或者位置调整,均会使得输入训练模型的输入数据发生变化),则利用该训练模型进行模型推理所得到的推理结果的准确性可能会降低,即该训练模型的推理效果可能会较差,业务系统的性能也可能会受到影响。
为了解决上述问题,本申请提出了一种模型训练方法,应用于混合云场景中,可以根据训练模型进行模型推理所得到的推理结果以及模型评估指标,对该训练模型进行评估,从而监控该训练模型的推理效果,以及时对推理效果较差的训练模型进行重训练,提高模型推理所得推理结果的准确性,保证业务系统(例如人脸识别系统)的 性能。因此,如图3所示,若按照功能来对混合云场景下的AI模型训练和推理的系统划分模块,则本申请实施例在图2所示的模块的基础上,还增加了模型重训练管理模块201和模型评估模块202。
其中,若第二服务器20为公有云中的一个服务器,则数据收集模块101、模型训练模块102、第一模型存储模块103以及模型重训练管理模块201位于该第二服务器20上;若第二服务器20为私有云中的多个服务器组成的服务器集群,则数据收集模块101、模型训练模块102、第一模型存储模块103以及模型重训练管理模块201位于该服务器集群中的同一服务器上,或者该服务器集群中的不同服务器上。
模型重训练管理模块201,用于接收模型评估模块202发送的针对训练模型的重训练指令。模型重训练管理模块201,还用于响应于其接收到的针对训练模型的重训练指令,指示数据收集模块101从模型评估模块202获取用于模型重训练的数据,并将获取到的数据作为重训练样本,加入到发送给模型训练模块102的重训练样本集合中。可选的,模型重训练管理模块201,还用于响应于接收到的重训练指令,指示模型训练模块102从第一模型存储模块103获取用于模型重训练的训练模型。
数据收集模块101,还用于根据模型重训练管理模块201的指示,从模型评估模块202获取用于模型重训练的数据,即输入数据和推理结果,输入数据为输入训练模型中进行模型推理的数据,推理结果为将输入数据输入训练模型中进行模型推理得到的推理结果。数据收集模块101,还用于将其获取到的数据作为模型重训练的样本加入到重训练样本集合中,然后将该重训练样本集合发送给模型训练模块102。一般的,数据收集模块101还用于利用对象存储服务(OBS,object storage service)来存储获取到的数据。
模型训练模块102,还用于从数据收集模块101获取用于模型重训练的重训练样本集合,并根据模型重训练管理模块201的指示,从第一模型存储模块103获取用于模型重训练的训练模型。模型训练模块102,还用于根据其获取到的重训练样本集合对该训练模型进行重训练,得到符合应用需求的重训练模型,并将存储在第一模型存储模块103中的训练模型替换为该重训练模型。
若第一服务器10为私有云中的一个服务器,则第二模型存储模块104、模型推理管理模块105以及模型评估模块202均位于该第一服务器10上;若第一服务器10为私有云中的多个服务器组成的服务器集群,则第二模型存储模块104、模型推理管理模块105以及模型评估模块202位于该服务器集群中的同一服务器上,或者该服务器集群中的不同服务器上。模型训练模块102、第一模型存储模块103、第二模型存储模块104以及模型推理管理模块105的介绍可以参见上述描述,在此不再赘述。
模型评估模块202,用于根据将输入数据输入训练模型进行模型推理得到的推理结果,以及模型评估指标,对该训练模型进行评估,确定针对该训练模型的模型评估指标的评估结果。然后,模型评估模块202,还用于根据模型评估指标的评估结果和其对应的预设阈值,来确定该训练模型的推理效果。模型评估模块202,还用于根据该训练模型的推理效果,向模型重训练管理模块201发送针对该训练模型的重训练指令。模型评估模块202,还用于发送用于模型重训练的数据给数据收集模块101。
示例性的,若模型评估模块202确定该训练模型的推理效果较差,则该模型评估 模块202发送针对该训练模型的重训练指令给模型重训练管理模块201。若模型评估模块202确定该训练模型的推理效果较好,则该模型评估模块202不发送针对该训练模型的重训练指令给模型重训练管理模块201。
可选的,模型评估模块202还可以更具体的划分为评估策略配置模块2021、数据采集模块2022,以及模型评估指标评价模块2023。
其中,评估策略配置模块2021用于配置训练模型的评估策略,包括配置用于评估训练模型的至少一项模型评估指标、模型评估指标的评估规则、模型评估指标对应的预设阈值、用于评估模型评估指标的输入数据和推理结果的选取规则,以及针对训练模型的重训练的触发策略。其中,输入数据为输入训练模型进行模型推理的数据,推理结果为将输入数据输入训练模型进行模型推理得到的预测结果,因此,输入数据和推理结果之间存在对应关系。评估策略配置模块2021,还用于将用于评估模型评估指标的输入数据和推理结果的选取规则、用于评估训练模型的至少一项模型评估指标,以及模型评估指标的评估规则发送给数据采集模块2022,将用于评估训练模型的至少一项模型评估指标、模型评估指标的评估规则、模型评估指标对应的预设阈值,以及针对训练模型的重训练的触发策略发送给模型评估指标评价模块2023。
数据采集模块2022,用于获取输入数据和推理结果,并将其获取到的输入数据和推理结果发送给数据收集模块101。数据采集模块2022,还用于从评估策略配置模块2021获取用于评估模型评估指标的输入数据和推理结果的选取规则、用于评估训练模型的至少一项模型评估指标,以及模型评估指标的评估规则等信息,并根据这些信息确定用于模型评估的输入数据和推理结果,将用于模型评估的输入数据和推理结果发送给模型评估指标评价模块2023。
模型评估指标评价模块2023,用于从数据采集模块2022获取用于模型评估的输入数据和推理结果,从评估策略配置模块2021获取用于评估训练模型的至少一项模型评估指标、模型评估指标的评估规则、模型评估指标对应的预设阈值,以及针对训练模型的重训练的触发策略。模型评估指标评价模块2023,还用于利用用于模型评估的输入数据和推理结果,按照用于评估训练模型的至少一项模型评估指标和模型评估指标的评估规则,来确定用于评估训练模型的至少一项模型评估指标的评估结果。模型评估指标评价模块2023,还用于根据模型评估指标的评估结果,以及模型评估指标对应的预设阈值,确定训练模型的推理效果。模型评估指标评价模块2023,还用于根据针对训练模型的重训练的触发策略,确定是否发送针对训练模型的重训练指令给模型重训练管理模块201。
示例性的,针对训练模型的重训练的触发策略为:若用于评估训练模型的至少一项模型评估指标的评估结果未超过其对应的预设阈值,则模型评估指标评价模块2023确定该训练模型的推理效果较差,并发送针对该训练模型的重训练指令给模型重训练管理模块201;若用于评估训练模型的模型评估指标的评估结果均超过其对应的预设阈值,则模型评估指标评价模块2023确定该训练模型的推理效果较好,不发送针对该训练模型的重训练指令给模型重训练管理模块201。
可选的,按照功能模块来进行划分,图2和图3所示的采用混合云进行AI模型训练和推理的系统中还可以包括数据标注模块,该数据标注模块与其他模块松耦合。 数据收集模块101,还用于在获取到用于模型训练的数据,或用于模型重训练的数据后,将获取到的用于模型训练/模型重训练的数据发送给数据标注模块。随后,数据标注模块,用于对接收到的用于模型训练/模型重训练的数据进行标注处理,并将标注后的数据发送给数据收集模块101。最后,数据收集模块101将从数据标注模块获取到的标注后的数据,作为训练样本/重训练样本加入到训练样本集/重训练样本集中,使得模型训练模块102可以利用训练样本集/重训练样本集中的数据,对预设模型进行模型训练,得到符合需求的训练模型,或者对训练模型进行模型重训练,得到重训练模型。
示例性的,以对图片中的物体进行分类的场景为例,数据标注的过程即利用标注工具对图像中的物体加矩形框,然后为该矩形框内的物体添加标签,例如“一只猫”或者“一部手机”等等。
关于数据收集模块101、模型训练模块102的其他作用,以及第一模型存储模块103、第二模型存储模块104、模型推理管理模块105的作用可以参见上述描述,在此不再赘述。
需要说明的是,在上述过程中,本申请还可以实现混合云场景下的模型训练和推理的系统中的数据闭环,即实现模型推理、模型评估、模型重训练、重训练模型下发、重训练模型推理的端到端的业务闭环,以使得本申请可以利用输入训练模型进行推理的输入数据和模型推理得到的推理结果,在训练模型由于环境变化造成输入数据变化等因素,出现推理效果下降的问题时,可以及时对训练模型进行重训练,再利用重训练模型进行模型推理,从而提高训练模型的推理效果,即推理结果的准确率,保证业务系统性能。
图4为本申请实施例提供的一种芯片的硬件结构,该芯片包括神经网络处理器300。该芯片可以被设置在如图1所示的第一服务器10和/或第二服务器20中,用以完成图2或图3所示的各个模块的工作,包括通过模型训练得到训练模型、利用训练模型进行模型推理、对训练模型进行模型评估以及通过模型重训练得到重训练模型等。
神经网络处理器NPU300作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU 300的核心部分为运算电路303,控制器304控制运算电路303提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路303内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路303从权重存储器302中取矩阵B相应的数据,并缓存在运算电路303中每一个PE上。运算电路303从输入存储器301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器308(accumulator)中。
向量计算单元307可以对运算电路303的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元307可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应 归一化(local response normalization)等。
在一些实现中,向量计算单元能307将经处理的输出的向量存储到统一存储器306。例如,向量计算单元307可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。
在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。
在一些实现中,处理过的输出的向量能够用作到运算电路303的激活输入,例如,用于在神经网络中的后续层中的使用。
统一存储器306用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器305(direct memory access controller,DMAC)将外部存储器中的输入数据存入至输入存储器301和/或统一存储器306、将外部存储器中的权重数据存入权重存储器302,以及将统一存储器306中的数据存入外部存储器。
总线接口单元310(bus interface unit,BIU),用于通过总线实现主CPU、DMAC和取指存储器309之间进行交互。
与控制器304连接的取指存储器309(instruction fetch buffer)用于存储控制器304使用的指令。控制器304用于调用取指存储器309中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器306,输入存储器301,权重存储器302以及取指存储器309均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
在图1至图3中,主CPU和NPU共同配合,可实现图1中第一服务器10和第二服务器20所需功能的相应算法,图2或图3所示的系统中各模块的运算可以由运算电路303或向量计算单元307执行。
上文中介绍的图1中的第一服务器10和第二服务器20能够执行本申请实施例中的模型训练方法的各个步骤,图4所示的芯片也可以用于执行本申请实施例的模型训练方法的各个步骤。
参见图5,本申请还提供一种模型训练装置,该模型训练装置400包括一个或多个处理器,如处理器401和/或处理器407,至少一个通信接口,如通信接口404,以及通信线路402。可选的,通信装置400还可以包括存储器403。下面以处理器401为例进行说明。
处理器401可以是一个通用中央处理器(central processing unit,CPU),微处理器,专用集成电路(application-specific integrated circuit,ASIC),FPGA(Field Programmable Gate Array,即现场可编程门阵列),或一个或多个集成了多种处理电路功能(如CPU+ASIC)的集成电路。
通信线路402可包括一个或多个通路,用于连接不同组件。
通信接口404,可以是收发电路,用于与其他设备或通信网络通信,如云计算网络、以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。例如,所述收发电路可以是收发器、收发机一类的装置。可选 的,所述通信接口404也可以是处理器401的输入/输出(input/output,I/O)电路,用以实现处理器401的信号输入和信号输出。
存储器403可以是具有存储功能的装置。例如可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器403可以是独立存在,并通过通信线路402与处理器401相连接。当然,存储器403也可以和处理器401集成在一起。
其中,存储器403用于存储执行本申请方案的计算机执行指令,并由处理器401来控制执行。处理器401用于读取并执行存储器403中存储的计算机指令(如用于CPU)或者配置文件(如用于FPGA),从而实现本申请实施例提供的模型训练方法。
或者,可选的,本申请实施例中,也可以是处理器401执行本申请下述实施例提供的模型训练方法中的相关处理功能,通信接口404负责与其他设备或通信网络通信,本申请实施例对此不作具体限定。
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。
在具体实现中,作为一种实施例,处理器401可以包括一个或多个CPU,例如图5中的CPU0和CPU1。
在具体实现中,作为一种实施例,模型训练装置400也可以包括多个处理器,例如图5中的处理器401和处理器407。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,模型训练装置400还可以包括输出设备405和输入设备406。输出设备405和处理器401通信,可以以多种方式来输出信息。例如,输出设备405可以是触摸屏,液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,投影仪(projector),或打印机(printer)等。输入设备406和处理器401通信,可以以多种方式接收用户的输入。例如,输入设备406可以是鼠标、键盘、触摸屏设备或传感设备等。
上述模型训练装置400有时也可以称为训练设备,其可以是一个通用设备或者是一个专用设备。例如,该训练设备可以是客户端、台式机、便携式电脑、网络服务器、掌上电脑(personal digital assistant,PDA)、移动手机、平板电脑、无线终端设备、嵌入式设备、或具有类似结构的设备。当然,上述模型训练装置400也可以是设置于上述各单一设备内部的软件和/或硬件实体,如用于执行本申请实施例提供的任务的芯片或芯片系统。本申请实施例不限定模型训练装置400的类型。
应理解,图5仅为便于理解而示例的简化示意图,该模型训练装置中还可以包括其他组件、电路或装置,图5中均未予以画出。
在本申请实施例中,图5所示的模型训练装置400,可以执行图5所示的模型训练方法。
本申请还提供一种模型训练方法,用于上述图1所描述的系统中,下面结合图6对本申请涉及到的模型训练方法进行介绍,该方法主要包括步骤S601-S608:
S601、第一服务器从第二服务器获取第一训练模型。
其中,第一训练模型是第二服务器针对预设模型进行模型训练得到的,预设模型可以是系统中预先存储的模型,也可以是根据所需第一训练模型的应用场景的需求确定的模型。
示例性的,在神经语言程序学(NPL,neuro-linguistic programming)领域中,用户通常会利用其自身获取到的训练样本集对某些基准模型,例如机器翻译模型、情感分析模型等来进行模型训练,以得到所需的训练模型,本申请实施例所指的预设模型为所述基准模型。若当前应用场景为文本翻译场景,则预设模型可以为机器翻译模型,若当前硬用场景为分析文本所表达的感情的场景,则预设模型为情感分析模型。
第二服务器利用获取到的训练数据,对预设模型进行模型训练得到第一训练模型后,存储该第一训练模型,以及该第一训练模型的描述信息,并将该第一训练模型以及该第一训练模型的描述信息下发给第一服务器。其中,第一训练模型的描述信息包括第一训练模型的名称、用途以及生命周期等。第一服务器在获取到第二服务器下发的第一训练模型,以及该第一训练模型的描述信息后,将接收到的第一训练模型和描述信息存储起来,并将该第一训练模型发布为服务,以使得用户可以通过该服务来调用相关的第一训练模型进行模型推理。
示例性的,第一训练模型的描述信息中包括该第一训练模型的名称,例如人脸识别模型,该第一训练模型的用途,例如人脸识别,该第一训练模型的生命周期,例如1h。可选的,第一训练模型的描述信息中包括该第一训练模型的存储时间,例如11:30am,该第一训练模型的存储时长,例如0.5h,则第一训练模型的可存储时长为0.5h。
可选的,第一服务器或第二服务器可以对其存储的第一训练模型的描述信息进行修改,还可以根据第一训练模型的生命周期,删除或者更新该第一训练模型。
S602、第一服务器将输入数据输入第一训练模型中进行模型推理,得到推理结果。
第一服务器在接收到用户终端发送的服务调用请求,以及用于模型推理的输入数据后,通过其发布的服务,来确定该服务对应的第一训练模型。然后,第一服务器将用于模型推理的输入数据,例如用户输入或者本地存储的数据输入到第一训练模型中进行模型推理,得到推理结果,并将推理结果发送给相应的用户终端。
示例性的,以第一训练模型为人脸识别模型为例,第一服务器在经过上述步骤S602后,获取到第二服务器发送的该人脸识别模型,将该人脸识别模型存储起来,并将该人脸识别模型发布为人脸识别服务。随后,若用户终端需要对某一图片中的人脸进行识别,则该用户终端发送人脸识别服务的服务调用请求,以及该图片给第一服务器,第一服务器确定需要调用的人脸识别服务对应的人脸识别模型,并将该图片输入到该人脸识别模型中进行模型推理,以确定推理结果,例如推理结果为图片中的人为 A。最后,第一服务器将这个推理结果,即图片中的人为A,发送给用户终端。
针对输入数据的不同,模型推理的类型可以划分为批量推理和实时推理。其中,将实时获取到的数据,输入训练模型中进行模型推理,得到推理结果的过程即为实时推理的过程。将预先存储的多个数据,输入到训练模型中进行模型推理,得到多个推理结果的过程,即为批量推理的过程,多个推理结果与多个数据之间存在对应关系。
在实时推理的过程中,第一服务器接收输入数据,即用户终端实时发送的数据,和服务调用请求。随后,第一服务器根据接收到的服务调用请求通过应用程序编程接口(API,application programming interface)来调用该服务调用请求对应的推理服务,然后将输入数据输入到该推理服务对应的第一训练模型中进行模型推理,得到推理结果。
示例性的,以人脸识别场景中的闸机为例,闸机通过摄像头捕捉到人脸图像后,进一步通过闸机与第一服务器之间的API将该人脸图像,以及人脸识别服务的调用请求发送至第一服务器。随后,第一服务器根据该人脸识别服务的调用请求(例如http形式的调用请求)调用相应的人脸识别模型,并利用该人脸识别模型对人脸图像进行识别,将识别结果发送给闸机,以使得闸机可以根据该识别结果打开或者保持关闭。
在批量推理过程中,第一服务器接收输入数据和服务调用请求,其中,输入数据为预先存储的数据,或者该预先存储的输入数据的路径信息(例如网络文件系统(NFS,network file system)地址、文件传输协议(FTP,file transfer protocol)地址等)。随后,第一服务器根据该路径信息获取预先存储的输入数据,根据服务调用请求调用相应的第一训练模型,并将获取到的预先存储的数据输入第一训练模型中进行模型推理,得到推理结果。
示例性的,以人脸识别场景为例,用户终端发送人脸识别服务的调用请求,以及输入数据的路径信息(例如NFS地址)给第一服务器,第一服务器根据该NFS地址获取该地址指向的文件夹内存储的图片,并根据人脸识别服务对应的人脸识别模型。以该文件夹内存储有10张图片为例,第一服务器将这10张图片分别输入人脸识别模型中进行模型推理,得到10个推理结果。
第一服务器将输入数据输入第一训练模型进行模型推理,并得到推理结果后,将该输入数据和该推理结果存储起来。可选的,第一服务器还可以将获取到的输入数据和推理结果发送给第二服务器,以使得第二服务器可以根据接收到的输入数据和推理结果对第一预设模型进行重训练。
S603、第一服务器根据推理结果,按照模型评估指标对第一训练模型进行评估,得到模型评估指标的评估结果。
其中,模型评估指标包括推理结果的准确率、推理结果的查准率、推理结果的召回率、推理结果的F1分数F1-Score、推理结果的接收者操作特征曲线ROC下的面积AUC等项中的至少一项。可选的,模型评估指标还可能包括平均绝对误差(MAE,mean absolute error)、均方误差(MSE,mean square error)等模型评估指标。一般的,上述准确率、查准率、召回率、F1分数F1-Score、AUC等模型评估指标主要用于二分类模型的评估,上述MAE、MSE等模型评估指标主要用于回归模型(例如人脸识别模型)的评估。
可选的,用于评估第一训练模型的模型评估指标,是根据当前应用场景,即第一训练模型的用途来确定的。
示例性的,若在当前应用场景下,相比于利用其他模型评估指标对第一训练模型的推理效果进行评估,利用推理结果的召回率对第一训练模型的推理效果进行评估,可以获得较好的评估效果,则可以将推理结果的召回率设置为用于评估第一训练模型的推理效果的模型评估指标。
示例性的,在某一应用场景下,用户也可以根据该应用场景的需要,自行设置用于评估第一训练模型的模型评估指标为推理结果的准确率、推理结果的召回率以及推理结果的F1-Score。
示例性的,以第一训练模型为二分类模型为例,100张卡片中有60张卡片上的数字是奇数,即有60个正类样本,以及有40张卡片上的数字为偶数,即有40个负类样本。利用第一训练模型对这100张卡片中的数字进行预测,即进行了100次推理,产生了100个推理结果。其中,在正类样本中,有40个卡片对应的推理结果是准确的,以及20个推理结果是错误的。在负类样本中,有30个卡片对应的推理结果是正确的,以及10个推理结果是错误的。将正类样本预测为正类的样本数量TP为40,将正类样本预测为负类的样本数量FN为20,将负类样本预测为正类样本数FP为10,将负类样本预测为负类样本的数量TN为30。那么,根据模型评估指标的评估规则,可以确定推理结果的准确率Accuracy为(TP+TN)/(TP+FN+FP+TN),即70%,推理结果的查准率Precision为TP/(TP+FP),即80%,推理结果的召回率Recall为TP/(TP+FN),即2/3,推理结果的F1分数F1-Score为精确率和召回率的调和均值2Presicion*Recall/(Presicion+Recall),即8/11。以x=FP/(FP+TN)为横坐标和y=TP/(TP+FN)为纵坐标,来确定ROC以及AUC,x和y的取值为[0,1]。示例性的,ROC和AUC如图7所示,ROC上有一点A的坐标为(1/4,2/3)。
可选的,第一训练模型的模型评估可以是周期性的,例如,按照预设时间间隔,利用第一个预设时间间隔内的实时推理的推理结果进行模型评估,得到第一训练模型的模型评估指标的评估结果。然后,根据第二个预设时间间隔内的实时推理的推理结果再次进行模型评估,又一次得到第一训练模型的模型评估指标的评估结果。
示例性的,10:00am-12:00am之间,利用第一训练模型进行实时推理的次数为20次。以预设时间间隔为40min为例,利用10:00am-10:40am之间的实时推理的推理结果来进行一次模型评估,得到第一训练模型的模型评估指标的评估结果;然后,利用10:40am-11:20am之间的实时推理的推理结果进行模型评估,又一次得到第一训练模型的模型评估指标的评估结果;最后,利用11:20am-12:00am之间的实时推理的推理结果进行模型评估,再一次得到第一训练模型的模型评估指标的评估结果。
可选的,按照预设次数间隔,利用批量推理或者实时推理的输入数据,和该输入数据对应的推理结果进行评估,得到该第一训练模型的模型评估指标的评估结果。
示例性的,以预设次数间隔为1次为例,若利用第一训练模型进行批量推理的次数有5次,即A、B、C、D和E,则分别利用批量推理A、C、E中的推理结果,进行模型评估,得到3组第一训练模型的模型评估指标的评估结果。
可选的,在进行步骤S603之前,需要对第一训练模型的评估策略进行配置,包 括配置用于评估训练模型的至少一项模型评估指标、模型评估指标的评估规则、模型评估指标对应的预设阈值、用于评估模型评估指标的输入数据和推理结果的选取规则,以及针对训练模型的重训练的触发策略。关于对评估策略的详细介绍,可以参见上述描述,在此不再赘述。可选的,第一服务器还可以对输入数据和推理结果的存储路径进行配置。
S604、若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则第一服务器发送针对第一训练模型的重训练指令。
其中,针对第一训练模型的重训练指令用于指示第二服务器对该第一训练模型进行模型重训练。
可选的,若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则第一服务器确定第一训练模型的推理效果较差,并发送针对第一训练模型的重训练指令给第二服务器。若模型评估指标的评估结果均超过其对应的预设阈值,则第一服务器确定第一训练模型的推理效果较好,不需要进行更新,因此,第一服务器不发送针对第一训练模型的重训练指令给第二服务器。其中,模型评估指标的评估结果对应的预设阈值可以是根据当前应用场景预先设定的,也可以是用户预先设定的。
示例性的,以模型评估指标包括推理结果的准确率、推理结果的查准率、推理结果的召回率、推理结果的F1-Score、推理结果的ROC下的面积AUC等项中的至少一项为例。若推理结果的准确率未超过预设准确率阈值、和/或推理结果的查准率未超过预设查准率阈值、和/或推理结果的召回率未超过预设准确率阈值、和/或推理结果的AUC未超过预设AUC阈值、和/或推理结果的F1-Score未超过预设F1-Score阈值,那么,第一服务器确定第一训练模型的推理效果较差,并发送针对第一训练模型的重训练指令给第二服务器。若推理结果的准确率超过预设准确率阈值、推理结果的查准率超过预设查准率阈值、推理结果的召回率超过预设准确率阈值、推理结果的AUC超过预设AUC阈值、且推理结果的F1-Score未超过预设F1-Score阈值,那么,第一服务器确定第一训练模型的推理效果较好,不发送针对第一训练模型的重训练指令给第二服务器。
可选的,在另一种可能的实现方式中,若模型评估指标的评估结果均未超过其对应的预设阈值,则第一服务器确定第一训练模型的推理效果较差,并发送针对第一训练模型的重训练指令给第二服务器。若至少一项模型评估指标的评估结果超过其对应的预设阈值,则第一服务器确定第一训练模型的推理效果较好,不需要进行更新,因此,第一服务器不发送针对第一训练模型的重训练指令给第二服务器。
可选的,在一种可能的实现方式中,在根据应用场景的需要,确定用于评估第一训练模型的模型评估指标,并执行步骤S604之后,用户还可以根据该应用场景的需要,进一步配置第一服务器发送重训练指令给第二服务器的触发条件。当用于评估第一训练模型的模型评估指标的评估结果满足该触发条件时,第一服务器发送针对该第一训练模型的重训练指令给第二服务器。可以理解的是,第一服务器发送针对该第一训练模型的重训练指令,也就是开启了针对第一训练模型的重训练。
示例性的,在当前的业务场景下,用户根据该业务场景的需求,设置用于评估第一训练模型的模型评估指标有3个,这3个模型评估指标分别为推理结果的准确率、 推理结果的召回率以及推理结果的F1-Score。然后,用户可以根据该业务场景的需要,确定触发第一服务器发送重训练指令给第二服务器的条件为:若推理结果的准确率和推理结果的F1-Score中,有至少一项的评估结果未超过其对应的预设阈值,且推理结果的召回率的评估结果也未超过其对应的预设阈值,则第一服务器发送针对第一训练模型的重训练指令给第二服务器。假定对于第一训练模型来说,推理结果的准确率的评估结果为a,推理结果的F1-Score的评估结果为b,推理结果的召回率的评估结果为c,且a、b、c对应的预设阈值分别为A、B、C。那么,若a<=A且c<=C,则不论b是否大于B,第一服务器发送针对第一训练模型的重训练指令给第二服务器;若b<=B且c<=C,则不论a是否大于A,第一服务器发送针对第一训练模型的重训练指令给第二服务器;若a>A且b>B,则不论c是否大于C,第一服务器不发送针对第一训练模型的重训练指令给第二服务器;若c>C,则不论a是否大于A,b是否大于B,第一服务器不发送针对第一训练模型的重训练指令给第二服务器。
需要说明的是,通过上述过程,在任一业务场景下,第一服务器可以自行周期性的或者非周期性的获取用于模型评估的推理结果,并根据预先配置的训练模型的评估策略,以及预先配置的训练模型的重训练的触发策略,来完成训练模型的推理效果的评估,进而确定触发/不触发该训练模型的重训练,以实现对训练模型的推理效果的监控,及时对推理效果较差的训练模型进行重训练,保证业务系统的性能。
S605、第二服务器从第一服务器获取输入数据以及推理结果。
其中,输入数据为第一服务器输入第一训练模型中进行模型推理的数据,推理结果为第一服务器将输入数据输入第一训练模型中进行模型推理,得到的推理结果。
第二服务器响应于其从第一服务器接收到的重训练指令,向第一服务器发送输入数据和该输入数据对应的推理结果的获取请求,该获取请求用于请求第一服务器发送输入第一训练模型中进行模型推理的输入数据,和该输入数据对应的推理结果给第二服务器。然后,第一服务器响应于该获取请求,向第二服务器发送数据,该数据中包括第一服务器输入第一训练模型的输入数据,以及将输入数据输入第一训练模型后进行模型推理得到的推理结果。
可选的,在另一种可能的实现方式中,第一服务器在发送重训练指令的同时,发送输入数据和该输入数据对应的推理结果给第二服务器。
可选的,在另一种可能的实现方式中,第二服务器可以在步骤S602之后先执行步骤S605,再执行步骤S603和步骤S604。此时,步骤S605可以是周期性执行的,第二服务器按照预设时间间隔,或者预设次数间隔,周期性的从第一服务器获取用于模型评估的输入数据,以及该输入数据对应的推理结果,即第一服务器可以按照预设时间间隔或者预设次数间隔,周期性地将用于模型评估的输入数据,该输入数据对应的推理结果发送给第二服务器。
需要说明的是,相对于周期性的执行步骤S605来说,在步骤S604之后,第二服务器响应于接收到的重训练指令,获取用于模型评估的输入数据和推理结果,来对第一训练模型进行重训练,可以在将第一训练模型的输入数据和推理结果发送到第一服务器的同时,节省用于数据发送的资源。
可选的,在另一种可能的实现方式中,第一服务器和第二服务器可以不执行步骤 S603和步骤S604,而是在步骤S602之后,直接执行步骤S605,并周期性或非周期性地发送重训练指令给第二服务器,以使得第二服务器根据该重训练指令,利用输入数据和输入数据对应的推理结果,对第一训练模型重新进行训练,以确定第二训练模型。
需要说明的是,这种不进行模型评估,直接指示第二服务器进行模型重训练的技术方案,可以使得重训练模型与当前环境下的输入数据很好的匹配,从而达到较好的推理效果。但是对模型进行重训练的次数可能会过于频繁,也可能会出现在现有的训练模型的推理效果良好的情况下来进行模型的重训练的情况,可能会造成繁重的计算负担,以及不必要的软硬件资源的消耗,另外,过于频繁的更换业务系统中用于模型推理的训练模型,可能会造成业务系统地不稳定,影响业务系统性能。而在进行模型评估后,根据模型的推理效果发送重训练指令给第二服务器的方式,可以在需要时进行模型的重训练,可以很好地减少带宽等软硬件资源的浪费,保证业务系统的稳定运行。
S606、第二服务器根据输入数据和推理结果,确定重训练样本集。
第二服务器响应于接收到的针对第一训练模型的重训练指令,将其从第一服务器获取到的第一训练模型的输入数据,以及该输入数据对应的推理结果,加入到用于模型重训练的重训练样本集中,该重训练样本集存储用于模型重训练的训练数据。
可选的,第二服务器在获取到输入数据,以及输入数据对应的推理结果后,先对输入数据进行标注,得到标注后的输入数据,然后将标注后的输入数据和推理结果加入到重训练样本集中,得到用于模型重训练的重训练样本集。
可选的,第二服务器获取到输入数据,以及输入数据对应的推理结果后,根据推理结果是否正确,对这些输入数据以及推理结果进行筛选或者修改。对于正确的推理结果,第二服务器将该推理结果和该推理结果对应的输入数据,加入到重训练样本集中。或者在对该正确的推理结果对应的输入数据进行标注后,将标注后的输入数据和输入数据对应的推理结果存储到重训练样本集中。对于错误的推理结果,第二服务器删除该推理结果以及该推理结果对应的输入数据,或者第二服务器将错误的推理结果修改为正确的推理结果,并将修改后的推理结果和输入数据(或者标注后的输入数据)加入到重训练样本集中,得到用于模型重训练的重训练样本集。
S607、第二服务器根据重训练样本集,对第一训练模型进行模型重训练,确定第二训练模型。
第二服务器响应于其接收到的重训练指令,利用其在步骤S606中确定的重训练样本集,对第一训练模型进行模型重训练,得到第二训练模型。随后,第二服务器将其存储的第一训练模型,替换为该第二训练模型。可选的,第二服务器还存储该第二训练模型的描述信息,关于描述信息的具体描述,可参见上述内容,在此不进行赘述。
S608、第二服务器向第一服务器发送第二训练模型。
第二服务器将其根据步骤S606中确定的重训练样本集,对第一训练模型进行模型重训练得到的第二训练模型,以及该第二训练模型的描述信息,一并下发给第一服务器。第一服务器将其存储的第一训练模型,替换为该第二训练模型,并存储该第二训练模型的描述信息。随后,第一服务器还将该第二训练模型发布为服务,以使得用 户可以通过该服务来调用第二训练模型进行模型推理。可选地,第一服务器将第一训练模型对应的服务删除,或者说将第一训练模型对应的服务,替换为第二训练模型对应的服务。
通过上述过程,第二服务器利用正确的推理结果,和该正确的推理结果对应的输入数据,对第一训练模型重新进行训练得到第二训练模型,可以很好的提高第二训练模型的推理效果,从而保证第二训练模型所在业务系统的性能。
可选的,在一种可能的实现方式中,第一服务器发送给第二服务器的重训练指令可以是针对预设模型的,以使得第二服务器根据步骤S606中确定的重训练样本集对预设模型进行模型重训练,得到第二训练模型。随后,第二服务器将其存储的第一训练模型,替换为该第二训练模型,并存储该第二训练模型的描述信息。关于预设模型的介绍可以参见上述内容,在此不再赘述。
需要说明的是,相对于根据步骤S606中确定的重训练样本集,对第一训练模型进行重训练得到的第二训练模型来说,根据步骤S606中确定的重训练样本集,对预设模型进行模型重训练来确定的第二训练模型,更加适用于当前场景,只是泛化能力较差,即不能保证之前的数据作为第二训练模型的输入数据时的推理效果。
通过上述实施例,本申请提供一种应用于混合云场景下的模型训练方法,可以对第一训练模型的模型评估指标进行评估,可以确定该第一训练模型的推理效果,实现对第一训练模型的推理效果的监控,从而在第一训练模型的推理效果较差时,向第二服务器发送用于模型重训练的重训练指令,使得第二服务器可以根据训练模型的推理效果,及时对训练模型进行重训练,以确定推理效果更好的训练模型,提高推理结果的准确度,保证业务系统的性能。
如图8所示,本申请还提供一种模型训练装置,应用于第一服务器和第二服务器的系统中,第一服务器位于私有云中,用于模型推理,第二服务器位于公有云中,用于模型训练。该装置作为第一服务器,可以用于执行上述方法实施例中第一服务器执行的步骤,该装置中包括获取单元801、推理单元802、评估单元803以及发送单元804。
获取单元801,用于从第二服务器获取第一训练模型。
推理单元802,用于将输入数据输入第一训练模型中进行模型推理,得到推理结果。
评估单元803,用于根据推理结果,按照模型评估指标对第一训练模型进行评估,得到模型评估指标的评估结果。
其中,模型评估指标包括以下至少一项:推理结果的准确率、推理结果的查准率、推理结果的召回率、推理结果的F1分数F1-Score、推理结果的接收者操作特征曲线ROC下的面积AUC。
发送单元804,用于若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则向第二服务器发送针对所述第一训练模型的重训练指令。其中,重训练指令用于指示第二服务器对第一训练模型进行重训练。
可选的,发送单元804,还用于向第二服务器发送输入数据和推理结果。其中,输入数据和推理结果用于对第一训练模型进行重训练。
可选的,发送单元804,还用于若模型评估指标的评估结果均超过其对应的预设阈值,则不向第二服务器发送针对第一训练模型的重训练指令。
如图9所示,本申请还提供一种模型训练装置,应用于第一服务器和第二服务器的系统中,第一服务器位于私有云中,用于模型推理,第二服务器位于公有云中,用于模型训练。该装置作为第二服务器,可以用于执行上述方法实施例中第二服务器执行的步骤,该装置中包括获取单元901、确定单元902、发送单元903以及处理单元904。
获取单元901,用于从第一服务器获取针对第一训练模型的重训练指令、输入数据以及推理结果。其中,重训练指令用于指示第二服务器对所述第一训练模型进行重训练,输入数据为第一服务器输入第一训练模型中进行模型推理的数据,推理结果为第一服务器将输入数据输入第一训练模型中进行模型推理后得到的结果。
获取单元901,具体用于响应于从第一服务器接收到的重训练指令,获取输入数据和推理结果。
确定单元902,用于根据输入数据以及所述推理结果,确定重训练样本集。
处理单元904,用于若推理结果为正确的推理结果,则第二服务器保留该推理结果和该推理结果对应的输入数据。若推理结果为错误的推理结果,则第二服务器删除该推理结果和该推理结果对应的输入数据,或者,第二服务器将该推理结果替换为输入数据对应的正确的推理结果。
确定单元902,具体用于对输入数据进行标注,得到标注后的输入数据,并将标注后的输入数据和推理结果存储到重训练样本集中。
确定单元902,还用于根据重训练样本集对所述第一训练模型进行重训练,确定第二训练模型。其中,第二训练模型用于替换所述第一训练模型。
发送单元903,用于向第一服务器发送第二训练模型。
本申请实施例还提供一种计算机可读存储介质,其上存储有指令,该指令被处理器运行时执行上述方法实施例中的方法。
本申请实施例还提供一种包含指令的计算机程序产品,该指令在计算机上被处理器执行时,使得计算机执行上述方法实施例中的方法。
本申请实施例还提供一种芯片,该芯片包括收发单元和处理单元。其中,收发单元可以是输入输出电路、通信接口;处理单元为该芯片上集成的处理器或者微处理器或者集成电路。该芯片可以执行上述方法实施例中的方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
上述实施例可以全部或部分通过软件,硬件,固件或者其任意组合实现。当使用软件程序实现时,上述实施例可以全部或部分地以计算机程序产品的形式出现,计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例的流程或功能。
其中,所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计 算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质,(例如,软盘,硬盘、磁带)、光介质(例如,DVD)或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是物理上分开的,或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。在应用过程中,可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是个人计算机,服务器,网络设备,单片机或者芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。

Claims (19)

  1. 一种模型训练方法,其特征在于,应用于包括第一服务器和第二服务器的系统中,所述第一服务器位于私有云中,用于模型推理;所述第二服务器位于公有云中,用于模型训练;所述方法包括:
    所述第一服务器从所述第二服务器获取第一训练模型;
    所述第一服务器将输入数据输入所述第一训练模型中进行模型推理,得到推理结果;
    所述第一服务器根据所述推理结果,按照模型评估指标对所述第一训练模型进行评估,得到模型评估指标的评估结果;
    若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则所述第一服务器向所述第二服务器发送针对所述第一训练模型的重训练指令,所述重训练指令用于指示所述第二服务器对所述第一训练模型进行重训练。
  2. 根据权利要求1所述的模型训练方法,其特征在于,在所述第一服务器将输入数据输入所述第一训练模型中进行模型推理,得到推理结果之后,所述方法还包括:
    所述第一服务器向所述第二服务器发送所述输入数据和所述推理结果;所述输入数据和所述推理结果用于对所述第一训练模型进行重训练。
  3. 根据权利要求1或2所述的模型训练方法,其特征在于,所述模型评估指标包括以下至少一项:推理结果的准确率、推理结果的查准率、推理结果的召回率、推理结果的F1分数F1-Score、推理结果的接收者操作特征曲线ROC下的面积AUC。
  4. 根据权利要求1-3中任一项所述的模型训练方法,其特征在于,若模型评估指标的评估结果均超过其对应的预设阈值,则所述第一服务器不向所述第二服务器发送针对所述第一训练模型的重训练指令。
  5. 一种模型训练方法,其特征在于,应用于包括第一服务器和第二服务器的系统中,所述第一服务器位于私有云中,用于模型推理;所述第二服务器位于公有云中,用于模型训练;所述方法包括:
    所述第二服务器从所述第一服务器获取针对第一训练模型的重训练指令、输入数据以及推理结果,所述重训练指令用于指示所述第二服务器对所述第一训练模型进行重训练,所述输入数据为所述第一服务器输入第一训练模型中的数据,所述推理结果为所述第一服务器将输入数据输入所述第一训练模型中进行模型推理后得到的结果;
    所述第二服务器根据所述输入数据以及所述推理结果,确定重训练样本集;
    所述第二服务器根据所述重训练样本集对所述第一训练模型进行重训练,确定第二训练模型,所述第二训练模型用于替换所述第一训练模型;
    所述第二服务器向所述第一服务器发送所述第二训练模型。
  6. 根据权利要求5所述的模型训练方法,其特征在于,所述第二服务器从所述第一服务器获取针对第一训练模型的重训练指令、输入数据以及推理结果,具体包括:
    所述第二服务器响应于从所述第一服务器接收到的所述重训练指令,获取所述输入数据和所述推理结果。
  7. 根据权利要求5或6所述的模型训练方法,其特征在于,所述第二服务器根据所述输入数据以及所述推理结果,确定重训练样本集,具体包括:
    所述第二服务器对所述输入数据进行标注,得到标注后的输入数据;
    所述第二服务器将所述标注后的输入数据和所述推理结果存储到重训练样本集中。
  8. 根据权利要求7所述的模型训练方法,其特征在于,在所述第二服务器对所述输入数据进行标注,得到标注后的输入数据之前,还包括:
    若所述推理结果为正确的推理结果,则所述第二服务器保留所述推理结果和所述推理结果对应的输入数据;
    若所述推理结果为错误的推理结果,则所述第二服务器删除所述推理结果和所述推理结果对应的输入数据,或者,所述第二服务器将所述推理结果替换为所述输入数据对应的正确的推理结果。
  9. 一种模型训练装置,其特征在于,应用于包括第一服务器和第二服务器的系统中,所述第一服务器位于私有云中,用于模型推理;所述第二服务器位于公有云中,用于模型训练;所述装置作为所述第一服务器包括:
    获取单元,用于从所述第二服务器获取第一训练模型;
    推理单元,用于将输入数据输入所述第一训练模型中进行模型推理,得到推理结果;
    评估单元,用于根据所述推理结果,按照模型评估指标对所述第一训练模型进行评估,得到模型评估指标的评估结果;
    发送单元,用于若至少一项模型评估指标的评估结果未超过其对应的预设阈值,则向所述第二服务器发送针对所述第一训练模型的重训练指令,所述重训练指令用于指示所述第二服务器对所述第一训练模型进行重训练。
  10. 根据权利要求9所述的模型训练装置,其特征在于,
    所述发送单元,还用于向所述第二服务器发送所述输入数据和所述推理结果;所述输入数据和所述推理结果用于对所述第一训练模型进行重训练。
  11. 根据权利要求9或10所述的模型训练装置,其特征在于,所述模型评估指标包括以下至少一项:推理结果的准确率、推理结果的查准率、推理结果的召回率、推理结果的F1分数F1-Score、推理结果的接收者操作特征曲线ROC下的面积AUC。
  12. 根据权利要求9-11中任一项所述的模型训练装置,其特征在于,
    所述发送单元,还用于若模型评估指标的评估结果均超过其对应的预设阈值,则不向所述第二服务器发送针对所述第一训练模型的重训练指令。
  13. 一种模型训练装置,其特征在于,应用于包括第一服务器和第二服务器的系统中,所述第一服务器位于私有云中,用于模型推理;所述第二服务器位于公有云中,用于模型训练;所述装置作为所述第二服务器包括:
    获取单元,用于从所述第一服务器获取针对第一训练模型的重训练指令、输入数据以及推理结果,所述重训练指令用于指示所述第二服务器对所述第一训练模型进行重训练,所述输入数据为所述第一服务器输入第一训练模型中的数据,所述推理结果为所述第一服务器将输入数据输入所述第一训练模型中进行模型推理后得到的结果;
    确定单元,用于根据所述输入数据以及所述推理结果,确定重训练样本集;
    所述确定单元,还用于根据所述重训练样本集对所述第一训练模型进行重训练, 确定第二训练模型,所述第二训练模型用于替换所述第一训练模型;
    发送单元,用于向所述第一服务器发送所述第二训练模型。
  14. 根据权利要求13所述的模型训练装置,其特征在于,
    所述获取单元,具体用于响应于从所述第一服务器接收到的所述重训练指令,获取所述输入数据和所述推理结果。
  15. 根据权利要求13或14所述的模型训练装置,其特征在于,
    所述确定单元,具体用于对所述输入数据进行标注,得到标注后的输入数据;
    所述确定单元,具体还用于将所述标注后的输入数据和所述推理结果存储到重训练样本集中。
  16. 根据权利要求15所述的模型训练装置,其特征在于,
    确定单元,用于若所述推理结果为正确的推理结果,则所述第二服务器保留所述推理结果和所述推理结果对应的输入数据;
    所述确定单元,还用于若所述推理结果为错误的推理结果,则所述第二服务器删除所述推理结果和所述推理结果对应的输入数据,或者,所述第二服务器将所述推理结果替换为所述输入数据对应的正确的推理结果。
  17. 一种模型训练装置,其特征在于,所述装置包括处理器、存储器和通信接口;其中,通信接口用于与其他设备或通信网络通信,存储器用于存储一个或多个程序,所述一个或多个程序包括计算机执行指令,当该装置运行时,处理器执行存储器存储的所述计算机执行指令以使该装置执行如权利要求1-4或5-8任一项所述的模型训练方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令被处理器运行时,实现权利要求1-4或5-8任一项所述的模型训练方法。
  19. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上被处理器运行时,使得所述计算机执行如权利要求1-4或5-8任一项所述的模型训练方法。
PCT/CN2020/113610 2020-01-16 2020-09-04 模型训练方法及装置 WO2021143155A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022543575A JP2023511327A (ja) 2020-01-16 2020-09-04 モデル訓練方法および装置
EP20913624.1A EP4080419A4 (en) 2020-01-16 2020-09-04 PATTERN LEARNING METHOD AND APPARATUS
US17/865,106 US20220351081A1 (en) 2020-01-16 2022-07-14 Model training method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010049320.1A CN113128686A (zh) 2020-01-16 2020-01-16 模型训练方法及装置
CN202010049320.1 2020-01-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/865,106 Continuation US20220351081A1 (en) 2020-01-16 2022-07-14 Model training method and apparatus

Publications (1)

Publication Number Publication Date
WO2021143155A1 true WO2021143155A1 (zh) 2021-07-22

Family

ID=76772124

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113610 WO2021143155A1 (zh) 2020-01-16 2020-09-04 模型训练方法及装置

Country Status (5)

Country Link
US (1) US20220351081A1 (zh)
EP (1) EP4080419A4 (zh)
JP (1) JP2023511327A (zh)
CN (1) CN113128686A (zh)
WO (1) WO2021143155A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114707654A (zh) * 2022-06-06 2022-07-05 浙江大学 基于人工智能框架的算法训练推理性能可视化方法及装置
CN115146737A (zh) * 2022-07-21 2022-10-04 中国电信股份有限公司 匹配模型的建模方法、防护实现方法及相关设备
CN117114141A (zh) * 2023-10-20 2023-11-24 安徽蔚来智驾科技有限公司 模型训练的方法、评估方法、计算机设备及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220049165A (ko) * 2020-10-14 2022-04-21 삼성에스디에스 주식회사 추론 데이터 기반 예측 모델 보정 시스템 및 방법
WO2023092428A1 (zh) * 2021-11-25 2023-06-01 华为技术有限公司 数据价值评估方法和相关产品
CN116233857A (zh) * 2021-12-02 2023-06-06 华为技术有限公司 通信方法和通信装置
CN114741269B (zh) * 2022-04-14 2022-09-23 网思科技股份有限公司 一种推理系统业务性能评估的方法
CN117993744A (zh) * 2022-10-31 2024-05-07 华为技术有限公司 模型评估方法和装置
CN116819964B (zh) * 2023-06-20 2024-02-06 小米汽车科技有限公司 模型优化方法、模型优化装置、电子设备、车辆和介质
CN116594846A (zh) * 2023-07-14 2023-08-15 支付宝(杭州)信息技术有限公司 推理服务监控方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543829A (zh) * 2018-10-15 2019-03-29 华东计算技术研究所(中国电子科技集团公司第三十二研究所) 在终端和云端上混合部署深度学习神经网络的方法和系统
CN109716346A (zh) * 2016-07-18 2019-05-03 河谷生物组学有限责任公司 分布式机器学习系统、装置和方法
CN109840591A (zh) * 2017-11-29 2019-06-04 华为技术有限公司 模型训练系统、方法和存储介质
CN110135575A (zh) * 2017-12-29 2019-08-16 英特尔公司 用于分布式机器学习的通信优化
WO2019172868A1 (en) * 2018-03-05 2019-09-12 Clinc, Inc. Systems and method for automatically configuring machine learning models
CN110276387A (zh) * 2019-06-12 2019-09-24 深圳前海微众银行股份有限公司 一种模型的生成方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201805296D0 (en) * 2018-03-29 2018-05-16 Benevolentai Tech Limited Shortlist Selection Model For Active Learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109716346A (zh) * 2016-07-18 2019-05-03 河谷生物组学有限责任公司 分布式机器学习系统、装置和方法
CN109840591A (zh) * 2017-11-29 2019-06-04 华为技术有限公司 模型训练系统、方法和存储介质
CN110135575A (zh) * 2017-12-29 2019-08-16 英特尔公司 用于分布式机器学习的通信优化
WO2019172868A1 (en) * 2018-03-05 2019-09-12 Clinc, Inc. Systems and method for automatically configuring machine learning models
CN109543829A (zh) * 2018-10-15 2019-03-29 华东计算技术研究所(中国电子科技集团公司第三十二研究所) 在终端和云端上混合部署深度学习神经网络的方法和系统
CN110276387A (zh) * 2019-06-12 2019-09-24 深圳前海微众银行股份有限公司 一种模型的生成方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4080419A4

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114707654A (zh) * 2022-06-06 2022-07-05 浙江大学 基于人工智能框架的算法训练推理性能可视化方法及装置
CN114707654B (zh) * 2022-06-06 2022-08-23 浙江大学 基于人工智能框架的算法训练推理性能可视化方法及装置
CN115146737A (zh) * 2022-07-21 2022-10-04 中国电信股份有限公司 匹配模型的建模方法、防护实现方法及相关设备
CN115146737B (zh) * 2022-07-21 2024-03-29 中国电信股份有限公司 匹配模型的建模方法、防护实现方法及相关设备
CN117114141A (zh) * 2023-10-20 2023-11-24 安徽蔚来智驾科技有限公司 模型训练的方法、评估方法、计算机设备及存储介质
CN117114141B (zh) * 2023-10-20 2024-02-27 安徽蔚来智驾科技有限公司 模型训练的方法、评估方法、计算机设备及存储介质

Also Published As

Publication number Publication date
CN113128686A (zh) 2021-07-16
EP4080419A4 (en) 2023-06-14
JP2023511327A (ja) 2023-03-17
US20220351081A1 (en) 2022-11-03
EP4080419A1 (en) 2022-10-26

Similar Documents

Publication Publication Date Title
WO2021143155A1 (zh) 模型训练方法及装置
US11507430B2 (en) Accelerated resource allocation techniques
Wadhwa et al. Fog computing with the integration of internet of things: Architecture, applications and future directions
KR20110002809A (ko) 신축적 컴퓨팅 시스템 및 환경을 포함하는 컴퓨팅 시스템 및 환경에 대한 실행 할당 비용 평가
US11757790B2 (en) Method and server for adjusting allocation of computing resources to plurality of virtualized network functions (VNFs)
US11630851B2 (en) Systems and methods for providing predictions to applications executing on a computing device
US20230281515A1 (en) Distributed learning model for fog computing
JP2022543994A (ja) 分散処理システムのリソース評価方法、システム、プログラム
JP2022543992A (ja) 分散処理システム内の運用データを管理する方法、システム、プログラム
US11706289B1 (en) System and method for distributed management of hardware using intermediate representations of systems to satisfy user intent
WO2021245327A1 (en) Collaborative machine learning
da Silva et al. Online machine learning for auto-scaling in the edge computing
WO2023089350A1 (en) An architecture for a self-adaptive computation management in edge cloud
Aqib et al. Classification of edge applications using decision tree, k-nn, & svm classifier
US11915060B2 (en) Graphics processing management system
US11669469B2 (en) Platform framework standby operation
CN115866726A (zh) 用于使网络通信量对齐以改善功耗的方法和装置
Pang et al. Improvement of the application of diabetic retinopathy detection model
Wen Improvement of short video transmission effect based on IoT node technology
Imteaj et al. Exploiting federated learning technique to recognize human activities in resource-constrained environment
US11593178B2 (en) ML-to-ML orchestration system and method for system wide information handling system (IHS) optimization
WO2020189951A1 (en) Methods and apparatus for identifying at least one application, on ascertaining probability of launching application
US20230325280A1 (en) System and method to predict session failure in virtual applications and desktops deployment
Najafi et al. HybridEI: Smartly Face Detection System in Resource Constrained Edge Environment
Alharbi et al. Fog Computing Performance Optimization in Healthcare Environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913624

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022543575

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020913624

Country of ref document: EP

Effective date: 20220719

NENP Non-entry into the national phase

Ref country code: DE