WO2023179038A1 - 数据标注的方法、ai开发平台、计算设备集群和存储介质 - Google Patents

数据标注的方法、ai开发平台、计算设备集群和存储介质 Download PDF

Info

Publication number
WO2023179038A1
WO2023179038A1 PCT/CN2022/130153 CN2022130153W WO2023179038A1 WO 2023179038 A1 WO2023179038 A1 WO 2023179038A1 CN 2022130153 W CN2022130153 W CN 2022130153W WO 2023179038 A1 WO2023179038 A1 WO 2023179038A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
annotation
data
user
basic model
Prior art date
Application number
PCT/CN2022/130153
Other languages
English (en)
French (fr)
Inventor
李明磊
糜飞
陈志毅
王雅圣
邓晓峰
怀宝兴
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210855348.3A external-priority patent/CN116862001A/zh
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023179038A1 publication Critical patent/WO2023179038A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a data annotation method, an AI development platform, a computing device cluster and a storage medium.
  • AI technology With the widespread application of AI technology, AI technology requires a large amount of labeled data for algorithm training, so labeling data efficiently and accurately has become a top priority.
  • embodiments of the present application provide a data annotation method that can reduce the dependence on initial annotation data, realize zero-sample and few-sample inference, and further reduce the labor cost required for data annotation.
  • This method can also Through the continuous absorption of knowledge through the basic model shared by different tasks, the basic model becomes more and more capable and the reasoning efficiency becomes higher and higher.
  • this method is more suitable to be placed in the cloud to achieve lifelong learning.
  • the knowledge of different customers can be inherited and shared, thereby achieving knowledge as a service and experience as a service.
  • This application also provides corresponding AI development platforms, computing device clusters, computer-readable storage media, and computer program products.
  • embodiments of the present application provide a method of data annotation.
  • the AI development platform receives a first prompt template input by the first user.
  • the first prompt template is used to describe the input data and annotation results.
  • the AI development platform performs data annotation on the first data set based on the prompt template and the basic model deployed in the AI platform in advance; then, the AI development platform determines the first difficult case in the first data set set, and generate a display interface to display the first difficult case set to the first user.
  • the first difficult case set includes at least one difficult case, and the user will mark the first difficult case in the display interface.
  • Confirmation including: direct confirmation of correct annotations, confirmation after modification of incorrect annotations
  • the AI platform performs the verification on the basic model based on the result of the first user confirming the annotation of the first difficult example set. Train to get the updated base model.
  • This method can directly start inference based on the prompt template and the basic model, reducing the dependence on the initial annotation data and the labor cost required for data annotation.
  • This method can also use the confirmation results of difficult case mining to verify the deployment in the AI development platform.
  • the basic model is updated to allow the basic model to retain new knowledge, making the basic model more powerful and more efficient in reasoning.
  • the data annotation method further includes: the AI development platform performs data annotation on the first data set based on the updated basic model, When the annotation accuracy of the updated basic model is higher than or equal to the threshold, return an annotation completion response, or, when the annotation accuracy of the updated basic model is lower than the threshold, determine the first data set the second difficult example set, and generate a display interface to display the second difficult example set to the first user, and the user will confirm the labeling of the second difficult example in the display interface; then, the AI development platform The first user confirms the annotation of the second difficult example set and trains the updated basic model to update the basic model again.
  • this method displays multiple rounds of difficult example sets to the user based on the annotation accuracy, and trains the basic model multiple times based on the user's confirmation of the difficult example set until the annotation accuracy of the basic model reaches the standard.
  • This method uses multiple Iterations, while continuously optimizing the annotation results, it can also continuously optimize the reasoning capabilities of the basic model.
  • the basic model is trained according to the result of the user confirming the annotation of the first difficult example set to obtain an updated basic model
  • the data annotation includes: training the basic model according to a result of the first user confirming the annotation of the first difficult example set and the first prompt template to obtain an updated basic model.
  • the data annotation method further includes: receiving a second prompt template input by a second user; and performing data analysis on the second data set based on the updated basic model and the second prompt template. Label.
  • the data annotation method further includes: determining the annotated first data set based on the result of the user confirming the annotation of the first difficult example set and the annotation of the non-difficult example set. , wherein the labeling of the non-complex example set is annotation generated in the step of data labeling the first data set based on the basic model and the prompt template, and the non-complex example set is the first data set A set consisting of the remaining data after excluding the first hard case set.
  • the target requirements of the first user are obtained, and the target requirements include task types; based on the target requirements and the labeled first data set, after the update Perform knowledge distillation on the basic model to obtain a target model, which is used to implement the tasks indicated by the task type.
  • the marked first data set and the target requirements of the first user are obtained, and the target requirements include task types; based on the marked first data set and the first user's target requirements, Model training is performed based on the target requirements to obtain a target model, which is used to implement the tasks indicated by the task type.
  • the target requirements also include model capabilities, and the performance requirements are used to describe the accuracy or performance of the target model.
  • the task types include: any of text sentiment analysis, text classification, entity naming, named entity recognition, sound classification, speech content recognition, image classification, object detection, image segmentation, and video annotation. A sort of.
  • the input first prompt template is preset in the AI development platform, and multiple prompt templates are preset in the AI development platform.
  • Each preset prompt template Corresponding to a task type; or, the first prompt template is designed by the user in the display interface.
  • inventions of the present application provide an artificial intelligence AI development platform.
  • the AI development platform includes multiple modules. The combination of multiple modules can implement the first aspect or any optional implementation method of the first aspect. the method described in .
  • the AI development platform may include: an input and output IO module, configured to: receive a first prompt template input by a first user, the first prompt template Used to describe the relationship between input data and annotation results; an inference module, used to: perform data annotation on the first data set based on a basic model and the prompt template, wherein the basic model is deployed on the AI development platform ; Difficult case mining module, used to: determine the first difficult case set in the first data set, and generate a display interface to display the first difficult case set to the first user, the first difficult case set including at least one difficult example; a basic model update module, configured to: train the basic model according to the result of the first user confirming the annotation of the first difficult example set to obtain an updated basic Model.
  • the AI development platform includes: the inference module, further configured to: perform data annotation on the first data set based on the updated basic model; the inference module, It is also used to: when the annotation accuracy of the updated basic model is higher than or equal to the threshold, return an annotation completion response; the difficult case mining module is also used to: when the annotation of the updated basic model is accurate When the rate is lower than the threshold, determine the second difficult example set in the first data set, generate a display interface to display the second difficult example set to the first user, and determine the second difficult example set according to the first user After confirming the annotations of the second difficult example set, the updated basic model is trained.
  • the basic model update module is configured to: based on the result of the first user confirming the annotation of the first difficult example set and the first prompt template, update the first prompt template.
  • the above basic model is trained to obtain the updated basic model.
  • the IO module is further configured to: receive a second prompt template input by a second user; the inference module is further configured to: based on the updated basic model and the third Second prompt template, perform data annotation on the second data set.
  • the IO module is further configured to: receive a second prompt template input by a second user; and the difficult case mining module is further configured to: based on the user's analysis of the first difficult case.
  • the result of confirming the labeling of the example set and the labeling of the non-complex example set determines the labeled first data set, wherein the labeling of the non-complex example set is based on the basic model and the prompt template, and the first data set is determined.
  • the annotation is generated in the step of annotating the data set, and the non-hard example set is a set consisting of the remaining data in the first data set excluding the first difficult example set.
  • the AI development platform further includes: a model distillation module, used to: obtain the target requirements of the first user, where the target requirements include task types; based on the target requirements, and The annotated first data set is subjected to knowledge distillation on the updated basic model to obtain a target model.
  • the target model is used to implement the task indicated by the task type.
  • the AI development platform further includes: a model training module, used to: obtain the marked first data set and the target requirements of the first user, where the target requirements include Task type: perform model training based on the labeled first data set and the target requirement to obtain a target model, which is used to implement the task indicated by the task type.
  • a model training module used to: obtain the marked first data set and the target requirements of the first user, where the target requirements include Task type: perform model training based on the labeled first data set and the target requirement to obtain a target model, which is used to implement the task indicated by the task type.
  • the target requirements also include model capabilities, and the performance requirements are used to describe the accuracy or performance of the target model.
  • the task types include: any of text sentiment analysis, text classification, entity naming, named entity recognition, sound classification, speech content recognition, image classification, object detection, image segmentation, and video annotation. A sort of.
  • the input first prompt template is preset in the AI development platform, and multiple prompt templates are preset in the AI development platform.
  • Each preset prompt template Corresponding to a task type; or, the first prompt template is designed by the user in the display interface.
  • the present application provides a computing device cluster.
  • the computing device includes at least one computing device.
  • the at least one computing device includes at least one processor and at least one memory; the at least one memory is used to store instructions.
  • the at least one processor executes the instruction stored in the at least one memory, so that the computing device cluster executes the data annotation method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • the present application provides a computer-readable storage medium that stores instructions that, when run on at least one computing device, cause the at least one computing device to execute the above-mentioned first aspect. Or the method described in any implementation of the first aspect.
  • the present application provides a computer program product containing instructions that, when run on at least one computing device, cause the computing device cluster to execute the above-mentioned first aspect or any implementation of the first aspect.
  • Figure 1 is a schematic diagram of the basic functions of an AI development platform 100 provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of the network architecture of an AI development platform 100 provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the network architecture of another AI development platform 100 provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a data annotation solution on the AI development platform 100 provided by the embodiment of the present application.
  • Figure 5 is a flow chart of a data annotation and model training method provided by an embodiment of the present application.
  • Figure 6(a) is a schematic diagram of a user interface for creating intelligent annotations provided by an embodiment of the present application.
  • Figure 6(b) is a schematic diagram of a user interface for creating a new prompt template provided by an embodiment of the present application.
  • Figure 6(c) is a schematic diagram of another user interface for creating intelligent annotations provided by an embodiment of the present application.
  • Figure 6(d) is a schematic user interface diagram of a few-sample annotation interface provided by an embodiment of the present application.
  • FIG. 6(e) is a schematic diagram of a user interface showing an annotation result interface provided by an embodiment of the present application.
  • Figure 6(f) is a schematic user interface diagram of an interface that is difficult to manually confirm provided by the embodiment of the present application.
  • Figure 6(g) is a schematic diagram of a user interface of a data annotation completion response provided by an embodiment of the present application.
  • Figure 6(h) is a schematic diagram of a user interface for model structure distillation provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of a data annotation device 300 provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a computing device 400 provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of a computing device cluster provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of an implementation manner of a computing device cluster provided by an embodiment of the present application.
  • AI development platform It is a platform that provides AI developers and users with a convenient AI development environment and convenient development tools. Based on the user's own algorithm and training image set, the AI development platform can train an AI model that can be used to achieve the user's needs. Users can use the trained AI model to complete their own specific tasks. In this process, the AI development platform can Users are provided with services such as data annotation, model training, model optimization, and model deployment.
  • Foundation Models refers to a pre-trained AI model with a large parameter volume. We can fine-tune based on the basic model to adapt to a variety of downstream task models. In other words, it is the "basis" of downstream tasks. ”, so it is called the basic model. Since the parameter scale of the basic model is usually large, it can also be called a large model in some cases. This type of large model is trained from massive unlabeled data, and the parameter volume is usually more than 1 billion. For example, Huawei Cloud currently The Pangu CV large model has 3 billion parameters, and the Pangu NLP large model has even reached 100 billion parameters.
  • Data annotation It is the process of adding labels in corresponding scenarios to unlabeled data.
  • the unlabeled data is unlabeled images.
  • the category to which the unlabeled image belongs is added.
  • the location information and category are added to the targets in the unlabeled image.
  • the annotated data carries labels.
  • the parameters in the AI model can be adjusted based on the labels of the data.
  • Automatic data labeling uses active learning to complete data labeling.
  • the core principle is to randomly select a part of the data for labeling, then train a model based on the labeled data, and determine the confidence threshold based on the verification data set; then predict the unlabeled data, and A confidence score is generated for each data. Data that are higher than the confidence threshold are directly regarded as automatically labeled. Data that are lower than the confidence threshold are sent to the user for re-labeling. This part of the automatic labeling process can reduce the amount of manual labeling.
  • Hard examples refer to the process of mining data for which the AI model has poor performance during training, evaluation, or inference.
  • Input data that cannot provide more accurate inference results for the AI model are called hard examples of the AI model.
  • input data whose loss function value between the prediction result and the label result during training is greater than a certain threshold is used as a difficult example.
  • the D data in the inference data set is input to the AI model. If the error rate of the output inference result is higher than the target threshold, the D data is a hard example.
  • the AI model can also be used to intelligently label unlabeled data. The intelligent labeling process using the AI model is actually the reasoning process of the AI model. Input data that is labeled incorrectly or has a high label error rate, was determined to be a difficult case.
  • Hard example mining refers to the method of determining an image as a hard example.
  • Small sample learning The small sample learning problem refers to training an AI model that effectively recognizes these targets given only a small number of training samples of the target, so as to obtain a model that accurately classifies test samples.
  • Small-sample learning can be divided into three categories according to the number of training samples, as follows: a) Few-sample learning: model training problems with training samples in the order of dozens of magnitudes. b) Single-sample learning: There is only one training sample, and it is also the mode closest to human processing. b) Zero-sample learning: Prediction is made without any labeled data, with the purpose of predicting classes that have not appeared in the training data set.
  • Model fine-tuning Based on the pre-trained large model, fine-tune the parameters of the fully connected layer or the top layers of the neural network model on a small sample data set to obtain a fine-tuned model to match different downstream tasks, so that the fine-tuned model The effect on downstream tasks is significantly improved. Model fine-tuning achieves the goal of using less domain-specific data and without going through fine-tuning steps to solve the target problem.
  • Prompt template "Prompt” is a context given to the model along with the input. It tells and guides the model what tasks you should do next. It is a prompt. In other words, it can be understood that the previous "hint” can transform the downstream task into what the pre-trained model expects.
  • the prompt template can be understood as how we expect the large model to label the data. In other words, it is a template used to prompt the relationship between the data and the labeling results.
  • the "relationship” here can be the context. relationship or other logical relationship, just like building a cloze template. For large models, you only need to fill in the blanks. For examples of prompt templates, see Table 1 below.
  • Prompt learning Currently mainly used in the field of NLP, without significantly changing the structure and parameters of the pre-trained language model (basic model), the downstream task is changed to a text generation task by adding prompt information to the input. There is a lot of knowledge and patterns in pre-trained language models. Some are ready-made and can be used directly, while others require certain methods to "stimulate" them.
  • Hint learning can be applied to knowledge exploration (fact exploration and linguistic exploration), classification tasks (text classification and natural language reasoning), information extraction (relationship extraction, semantic analysis and named entity recognition), reasoning in NLP (common sense reasoning and mathematics) Reasoning), question answering, text generation, automatic evaluation of text generation, multi-modal learning, meta-applications (domain adaptation, debiasing and data set creation) and other task types. This application does not place any restrictions on task types.
  • Automatic data labeling uses active learning to complete data labeling.
  • the core principle is to randomly select a part of the data for labeling, then train a model based on the labeled data, and determine the confidence threshold based on the verification data set; then predict the unlabeled data, and A confidence score is generated for each data.
  • Data that is higher than the confidence threshold are directly regarded as automatically labeled, and data that is lower than the confidence threshold are sent to the user for re-labeling. This part of the automatic labeling process can reduce the amount of manual labeling.
  • Knowledge distillation It is a commonly used method for model compression. Different from pruning and quantification in model compression, knowledge distillation can transfer the knowledge of one network to another by building a lightweight small model. Network, two networks can be homogeneous or heterogeneous.
  • the method of distillation is to first train a teacher network, and then use the output of the teacher network and the real label of the data to train the student network.
  • the student model obtains the teacher's knowledge through distillation training, which can improve the performance slightly. Transfer the knowledge of the complex teacher model to the simple student model at the cost of loss to obtain better performance on downstream tasks.
  • FIG. 1 is a schematic diagram of the basic functions of an AI development platform 100 according to an embodiment of the present application.
  • the AI development platform 100 is a PaaS cloud service in the cloud platform. It is based on the large number of basic resources and software capabilities owned by the cloud service provider to provide assistance to users (also known as: tenants, AI developers, etc.) to carry out AI models.
  • the basic capabilities provided by the AI development platform 100 can include the following six parts: data preprocessing 110, model construction and training 120, model management 130, model deployment 140, data optimization 150, model optimization update 160, each The functional modules are introduced as follows:
  • Data preprocessing 110 Users can perform one or more operations on the data set such as data selection, data annotation, data enhancement, data cleaning, and feature analysis according to their needs.
  • data annotation is the most important step in data preprocessing 110.
  • Data annotation usually refers to the data set required for AI model training. The data set here can be pre-collected by the user according to the actual application scenario and uploaded to the platform. 100, you can also use open source data sets that have been formed in the industry.
  • data annotation 111 please refer to the description of Figure 3 below.
  • Model construction and training 120 The construction and training of AI models are the key capabilities of the AI basic development platform, mainly: (1) Based on the user's goals (such as task type, target accuracy, etc.), automatically select AI basic development for the user The initial model built into the platform is trained to obtain an AI model that meets the user's goals; (2) Based on the user's goals and the initial AI model provided by the user or selected by the user on the AI basic development platform, the initial AI model is Carry out training to obtain an AI model that meets the user's goals (3) Based on the user's goals, the AI basic development platform uses the background neural network architecture search algorithm to automatically search for a suitable AI model, train it, and obtain an AI model that meets the user's goals. AI model.
  • the first two methods mainly use the computing power of the cloud environment to train the AI model.
  • the third method includes both the search for the AI model architecture and the training of the AI model.
  • the principle of AI model training is as follows. No further details will be given.
  • Model management 130 The AI basic development platform also provides the function of model management.
  • the models can come from the AI models that have been trained as mentioned above, as well as the user's own AI models. Unified management of models includes model evaluation, diagnosis, optimization, conversion, etc. Among them, model evaluation mainly uses at least one evaluation index to measure the performance of the trained AI model, for example: the trained AI model can be calculated The accuracy of inference results on the evaluation data set.
  • Model deployment 140 The aforementioned target AI model can be deployed on nodes in the cloud environment or nodes in the edge environment, where the nodes in the cloud environment can be virtual machine instances, container instances, physical servers, etc.
  • the AI model can be deployed distributedly on multiple nodes based on the idea of model parallelism.
  • AI models can also be deployed independently on multiple nodes to support a larger number of visits to online services.
  • the nodes in the edge environment can be various edge devices.
  • the deployed AI model can become an AI application or become a part of an AI application.
  • users can access AI applications online through Web pages, or access AI applications online through client apps.
  • the AI application can be provided by calling the AI model deployed in the edge environment or cloud environment through online calling. response.
  • the AI model developed and trained through the AI basic development platform can implement inference on online request data and return inference results.
  • the cloud platform can bill based on the number of calls to the AI model, or based on the resource consumption inferred by the AI model.
  • the AI model developed and trained by the aforementioned AI basic development platform may not be deployed online. Instead, users can download the trained AI model to the local area for users to freely deploy locally. For example, users can choose to save the trained AI model to OBS, and then download the AI model from OBS to the local computer.
  • user 1 uses the aforementioned AI basic development platform 100 to complete the training of the AI model and then publishes it to the AI market.
  • the AI model in the AI market can be subscribed and used by other users.
  • the functions of the AI model can be integrated. to other users’ AI applications.
  • users can complete the development of AI models and the deployment and management of AI applications based on the AI basic development platform 100.
  • Various capabilities in the AI basic development platform can be integrated for users to use the entire AI process, or they can provide independent functions for users.
  • FIG. 2 is a schematic diagram of the network architecture of an AI development platform 100 provided in the embodiment of this application.
  • the AI development platform 100 can be deployed independently on a server or virtual machine in a data center in a cloud environment.
  • the AI development platform 100 can also be deployed in a distributed manner on multiple servers in a data center, or in a distributed manner. Deployed on multiple virtual machines in the data center.
  • the data center in Figure 2 is the central cloud data center of the cloud service provider.
  • the interaction form between the user and the AI development platform 100 mainly includes: the user logs in to the cloud platform through the client web page, selects and purchases the cloud service of the AI development platform 100 in the cloud platform, usually, the user needs to first purchase the prepaid
  • the package package means that you can use the capabilities provided by the AI basic development platform and the basic computing resources included in the prepaid package package for data annotation, model construction, training, deployment, etc.
  • users can conduct full-process AI development based on the functions provided by the AI development platform 100.
  • users develop and train their own AI models on the AI base platform, they are based on the basic resources (including computing resources, storage resources and network resources) in the cloud service provider's data center.
  • the computing resources include CPU, GPU, NPU, etc.
  • the cloud platform charges for the excess resources in a pay-as-you-go manner.
  • users can specify the tasks to be completed by the AI model and upload unlabeled data sets to the cloud environment through the application program interface (API) or graphical user interface (GUI).
  • API application program interface
  • GUI graphical user interface
  • the AI development platform 100 in the cloud environment receives the user's task information, unlabeled image sets, etc., performs data preprocessing, AI model training, and uses the trained AI model for reasoning.
  • the trained AI model can be downloaded by the user or online Deployed to complete specific tasks.
  • the aforementioned data center may also include an edge data center provided to users by a cloud service provider.
  • FIG 3 is a schematic network architecture diagram of another AI development platform 100 provided by an embodiment of the present application.
  • the AI development platform 100 in Figure 3 can also be deployed in a distributed manner in different environments.
  • the AI development platform 100 can be logically divided into Multiple sections, each with different functions.
  • part of the AI development platform 100 may be deployed in computing devices in an edge environment (also called edge computing devices), and another part may be deployed in devices in a cloud environment.
  • the edge environment is an environment that is geographically close to the user's terminal computing device.
  • the edge environment includes edge computing devices, such as edge servers, edge stations with computing capabilities, etc.
  • the resources in the public cloud are used to run the functions of model construction and training 120 and model management 130 in Figure 1 provided in the AI development platform, and the resources in the private cloud are used to run the AI
  • the data storage OBS and data preprocessing 110 functions provided in the development platform can provide stronger security for user data.
  • public cloud resources can come from the central cloud data center, and private cloud resources can come from edge data centers.
  • FIG 4 is a schematic diagram of a data annotation scheme on an AI development platform 100 given in the embodiment of this application.
  • the process mainly includes: user input prompt template 111, AI development platform for intelligent annotation 112, manual confirmation of difficult cases 113, and training basis Model 114.
  • the data annotation process in the embodiment of this application includes: first, the user inputs a prompt template 111, and the AI development platform 100 intelligently annotates the data set A 112 based on the basic model B. At the same time, the AI development platform 100 performs difficult intelligent annotation 112. Example mining; then, the AI development platform 100 sends the difficult examples to the user, and performs manual confirmation of the difficult examples 113; then, the AI development platform 100 continues to train the basic model B based on the results of the difficult example confirmation 114, and learns this data set A brings new knowledge and performs intelligent annotation again based on the training basis 112. In the process shown in Figure 3, the AI development platform 100 continuously repeats the process of intelligent annotation 111, manual confirmation of difficult cases 113, and basic model training 114 until the accuracy of the intelligent annotation 112 meets the conditions.
  • the annotation accuracy rate after the first intelligent annotation 112 meets the conditions, for example, the accuracy rate reaches the threshold of 99.8%. At this time, manual confirmation is not required. Step 113, and directly return the annotation completion response.
  • This method is an ideal situation. It may occur after the data annotation function module is launched on the AI development platform for a long time. As the basic model B continues to absorb knowledge, the model ability becomes stronger and stronger, the zero-sample effect becomes better and better, and the efficiency It is getting higher and higher, and even intelligent annotation with higher accuracy can be completed in one go.
  • steps 111-114 are as follows:
  • Step 111 Enter the prompt template.
  • Task type prompt template Text sentiment analysis X emotion polarity is ⁇ MASK> Named entity recognition What entities are Z in X? Answer: ⁇ MASK> Text Categorization X is ⁇ MASK> news sound classification X is the sound of ⁇ MASK> Speech content recognition The content of X is ⁇ MASK> Image classification X is a kind of Z? Answer: ⁇ MASK> Object detection The coordinates of object Z in X are ⁇ MASK> Video annotation The coordinates of object ⁇ MASK> in X are ⁇ MASK>
  • X represents the input data, which can be text, image, audio, and video
  • ⁇ MASK> represents the output, which is the result of data annotation. It should be noted that the above examples are only examples and are not used to limit the format of the prompt template in this application. X or ⁇ MASK> may also be omitted in some cases.
  • X or ⁇ MASK> may also be omitted in some cases.
  • X in Table 1 represents that the input is a sentence, and directional words such as "sentiment polarity" and "news" are all Prompt
  • the NLP large model can combine the context meaning of the prompt template and output the result corresponding to ⁇ MASK>.
  • the text sentiment analysis task if output.
  • a customer uploads news named entity recognition data he can choose the named entity recognition template "What entities like Z are there in X? Answer: ⁇ MASK>", where X is the original text content and Z is a certain entity category.
  • Example entity word, MASK is the content to be generated by the prompt template.
  • X is a kind of Z, answer: ⁇ MASK>
  • ⁇ MASK> is similar to a natural language understanding task, except that the input X here changes from a sentence to a picture, and the following Z can also be a picture we input.
  • Picture there is a black dog in the picture.
  • the large CV model may have photos of dogs similar to this picture among the 400 million pictures it has seen.
  • the user only needs to input the prompt template, and the basic model B can perform zero-shot learning based on the prompt model to perform intelligent annotation.
  • the user while the user only inputs the prompt template, he can also input a small number of samples corresponding to the prompt template format to help basic model B perform few-sample learning for intelligent annotation.
  • the AI development platform performs prompt learning based on basic model B and labels data set A. Specifically: based on the prompt template input in the previous step 112, or based on the previous prompt template and a small amount of sample data, the unlabeled data in data set A is The data is used for reasoning (i.e., prompt learning), such as zero-shot reasoning/few-shot reasoning, and the labeling results corresponding to the unlabeled data are output.
  • prompt learning i.e., prompt learning
  • basic model B is a pre-trained large model deployed on the AI development platform. This type of large model is usually trained from massive unlabeled data. The parameter scale of the basic model is usually large and has excellent generalization capabilities. Large models are mainly divided into two categories according to different types of training data: Natural Language Processing (NLP) large models and CV Computer Vision (CV) large models. In addition, the large model may also include: multi-modal large model, scientific computing large model, etc. This application does not limit this.
  • the basic model B in the embodiment of the application may be any of the large models described above.
  • Step 113 Manual confirmation of difficult cases.
  • the AI development platform 100 also introduces difficult case mining technology, which allows the AI development platform to perform a closed-loop process of reasoning, hard case mining, training, and re-reasoning based on the pre-trained basic model B.
  • One possible implementation method is that after the basic model performs inference (intelligent annotation), the AI development platform uses the difficult example mining algorithm to sort the samples (that is, difficult examples) that are not confident in the model prediction, and determine the difficult examples in the first data set. And the attributes of difficult cases are then given to the user through the user interface program for manual annotation. The proportion of difficult cases marked here can be adjusted manually.
  • the AI development platform 100 can use one of a temporal consistency algorithm, a data feature distribution-based algorithm, a data-enhanced consistency algorithm, an uncertainty-based algorithm, a clustering algorithm, or an anomaly detection algorithm.
  • a temporal consistency algorithm determines the difficult examples in the unlabeled image set and the difficult attributes of each difficult example.
  • multiple algorithms are used to determine the difficult examples and their attributes. The weights of different algorithms are different, and the weights of different features are also different.
  • the user can see the number of difficult examples that need to be confirmed, as well as the proportion of difficult or non-hard cases in the unlabeled data set A, and then determine whether the inference performance of the current basic model meets the requirements.
  • users can also see the accuracy of the current smart annotation.
  • the AI development platform determines the difficult examples in the unlabeled image set and the difficult example attributes of each difficult example based on the inference results of the basic model.
  • the difficult example attributes include the difficult example coefficient, which is used to describe the difficult example.
  • the difficulty coefficient can be a number between 0 and 1, which is used to reflect the degree of difficulty of the difficult example (for example, the difficulty of classifying or detecting through an AI model to obtain correct results).
  • the greater the difficulty coefficient the degree of difficulty. The higher it is, conversely, the smaller the difficulty coefficient is and the lower the degree of difficulty is.
  • the AI development platform sorts the difficult examples and sends at least some of the difficult examples to the user according to the set labeling ratio (or difficulty coefficient threshold). For example, the AI development platform can only set the difficulty coefficient threshold to 0.6. In other words In other words, only difficult cases with a difficulty coefficient greater than 0.6 are returned to the user for confirmation.
  • Step 114 Train the basic model.
  • the AI development platform continues to train the basic model to update the basic model.
  • the update here refers to: training the basic model in the AI development platform to adjust the parameters in the basic model 115 based on the annotation results after confirmation of the difficult examples and the aforementioned prompt template.
  • the samples in data set A are intelligently annotated again based on the current basic model B, manually confirmed with difficult cases, and the basic model is updated, that is, steps 112 to 115 are repeated.
  • the above-mentioned labeling process is not terminated until the accuracy rate of the automatic labeling of the basic model is higher than (or equal to) the threshold T in a certain step 112.
  • the proportion of non-infringement data obtained by Customer C will be higher than that obtained by Customer D.
  • the number and rounds of manual confirmation of difficult cases will be fewer, that is, the overall efficiency of automatic labeling will be higher.
  • Figure 5 is a flow chart of a method of data annotation and model training provided by the embodiment of this application.
  • the method is executed by the AI development platform.
  • the application will be introduced below by taking the annotation of task type and text data based on emotional analysis as an example. Data annotation and model training methods.
  • Sentiment analysis is an important branch in the field of natural language understanding. It mainly focuses on text fragments and automatically identifies whether the text fragment has a positive, negative or neutral evaluation. This problem is a text classification problem, and the category labels are positive and negative. ,neutral. For example, after watching a movie, users can choose to leave their own evaluation of the movie on a group buying website. Next, the embodiment of this application will introduce this method based on the task of sentiment analysis. The steps include:
  • Step 201 The AI development platform receives the first data set uploaded by the first user.
  • the first data set can be pre-collected by the first user based on actual application scenarios, or an open source data set that has been formed in the industry can be used. For example, the first user collected 800 movie reviews in advance as the first data set to be annotated.
  • users can pre-purchase object storage service (OBS) on the cloud platform, which is an object-based cloud storage service.
  • OBS object storage service
  • Users can store data sets in a certain path of OBS, and then When using the data preprocessing 110 (for example, data annotation) function provided by the AI basic development platform, directly enter the path of OBS in the user interface, and then read the data in the data set from OBS when performing intelligent annotation later.
  • the user can also directly upload the first data set to be annotated in the user interface of the data annotation service.
  • Figure 6(a) is a schematic diagram of a user graphical interface for creating a task according to an embodiment of the present application, which is used to create this intelligent annotation task.
  • the user can directly "select" an existing OBS directory "obs/buckets/test", which has stored the data set previously uploaded by the user, or "create” a new OBS directory and upload 800 movie reviews.
  • Step 202 The AI development platform receives a first prompt template input by the first user, where the input first prompt template is used to describe the relationship between the data in the first data set and the annotation results.
  • the AI development platform 100 can provide intelligent annotation based on basic models, such as the Pangu NLP large model and the Pangu CV large model on Huawei Cloud.
  • the user can only provide a "prompt template" as a reference, and the basic model will perform intelligent annotation. That is, the user can directly start the intelligent annotation service without providing annotation samples. This method is called zero-shot learning.
  • the user processes the input prompt template and only needs to label a small number of samples (for example, 1 to 10) to quickly start intelligent labeling.
  • the first user directly selects the task type he needs as "Text Sentiment Analysis” in the drop-down box of the task type in the GUI, and then the first user selects the task.
  • the optional prompt template "Text X, emotional polarity is ⁇ MASK>" appears in the "Prompt Template” column. The first user can directly select the prompt template.
  • Figure 6(b) is a first user interface for creating a new prompt template provided by an embodiment of the present application.
  • the first user wants to design a prompt template to identify the emotional polarity of movie reviews.
  • the first user first selects the data type as "text”, and then reads AI development After the platform's "Format Description", I designed a prompt model that better suits my needs: "Comment X, this movie is really ⁇ MASK> worth watching.”
  • FIG 6(c) is another interface diagram for creating an intelligent annotation task provided by an embodiment of the present application. In this interface, the first user can also select the annotation mode as "few samples".
  • the first user After clicking "Next", the first user That is, entering the few-sample annotation interface in Figure 6(d), the first user provides several examples in the interface, such as "The movie is very touching, and the emotional polarity is ⁇ good>", "The plot is boring, and the emotional polarity is ⁇ Bad>", the AI development platform 100 performs few-sample learning based on these two examples, and generates content corresponding to ⁇ MASK> in the prompt template for other samples in the data set, thereby directly predicting the data without the need for Manually label data.
  • Step 203 The AI development platform performs data annotation on the data in the first data set based on the basic model and the first prompt template.
  • the AI development platform 100 obtains the basic model B (for example, a large NLP model), and directly performs inference on the data in the first data set based on the first prompt template input by the first user, thereby realizing automatic data annotation of the data set. For example, "I watched this movie and liked it very much. This movie is very ⁇ MASK>".
  • the basic model can predict that the word corresponding to ⁇ MASK> is most likely to be "good", and then map it to a "positive" evaluation.
  • the AI development platform 100 learns based on the basic model, the prompt template input by the user, and a small number of labeled samples, and infers the data in the first data set, thereby Implement automatic labeling of data sets.
  • the AI development platform 100 will also store the labeled data in the corresponding path of OBS.
  • Step 204 The AI development platform determines the first difficult case set in the first data set, and displays the first difficult case set to the first user through a display interface.
  • the first set of difficult cases includes one or more difficult cases.
  • the AI development platform 100 in the embodiment of this application introduces difficult example mining technology, which can identify which input data are difficult examples during the reasoning process of the basic model, that is, determine the first difficult example set, and the first difficult example set is Include one or more difficult examples.
  • the AI development platform 100 can provide one or more difficult examples to the user through the display interface.
  • the user can see the current data annotation results, as well as the number and accuracy of difficult examples. As shown in Figure 6(e), 80 data are identified as difficult examples by the system. The accuracy of the basic model The rate is 90%.
  • the accuracy of the automatic annotation of the current basic model can be defined as the proportion of non-unlabeled examples in the unlabeled data set A. For example, if the current basic model automatically labels data set A and the proportion of unlabeled data set A is 90%, it can be understood that the accuracy of automatic labeling by the current basic model is 90%. It should be noted that the first user C selected a specific first prompt template according to his own task type, and the basic model automatically annotated based on this prompt model. Therefore, the automatic annotation accuracy here is for the current task type. .
  • the accuracy of the automatic annotation of the current basic model can also be defined as its prediction accuracy on the test set B.
  • the first user can simultaneously upload a test set B in step 111. If the test accuracy of the current basic model on test set B is 85%, it can be understood that the accuracy of the automatic annotation of the basic model is 90%.
  • the first user can click "Settings" in Figure 6(e) to manually adjust the marked proportion of difficult cases and the threshold of difficult case coefficients. Please refer to the previous article for related content. After the first user confirms the current annotation result, he can click "Manual confirmation of difficult cases" to enter the manual confirmation interface of difficult cases in Figure 6(f).
  • the user can confirm the annotation of difficult examples provided by the AI development platform in the display interface (specifically including direct confirmation, modified confirmation, etc.).
  • Figure 6(f) is an interface for manual confirmation of difficult cases provided by the embodiment of the present application.
  • the annotation results of difficult cases include whether the evaluation of the movie conveyed by the text comments in the figure is positive or negative.
  • click "Confirm” directly; if the user does not agree with the result of automatic annotation, click "Modify".
  • Step 205 The AI development platform obtains the annotation results after the first user confirms the first difficult case set in the display interface.
  • the AI development platform obtains the annotation result of the first user's annotation and confirmation of the first difficult example.
  • the annotation results include different contents.
  • the AI development platform provides the user with one or more difficult cases after the first intelligent annotation, the user only needs to mark and confirm the difficult cases and provide the confirmation results to the AI development platform, which can then help the platform optimize the basic model. , making the automatic annotation provided by the basic model more accurate next time.
  • the AI development platform will synchronize the confirmed difficult cases to the labeled first data set, that is, store them in the corresponding path of OBS.
  • the AI development platform can also convert the first user's to-be-confirmed difficult example set into a labeled difficult example set, or a labeled non-difficult example set, or an unlabeled difficult example set based on the first user's label confirmation. A collection of unlabeled censure examples.
  • Step 206 Train the basic model according to the confirmed annotation results of the first difficult example to update the basic model.
  • the annotation of the first data set is first updated (for example, the annotation here is generated by the automatic annotation in step 203), and based on The base model is trained on the updated first data set (labeled) to update the base model.
  • train the basic model to update the basic model in this step is a general term and is not used to limit the model to only one update.
  • the AI development platform may have returned one or more rounds of difficult example sets for users to confirm, and trained the basic model based on the confirmed first data set.
  • the AI development platform 100 trains the basic model based on the first prompt template input by the first user in the previous step and the first data set (labeled) after manual confirmation of difficult cases to update The base model. For example, since the type performed by the first user is "text sentiment analysis", the AI development platform fine-tunes the large NLP model in the platform based on the previous first data set (labeled) and the first prompt template to update the base model. some parameters.
  • Step 207 Label the data in the first data set based on the current basic model.
  • Step 207 Based on the updated basic model, perform data annotation on the first data set.
  • the first data set needs to be automatically labeled again.
  • the AI development platform 100 will return the labeling results, and the first user can see the completion status of this labeling, as well as the number and accuracy of difficult cases.
  • the AI platform 100 First, receive a second prompt template input by the second user, and perform data annotation on the second data set based on the updated basic model and the second prompt template. It can be seen that after training, the basic model has accumulated knowledge in the first user's data. At this time, the updated basic model is used to annotate the second user's second data, and the accuracy can be improved to a certain extent.
  • Step 208 Determine whether the labeling accuracy of the updated basic model is lower than the threshold.
  • step 204 If the accuracy of this annotation is lower than the threshold, return to step 204 to step 208 until the accuracy of a certain annotation is not lower than the threshold. For example, the annotation accuracy of the updated basic model is lower than the threshold.
  • the updated basic model is trained, thereby updating the basic model B on the AI development platform again.
  • step 209 a response that the annotation is completed is returned.
  • the AI development platform will also determine whether the labeling accuracy of the updated basic model is lower than the threshold.
  • the system can directly return the annotation completion response (step 209).
  • Step 209 Return the annotation completion response of the first data set.
  • Figure 6(g) is a schematic diagram of the user interface of a data annotation completion response given in the embodiment of the present application.
  • the accuracy rate at this time is as high as 99.9%. It can be considered that the updated basic model is superior in the first data set and text sentiment analysis tasks. The reasoning performance is excellent.
  • the completed labeled first data set returned in this step includes: in step 205, the difficult case confirmation result is synchronized to the first data set after the labeled first data set, that is, the first data set that has been labeled here.
  • the first dataset of completed annotation includes the results of automatic annotation and difficult example confirmation.
  • the completed annotated first data set returned in this step includes: the result of automatic annotation of the first data set by the updated basic model (i.e., the automatic annotation result of step 207).
  • Step 210 Obtain the labeled first data set and the first user's target requirements.
  • a method of model construction and training which can generate an AI model (ie, target model) that meets the first user's expected task based on the first user's target needs.
  • the first user's target requirements may include: task type, model capability, where the model capability refers to the accuracy, performance, price and other requirements that the first user expects the target model to achieve.
  • Step 211 Based on the labeled first data set and the target requirement, train to obtain a target model, or distill the target model from the updated basic model.
  • the following two types of model construction/training methods are given:
  • distillation it is also called knowledge distillation, that is, using the aforementioned basic model as supervision information, and using the labeled first data set as the training sample to train the target model (lightweight small model) , thereby transferring the knowledge of the basic model to the target model to improve the reasoning capabilities on the task types set by the user (such as sentiment analysis tasks). Since user C's task type is text-based, this distillation will be based on the NLP large model in the AI development platform.
  • Figure 6(h) is an interface for model structure distillation provided by the embodiment of the present application.
  • the first user again selects the OBS location and task type of the data set.
  • the obs/buckets/test1 path The annotations of the data have been stored in the first data set.
  • the first user can also set in the interface what kind of performance he expects the target model to be.
  • the target model is "high precision", “high performance” or “economical”, where economic represents the generated target model.
  • the comprehensive cost is low, that is, the first user can obtain the distilled target model by paying a reasonable low price.
  • the first user can further set the parameters of the accuracy and performance expected of the model during model distillation. Specifically: accuracy can be used to indicate model accuracy, precision, recall and other conventional indicators, and performance can represent Performance indicators such as computing time and space consumption.
  • the user can create a model training task and enter the parameters of the training job in the user interface, such as task type, input path, algorithm name, AI engine, number of computing nodes, training specifications and other parameters, where the input path refers to the OBS of the input data. path.
  • users can further set and manage parameters of the accuracy and performance expected of the target model during model training. Specifically: accuracy can be used to indicate model accuracy, precision, recall and other conventional indicators, and performance can represent Performance indicators such as computing time and space consumption.
  • the model distillation method provided in the embodiment of the present application is obtained by performing knowledge distillation based on a basic model with a large number of parameters. Since the basic model learns different users and different tasks The knowledge can train the target model that meets the user's needs with higher efficiency.
  • the AI development platform when the AI development platform provides users with the services in the embodiments of this application, it can be divided into two parts, namely: intelligent annotation service and model training/model distillation service. Users can first purchase only intelligent annotation services on the cloud service platform, or only purchase model training/model distillation services. For example, after users purchase basic service cloud services, the cloud service provider provides APIs for these two types of services, and ultimately the intelligent annotation service and model training/model distillation service are billed additionally based on the number of API calls.
  • Figure 7 is an example of a data annotation device 300 (which can also be understood as an AI development platform 300) provided by an embodiment of the present application.
  • the device can also provide a model training function.
  • the device 300 can be implemented as part or all of the AI development platform 100 through software, hardware, or a combination of the two, and can be used to implement the methods in Figures 3 and 4 of the embodiment of the present application.
  • the device 300 includes: input and output IO module 301, data storage module 302, inference module 303, basic model storage module 304, difficult example mining module 305, basic model update module 306, model distillation module 307 and model training module 308 .
  • the input and output IO module 301 is used to receive a first prompt template input by the first user through the display interface, where the input first prompt template is used to describe the relationship between the data in the first data set and the annotation results.
  • the user only needs to input the prompt template, and the basic model B can perform zero-shot learning based on the prompt model to perform intelligent annotation.
  • the user while the user only inputs the prompt template, he can also input a small number of samples corresponding to the prompt template format to help basic model B perform few-sample learning for intelligent annotation.
  • the module 301 presets multiple prompt templates, and each preset prompt template corresponds to a business type; optionally, the prompt template can also be the first user's prompt template in the display interface. Designed by myself.
  • the IO module 301 is also configured to receive the first data set uploaded by the first user, and store the first data in the storage module 301.
  • the first data set is an unlabeled data set.
  • users can purchase OBS services on the cloud platform in advance. Users can store the data set in a certain path of OBS. In this step, they only need to enter the path of OBS in the user interface. When performing intelligent annotation later, the data in the data set will be read from OBS. The user can also directly upload the first data set to be annotated in the user interface of the data annotation service.
  • the data storage module 302 is used to store the first data set uploaded by the first user.
  • the data storage module 302 may be an OBS service provided by the cloud platform.
  • the AI development platform 100 will also store the annotated first data in the corresponding path of the OBS.
  • the OBS service is another cloud service that is different from the AI development platform.
  • the inference module 303 is used to annotate the data in the first data set based on the updated basic model and the first prompt template, where the basic model is a pre-trained model deployed in the AI development platform. AI model.
  • the inference module 303 performs prompt learning based on the basic model and annotates the data set A, specifically: based on the first prompt template input in the previous step 112, or based on the previous first prompt template and a small amount of Sample data, perform inference (i.e. prompt learning) on the data of the first data set, such as zero-sample inference/few-sample inference, and output the annotations corresponding to the unlabeled data.
  • the inference module 303 is also configured to return an annotation completion response of the first data set when the annotation accuracy of the updated basic model reaches a threshold.
  • the basic model storage module 304 is used to store the basic model, where the basic model refers to a pre-trained large model deployed on the AI development platform 100.
  • This type of large model is usually trained from massive unlabeled data.
  • the parameter scale of the basic model is usually large and has excellent generalization capabilities.
  • Large models are mainly divided into two categories according to different types of training data: Natural Language Processing (NLP) large models and CV Computer Vision (CV) large models.
  • the large model may also include: multi-modal large model, scientific computing large model, etc. This application does not limit this.
  • the basic model in the embodiment of the application may be any of the large models described above.
  • the difficult case mining module 305 is used to determine one or more difficult cases in the first data set (i.e., the first difficult case set), and display the first difficult case in the first data set to the first user through a display interface.
  • the AI development platform 100 also introduces difficult case mining technology, which allows the AI development platform to perform a closed-loop process of reasoning, hard case mining, training, and re-reasoning based on the pre-trained basic model B.
  • the AI development platform 100 uses a difficult example mining algorithm to sort samples (i.e., difficult examples) for which the model prediction is not certain, and determines the samples in the first data set. Difficult cases and difficult case attributes are then given to the first user through the user interface program to manually confirm and modify the labeling results of difficult cases.
  • the difficult example mining model 305 is also used to: when the labeling accuracy of the updated basic model for the first data set is lower than a threshold, determine the second difficult example set in the first data set, by A display interface displays the second set of difficult examples to the first user.
  • the AI development platform will synchronize the confirmed difficult cases to the labeled first data set, that is, store the confirmed results of the difficult cases in the corresponding path of OBS.
  • the basic model update module 306 is configured to train the basic model according to the result of the first user confirming the annotation of the difficult example, so as to update the basic model. Specifically, based on the result of the first user confirming the labeling of the difficult example, the labeling in the first data set in OBS is first updated, and the basic data set is updated based on the updated first data set (labeled). The model is trained to update the base model.
  • the basic model update module 306 is also configured to train the updated basic model based on the result of the first user confirming the annotation of the second difficult example set.
  • the IO module 301 receives the second prompt template input by the second user, And based on the updated basic model and the second prompt template, perform data annotation on the second data set. It can be seen that after training, the basic model has accumulated knowledge in the first user's data. At this time, the updated basic model is used to annotate the second user's second data, and the accuracy can be improved to a certain extent.
  • the model distillation module 307 is configured to: obtain the target demand of the first user, where the target demand includes a task type; based on the task type, use the updated basic model knowledge distillation to obtain the target model, where the target
  • the model is used to implement the tasks indicated by the task type.
  • the task types include: any one of text sentiment analysis, text classification, entity naming, named entity recognition, sound classification, speech content recognition, image classification, object detection, image segmentation, and video annotation.
  • the model training module 308 is configured to: obtain the labeled first data set and the first user's target demand, where the target demand includes a task type, based on the labeled first data set and the first user's target demand.
  • Target requirements training to obtain a target model, the target model is used to achieve the tasks indicated by the task type.
  • module 301 any module from module 301 to module 308 can be used to perform some or all of the steps in the methods of FIG. 4 and FIG. 5 of this application.
  • the reasoning module 303 can be used to perform any steps of the methods in the embodiments of this application.
  • other modules can also be used to perform any steps of the methods in the embodiments of this application.
  • Modules 301 - The steps that module 308 is responsible for implementing can be specified as needed.
  • Module A, module B, and module C respectively implement different steps in the method in the embodiment of this application to realize all functions of the data annotation device 300.
  • the implementation method (software method and hardware method) of the inference module 303 is introduced.
  • the implementation of other modules in the device 300 can refer to the implementation of the inference module 303:
  • a module is an example of a software functional unit.
  • the inference module 303 may be an application program or code block running on a computer device.
  • the computer device may be at least one of a physical host, a virtual machine, a container, and other computing devices. Further, the above computer equipment may be one or more.
  • the inference module 303 may be an application running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the application can be distributed in the same availability zone (AZ) or in different AZs. Multiple hosts/VMs/containers used to run the application can be distributed in the same region or in different regions. Among them, usually a region can include multiple AZs.
  • multiple hosts/virtual machines/containers used to run the application can be distributed in the same virtual private cloud (VPC) or across multiple VPCs.
  • VPC virtual private cloud
  • the inference module 303 may include at least one computing device, such as a server.
  • the A module can also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • Multiple computing devices included in the inference module 303 may be distributed in the same AZ or in different AZs.
  • Multiple computing devices included in module A can be distributed in the same region or in different regions.
  • modules A can be distributed in the same VPC or in multiple VPCs.
  • the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • Figure 8 shows a schematic structural diagram of a computing device 400.
  • the above-mentioned model training device can be deployed on the computing device.
  • the computing device can be a computing device (such as a server) in a cloud environment or a computing device in an edge environment. Or terminal equipment, etc. can be specifically used to implement the functions of each module in the above device 300.
  • computing device 400 includes processor 401 , memory 402 , communication interface 403 and bus 404 .
  • the processor 401, the memory 402 and the communication interface 403 communicate through the bus 404.
  • the bus 404 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 8, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 403 is used to communicate with the outside, such as receiving original data provided by the first user and the feature extraction network model to be trained, etc.
  • the processor 401 can be a central processing unit (CPU), an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more integrated circuits.
  • the processor 401 may also be an integrated circuit chip with signal processing capabilities.
  • the functions of each module in the model training device can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 401 .
  • the processor 401 can also be a general-purpose processor, a data signal processor (digital signal process, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, Discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application.
  • the general processor can be a microprocessor or the processor can be any conventional processor, etc.
  • the method disclosed in combination with the embodiments of the present application can be directly implemented as a hardware decoding processor to complete the execution, or can be performed using decoding processing.
  • the combination of hardware and software modules in the device is executed.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 402.
  • the processor 401 reads the information in the memory 402 and completes the functions of each module in the model training device in combination with its hardware.
  • Memory 402 may include volatile memory, such as random access memory (RAM).
  • RAM random access memory
  • the memory 402 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, HDD or SSD.
  • ROM read-only memory
  • HDD hard disk drive
  • the memory 402 stores executable code, and the processor 401 executes the executable code to perform the data annotation and model training methods proposed in the embodiments of this application, so as to realize the functions of the aforementioned modules 301 to 308 respectively.
  • the memory 402 also stores data required for execution of this method, such as the first data set and the basic model file.
  • FIG. 9 is a computing device cluster provided by an embodiment of the present application.
  • the computing device cluster includes at least one computing device 400, which may be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
  • the memory 401 of one or more computing devices 400 in the computing device cluster may store instructions for the same data annotation device 300 for executing the methods of data annotation and model training proposed in the embodiments of the present application.
  • one or more computing devices 400 in the computing device cluster can also be used to implement the functions of some modules in the data annotation device 300, that is, to execute some instructions of the methods in the embodiments of the present application.
  • a combination of one or more computing devices 400 can jointly store instructions for modules in the data annotation device 300 to perform the methods of data annotation and model training proposed in the embodiments of this application.
  • the memories 402 in different computing devices 400 in the computing device cluster can store different instructions for executing part of the functions of the data annotation device 300. That is, the memories 401 in different computing devices 400 store The instructions can implement one or more of the IO module 301, the data storage module 302, the inference module 303, the basic model storage module 304, the difficult example mining module 305, the basic model update module 306, the model distillation module 307 and the model training module 308. module functionality.
  • the memory 402 also stores data required for the execution of this method, such as the first data set and the model file of the basic model.
  • Figure 10 is a possible implementation manner of a computing device cluster provided by an embodiment of the present application.
  • three computing devices 400A, 400B, 400C and 400D are connected through a network, where the network may be a wide area network or a local area network, etc.
  • the connection to the network is made through a communication interface in each computing device.
  • the instructions or program codes stored in the memory 401 in different computing devices 400 can implement the IO module 301, the data storage module 302, the inference module 303, the basic model storage module 304, the difficult example mining module 305, the basic model update module 306, and model distillation. Functions of one or more modules in module 307 and model training module 308.
  • automatic annotation basic model inference and update
  • model training can be provided as independent cloud services to users on the cloud platform 100.
  • users can purchase them separately.
  • Difficult case mining services are used to perform hard case mining, therefore, their functions may be implemented by different computing devices.
  • the memory 401 in the computing device 400A stores the program codes of the execution IO module 301, the inference module 303, the basic model storage module 304, and the basic model update module 306.
  • the computing device 400A is used to implement simultaneous automatic annotation. Functions, specifically, include: conducting inference based on the basic model and the prompt template input by the user, realizing automatic labeling of the first data set, and updating the basic model based on the first data set (labeled) after confirmation of the difficult example.
  • the memory 402 in the computing device 400B stores program codes for executing the functions of the model distillation module 307 and the model training module 308, which can implement model training and model distillation based on the labeled first data set.
  • the memory 402 in the computing device 400C stores the program code that implements the function of the hard case mining module 305, which can perform a closed-loop process of reasoning, hard case mining, training, and re-reasoning based on the AI model.
  • the memory 402 in the computing device 400D stores program code that implements the function of the data storage module 302.
  • the data storage module 302 may be an OBS service and is used to store the first data set uploaded by the user. Then as computing device 400A performs the functions of model module 303, computing device 400A can read the data in the data set from the OBS.
  • computing device 400A shown in FIG. 10 may also be performed by multiple computing devices 400.
  • the functions of computing devices 400B, 400C, and 400D can also be completed by multiple computing devices 400 respectively.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores instructions, which when run on one or more computing devices, cause the one or more computing devices to execute the above implementation. Methods executed by each module of the example model training device.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product When the computer program product is executed by one or more computing devices, the one or more computing devices execute any one of the foregoing model training methods.
  • the computer program product can be a software installation package. If it is necessary to use any of the foregoing model training methods, the computer program product can be downloaded and executed on the computer.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of this application. method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种数据标注的方法、AI开发平台、计算设备集群,方法可以由AI开发平台执行,首先,AI开发平台可以基于用户输入的提示模板和提前部署AI平台中的基础模型,对第一数据集进行数据标注,然后,生成显示界面以向所述第一用户显示第一数据集中的难例集,最后,AI平台根据所述第一用户对所述第一难例集的标注进行确认后的结果,对所述基础模型进行训练,以得到更新后的基础模型。该方法可以降低对初始标注数据的依赖、数据标注所需的人力成本,该方法还能通过难例挖掘的确认结果,对部署在AI开发平台中的这个基础模型进行更新,让基础模型持续新知识,使得基础模型能力越来越强,推理效率越来越高。

Description

数据标注的方法、AI开发平台、计算设备集群和存储介质 技术领域
本申请涉及人工智能技术领域,特别涉及一种数据标注的方法、AI开发平台、计算设备集群以及存储介质。
背景技术
随着AI技术的广泛应用,AI技术中需要大量的已标注数据来进行算法训练,因此高效准确地标注数据成为当务之急。
一方面,在各个领域,海量的标注数据获取成本是非常高的,当前主要是通过人工标注,因此,如何降低标注成本成为一个亟需解决的问题。
另一方面,不同的任务使用不同的模型,不同任务的数据分别训练不同的模型,会导致不同任务的标注数据积累的知识无法汇总,会造成模型和知识碎片,因此,如何解决模型和知识碎片,使得知识可以持续积累也是一个亟需解决的问题。
发明内容
有鉴于此,本申请实施例提供了一种数据标注的方法,可以降低对初始标注数据的依赖,实现零样本和少样本的推理,来进一步降低数据标注所需的人力成本,该方法还能通过不同任务共用的基础模型持续吸收知识,使得基础模型能力越来越强,推理效率越来越高。此外,该方法更适合放到云端,实现终生学习,对于不同客户的知识可以实现继承和共享,从而达成知识即服务,经验即服务。本申请还提供了对应的AI开发平台、计算设备集群、计算机可读存储介质以及计算机程序产品。
第一方面,本申请实施例提供了一种数据标注的方法,该方法中:首先,AI开发平台接收到第一用户输入的第一提示模板,第一提示模板用于描述输入数据和标注结果之间的关系,其次,AI开发平台基于该提示模板和提前部署AI平台中的基础模型,对第一数据集进行数据标注;接着,AI开发平台确定所述第一数据集中的第一难例集,并生成显示界面以向所述第一用户显示所述第一难例集,所述第一难例集中包括至少一个难例,用户会在显示界面中对第一难例的标注情况进行确认(包括:正确的标注直接确认,错误的标注修改后确认);最后,AI平台根据所述第一用户对所述第一难例集的标注进行确认后的结果,对所述基础模型进行训练,以得到更新后的基础模型。
该方法可以基于提示模板和基础模型直接开始推理,降低对初始标注数据的依赖、数据标注所需的人力成本,该方法还能通过难例挖掘的确认结果,对部署在AI开发平台中的这个基础模型进行更新,让基础模型持续新知识,使得基础模型能力越来越强,推理效率越来越高。
一种可选的实现方式中,在得到所述更新后的基础模型之后,该数据标注方法还包括:AI开发平台基于所述更新后的基础模型,对所述第一数据集中进行数据标注,在所述更新后的基础模型的标注准确率高于或等于阈值时,返回标注完成响应,或,在所述更新后的基础模型的标注准确率低于阈值时,确定所述第一数据集中的第二难例集,并生成显示界面以向所述第一用户显示所述第二难例集,用户会在显示界面中对第二难例的标注情况进行确认;然后,AI开发平台根据所述第一用户对所述第二难例集的标注进行确认后的结果,对所述更新后的基础模型进行训练,以再次更新基础模型。可见,该方法根据标注准确率情况向用户 显示多轮难例集,并根据用户对难例集的确认情况,对基础模型进行多次训练,直到基础模型的标注准确率达标,该方法通过多次迭代,在不断优化标注结果的同时,还能不断优化基础模型的推理能力。
一种可选的实现方式中,所述根据所述用户对所述第一难例集的标注进行确认后的结果,对所述基础模型进行训练,以得到更新后的基础模型,该数据标注方法包括:根据所述第一用户对所述第一难例集的标注进行确认后的结果和所述第一提示模板,对所述基础模型进行训练,以得到更新后的基础模型。
一种可选的实现方式中,该数据标注方法还包括:接收第二用户输入的第二提示模板;基于所述更新后的基础模型和所述第二提示模板,对第二数据集进行数据标注。
一种可选的实现方式中,该数据标注方法还包括:根据所述用户对所述第一难例集的标注进行确认后的结果和非难例集的标注,确定已标注的第一数据集,其中,所述非难例集的标注是在所述基于基础模型和所述提示模板,对第一数据集进行数据标注的步骤中生成的标注,所述非难例集是所述第一数据集除去所述第一难例集余下的数据组成的集合。
一种可选的实现方式中,获取所述第一用户的目标需求,所述目标需求中包括任务类型;基于所述目标需求、和所述已标注的第一数据集,在所述更新后的基础模型上进行知识蒸馏,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
一种可选的实现方式中,获取所述已标注的第一数据集和所述第一用户的目标需求,所述目标需求中包括任务类型;基于所述已标注的第一数据集和所述目标需求进行模型训练,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
一种可选的实现方式中,所述目标需求中还包括模型能力,所述性能需求用于描述所述目标模型的精度或性能。
一种可选的实现方式中,所述任务类型包括:文本情感分析、文本分类、实体命名、命名实体识别、声音分类、语音内容识别、图像分类、物体检测、图像分割、视频标注中的任意一种。
一种可选的实现方式中,所述输入的第一提示模板是预设在所述AI开发平台中的,所述AI开发平台中预设了多个提示模板,每个预设的提示模板对应一种任务类型;或,所述第一提示模板是用户在所述显示界面中设计的。
第二方面,本申请实施例提供了一种人工智能AI开发平台,该AI开发平台包括多个模块,多个模块的组合可以实现第一方面或第一方面的任意一种可选的实现方式中所述的方法。
第三方面,本申请实施例提供了一种人工智能AI开发平台,该AI开发平台可以包括:输入输出IO模块,用于:接收第一用户输入的第一提示模板,所述第一提示模板用于描述输入数据和标注结果之间的关系;推理模块,用于:基于基础模型和所述提示模板,对第一数据集进行数据标注,其中,所述基础模型部署于所述AI开发平台;难例挖掘模块,用于:确定所述第一数据集中的第一难例集,并生成显示界面以向所述第一用户显示所述第一难例集,所述第一难例集中包括至少一个难例;基础模型更新模块,用于:根据所述第一用户对所述第一难例集的标注进行确认后的结果,对所述基础模型进行训练,以得到更新后的基础模型。
一种可选的实现方式中,所述AI开发平台包括:所述推理模块,还用于:基于所述更新后的基础模型,对所述第一数据集中进行数据标注;所述推理模块,还用于:在所述更新后的基础模型的标注准确率高于或等于阈值时,返回标注完成响应;所述难例挖掘模块,还用 于:在所述更新后的基础模型的标注准确率低于阈值时,确定所述第一数据集中的第二难例集,并生成显示界面以向所述第一用户显示所述第二难例集,并根据所述第一用户对所述第二难例集的标注进行确认后的结果,对所述更新后的基础模型进行训练。
一种可选的实现方式中,所述基础模型更新模块,用于:根据所述第一用户对所述第一难例集的标注进行确认后的结果和所述第一提示模板,对所述基础模型进行训练,以得到更新后的基础模型。
一种可选的实现方式中,所述IO模块,还用于:接收第二用户输入的第二提示模板;所述推理模块,还用于:基于所述更新后的基础模型和所述第二提示模板,对第二数据集进行数据标注。
一种可选的实现方式中,所述IO模块,还用于:接收第二用户输入的第二提示模板;所述难例挖掘模块,还用于:根据所述用户对所述第一难例集的标注进行确认后的结果和非难例集的标注,确定已标注的第一数据集,其中,所述非难例集的标注是在所述基于基础模型和所述提示模板,对第一数据集进行数据标注的步骤中生成的标注,所述非难例集是所述第一数据集除去所述第一难例集余下的数据组成的集合。
一种可选的实现方式中,所述AI开发平台还包括:模型蒸馏模块,用于:获取所述第一用户的目标需求,所述目标需求中包括任务类型;基于所述目标需求、和所述已标注的第一数据集,在所述更新后的基础模型上进行知识蒸馏,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
一种可选的实现方式中,所述AI开发平台还包括:模型训练模块,用于:获取所述已标注的第一数据集和所述第一用户的目标需求,所述目标需求中包括任务类型;基于所述已标注的第一数据集和所述目标需求进行模型训练,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
一种可选的实现方式中,所述目标需求中还包括模型能力,所述性能需求用于描述所述目标模型的精度或性能。
一种可选的实现方式中,所述任务类型包括:文本情感分析、文本分类、实体命名、命名实体识别、声音分类、语音内容识别、图像分类、物体检测、图像分割、视频标注中的任意一种。
一种可选的实现方式中,所述输入的第一提示模板是预设在所述AI开发平台中的,所述AI开发平台中预设了多个提示模板,每个预设的提示模板对应一种任务类型;或,所述第一提示模板是用户在所述显示界面中设计的。
第三方面,本申请提供一种计算设备集群,所述计算设备包括至少一个计算设备,所述至少一个计算设备包括至少一个处理器和至少一个存储器;所述至少一个存储器用于存储指令,所述至少一个处理器执行所述至少一个存储器存储的该指令,以使所述计算设备集群执行上述第一方面或第一方面任一种可能实现方式中的数据标注的方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在至少一个计算设备上运行时,使得所述至少一个计算设备执行上述第一方面或第一方面的任一种实现方式所述的方法。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在至少一个计算设备上运行时,使得所述计算设备集群执行上述第一方面或第一方面的任一种实现方式所述的数据标注的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1是本申请实施例提供的一种AI开发平台100的基础功能示意图。
图2是本申请实施例提供的一种AI开发平台100的网络架构示意图。
图3是本申请实施例提供的另一种AI开发平台100的网络架构示意图。
图4是本申请实施例给出的一种AI开发平台100上的数据标注方案示意图。
图5是本申请实施例提供的一种数据标注和模型训练的方法流程图。
图6(a)是本申请实施例提供的一种创建智能标注的用户界面示意图。
图6(b)是本申请实施例提供的一种创建新建提示模板的用户界面示意图。
图6(c)是本申请实施例提供的另一种创建智能标注的用户界面示意图。
图6(d)是本申请实施例提供的一种少样本标注界面的用户界面示意图。
图6(e)是本申请实施例提供的一种展示标注结果情况界面的用户界面示意图。
图6(f)是本申请实施例提供的一种难例人工确认的界面的用户界面示意图。
图6(g)是本申请实施例提供的一种数据标注完成响应的用户界面示意图。
图6(h)是本申请实施例提供的一种模型构蒸馏的用户界面示意图。
图7是本申请实施例提供的一种数据标注装置300示意图。
图8是本申请实施例提供的一种计算设备400的结构示意图。
图9是本申请实施例提供的一种计算设备集群的示意图。
图10是本申请实施例提供的一种计算设备集群的实现方式的示意图。
具体实施方式
首先,为了便于理解本申请提供的技术方案和实施例,在此对AI开发平台、AI模型、数据标注、难例、难例挖掘等概念进行说明:
AI开发平台:是一种为AI开发者和用户提供便捷的AI开发环境以及便利的开发工具的平台。AI开发平台基于用户自己的算法和训练图像集,可以训练出一个可用于实现用户需要的AI模型,用户可利用训练完成的AI模型完成自己的特定任务,在这个过程中,AI开发平台可以为用户提供数据标注、模型训练、模型优化、模型部署等服务。
基础模型(Foundation Models)指一个预训练好的、参数体量很大的AI模型,我们可以基于基础模型进行微调,从而适配多种下游任务模型,换句话说,它是下游任务的“基础”,因此被称为基础模型。由于基础模型的参数规模通常较大,因此有些情况下也可以称之为大模型,这类大模型是由海量无标注数据训练得来,参数体量通常在10亿以上,例如,目前华为云的盘古CV大模型有30亿、盘古NLP大模型参数量甚至达到1000亿。
数据标注:是对未标注数据添加在相应场景中的标签的过程。例如,未标注数据为未标注图像,在图像分类的场景中,为未标注图像添加所属类别,在目标检测的场景中,为未标注图像中的目标添加位置信息以及类别。经过标注的数据携带了标签,当数据作为输入数据用于训练AI模型时,可以根据数据的标签调整AI模型中的参数。
自动数据标注:是利用主动学习完成数据标注,核心原理是随机选择一部分数据进行标注,然后根据已标注的数据训练一个模型,根据验证数据集确定置信度阈值;然后对未标注数据进行预测,并对每一个数据生成置信度得分,对于高于置信度阈值的数据直接视为自动 标记,对于低于置信度阈值的数据发送给用户重新标注,这个自动标注的过程部分可以降低人工标注量。
难例,是指在训练、评估或者推理过程中,挖掘AI模型在性能表现上不佳的数据的过程,对于AI模型不能给出较准确的推理结果的输入数据称为该AI模型的难例。例如,在AI模型的训练过程中,将训练时预测结果与标签结果之间的损失函数值大于一定阈值的输入数据作为难例。在AI模型的推理过程中,将推理数据集中的D数据输入至AI模型,输出的推理结果的错误率高于目标阈值,则该D数据为难例。在一种场景下,AI模型也可以用于对未标注的数据进行智能标注,利用AI模型做智能标注过程实际也是AI模型的推理过程,被标注错误或者被标注错误率较高的输入数据,被确定为难例。
难例挖掘:指确定一个图像为难例的方法。
小样本学习:小样本学习问题是指只给定目标少量训练样本的条件下,训练有效地识别这些目标的AI模型,以获得准确分类测试样本的模型。按照训练样本的多少可以将小样本学习分为三类,如下:a)少样本学习:训练样本在数十个量级的左右的模型训练问题。b)单样本学习:只有一个训练样本,也是最接近人类处理方式的模式b)零样本学习:在没有任何标记数据的情况下进行预测,目的是预测训练数据集中没有出现过的类。
模型微调:基于预训练的大模型,在小样本数据集上对神经网络模型的全连接层或者顶端几层进行参数微调,得到微调后的模型,以匹配不同的下游任务,使得微调后的模型在下游任务上的效果显著提升。模型微调实现了用更少的特定领域数据、且不经过精调步骤来解决目标问题的目的。
提示(promot)模板:“提示”就是伴随着输入一起,给予模型的一种上下文,它告诉、指导模型接下来你应当要做什么任务,是一个提示。换一种说法,可以理解为前面的“提示”能够将下游任务改造成预训练模型期望的样子。在进行数据标注时,提示模板可以理解为我们期望大模型如何对数据进行标注,换句话说,它是一个用于提示数据和标注结果之间的关系的模板,这里的“关系”可以是上下文关系或其他逻辑关系,就像构建了一个完型填空的模板,大模型只需要完成填空即可,提示模板的举例请见后文的表1。
提示学习(Prompt learning):目前主要应用于NLP领域,在不显著改变预训练语言模型(基础模型)结构和参数的情况下,通过向输入增加提示信息将下游任务改为文本生成任务。预训练语言模型中存在很多知识和模式,有的是现成的、可以直接使用,有的则需要一定的方法来“激发”出来。提示学习可以应用于知识探索(事实探索和语言学探索)、分类任务(文本分类和自然语言推理)、信息提取(关系提取、语义分析和命名实体识别)、NLP中的推理(常识推理和数学推理)、问答、文本生成、文本生成的自动评估、多模态学习、元应用(域自适应、除偏和数据集创建)等任务类型,本申请不对任务类型作任何限制。
自动数据标注:是利用主动学习完成数据标注,核心原理是随机选择一部分数据进行标注,然后根据已标注的数据训练一个模型,根据验证数据集确定置信度阈值;接着对未标注数据进行预测,并对每一个数据生成置信度得分,对于高于置信度阈值的数据直接视为自动标记,对于低于置信度阈值的数据发送给用户重新标注。这个自动标注的过程部分可以降低人工标注量。
知识蒸馏(knowledge distillation):是模型压缩的一种常用的方法,不同于模型压缩中的剪枝和量化,知识蒸馏是通过构建一个轻量化的小模型,可以将一个网络的知识转移到另一个网络,两个网络可以是同构或者异构。只是蒸馏的做法是先训练一个教师(teacher)网络,然后使用这个teacher网络的输出和数据的真实标签去训练学生(student)网络, student模型通过蒸馏训练来获取教师的知识,可以以轻微的性能损失为代价将复杂teacher模型的知识迁移到简单的学生模型中,以获得在下游任务上更好的性能。
下面,本申请说明书从AI开发平台的基础功能模块(图1)、AI开发平台的网络结构(图2-图3)、AI开发平台的数据标注流程图(图4)、本申请方法流程图(图5)、用户图形界面示意图(图6a~图6h)、软件装置(图7)、硬件结构(图8-图10)等多个方面介绍本申请。
图1是本申请实施例给出的一种AI开发平台100的基础功能示意图。AI开发平台100是云平台中一项PaaS云服务,是基于云服务提供商所拥有的大量基础资源和软件能力对用户(也称为:租户、AI开发者等)提供的辅助进行AI模型的构建、训练、部署以及AI应用的开发和部署的软件平台。
如图1所示,AI开发平台100提供的基础能力可以包括如下六大部分:数据预处理110、模型构建与训练120、模型管理130、模型部署140、数据优化150、模型优化更新160,各个功能模块介绍如下:
数据预处理110:用户可以根据需求对数据集进行数据选择、数据标注、数据增强、数据清洗、特征分析等一种或多种操作。其中,数据标注是数据预处理110中最重要的步骤,数据标注的数据通常指AI模型的训练所需的数据集,这里的数据集可以由用户根据实际的应用场景进预先采集并上传到平台100上的,也可以使用业界已经形成的开源数据集。数据标注111具体介绍请参见后文图3的描述。
模型构建与训练120:对AI模型的构建和训练是AI基础开发平台的重点能力,主要是:(1)基于用户的目标(例如:任务类型、目标精度等),为用户自动选择AI基础开发平台中内置的初始模型并对其进行训练,获得满足用户的目标的AI模型;(2)基于用户的目标,以及用户提供或者用户在AI基础开发平台上选择的初始AI模型,对初始AI模型进行训练,获得满足用户的目标的AI模型(3)基于用户的目标,AI基础开发平台利用后台的神经网络架构搜索算法,自动搜索到合适的AI模型,对其进行训练,获得满足用户的目标的AI模型。
上述三种方式中,前两种方式主要是利用云环境的算力对AI模型进行训练,第三种方式中既包括AI模型架构的搜索,又包括AI模型的训练,AI模型训练的原理此处不再赘述。
模型管理130:AI基础开发平台还提供模型管理的功能,模型可以来自前述训练完成的AI模型、以及用户自带的AI模型。对模型进行统一管理包括对模型进行评估和诊断、优化、转换等,其中,对模型进行评估,主要是利用至少一个评估指标衡量已训练的AI模型的性能,例如:可以计算已训练的AI模型对评估数据集的推理结果的准确率。
模型部署140:前述的目标AI模型可以被部署在云环境中的节点或者边缘环境中的节点,其中,云环境中的节点可以是虚拟机实例、容器实例、物理服务器等。一方面,当AI模型的规模较大时,可以基于模型并行的思想将AI模型分布式地部署在多个节点上。另一方面,也可以在多个节点分别独立地部署AI模型,以支撑较大的在线服务的访问量。边缘环境中的节点可以是各种边缘设备。
被部署后的AI模型可以成为一项AI应用,或者成为AI应用中的一部分。所示,用户可以通过Web网页在线访问AI应用,或者通过客户端app在线访问AI应用,当AI应用被使用时,可以通过在线调用的方式,调用部署在边缘环境或者云环境的AI模型来提供响应。由此,通过AI基础开发平台开发和训练的AI模型可以实现对在线请求数据的推理,返回推理结果。在利用AI模型提供在线服务对过程中,云平台可以根据AI模型的调用次数计费,也可以根 据AI模型推理的资源消耗计费。
应理解,在另一些情况下,由前述AI基础开发平台开发和训练的AI模型也可以不被在线部署,而是供用户下载训练完成的AI模型至本地,供用户自由地进行本地部署。例如:用户可以选择将训练完成的AI模型保存至OBS,进而用户从OBS下载AI模型至本地。
在另一些情况下,用户1利用前述AI基础开发平台100训练完成了AI模型后可以发布至AI市场,在AI市场的AI模型可以被其他用户订阅使用,例如:可以使得AI模型的功能被集成至其他用户的AI应用中。
基于上述的各种功能,用户可以基于AI基础开发平台100完成AI模型的开发和AI应用的部署和管理。AI基础开发平台中的各个能力可以整合起来供用户进行AI全流程的使用,也可以分别为用户提供独立的功能。
图2为本申请实施例中提供的一种AI开发平台100的网络架构示意图。
由于AI开发平台的售卖实际上是软件能力整合硬件虚拟化基础资源一起售卖的形式,并且支撑AI开发平台中任何一个流程的基础资源可能是分布于不同的物理设备上的。在图1中,AI开发平台100可以独立地部署在云环境的数据中心中的服务器或虚拟机上,AI开发平台100也可以分布式地部署在数据中心中的多台服务器上、或者分布式地部署在数据中心中的多台虚拟机上。
一种可能的实施方式中,图2中的数据中心是云服务提供商的中心云数据中心。
如图2所示,用户与AI开发平台100的交互形态主要包括:用户通过客户端网页登录云平台,在云平台中选择并购买AI开发平台100的云服务,通常,用户需要先购买完预付套餐包,即可以利用AI基础开发平台提供的能力,以及预付套餐包中包括的基础计算资源进行数据标注、模型的构建、训练、部署等。购买后,用户即可以基于AI开发平台100提供的功能进行全流程的AI开发。用户在AI基平台上开发和训练自己的AI模型时,是基于云服务提供商的数据中心中的基础资源(包括计算资源、存储资源和网络资源,其中计算资源包括如CPU、GPU、NPU等)进行的。在当资源使用量超出当前预付套餐包的额度时,云平台按照按需收费的方式,对超出的资源部分进行收费。在使用AI开发平台的云服务时,用户可以通过应用程序接口(application program interface,API)或者图形用户界面(Graphical User Interface,GUI)指定要AI模型完成的任务、上传未标注数据集至云环境,云环境中的AI开发平台100接收用户的任务信息、未标注图像集等,执行数据预处理、AI模型训练、使用训练完成的AI模型进行推理,训练完成的AI模型可被用户下载或者在线部署,用于完成特定的任务。
另一种实施例中,前述的数据中心还可以包括云服务提供商向用户提供的边缘数据中心。
图3是本申请实施例提供的另一种AI开发平台100的网络架构示意图,图3中的AI开发平台100还可以分布式地部署在不同的环境中,AI开发平台100可以在逻辑上分成多个部分,每个部分具有不同的功能。例如,AI开发平台100中的一部分可以部署在边缘环境中的计算设备中(也称边缘计算设备),另一部分可以部署在云环境中的设备中。边缘环境为在地理位置上距离用户的终端计算设备较近的环境,边缘环境包括边缘计算设备,例如:边缘服务器、拥有计算能力的边缘小站等,
例如,在公有云与私有云结合的场景中,利用公有云中的资源运行AI开发平台中提供的图1中的模型构建与训练120和模型管理130的功能,利用私有云中的资源运行AI开发平台中提供的数据存储OBS和数据预处理110的功能,这样可以为用户的数据提供更强的安全性。这种场景下,公有云的资源可以是来自中心云数据中心,私有云的资源可以是来自边缘数据 中心。
图4是本申请实施例给出的一种AI开发平台100上的数据标注方案示意图,该流程主要包括:用户输入提示模板111、AI开发平台进行智能标注112、难例人工确认113、训练基础模型114。
传统的数据标注通常由人工进行,由于需要标注的数据通常数据量巨大,需要消耗较多的人力资源,现有的AI开发平台的虽然可以通过主动学习模型对未标注数据进行自动标注,但是即使是自动标注,也需要人工初始标注一部分数据用于模型训练效率低,人工标注成本高导致冷启动成本高。
本申请实施例中的数据标注流程包括:首先,用户输入提示模板111,AI开发平台100基于基础模型B对数据集A进行智能标注112,与此同时,AI开发平台100对智能标注112进行难例挖掘;接着,AI开发平台100将难例发给用户,进行难例人工确认113;然后,AI开发平台100基于难例确认后的结果对基础模型B进行继续训练114,学习本次数据集A带来的新知识,并基于训练后的基础再次进行智能标注112。在如图3所示的流程中,AI开发平台100不断重复智能标注111、难例人工确认113、训练基础模型114的过程,直到智能标注112的准确率满足条件。
一种可能的实施方式中,第一次进行智能标注112(或后文的步骤203)后的标注准确率就满足条件,例如准确率达到阈值99.8%,此时,可以不进入难例人工确认113的步骤,而直接返回标注完成响应。这种方式是一种理想情况,可能发生在该数据标注功能模块上线AI开发平台较长时间后,由于基础模型B持续吸收知识,模型能力越来越强,零样本效果越来越好,效率越来越高,甚至可以一次性完成较高准确率的智能标注。
具体的,步骤111-114的具体介绍如下:
步骤111、输入提示模板。
用户C根据自己想要实现的标注任务,在GUI中选择或输入特定的提示模板,表1中给出了一些任务类型和提示模板的示例:
表1
任务类型 提示模板
文本情感分析 X,情感极性是<MASK>
命名实体识别 X中有哪些Z这样实体?回答:<MASK>
文本分类 X是<MASK>类新闻
声音分类 X是<MASK>的声音
语音内容识别 X的内容是<MASK>
图像分类 X是一种Z?回答:<MASK>
物体检测 X中的物体Z的坐标是<MASK>
视频标注 X中的物体<MASK>的坐标是<MASK>
表1
其中,X代表输入的数据,可以是文本、图像、音频、视频,<MASK>代表输出,即数据标注的结果。需要说明的是,上述示例仅作为举例,不用于限定本申请中的提示模板的格式,其中,X或<MASK>在一些情况下也可以省略。下面给出一些具体举例:
对于自然语言理解类任务,例如文本情感分析、文本分类、命名实体识别等,表1中的X代表输入的是一个句子,“情感极性”、“新闻”这类指向性词语都是一种提示,NLP大模型可以结合提示模板的上下文的含义,输出<MASK>对应的结果。例如,文本情感分析任务中,X为“这个手机拍照不错”,则NLP补全提示模板后得到“这个手机拍照不错,情感极性是<好>“,其中,“好”就为MASK对于的输出。再例如,若客户上传新闻命名实体识别数据数据,他可以选择命名实体识别模板“X中有哪些Z这样实体?回答:<MASK>”,其中X是原始文本内容,Z是某个实体类别的示例实体词,MASK是提示模板要生成的内容。
对于计算机视觉类任务,例如图像分类、物体检测、物体识别等,当用户的数据集是包含的动物的数据集,想要做分类任务来识别一种黑色的狗,那么用户可以给的提示模板可以是“X是一种Z,回答:<MASK>”,类似于自然语言理解类任务,只不过这里的输入X从一个句子变成了一个图片,后面的Z也可以是我们输入的一张图片,图片中有一只黑色狗。CV大模型就在在它看过的4亿张图片里面可能就有跟这张图片Z类似的狗的照片。
一种可能的实施例中,用户只需要输入提示模板,基础模型B可以基于提示模型进行零样本学习,来进行智能标注。
另一种可能的实施例中,用户只在输入提示模板的同时,还可以输入和提示模板格式对应的少量样本,帮助基础模型B进行少样本学习,来进行智能标注。
步骤112、智能标注。
AI开发平台基于基础模型B进行提示学习,对数据集A进行标注,具体的:基于前面的步骤112输入的提示模板,或基于前面的提示模板和少量样本数据,对数据集A中未标注的数据进行推理(即提示学习),例如零样本推理/少样本推理,输出未标注的数据对应的标注结果。
其中,基础模型B是部署在AI开发平台上的预训练好的大模型。这类大模型通常是由海量无标注数据训练得来,基础模型的参数规模通常较大,具备极佳泛化能力。大模型根据训练数据的不同类型,主要分为两类:自然语言处理(Natural Language Processing,NLP)大模型和CV计算机视觉(Computer Vision,CV)大模型。此外,大模型还可以包括:多模态大模型、科学计算大模型等,本申请对此不做限定,本申请实施例中的基础模型B可以是以上描述的任一种大模型。
步骤113、难例人工确认。
AI开发平台100中还引入了难例挖掘技术,使得AI开发平台可以基于预训练的基础模型B进行推理、难例挖掘、训练、再推理的闭环过程。
一种可能的实施方式,在基础模型进行推理(智能标注)之后,AI开发平台通过难例挖掘算法,对模型预测把握不大的样本(即难例)进行排序,确定第数据集中的难例以及难例属性,然后通过用户界面程序给用户,来人工进行标注,这里标注的难例比例可以由人工调整。
一种可能的实施方式中,AI开发平台100可以使用时序一致性算法、基于数据特征分布算法、基于数据增强一致性算法、基于不确定性算法、基于聚类算法或基于异常检测算法中的一种或多种,确定出未标注图像集中的难例以及各难例的难例属性。在AI开发平台课使用多种算法,确定难例以及难例属性时,不同算法的权重不相同,且不同特征的权重也不相同。
可选的,用户可以看到需要进行确认的难例的数量,以及难例或非难例占未标注数据集A的比例,进而可以确定出当前的基础模型的推理性能是否满足要求,难例占未标注图像集的比例越小,AI模型的推理性能越好。
可选的,用户还可以看到当前智能标注的准确率。
一种可能的实施方式,AI开发平台根据基础模型的推理结果,确定未标注图像集中的难例以及各难例的难例属性,该难例属性包括难例系数,难例系数用于描述难例的程度。例如,难例系数可以为0至1之间的数,用于反映难例的难例程度(例如,通过AI模型进行分类或者检测获得正确结果的难度),难例系数越大,难例程度越高,反之难例系数越小,难例程度越低。AI开发平台对难例进行排序,并根据设置的标注比例(或难例系数阈值)将至少部分的难例发送给用户,例如,AI开发平台可以只将难度系数阈值设定为0.6,换句话说,只有难例系数大于0.6的难例才返给用户进行确认。
步骤114、训练基础模型。在步骤114中难例人工确认后,AI开发平台对基础模型继续进行训练,以更新该基础模型。具体的,这里的更新指:基于难例确认后的标注结果和前述提示模板,对AI开发平台中的基础模型进行训练,以调整基础模型115中的参数。
可选的,在步骤115的模型更新完成后,再基于当前的基础模型B再对数据集A中的样本再次进行智能标注、难例人工确认、更新基础模型,即重复步骤112-步骤115,直到某次步骤112中基础模型自动标注的准确率高于(或等于)阈值T,才中止上述的标注流程。
完成图3中的数据标注后,有两种方式来生成用于实现目标任务类型的目标模型:①蒸馏出目标模型:即从基础模型中蒸馏得到一个小模型(目标模型)②训练出目标模型:即基于标注后的数据训练得到目标模型。
需要说明的是,AI开发平台的不同客户的不同任务会在不同时间到来,所有用户的任务共用同一个基础模型,这样不同客户的知识可以实现继承和共享,实现知识即服务,经验即服务。例如,图3的场景中,T时刻,用户C开始使用对数据集A进行标注,而在此之前的T-1时刻,已经有另一个在先客户D对另一个数据集进行了执行数据标注111的流程,AI开发平台基于此对基础模型115进行了训练,从而更新了基础模型115。因此,T时刻的基础模型和T-1时刻的基础模型B必然是不同的,其中的部分参数已经发生变化,在后的基础模型会比在前的基础模型更健壮、能力更强,尤其是在相似任务上得到的性能提升会更高,例如,如果客户C和客户D要做的标注任务都是情感分析任务,客户C得到的非难例数据占比会比在先客户D使用的时候会更高,难例人工确认的数量和轮次会更少,即自动标注整体效率会更高。
图5是本申请实施例提供的一种数据标注和模型训练的方法流程图,该方法由AI开发平台执行,下面将结合情感分析为任务类型、文本类数据的标注为例,来介绍申请提出的数据标注及模型训练方法。
情感分析(sentiment analysis)是自然语言理解领域一个重要分支,主要是针对文本片段,自动识别出该文本片段是正面、负面还是中性评价,该问题是一个文本分类问题,类别标签为正、负、中性。例如,用户在看完电影之后可以选择在团购网站上留下自己对电影的评价。接下来,本申请实施例将基于情感分析的任务介绍本方法,步骤包括:
步骤201、AI开发平台接收第一用户上传的第一数据集。
第一数据集可以由第一用户根据实际的应用场景进预先采集的,也可以使用业界已经形成的开源数据集。例如,第一用户提前收集了800条电影评论作为待标注的第一数据集。
在一些实施例中,用户可以预先在云平台购买对象存储服务object storage service(OBS),这是一种基于对象的云存储服务,用户可以将数据集存储于OBS的某个路径中,然后在利用AI基础开发平台提供的数据预处理110(例如,数据标注)的功能时,直接在用户界面输入OBS的路径,在后面执行智能标注时,再从OBS读取数据集中的数据。可选 的,用户也可以在数据标注服务的用户界面中直接上传待标注的第一数据集。
图6(a)是本申请实施例给出的一种创建任务的用户图形界面示意图,用于创建本次的智能标注任务。首先,用户可以直接“选择”一个现有的OBS目录“obs/buckets/test”,这个目录下已存储了用户之前上传的数据集,或者“新建”一个OBS目录并上传800条电影评论。
步骤202、AI开发平台接收第一用户输入的第一提示模板,所述输入的第一提示模板用于描述第一数据集中的数据和标注结果之间的关系。
在本申请实施例中,AI开发平台100可以提供基于基础模型的提供智能标注,例如华为云上的盘古NLP大模型、盘古CV大模型。
用户可以仅提供一个“提示模板”作为参考,由基础模型来执行智能标注,即用户无需提供标注样本就可以直接开始智能标注服务,这种方式称为零样本学习。可选地,用户处理输入提示模板,还仅需标注少量样本(例如1~10个)就可以快速开始智能标注。
一种可选的实施方式,如图6(a)所示,第一用户在GUI中直接在任务类型的下拉框中选择自己需要的任务类型为“文本情感分析”,再第一用户选择任务类型之后,“提示模板”一栏中就出现了可选的提示模板“文本X,情感极性是<MASK>”,第一用户直接选择该提示模板即可。
另一种可选的实施方式,第一用户还可以直接点击“新建模板”设计自己符合自己需求的模板。图6(b)是本申请实施例提供的一种创建新建提示模板的第一用户界面。如图6(b)所示,第一用户想要设计一个提示模板来识别电影评论的情感极性,首先由于电影评论是文本,第一用户先选择数据类型为“文本”,在阅读AI开发平台的“格式说明”后,设计出了一个更符合自己需求的提示模型“评论X,这个电影真<MASK>看”。
如图6(c),为了实现更好的标注效果,本申请实施例还提供了少样本学习的标注方式,即第一用户除了输入合适的提示模板,还需要提供少量的标注样本,就可以快速开启智能标注。图6(c)是本申请实施例提供的另一种创建智能标注任务界面图,在该界面中第一用户还可以选择标注方式为“少样本”,点击“下一步”后,第一用户即进入图6(d)中的少样本标注界面,第一用户在界面中提供几个样例,如“电影很感人,情感极性是<好>”、“情节很无聊,情感极性是<不好>”,AI开发平台100通过基于这两个例子进行少样本学习,对数据集中的其他样本,生成提示模板中<MASK>对应的内容,进而实现直接对数据直接进行预测,不需要人工标注数据。
步骤203、AI开发平台基于基础模型和所述第一提示模板,对所述第一数据集中的数据进行数据标注。
AI开发平台100获取到基础模型B(例如NLP大模型),基于第一用户输入的第一提示模板,对第一数据集中的数据直接进行推理,从而实现对数据集的自动数据标注。例如,“我看了这个电影,很喜欢。这个电影很<MASK>”,基础模型可以预测出<MASK>对应的词最可能是“好”,进而映射成“正面”评价。
可选的,当用户选择了“少样本”的方式时,AI开发平台100基于基础模型、用户输入的提示模板、以及少量已标注的样本进行学习,对第一数据集中的数据的推理,从而实现对数据集的自动标注。
同时,AI开发平台100还会将已标注数据存储到OBS的相应路径中。
可见,本申请实施例的方案和现有的自动标注区别点之一在于,本方案无需大量人工标注来启动标注,现有的智能技术都需要一部分数据用于训练初始模型,或基于这部分数据对 预训练的模型进行微调,然后基于该模型对未标注数据进行预测,而可以基于提示模型和基础模型直接对数据集中的数据进行推理并输出标注结果。而本方案仅需借助提示模板,或少量样本,就可以将下游任务的输入输出形式改造成适合预训练模型的样子,从而直接启动标注流程,而不必对预训练的大模型的参数进行调整,才能生成适配当前任务类型的模型。
步骤204、AI开发平台确定所述第一数据集中的第一难例集,通过显示界面向所述第一用户显示所述第一难例集。其中的第一难例集中包括一个或多个难例。
本申请实施例中的AI开发平台100中引入难例挖掘技术,可以在基础模型进行推理的过程中,识别出哪些输入数据为难例,也即确定第一难例集,第一难例集在包括一个或多个难例。在本实施例中,AI开发平台100获得难例后,可以通过显示界面将一个或多个难例提供给用户。
一种可能的实施方式中,用户可以看到当前数据标注的结果,以及难例的数量、准确率,如图6(e)所示,有80个数据被系统认定为难例,基础模型的准确率为90%。
其中,关于基础模型的自动标注的准确率,一种可能的实施方式中,当前基础模型的自动标注的准确率可定义为非难例占未标注数据集A的比例。例如,若当前的基础模型对数据集A进行自动标注后,非难例占未标注数据集A的比例为90%,则可以理解为当前基础模型自动标注的准确率为90%。需要说明的是,第一用户C根据自己的任务类型选择了特定的第一提示模板,基础模型基于该提示模型进行自动标注,因此,这里的自动标注准确率是针对当前的任务类型来说的。
另一种可能的实施方式中,当前基础模型的自动标注的准确率还可定义为其在测试集合B上的预测准确率。例如,第一用户可以在步骤111中同步上传一个测试集B,若当前的基础模型在测试集B上的测试准确率为85%,则可以理解为基础模型自动标注的准确率为90%。
可选的,第一用户可以点击图6(e)中的“设置”,对标注的难例比例、难例系数阈值进行人工调整,相关内容请参照前文。在第一用户确认完当前的标注结果后,可以点击“难例人工确认”进入图6(f)中的难例人工确认的界面。
在本实施例中,用户可以在显示界面中,对AI开发平台提供的难例的标注,进行确认(具体包括直接确认、修改后确认等)。例如,图6(f)是本申请实施例给出的一种难例人工确认的界面,在本示例中,难例的标注结果包括图中文本评论内容所传达的对电影的评价是正面还是侧面的,如果用户认同自动标注的结果,则直接点击“确认”,如果用户不认同自动标注的结果,则点击“修改”。
步骤205、AI开发平台获取所述第一用户在所述显示界面中对所述第一难例集进行确认后的标注结果。
AI开发平台获取到第一用户对对第一难例进行标注确认的标注结果。具体的,对于不同任务类型中的难例,标注结果包括不同的内容。这样,由于AI开发平台在进行第一次智能标注后,给用户提供一个或多个难例,用户仅需要标注确认难例,并将确认结果提供给AI开发平台,进而可以帮助平台优化基础模型,使基础模型下一次提供的自动标注更准确。
可选的,在第一用户对难例进行标注确认后,AI开发平台会将确认的难例同步至已标注的第一数据集中,即存储到OBS的相应路径中。另外,AI开发平台还可以根据第一用户的标注确认,将第一用户待确认的难例集转变为已标注的难例集,或者已标注的非难例集,或者未标注的难例集,未标注的非难例集。
步骤206、根据所述第一难例确认后的标注结果,对所述基础模型进行训练,以更新所述基础模型。
具体的,基于第一用户对所述难例的标注进行人工确认后的结果,先对第一数据集的标注(例如,这里的标注是步骤203中的自动标注生成的)进行更新,并基于更新后的第一数据集(已标注)对所述基础模型进行训练,以更新所述基础模型。
需要说明的是,本步骤中的“对所述基础模型进行训练,以更新所述基础模型”是一种泛指,并不用于限定只对模型进行了一次更新。换句话说,AI开发平台可能已经返回了一轮或多轮难例集让用户确认,并基于确认后的第一数据集对基础模型进行训练。
一种可能的实施方式,AI开发平台100根据第一用户在前面步骤中输入的第一提示模板和进行难例人工确认后的第一数据集(已标注),对基础模型进行训练,以更新所述基础模型。例如,由于第一用户执行的类型为“文本情感分析”,AI开发平台基于前面的第一数据集(已标注)和第一提示模板对平台中的NLP大模型进行微调,以更新基础模型中的部分参数。步骤207、基于当前的基础模型,对所述第一数据集中的数据进行标注。
步骤207、基于更新后的基础模型,对所述第一数据集中进行数据标注。
可选的,在步骤205中更新完基础模型之后,还需要再次对第一数据集进行自动标注。一种可能的实施方式中,AI开发平台100会返回标注结果,第一用户可以看到本次标注完成情况,以及难例的数量、准确率。
一种可能的实施方式中,若在此时(已完成了对基础模型的至少一次更新),第二用户也登录了AI开发平台100,开始使用该平台提供的数据标注功能,则AI平台100:首先接收第二用户输入的第二提示模板,并基于所述更新后的基础模型和所述第二提示模板,对第二数据集进行数据标注。可见,基础模型在训练后,已经积累第一用户数据中的知识,此时利用更新后的基础模型对第二用户的第二数据进行标注,准确率可以得到在一定程度上得到提升。
步骤208、判断更新后的基础模型的标注准确率是否低于阈值。
一种可能的实施方式中,AI开发平台100会判断当前的基础模型的标注准确率是否满足条件(即步骤208),例如,准确率是否低于阈值S(S=95%),此时,有两种情况:
若本次标注的准确率低于阈值,则重新回到步骤204-步骤208,直到某次的标注的准确率不低于阈值例如,在所述更新后的基础模型的标注准确率低于阈值时,AI开发平台确定所述第一数据集中的第二难例集,并生成显示界面以向所述第一用户显示所述第二难例集,并根据所述第一用户对所述第二难例集的标注进行确认后的结果,对所述更新后的基础模型进行训练,从而再次更新AI开发平台在的基础模型B。
若本次标注准确率不低于阈值,准则进入步骤209,即返回标注完成的响应。
可选的,在步骤203的标注之后,AI开发平台也会判断更新后的基础模型的标注准确率是否低于阈值。一种理想情况中,即当基础模型B的泛化能力很强的时候,若步骤203的标注之后的标注准确率就达到条件,系统可以直接返回标注完成响应(步骤209)。
步骤209、返回所述第一数据集的标注完成响应。
如前文所述,如果本次标注准确率是否低于阈值,AI开发平台会返回第一数据标注完成的响应。图6(g)是本申请实施例给出的一种数据标注完成响应的用户界面示意图,此时的准确率高达99.9%,可以认为更新后基础模型在第一数据集和文本情感分析任务上的推理性能很优秀。
一种可能的实施方式中,本步骤中返回的已完成标注第一数据集中包括:步骤205中将难例确认结果同步至已标注的第一数据集后的第一数据集,即这里的已完成标注第一数据集中包括自动标注和难例确认的结果。
可选的,本步骤中返回的已完成标注第一数据集中包括:由更新后的基础模型对第一数 据集合进行自动标注的结果(即步骤207的自动标注结果)。
步骤210、获取已标注的第一数据集和第一用户的目标需求。
本申请实施例中,还提供模型构建和训练的方法,可以基于第一用户的目标需求,生成符合第一用户完成预期任务的AI模型(即目标模型)。第一用户目标需求中可以包括:任务类型、模型能力,其中模型能力指第一用户期望目标模型能达到的精度、性能和价格等要求。
步骤211、基于所述已标注的第一数据集和所述目标需求,训练得到目标模型,或从更新后的基础模型中蒸馏得到目标模型。在本申请的实施例中,给出以下两类模型构建/训练的方法:
(1)蒸馏:基于用户的目标需求(例如:任务类型、目标性能等),从AI开发平台上的基础模型中蒸馏出符合用户目标需求的目标AI模型。
关于此处的“蒸馏”,又称作知识蒸馏,即利用性前述的基础模型的作为监督信息,以已标注的第一数据集为训练样本,来训练得到目标模型(轻量化的小模型),从而将和基础模型的知识转移到目标模型中,以提升在用户设定的任务类型(例如情感分析任务)上的推理能力。由于用户C的任务类型是文本类的,则本次蒸馏将基于AI开发平台中的NLP大模型进行。
图6(h)是本申请实施例提供的一种模型构蒸馏的界面,第一用户再次选择数据集的OBS位置和任务类型,经过前面的自动标注流程后,obs/buckets/test1路径下的第一数据集中已经存储了数据的标注。可选的,第一用户还可以在界面中设定自己期望目标模型的性能是什么样的,例如,目标模型是“高精度”、“高性能”或“经济”,其中经济代表生成目标模型的综合成本较低,即第一用户支付一个合理的低价就可以获得蒸馏的目标模型。进一步的,第一用户还可以在模型蒸馏时,进一步设置期望模型达到的精度和性能的参数,具体的:精度可以用于指示模型准确率、精确率、召回率、等常规指标,性能可代表运算时间、空间消耗等性能指标。
(2)常规训练:基于用户的目标需求和标注好的数据集,为用户自动选择AI基础开发平台中内置的初始模型,并基于标注好的数据集对初始模型进行训练,获得满足用户需求的目标AI模型;或者基于用户的目标需求,以及用户提供或者用户在AI基础开发平台上选择的初始AI模型,对初始AI模型进行训练,获得满足用户的目标的AI模型。
例如,用户可以创建一个模型训练任务,在用户界面输入训练作业的参数,例如任务类型、输入路径、算法名称、AI引擎、计算节点个数、训练规格等参数,其中输入路径指输入数据的OBS路径。进一步的,用户还可以在模型训练时,进一步设置和管理期望目标模型达到的精度、性能的参数,具体的:精度可以用于指示模型准确率、精确率、召回率等常规指标,性能可代表运算时间、空间消耗等性能指标。
可见,相比与基于常规的模型训练来得到目标模型,本申请实施例提供的模型蒸馏的方式,是基于参数量庞大的基础模型来进行知识蒸馏得到,由于基础模型学习了不同用户、不同任务的知识,可以以更高的效率来训练出符合用户需求的目标模型。
一种可能的实施方式中,AI开发平台向用户提供本申请实施例中的服务时,可分为两部分,即:智能标注服务和模型训练/模型蒸馏服务。用户在云服务平台可先仅购买智能标注服务,也可以仅购买模型训练/模型蒸馏服务。例如,用户可在购买基础服务云服务后,由云服务提供商提供这两类服务的API,最终按照调用API的次数对智能标注服务和模型训练/模型蒸馏服务进行额外计费。
图7是本申请实施例提供的一种数据标注装置300(也可以理解为一种AI开发平台300) 的示例,可选的,该装置还可以提供模型训练的功能。装置300可以通过软件、硬件或者两者的结合实现成为AI开发平台100中的部分或者全部,即可用于实现本申请实施例图3、图4中的方法。示例的,该装置300包括:输入输出IO模块301、数据存储模块302、推理模块303、基础模型存储模块304、难例挖掘模块305、基础模型更新模块306、模型蒸馏模块307和模型训练模块308。
输入输出IO模块301,用于接收第一用户通过显示界面输入的第一提示模板,所述输入的第一提示模板用于描述第一数据集中的数据和标注结果之间的关系。一种可能的实施例中,用户只需要输入提示模板,基础模型B可以基于提示模型进行零样本学习,来进行智能标注。另一种可能的实施例中,用户只在输入提示模板的同时,还可以输入和提示模板格式对应的少量样本,帮助基础模型B进行少样本学习,来进行智能标注。
一种可能的实施方式中,模块301预设了多个提示模板,每个预设的提示模板对应一种业务类型;可选的,所述提示模板还可以是第一用户在所述显示界面中自己设计的。
可选的,IO模块301还用于接收第一用户上传的第一数据集,并将第一数据及存储于存储模块301,第一数据集合是未标注的数据集。可选的,用户可以预先在云平台购买OBS服务,用户可以将数据集存储于OBS的某个路径中,然后在本步骤中仅需要在用户界面输入OBS的路径即可。在后面执行智能标注时,再从OBS读取数据集中的数据。用户也可以在数据标注服务的用户界面中直接上传待标注的第一数据集。
数据存储模块302,用于存储第一用户上传的第一数据集。可选的,数据存储模块302可以是云平台提供的OBS服务,在进行自动标注后,AI开发平台100还会将已标注的第一数据存储到OBS的相应路径中。一种可能实施例中,OBS服务是区别于AI开发平台的另一种云服务。
推理模块303,用于基于更新后的基础模型和所述第一提示模板,对所述第一数据集中的数据进行标注,其中,所述基础模型是部署于所述AI开发平台中的预训练AI模型。
一种可能的实施方式中,推理模块303基于基础模型进行提示学习,对数据集A进行标注,具体的:基于前面的步骤112输入的第一提示模板,或基于前面的第一提示模板和少量样本数据,对第一数据集的数据进行推理(即提示学习),例如零样本推理/少样本推理,并输出未标注的数据对应的标注。可选的,推理模块303,还用于在更新后的基础模型的标注准确率达到阈值时,返回所述第一数据集的标注完成响应。
基础模型存储模块304,用于存储基础模型,这里的基础模型是指部署在AI开发平台100上的预训练好的大模型。这类大模型通常是由海量无标注数据训练得来,基础模型的参数规模通常较大,具备极佳泛化能力。大模型根据训练数据的不同类型,主要分为两类:自然语言处理(Natural Language Processing,NLP)大模型和CV计算机视觉(Computer Vision,CV)大模型。此外,大模型还可以包括:多模态大模型、科学计算大模型等,本申请对此不做限定,本申请实施例中的基础模型可以是以上描述的任一种大模型。
难例挖掘模块305,用于确定所述第一数据集中的一个或多个难例(即第一难例集),并通过显示界面向所述第一用户显示所述第一数据集中的第一难例集。AI开发平台100中还引入了难例挖掘技术,使得AI开发平台可以基于预训练的基础模型B进行推理、难例挖掘、训练、再推理的闭环过程。
一种可能的实施方式,在基础模型进行推理(智能标注)之后,AI开发平台100通过难例挖掘算法,对模型预测把握不大的样本(即难例)进行排序,确定第一数据集中的难例以及难例属性,然后通过用户界面程序给第一用户,来人工进行确认和修改难例的标注结果。
可选的,难例挖掘模型305还用于:在更新后的基础模型对所述第一数据集的标注准确率低于阈值时,确定所述第一数据集中的第二难例集,通过显示界面向所述第一用户显示所述第二难例集。
可选的,在第一用户对难例进行标注确认后,AI开发平台会将确认的难例同步至已标注的第一数据集中,即将难例确认后的结果存储到OBS的相应路径中。
基础模型更新模块306,用于:根据所述第一用户对所述难例的标注进行确认后的结果,对所述基础模型进行训练,以更新所述基础模型。具体的,基于第一用户对所述难例的标注进行确认后的结果,先对OBS中第一数据集中的标注进行更新,并基于更新后的第一数据集(已标注)对所述基础模型进行训练,以更新所述基础模型。
基础模型更新模块306,还用于:基础模型更新模块306会根据所述第一用户对所述第二难例集的标注进行确认后的结果,对所述更新后的基础模型进行训练。
一种可能的实施方式中,若在对所述基础模型进行训练之后,第二用户也开始使用数据标注装置300的功能,则此时:IO模块301接收第二用户输入的第二提示模板,并基于所述更新后的基础模型和所述第二提示模板,对第二数据集进行数据标注。可见,基础模型在训练后,已经积累第一用户数据中的知识,此时利用更新后的基础模型对第二用户的第二数据进行标注,准确率可以得到在一定程度上得到提升。
模型蒸馏模块307,用于:获取第一用户的目标需求目标需求,所述目标需求目标需求中包括任务类型;基于所述任务类型,利用更新后的基础模型知识蒸馏得到目标模型,所述目标模型用于实现所述任务类型指示的任务。其中,任务类型包括:文本情感分析、文本分类、实体命名、命名实体识别、声音分类、语音内容识别、图像分类、物体检测、图像分割、视频标注中的任意一种。
模型训练模块308,用于:获取已标注的第一数据集和第一用户的目标需求目标需求,所述目标需求目标需求中包括任务类型,基于所述已标注的第一数据集和所述目标需求,训练得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
需要说明的是,上述的模块功能的划分仅作为示例,模块301-模块308的任一模块均可以用于执行本申请图4、图5的方法中的部分或全部步骤。换句话说,在其他实施例中,推理模块303可以用于执行本申请实施例中的方法的任意步骤,通用,其他各个模块也可以用于执行本申请实施例中的方法的任意步骤,模块301-模块308负责实现的步骤可根据需要指定,通过A模块、B模块、以及C模块分别实现本申请实施例中的方法中不同的步骤来实现数据标注装置300的全部功能。
接下来,以推理模块303为例,介绍推理模块303的实现方式(软件方式和硬件方式)。类似的,装置300中的其他模块的实现方式均可以参考推理模块303的实现方式:
模块作为软件功能单元的一种举例,例如,推理模块303可以是运行在计算机设备上的应用程序或代码块。其中,计算机设备可以是物理主机、虚拟机、容器等计算设备中的至少一种。进一步地,上述计算机设备可以是一台或者多台。例如,推理模块303可以是运行在多个主机/虚拟机/容器上的应用程序。需要说明的是,用于运行该应用程序的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中。用于运行该应用程序的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。其中,通常一个region可以包括多个AZ。同样,用于运行该应用 程序的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个region可以包括多个VPC,而一个VPC中可以包括多个AZ。
模块作为硬件功能单元的一种举例,例如,推理模块303中可以包括至少一个计算设备,如服务器等。或者,A模块也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。推理模块303包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。A模块包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。同样,A模块包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。
图8示出了一种计算设备400的结构示意图,上述模型训练装置可以部署在该计算设备上,该计算设备可以是云环境中的计算设备(如服务器),或边缘环境中的计算设备,或终端设备等具体可以用于实现上述装置300中各个模块的功能。
如图8所示,计算设备400包括处理器401、存储器402、通信接口403和总线404。处理器401、存储器402和通信接口403之间通过总线404通信。总线404可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口403用于与外部通信,例如接收第一用户提供的原始数据以及待训练的特征提取网络模型等。
其中,处理器401可以为中央处理器(central processing unit,CPU)、专用集成电路(application specific integrated circuit,ASIC)、图形处理器(graphics processing unit,GPU)或者一个或多个集成电路。处理器401还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,模型训练装置中各个模块的功能可以通过处理器401中的硬件的集成逻辑电路或者软件形式的指令完成。处理器401还可以是通用处理器、数据信号处理器(digital signal process,DSP)、现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件,分立门或者晶体管逻辑器件,分立硬件组件,可以实现或者执行本申请实施例中公开的方法、步骤及逻辑框图。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,结合本申请实施例所公开的方法可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器402,处理器401读取存储器402中的信息,结合其硬件完成模型训练装置中各模块的功能。
存储器402可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器402还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。
存储器402中存储有可执行代码,处理器401执行该可执行代码以执行本申请实施例中提出的数据标注和模型训练的方法,以分别实现前述301模块-308模块的功能。所述存储器 402中还存储了本方法执行时所需要的数据,例如第一数据集和基础模型文件。
图9是本申请实施例提供了一种计算设备集群。该计算设备集群包括至少一台计算设备400,该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。计算设备集群中的一个或多个计算设备400中的存储器401中可以存有相同的数据标注装置300用于执行本申请实施例中提出的数据标注和模型训练的方法的指令。
一种可能的实现方式中,该计算设备集群中的一个或多个计算设备400也可以用于实现数据标注装置300中部分模块的功能,即用于执行本申请实施例中的方法的部分指令。换言之,一个或多个计算设备400的组合可以共同存储数据标注装置300中模块用于执行本申请实施例中提出的数据标注和模型训练的方法的指令。
需要说明的是,计算设备集群中的不同的计算设备400中的存储器402可以存储不同的指令,用于执行数据标注装置300装置的部分功能,也即,不同的计算设备400中的存储器401存储的指令可以实现IO模块301、数据存储模块302、推理模块303、基础模型存储模块304、难例挖掘模块305、基础模型更新模块306、模型蒸馏模块307和模型训练模块308中的一个或多个模块的功能。此外,存储器402中还存储了本方法执行时所需要的数据,例如第一数据集和基础模型的模型文件。
图10是本申请实施例提供的一种计算设备集群可能的实现方式。如图10所示,三个计算设备400A、400B、400C和400D之间通过网络进行连接,其中,所述网络可以是广域网或局域网等等。具体地,通过各个计算设备中的通信接口与所述网络进行连接。不同的计算设备400中的存储器401存储的指令或程序代码可以实现IO模块301、数据存储模块302、推理模块303、基础模型存储模块304、难例挖掘模块305、基础模型更新模块306、模型蒸馏模块307和模型训练模块308中的一个或多个模块的功能。
一种可能的实现方式中,考虑到自动标注(基础模型的推理和更新)、模型训练、难例挖掘、数据存储可以是作为独立的云服务给云平台100上的用户,例如用户可以单独购买难例挖掘服务来进行难例挖掘,因此,它们的功能可能是由不同的计算设备实现的。
一种实现方式举例,计算设备400A中的存储器401中存有执行IO模块301、推理模块303、基础模型存储模块304、基础模型更新模块306的程序代码,计算设备400A用于实现同时自动标注的功能,具体的,包括:基于基础模型和用户输入的提示模板进行推理,实现对第一数据集的自动标注,并根据难例确认后的第一数据集(已标注)对基础模型进行更新。同时,计算设备400B中的存储器402中存有执行模型蒸馏模块307和模型训练模块308的功能的程序代码,可以实现基于已标注的第一数据集进行模型训练和模型蒸馏。同时,计算设备400C中的存储器402中存有实现难例挖掘模块305功能的程序代码,可以对基于AI模型进行推理、难例挖掘、训练、再推理的闭环过程。同时,计算设备400D中的存储器402中存有实现数据存储模块302功能的程序代码,例如该数据存储模块302可以是OBS服务,用于存储用户上传的第一数据集。然后在计算设备400A执行模型模块303的功能时,计算设备400A可以从OBS读取数据集中的数据。
应理解,图10中示出的计算设备400A的功能也可以由多个计算设备400完成。同理,计算设备400B、400C、400D的功能也可以分别由多个计算设备400完成。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在一个或者多个计算设备上运行时,使得该一个或者多个计算设备执行上述实施例模型训练装置的各个模块所执行的方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被一个或者多个计算设备执行时,所述一个或者多个计算设备执行前述模型训练方法中的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述模型训练方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (22)

  1. 一种数据标注的方法,其特征在于,应用于人工智能AI开发平台,包括:
    接收第一用户输入的第一提示模板,所述第一提示模板用于描述输入数据和标注结果之间的关系;
    基于基础模型和所述提示模板,对第一数据集进行数据标注,其中,所述基础模型部署于所述AI开发平台;
    确定所述第一数据集中的第一难例集,并生成显示界面以向所述第一用户显示所述第一难例集,所述第一难例集中包括至少一个难例;
    根据所述第一用户对所述第一难例集的标注进行确认后的结果,对所述基础模型进行训练,以得到更新后的基础模型。
  2. 根据权利要求1所述的方法,其特征在于,在得到所述更新后的基础模型之后,所述方法还包括:
    基于所述更新后的基础模型,对所述第一数据集中进行数据标注;
    在所述更新后的基础模型的标注准确率高于或等于阈值时,返回标注完成响应;或,
    在所述更新后的基础模型的标注准确率低于阈值时,确定所述第一数据集中的第二难例集,并生成显示界面以向所述第一用户显示所述第二难例集,并根据所述第一用户对所述第二难例集的标注进行确认后的结果,对所述更新后的基础模型进行训练。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述用户对所述第一难例集的标注进行确认后的结果,对所述基础模型进行训练,以得到更新后的基础模型,包括:
    根据所述第一用户对所述第一难例集的标注进行确认后的结果和所述第一提示模板,对所述基础模型进行训练,以得到更新后的基础模型。
  4. 根据权利要求1或3所述的方法,其特征在于,所述方法包括:
    接收第二用户输入的第二提示模板;
    基于所述更新后的基础模型和所述第二提示模板,对第二数据集进行数据标注。
  5. 根据权利要求1所述方法,其特征在于,所述方法包括:
    根据所述用户对所述第一难例集的标注进行确认后的结果和非难例集的标注,确定已标注的第一数据集,其中,所述非难例集的标注是在所述基于基础模型和所述提示模板,对第一数据集进行数据标注的步骤中生成的标注,所述非难例集是所述第一数据集除去所述第一难例集余下的数据组成的集合。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    获取所述第一用户的目标需求,所述目标需求中包括任务类型;
    基于所述目标需求、和所述已标注的第一数据集,在所述更新后的基础模型上进行知识蒸馏,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
  7. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    获取所述已标注的第一数据集和所述第一用户的目标需求,所述目标需求中包括任务类型;
    基于所述已标注的第一数据集和所述目标需求进行模型训练,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
  8. 根据权利要求6或7所述的方法,其特征在于,所述目标需求中还包括模型能力,所述性能需求用于描述所述目标模型的精度或性能。
  9. 根据权利要求6-8任一项所述的方法,其特征在于,所述任务类型包括:文本情感分析、文本分类、实体命名、命名实体识别、声音分类、语音内容识别、图像分类、物体检测、图像分割、视频标注中的任意一种。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,
    所述输入的第一提示模板是预设在所述AI开发平台中的,所述AI开发平台中预设了多个提示模板,每个预设的提示模板对应一种任务类型;或,所述第一提示模板是用户在所述显示界面中设计的。
  11. 一种人工智能AI开发平台,其特征在于,所述AI开发平台包括:
    输入输出IO模块,用于:接收第一用户输入的第一提示模板,所述第一提示模板用于描述输入数据和标注结果之间的关系;
    推理模块,用于:基于基础模型和所述提示模板,对第一数据集进行数据标注,其中,所述基础模型部署于所述AI开发平台;
    难例挖掘模块,用于:确定所述第一数据集中的第一难例集,并生成显示界面以向所述第一用户显示所述第一难例集,所述第一难例集中包括至少一个难例;
    基础模型更新模块,用于:根据所述第一用户对所述第一难例集的标注进行确认后的结果,对所述基础模型进行训练,以得到更新后的基础模型。
  12. 根据权利要求11所述的AI开发平台,其特征在于,所述AI开发平台包括:
    所述推理模块,还用于:基于所述更新后的基础模型,对所述第一数据集中进行数据标注;
    所述推理模块,还用于:在所述更新后的基础模型的标注准确率高于或等于阈值时,返回标注完成响应;
    所述难例挖掘模块,还用于:在所述更新后的基础模型的标注准确率低于阈值时,确定所述第一数据集中的第二难例集,并生成显示界面以向所述第一用户显示所述第二难例集,并根据所述第一用户对所述第二难例集的标注进行确认后的结果,对所述更新后的基础模型进行训练。
  13. 根据权利要求11所述的AI开发平台,其特征在于,所述基础模型更新模块,用于:
    根据所述第一用户对所述第一难例集的标注进行确认后的结果和所述第一提示模板,对所述基础模型进行训练,以得到更新后的基础模型。
  14. 根据权利要求11或13所述的AI开发平台,其特征在于,
    所述IO模块,还用于:接收第二用户输入的第二提示模板;
    所述推理模块,还用于:基于所述更新后的基础模型和所述第二提示模板,对第二数据集进行数据标注。
  15. 根据权利要求11所述的AI开发平台,其特征在于,
    所述难例挖掘模块,还用于:根据所述用户对所述第一难例集的标注进行确认后的结果和非难例集的标注,确定已标注的第一数据集,其中,所述非难例集的标注是在所述基于基础模型和所述提示模板,对第一数据集进行数据标注的步骤中生成的标注,所述非难例集是所述第一数据集除去所述第一难例集余下的数据组成的集合。
  16. 根据权利要求15所述的AI开发平台,其特征在于,所述AI开发平台还包括:
    模型蒸馏模块,用于:
    获取所述第一用户的目标需求,所述目标需求中包括任务类型;
    基于所述目标需求、和所述已标注的第一数据集,在所述更新后的基础模型上进行知识蒸馏,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
  17. 根据权利要求15所述的AI开发平台,其特征在于,所述AI开发平台还包括:
    模型训练模块,用于:
    获取所述已标注的第一数据集和所述第一用户的目标需求,所述目标需求中包括任务类型;
    基于所述已标注的第一数据集和所述目标需求进行模型训练,以得到目标模型,所述目标模型用于实现所述任务类型指示的任务。
  18. 根据权利要求16或17所述的AI开发平台,其特征在于,所述目标需求中还包括模型能力,所述性能需求用于描述所述目标模型的精度或性能。
  19. 根据权利要求16或17所述的AI开发平台,其特征在于,所述任务类型包括:文本情感分析、文本分类、实体命名、命名实体识别、声音分类、语音内容识别、图像分类、物体检测、图像分割、视频标注中的任意一种。
  20. 根据权利要求11-19任一项所述的AI开发平台,其特征在于,所述输入的第一提示模板是预设在所述AI开发平台中的,所述AI开发平台中预设了多个提示模板,每个预设的提示模板对应一种任务类型;或,所述第一提示模板是用户在所述显示界面中设计的。
  21. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;
    所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求1至10任一项所述的方法。
  22. 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备集群执行时,所述计算设备集群执行如权利要求1至10任一项所述的方法。
PCT/CN2022/130153 2022-03-24 2022-11-05 数据标注的方法、ai开发平台、计算设备集群和存储介质 WO2023179038A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210303683 2022-03-24
CN202210303683.2 2022-03-24
CN202210855348.3A CN116862001A (zh) 2022-03-24 2022-07-19 数据标注的方法、ai开发平台、计算设备集群和存储介质
CN202210855348.3 2022-07-19

Publications (1)

Publication Number Publication Date
WO2023179038A1 true WO2023179038A1 (zh) 2023-09-28

Family

ID=88099741

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130153 WO2023179038A1 (zh) 2022-03-24 2022-11-05 数据标注的方法、ai开发平台、计算设备集群和存储介质

Country Status (1)

Country Link
WO (1) WO2023179038A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557871A (zh) * 2024-01-11 2024-02-13 子亥科技(成都)有限公司 三维模型标注方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808004A (zh) * 2017-11-15 2018-03-16 北京百度网讯科技有限公司 模型训练方法和系统、服务器、存储介质
WO2018184195A1 (en) * 2017-04-07 2018-10-11 Intel Corporation Joint training of neural networks using multi-scale hard example mining
CN111476324A (zh) * 2020-06-28 2020-07-31 平安国际智慧城市科技股份有限公司 基于人工智能的交通数据标注方法、装置、设备及介质
CN112529026A (zh) * 2019-09-17 2021-03-19 华为技术有限公司 提供ai模型的方法、ai平台、计算设备及存储介质
CN113838058A (zh) * 2021-10-11 2021-12-24 重庆邮电大学 一种基于小样本分割的医学图像自动标注方法及系统
CN113935389A (zh) * 2020-06-29 2022-01-14 华为技术有限公司 数据标注的方法、装置、计算设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018184195A1 (en) * 2017-04-07 2018-10-11 Intel Corporation Joint training of neural networks using multi-scale hard example mining
CN107808004A (zh) * 2017-11-15 2018-03-16 北京百度网讯科技有限公司 模型训练方法和系统、服务器、存储介质
CN112529026A (zh) * 2019-09-17 2021-03-19 华为技术有限公司 提供ai模型的方法、ai平台、计算设备及存储介质
CN111476324A (zh) * 2020-06-28 2020-07-31 平安国际智慧城市科技股份有限公司 基于人工智能的交通数据标注方法、装置、设备及介质
CN113935389A (zh) * 2020-06-29 2022-01-14 华为技术有限公司 数据标注的方法、装置、计算设备和存储介质
CN113838058A (zh) * 2021-10-11 2021-12-24 重庆邮电大学 一种基于小样本分割的医学图像自动标注方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557871A (zh) * 2024-01-11 2024-02-13 子亥科技(成都)有限公司 三维模型标注方法、装置、设备及存储介质
CN117557871B (zh) * 2024-01-11 2024-03-19 子亥科技(成都)有限公司 三维模型标注方法、装置、设备及存储介质

Similar Documents

Publication Publication Date Title
WO2021063171A1 (zh) 决策树模型的训练方法、系统、存储介质及预测方法
US10140709B2 (en) Automatic detection and semantic description of lesions using a convolutional neural network
KR20200094627A (ko) 텍스트 관련도를 확정하기 위한 방법, 장치, 기기 및 매체
US11645548B1 (en) Automated cloud data and technology solution delivery using machine learning and artificial intelligence modeling
WO2021139191A1 (zh) 数据标注的方法以及数据标注的装置
WO2022252363A1 (zh) 数据处理方法、计算机设备以及可读存储介质
US11551437B2 (en) Collaborative information extraction
WO2022001896A1 (zh) 推荐理由生成方法、装置、设备及存储介质
WO2021135455A1 (zh) 语义召回方法、装置、计算机设备及存储介质
WO2022083093A1 (zh) 图谱中的概率计算方法、装置、计算机设备及存储介质
CN111259647A (zh) 基于人工智能的问答文本匹配方法、装置、介质及电子设备
US20220100772A1 (en) Context-sensitive linking of entities to private databases
WO2023159755A1 (zh) 虚假新闻检测方法、装置、设备及存储介质
US20230106106A1 (en) Text backup method, apparatus, and device, and computer-readable storage medium
WO2023179038A1 (zh) 数据标注的方法、ai开发平台、计算设备集群和存储介质
CN113706211A (zh) 基于神经网络的广告点击率预测方法及系统
CN113919320A (zh) 异构图神经网络的早期谣言检测方法、系统及设备
US20220100967A1 (en) Lifecycle management for customized natural language processing
CN117312535A (zh) 基于人工智能的问题数据处理方法、装置、设备及介质
US20230186117A1 (en) Automated cloud data and technology solution delivery using dynamic minibot squad engine machine learning and artificial intelligence modeling
KR102243275B1 (ko) 오프라인 오브젝트에 관한 콘텐츠 자동 생성 방법, 장치 및 컴퓨터 판독가능 저장 매체
WO2023284516A1 (zh) 基于知识图谱的信息推荐方法、装置、设备、介质及产品
CN116383517A (zh) 动态传播特征增强的多模态谣言检测方法及系统
CN113837216B (zh) 数据分类方法、训练方法、装置、介质及电子设备
US20230140828A1 (en) Machine Learning Methods And Systems For Cataloging And Making Recommendations Based On Domain-Specific Knowledge

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22933089

Country of ref document: EP

Kind code of ref document: A1