WO2023245523A1 - 用于生成训练数据的方法以及装置 - Google Patents

用于生成训练数据的方法以及装置 Download PDF

Info

Publication number
WO2023245523A1
WO2023245523A1 PCT/CN2022/100583 CN2022100583W WO2023245523A1 WO 2023245523 A1 WO2023245523 A1 WO 2023245523A1 CN 2022100583 W CN2022100583 W CN 2022100583W WO 2023245523 A1 WO2023245523 A1 WO 2023245523A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
deep learning
training
reference sample
label
Prior art date
Application number
PCT/CN2022/100583
Other languages
English (en)
French (fr)
Inventor
肖涵
王楠
王博
马克西米利安•韦克
•马斯特拉帕斯 乔治奥斯
Original Assignee
极纳人工智能有限公司
极纳人工智能(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 极纳人工智能有限公司, 极纳人工智能(北京)有限公司 filed Critical 极纳人工智能有限公司
Priority to CN202280005189.6A priority Critical patent/CN115836288A/zh
Priority to EP22871054.7A priority patent/EP4322066A4/en
Priority to PCT/CN2022/100583 priority patent/WO2023245523A1/zh
Publication of WO2023245523A1 publication Critical patent/WO2023245523A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning

Definitions

  • Embodiments of the present disclosure relate to the field of computer technology, and in particular, to methods and devices for generating training data.
  • the deep learning model is a machine learning model that aims to establish and simulate the neural network of the human brain for analysis and learning, and to imitate the mechanism of the human brain to interpret data, such as text, images, sounds, etc.
  • Deep learning models can be widely used in various fields to perform a variety of tasks, such as computer vision, language understanding, speech recognition, advertising recommendation, neural search, etc.
  • the embodiments described herein provide a method, apparatus, electronic device, and computer-readable storage medium storing a computer program for generating training data.
  • a method for generating training data is provided.
  • This training data is used to train the target deep learning model.
  • the raw data input by the user for the target deep learning model is obtained.
  • Types of original data include labeled categorical data, labeled session data, and unlabeled data.
  • the label of categorical data indicates the category of the categorical data.
  • the labels of the session data indicate the question-answer relevance of the session data.
  • the training data in the step of generating training data according to the type of the original data, if the original data is categorical data, the training data is generated according to the category indicated by the label of the categorical data.
  • part or all of the classification data is selected from the classification data as reference samples. Treat each reference sample in the reference samples as a target reference sample. Classification data having the same category as the target reference sample is determined as a positive sample associated with the target reference sample. Classification data having a different class than the target reference sample is determined as a negative sample associated with the target reference sample. Then, the target reference sample, the positive samples associated with the target reference sample, and the negative samples associated with the target reference sample are combined into a set of training data.
  • the classification data includes multiple labels.
  • the category of categorical data is determined by one or more labels of the categorical data.
  • the training data in the step of generating training data according to the type of the original data, if the original data is session data, the training data is generated according to the question and answer correlation indicated by the label of the session data.
  • each piece of session data includes a reference sample and a plurality of matching samples.
  • the matching sample whose label indicates a positive question and answer correlation is regarded as a positive sample
  • the matching sample whose label indicates a negative question and answer correlation is regarded as a positive sample.
  • Correlated matching samples are used as negative samples.
  • the reference samples, positive samples, and negative samples are combined into a set of training data.
  • the labels of the classification data are unary labels
  • the labels of the session data are binary labels
  • the step of generating training data according to the type of original data if the original data is unlabeled data, data augmentation technology is used to generate the training data.
  • each unlabeled data in the unlabeled data is used as a reference sample.
  • the data enhancement technology when the unlabeled data is a picture, includes: performing one or more operations on the picture such as flipping, mirroring, cropping, etc.
  • the data enhancement technique when the unlabeled data is text, includes: performing a random masking operation on the text.
  • the data enhancement technique includes performing a random masking operation on the sound passage.
  • an apparatus for generating training data includes: an acquisition module for acquiring original data input by a user for a target deep learning model; a determination module for determining the type of the original data; and a generation module for generating training data according to the type of the original data.
  • the types of original data include labeled classification data, labeled conversation data, and unlabeled data.
  • the label of the classification data indicates the category of the classification data, and the label of the conversation data indicates the question-answer correlation of the conversation data.
  • an electronic device includes: at least one processor; and at least one memory storing a computer program.
  • the computer program executes the computer program: obtains original data input by the user for the target deep learning model; determines the type of the original data, the type of the original data includes labeled classification data, labeled conversation data, and unlabeled data, the label of the classified data indicates the category of the classified data, and the label of the conversation data indicates the question-answer relevance of the conversation data; and the training data is generated according to the type of the original data.
  • the computer program when executed by at least one processor, causes the electronic device to generate training data according to the type of the original data by: in response to the original data being categorical data, according to the label of the categorical data. The indicated categories to generate training data.
  • the computer program when executed by at least one processor, causes the electronic device to generate training data according to a category indicated by a label of the categorical data by: selecting some or all of the categorical data from the categorical data as a reference sample; taking each reference sample in the reference sample as a target reference sample; determining the classification data with the same category as the target reference sample as a positive sample associated with the target reference sample; taking a different category from the target reference sample Classification data of the category are determined as negative samples associated with the target reference sample; and the target reference sample, positive samples associated with the target reference sample, and negative samples associated with the target reference sample are combined into a set of training data.
  • the computer program when executed by at least one processor, causes the electronic device to generate training data according to the type of the original data by: in response to the original data being session data, according to the label of the session data. Indicated question and answer correlations to generate training data.
  • each piece of session data includes a reference sample and a plurality of matching samples.
  • the computer program when executed by at least one processor, causes the electronic device to generate training data in accordance with question and answer relevance indicated by labels of the conversation data by: for each piece of conversation data, a match whose label indicates a positive question and answer relevance. samples as positive samples; matched samples whose labels indicate negative question-answer correlations as negative samples; and combine reference samples, positive samples, and negative samples into a set of training data.
  • the computer program when executed by at least one processor, causes the electronic device to generate training data according to the type of the original data by: in response to the original data being unlabeled data, using a data augmentation technique to Generate training data.
  • the computer program when executed by at least one processor, causes the electronic device to generate training data using data augmentation techniques by: treating each unlabeled data in the unlabeled data as a reference sample; Use data augmentation techniques to generate multiple positive samples from reference samples; and use data augmentation techniques to generate multiple negative samples from unlabeled data in addition to reference samples.
  • a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method according to the first aspect of the present disclosure.
  • FIG. 1 is an exemplary flowchart of a method for generating a target deep learning model according to an embodiment of the present disclosure
  • Figure 2 is an exemplary flow chart of the steps of generating training data from raw data in the embodiment shown in Figure 1;
  • Figure 3 is an exemplary flow chart of the steps of generating training data according to the type of original data in the embodiment shown in Figure 2;
  • Figure 4 is an exemplary flowchart of the steps of determining a first deep learning model corresponding to a task in the embodiment shown in Figure 1;
  • Figure 5 is a schematic block diagram of an apparatus for generating training data according to an embodiment of the present disclosure
  • FIG. 6 is a schematic block diagram of an electronic device executing a method for generating training data according to an embodiment of the present disclosure.
  • developers of deep learning models can obtain target deep learning models by fine-tuning pre-trained deep learning models.
  • operations such as training data preparation, model selection, and training parameter selection are required. This requires developers to have a lot of knowledge about deep learning models, so it is not friendly to junior developers. This not only requires junior developers to put in a lot of work, but also delays development progress.
  • FIG. 1 illustrates an exemplary flowchart of a method for generating a target deep learning model according to an embodiment of the present disclosure.
  • the instructions include the tasks that the target deep learning model is expected to perform.
  • the task may be a search task, such as searching for pictures with text, searching for text with text, searching for pictures with pictures, searching for text with pictures, and searching for sounds with sound.
  • the user-entered indication may include a desire to obtain a deep learning model capable of performing a specified search task.
  • the raw data entered by the user is associated with the tasks included in the instructions.
  • the search task is to search for pictures based on pictures, the user can input a picture set as the original data for generating the target deep learning model.
  • the search task is to search for sounds by sounds
  • the user can input an audio set as raw data for generating the target deep learning model.
  • FIG. 2 shows an exemplary flowchart of steps for generating training data from raw data.
  • the type of raw data is determined.
  • Types of raw data can include: labeled categorical data, labeled session data, and unlabeled data.
  • the label of categorical data indicates the category of the categorical data.
  • the labels of the session data indicate the question-answer relevance of the session data.
  • the raw data may include a tag indication field.
  • the label indication field indicates the label of the original data. This tag indicates the field to determine the type of raw data.
  • the classification data may have more than one label.
  • the category of the categorical data can be determined based on some or all of the labels of the categorical data. Take a picture as an example to illustrate how the label of categorical data indicates the category of categorical data.
  • the labels of the categorical data include: cat, dog, cute, ugly. Then the pictures can be classified into cat pictures and dog pictures based on the partial labels "cat” and "dog". Alternatively, the pictures may be classified into cute cat pictures, ugly cat pictures, cute dog pictures, and ugly based on all the tags "cat", "dog", "cute” and "ugly” dog pictures.
  • the label is a unary label.
  • a unary tag means that the tag is related to only one piece of data.
  • the session data is, for example, historical interaction data between the e-commerce platform and the user.
  • users can search for the keyword of a certain product on the e-commerce platform (equivalent to "ask"). Based on this keyword, the e-commerce platform can push several product links (equivalent to "answers") to users. If the user clicks on a certain product link, the label of the session data formed by the keyword and the product link is set to, for example, "relevant” (for example, represented by the number 1). If the user does not click on a certain product link, the label of the session data formed by the keyword and the product link is set to, for example, "irrelevant" (for example, represented by a number 0). The labels of conversation data are related to both questions and answers, so they are binary labels.
  • Such session data may, for example, come from search logs saved by the e-commerce platform for each user.
  • unlabeled data refers to data in which the label indication field is empty.
  • unlabeled data is, for example, product photos uploaded by users. Such product photos do not have labels and are therefore defined as unlabeled data.
  • training data is generated according to the type of original data.
  • Figure 3 shows an exemplary flowchart of steps for generating training data according to the type of original data.
  • whether the original data is categorical data can be determined by whether the label of the original data is a unary label or a binary label. If the label is a unary label, it is determined that the original data is categorical data. If the label is a binary label, the original data is determined to be session data.
  • whether the original data is classified data may be determined by whether the label indication field in the original data includes text. If the label indicates that the field includes text, the original data is determined to be categorical data. If the tag indicates that the field only includes the numbers 0 or 1, the original data is determined to be session data.
  • the above text may include Chinese, English, other language types, or a combination thereof.
  • training data is generated according to the category indicated by the label of the categorical data.
  • part or all of the classification data may be selected as the reference sample, for example, from the classification data.
  • the classification data having the same category as the reference sample can be determined as a positive sample associated with the reference sample, and the classification data having a different category from the reference sample can be determined as the positive sample associated with the reference sample.
  • the negative sample associated with the sample The reference sample, the positive samples associated with the reference sample, and the negative samples associated with the reference sample are combined into a set of training data. In this way, a corresponding set of training data is generated for each of the reference samples.
  • each piece of session data includes a reference sample and a plurality of matching samples.
  • the reference sample can be a keyword used by the user to search for a certain product on the e-commerce platform, and the multiple matching samples can be several product links pushed by the e-commerce platform to the user.
  • Each matching sample has a label that indicates whether the matching sample is related to the reference sample.
  • the matching samples with a label of "relevant" or 1 can be regarded as positive samples, and the matching samples with a label of "irrelevant” or 0 (the label indicates a negative question and answer correlation) can be used as positive samples.
  • the matching samples are used as negative samples.
  • Reference samples, positive samples, and negative samples are combined into a set of training data. In this way, a corresponding set of training data is generated for each session data.
  • each of the original data may be used as a reference sample.
  • Data augmentation techniques can then be used to generate multiple positive samples from the reference sample and multiple negative samples from the original data except the reference sample.
  • the data enhancement method adopted can be determined by the task obtained at block S102 of Figure 1.
  • one or more operations such as flipping, mirroring, and cropping may be performed on the pictures as reference samples to generate multiple positive samples.
  • One or more operations such as flipping, mirroring, and cropping can be performed on the images in the original data except the reference samples to generate multiple negative samples.
  • a random masking operation can be performed on text or sound passages as reference samples to generate multiple positive samples.
  • a random masking operation can be performed on text or sound passages in the original data other than the reference sample to generate multiple negative samples.
  • any one or more characters in the text can be randomly masked or removed.
  • the number and position of the obscured or removed words in the text are random.
  • performing a random masking operation on "I like Beijing very much” may result in “I [Unknown] like Beijing", in which "Very” is masked and identified as “[Unknown]”.
  • performing a random mask operation on "I like Beijing very much” may also result in “I like Beijing", in which "very” is removed.
  • performing a random mask operation on "I like Beijing very much” may also result in "[Unknown] likes Beijing", where "I very much” is masked and identified as “[Unknown]”.
  • the operation at block S104 can automatically generate training data without user participation, reducing the user's work burden and improving work efficiency.
  • a first deep learning model corresponding to the task is determined.
  • the operations at block S106 may be performed in parallel with the operations at block S104. In other embodiments of the present disclosure, the operation at block S106 may be performed first and then the operation at block S104.
  • Figure 4 shows an exemplary flowchart of the steps of determining a first deep learning model corresponding to a task.
  • a plurality of candidate deep learning models corresponding to the task are determined.
  • a first mapping table of a plurality of pre-trained deep learning models and a plurality of tasks that can be performed by the plurality of deep learning models may be pre-established.
  • the plurality of pre-trained deep learning models may be existing pre-trained deep learning models, or may be pre-trained deep learning models developed in the future.
  • the first mapping table may be established based on empirical values.
  • multiple deep learning models can perform the same task. In other words, a task can be performed by any one of multiple deep learning models. Therefore, a plurality of candidate deep learning models corresponding to the task may be determined based on the first mapping table.
  • multiple pre-trained deep learning models may be adjusted in advance, and the adjusted deep learning models may be added to the first mapping table.
  • the above-mentioned adjusted deep learning model can be used as a candidate deep learning model corresponding to the task.
  • the adjustment may include, for example: adding several layers of fully connected neurons as outputs on the output of the last layer of the pre-trained deep learning model; changing the layer number of the output layer of the pre-trained deep learning model (for example, from the penultimate layer to output instead of output from the last layer).
  • the plurality of pre-trained deep learning models may include large models and small models.
  • the total number of layers of a large model is greater than that of a small model.
  • different training parameters may be set for the plurality of candidate deep learning models. Training parameters may include one or more of the following: learning rate, training stop conditions, etc.
  • the plurality of candidate deep learning models are trained using a portion of the training data.
  • the purpose of using part of the training data here is to reduce the amount of calculations. Training of these multiple candidate deep learning models using partial training data is equivalent to test training.
  • the number of training rounds N (N is a positive integer) for test training execution may be set. After performing N rounds of training on the plurality of candidate deep learning models using part of the training data, the test training process of the plurality of candidate deep learning models ends.
  • a candidate deep learning model that performs the task best among the plurality of trained candidate deep learning models is determined.
  • the candidate deep learning model with the smallest value of the loss function may be determined as the best performing candidate deep learning model for performing the task.
  • data in the training data other than the portion of the training data used at block S404 may be determined as validation data. Validation data is then used to verify how well multiple trained candidate deep learning models perform on the task. In the case where the performance is search accuracy, the candidate deep learning model with the highest search accuracy may be determined as the best performing candidate deep learning model.
  • the best performing candidate deep learning model is determined as the first deep learning model.
  • the first deep learning model may be the deep learning model most suitable for performing the user-specified task.
  • the training data is used to train the first deep learning model to obtain the target deep learning model.
  • a loss function and an optimizer corresponding to the first deep learning model may be determined. Wherein, the determined loss function and optimizer are used to train the first deep learning model.
  • a second mapping table of multiple pre-trained deep learning models and loss functions and optimizers corresponding to the multiple deep learning models may be pre-established. The second mapping table may be established based on empirical values. After the first deep learning model is determined, the loss function and optimizer corresponding to the first deep learning model may be determined based on the second mapping table.
  • the value of the loss function of the first deep learning model at each round may be displayed during the training of the first deep learning model.
  • the value of the loss function at each round can be plotted as a curve for the user to observe.
  • the training history of the first deep learning model may be recorded during the training of the first deep learning model.
  • the training history includes the model parameters of the first deep learning model obtained after each round of training. This allows users to review model training history.
  • the user can select the number of training epochs to train the first deep learning model based on the observed value of the loss function. If the user's selection of the number of training rounds of the first deep learning model is received, the first deep learning model trained by the number of training rounds can be generated according to the model parameters corresponding to the recorded number of training rounds. Then, the generated first deep learning model may be determined as the target deep learning model.
  • the method for generating a target deep learning model is very user-friendly, can reduce the user's workload and speed up development progress.
  • the indication obtained at block S102 may also include one or more of the following: a model of the first deep learning model, a total number of layers of the first deep learning model, an output layer of the first deep learning model The layer number of , and the training parameters used to train the first deep learning model.
  • advanced developers of deep learning models can work flexibly using the method for generating a target deep learning model according to embodiments of the present disclosure.
  • pre-trained deep learning models can have different deep learning frameworks (formats). Junior developers often start with a single deep learning framework to learn how to build deep learning models. If the pre-trained model that a junior developer wants to use is written in a deep learning framework that the junior developer is not good at, then he needs to be familiar with the deep learning framework before fine-tuning the deep learning model.
  • the generated deep learning model can be made to have a format desired by the user (which may be alternatively referred to as a target format in this context).
  • the indication obtained at block S102 of FIG. 1 may include a target format of the target deep learning model.
  • the graph description and model parameters of the first deep learning model determined at block S106 may be respectively converted into the graph description and model parameters of the general format ONNX model, thereby converting the format of the first deep learning model into ONNX.
  • the first deep learning model in the ONNX format is converted into a first deep learning model in the target format.
  • the first deep learning model trained at block S108 of Figure 1 is the first deep learning model with the target format. After training the first deep learning model with the target format, a target deep learning model with the target format can be obtained.
  • FIG. 5 shows a schematic block diagram of an apparatus 500 for generating training data according to an embodiment of the present disclosure.
  • the device 500 includes: an obtaining module 510, a determining module 520, and a generating module 530.
  • the acquisition module 510 is used to acquire the original data input by the user for the target deep learning model.
  • the determining module 520 is used to determine the type of original data.
  • the generation module 530 is used to generate training data according to the type of original data.
  • the types of original data include labeled classification data, labeled conversation data, and unlabeled data.
  • the label of the classification data indicates the category of the classification data, and the label of the conversation data indicates the question-answer correlation of the conversation data.
  • FIG. 6 shows a schematic block diagram of an electronic device 600 executing a method for generating a target deep learning model according to an embodiment of the present disclosure.
  • the electronic device 600 may include a processor 610 and a memory 620 storing a computer program. When the computer program is executed by the processor 610, the electronic device 600 can perform the steps of the method 100 shown in FIG. 1 .
  • electronic device 600 may be a computer device or a cloud computing node.
  • the electronic device 600 may serve as a platform for providing services for generating training data from raw data.
  • the electronic device 600 may obtain raw data input by the user for the target deep learning model. Electronic device 600 may then determine the type of raw data.
  • Types of raw data include labeled categorical data, labeled session data, and unlabeled data.
  • the label of categorical data indicates the category of the categorical data.
  • the labels of the session data indicate the question-answer relevance of the session data. Then, the electronic device 600 can generate training data according to the type of original data.
  • the electronic device 600 may generate training data according to the category indicated by the label of the categorical data.
  • the electronic device 600 may select part or all of the classification data as reference samples from the classification data.
  • the electronic device 600 may use each of the reference samples as a target reference sample.
  • the electronic device 600 may determine classification data having the same category as the target reference sample as a positive sample associated with the target reference sample.
  • the electronic device 600 may determine classification data having a different category from the target reference sample as a negative sample associated with the target reference sample.
  • the electronic device 600 may then combine the target reference sample, the positive samples associated with the target reference sample, and the negative samples associated with the target reference sample into a set of training data.
  • the electronic device 600 may generate training data according to the question and answer correlation indicated by the tag of the session data.
  • each piece of session data includes a reference sample and a plurality of matching samples.
  • the electronic device 600 may treat, for each piece of session data, a matching sample whose label indicates a positive question-answer correlation as a positive sample, and a matching sample whose label indicates a negative question-answer correlation as a negative sample.
  • the electronic device 600 may then combine the reference samples, positive samples, and negative samples into a set of training data.
  • the electronic device 600 may use data augmentation techniques to generate training data.
  • the electronic device 600 may use each unlabeled data in the unlabeled data as a reference sample.
  • Electronic device 600 may generate a plurality of positive samples from the reference samples using data augmentation techniques.
  • the electronic device 600 may use data augmentation techniques to generate a plurality of negative samples from the unlabeled data in addition to the reference samples.
  • the processor 610 may be, for example, a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a processor based on a multi-core processor architecture, or the like.
  • Memory 620 may be any type of memory implemented using data storage technology, including but not limited to random access memory, read-only memory, semiconductor-based memory, flash memory, disk memory, and the like.
  • the electronic device 600 may also include an input device 630, such as a keyboard, a mouse, etc., for obtaining raw data for generating training data.
  • the electronic device 600 may also include an output device 640, such as a display, for outputting the generated training data.
  • the methods and devices for generating training data can automatically generate training data for training a target deep learning model from raw data from users. In this way, users do not need to master relevant knowledge about generating training data from various types of raw data, which reduces the user's work burden and improves work efficiency.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that embody one or more elements for implementing the specified logical function(s).
  • Executable instructions may be included in the block.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本公开提供了一种用于生成训练数据的方法和装置。该训练数据用于训练目标深度学习模型。在该方法中,获取用户输入的用于目标深度学习模型的原始数据。然后,确定原始数据的类型。原始数据的类型包括有标签的分类数据、有标签的会话数据、以及无标签数据。分类数据的标签指示分类数据的类别。会话数据的标签指示会话数据的问答相关性。接着,按照原始数据的类型来生成训练数据。

Description

用于生成训练数据的方法以及装置 技术领域
本公开的实施例涉及计算机技术领域,具体地,涉及用于生成训练数据的方法以及装置。
背景技术
深度学习模型是一种机器学习模型,其目的在于建立、模拟人脑进行分析学习的神经网络,模仿人脑的机制来解释数据,如文本、图像、声音等。深度学习模型可以被广泛地应用于各个领域,执行各种各样的任务,例如计算机视觉、语言理解、语音识别、广告推荐、神经搜索等。
在深度学习技术发展的初始阶段,每个深度学习模型的开发者都需要编写大量的重复代码。为了提高工作效率,这些开发者将他们编写好的代码写成了深度学习框架发布到网络上供其他开发者一起使用。陆续地在网络上出现了不同的深度学习框架。目前流行的深度学习框架有PaddlePaddle、Tensorflow、Caffe、Theano、MXNet、Torch和PyTorch等。随着深度学习技术的发展,一些开发者会将预训练的深度学习模型发布在网络上。在其他开发者需要实现任务时,他们可使用任务数据对预训练的深度学习模型进行微调来获得期望的深度学习模型。在这个微调的过程中,开发者需要根据实际情况处理任务数据以构建训练数据集,并根据个人经验选择损失函数以及进行模型优化。
发明内容
本文中描述的实施例提供了一种用于生成训练数据的方法、装置、电子设备以及存储有计算机程序的计算机可读存储介质。
根据本公开的第一方面,提供了一种用于生成训练数据的方法。该训练数据用于训练目标深度学习模型。在该方法中,获取用户输入的用于目标深度学习模型的原始数据。然后,确定原始数据的类型。原始数据的类 型包括有标签的分类数据、有标签的会话数据、以及无标签数据。分类数据的标签指示分类数据的类别。会话数据的标签指示会话数据的问答相关性。接着,按照原始数据的类型来生成训练数据。
在本公开的一些实施例中,在按照原始数据的类型来生成训练数据的步骤中,如果原始数据是分类数据,则按照分类数据的标签所指示的类别来生成训练数据。
在本公开的一些实施例中,在按照分类数据的标签所指示的类别来生成训练数据的步骤中,从分类数据中选择部分或全部分类数据作为参考样本。将参考样本中的每个参考样本作为目标参考样本。将具有与目标参考样本相同的类别的分类数据确定为与目标参考样本相关联的正样本。将具有与目标参考样本不同的类别的分类数据确定为与目标参考样本相关联的负样本。然后,将目标参考样本、与目标参考样本相关联的正样本和与目标参考样本相关联的负样本组合成一组训练数据。
在本公开的一些实施例中,分类数据包括多个标签。分类数据的类别由分类数据的一个或多个标签来确定。
在本公开的一些实施例中,在按照原始数据的类型来生成训练数据的步骤中,如果原始数据是会话数据,则按照会话数据的标签所指示的问答相关性来生成训练数据。
在本公开的一些实施例中,每一条会话数据包括一个参考样本以及多个匹配样本。在按照会话数据的标签所指示的问答相关性来生成训练数据的过程中,针对每一条会话数据,将其标签指示肯定的问答相关性的匹配样本作为正样本,并将其标签指示否定的问答相关性的匹配样本作为负样本。然后,将参考样本、正样本和负样本组合成一组训练数据。
在本公开的一些实施例中,分类数据的标签为一元标签,会话数据的标签为二元标签。
在本公开的一些实施例中,在按照原始数据的类型来生成训练数据的步骤中,如果原始数据是无标签数据,则使用数据增强技术来生成训练数据。
在本公开的一些实施例中,在使用数据增强技术来生成训练数据的步骤中,将无标签数据中的每个无标签数据作为参考样本。使用数据增强技术从参考样本生成多个正样本。使用数据增强技术从除了参考样本的无标签数据生成多个负样本。
在本公开的一些实施例中,在无标签数据是图片的情况下,数据增强技术包括:对图片执行翻转、镜像、裁剪等操作中的一个或多个操作。
在本公开的一些实施例中,在无标签数据是文字的情况下,数据增强技术包括:对文字执行随机掩码操作。
在本公开的一些实施例中,在无标签数据是声音段落的情况下,数据增强技术包括:对声音段落执行随机掩码操作。
根据本公开的第二方面,提供了一种用于生成训练数据的装置。该装置包括:获取模块,用于获取用户输入的用于目标深度学习模型的原始数据;确定模块,用于确定原始数据的类型;以及生成模块,用于按照原始数据的类型来生成训练数据。原始数据的类型包括有标签的分类数据、有标签的会话数据、以及无标签数据,分类数据的标签指示分类数据的类别,会话数据的标签指示会话数据的问答相关性。
根据本公开的第三方面,提供了一种电子设备。该电子设备包括:至少一个处理器;以及存储有计算机程序的至少一个存储器。当计算机程序由至少一个处理器执行时,使得电子设备:获取用户输入的用于目标深度学习模型的原始数据;确定原始数据的类型,原始数据的类型包括有标签的分类数据、有标签的会话数据、以及无标签数据,分类数据的标签指示分类数据的类别,会话数据的标签指示会话数据的问答相关性;以及按照原始数据的类型来生成训练数据。
在本公开的一些实施例中,计算机程序在由至少一个处理器执行时使得电子设备通过以下操作来按照原始数据的类型来生成训练数据:响应于原始数据是分类数据,按照分类数据的标签所指示的类别来生成训练数据。
在本公开的一些实施例中,计算机程序在由至少一个处理器执行时使得电子设备通过以下操作来按照分类数据的标签所指示的类别来生成训练 数据:从分类数据中选择部分或全部分类数据作为参考样本;将参考样本中的每个参考样本作为目标参考样本;将具有与目标参考样本相同的类别的分类数据确定为与目标参考样本相关联的正样本;将具有与目标参考样本不同的类别的分类数据确定为与目标参考样本相关联的负样本;以及将目标参考样本、与目标参考样本相关联的正样本和与目标参考样本相关联的负样本组合成一组训练数据。
在本公开的一些实施例中,计算机程序在由至少一个处理器执行时使得电子设备通过以下操作来按照原始数据的类型来生成训练数据:响应于原始数据是会话数据,按照会话数据的标签所指示的问答相关性来生成训练数据。
在本公开的一些实施例中,每一条会话数据包括一个参考样本以及多个匹配样本。计算机程序在由至少一个处理器执行时使得电子设备通过以下操作来按照会话数据的标签所指示的问答相关性来生成训练数据:针对每一条会话数据,将其标签指示肯定的问答相关性的匹配样本作为正样本;将其标签指示否定的问答相关性的匹配样本作为负样本;以及将参考样本、正样本和负样本组合成一组训练数据。
在本公开的一些实施例中,计算机程序在由至少一个处理器执行时使得电子设备通过以下操作来按照原始数据的类型来生成训练数据:响应于原始数据是无标签数据,使用数据增强技术来生成训练数据。
在本公开的一些实施例中,计算机程序在由至少一个处理器执行时使得电子设备通过以下操作来使用数据增强技术来生成训练数据:将无标签数据中的每个无标签数据作为参考样本;使用数据增强技术从参考样本生成多个正样本;以及使用数据增强技术从除了参考样本的无标签数据生成多个负样本。
根据本公开的第四方面,提供了一种存储有计算机程序的计算机可读存储介质,其中,计算机程序在由处理器执行时实现根据本公开的第一方面所述的方法的步骤。
附图说明
为了更清楚地说明本公开的实施例的技术方案,下面将对实施例的附图进行简要说明,应当知道,以下描述的附图仅仅涉及本公开的一些实施例,而非对本公开的限制,其中:
图1是根据本公开的实施例的用于生成目标深度学习模型的方法的示例性流程图;
图2是图1所示的实施例中的从原始数据生成训练数据的步骤的示例性流程图;
图3是图2所示的实施例中的按照原始数据的类型来生成训练数据的步骤的示例性流程图;
图4是图1所示的实施例中的确定与任务相对应的第一深度学习模型的步骤的示例性流程图;
图5是根据本公开的实施例的用于生成训练数据的装置的示意性框图;
图6是根据本公开的实施例的执行用于生成训练数据的方法的电子设备的示意性框图。
需要注意的是,附图中的元素是示意性的,没有按比例绘制。
具体实施方式
为了使本公开的实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本公开的实施例的技术方案进行清楚、完整的描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域技术人员在无需创造性劳动的前提下所获得的所有其它实施例,也都属于本公开保护的范围。
除非另外定义,否则在此使用的所有术语(包括技术和科学术语)具有与本公开主题所属领域的技术人员所通常理解的相同含义。进一步将理解的是,诸如在通常使用的词典中定义的那些的术语应解释为具有与说明书上下文和相关技术中它们的含义一致的含义,并且将不以理想化或过于 正式的形式来解释,除非在此另外明确定义。诸如“第一”和“第二”的术语仅用于将一个部件(或部件的一部分)与另一个部件(或部件的另一部分)区分开。
如上所述,深度学习模型的开发者可通过微调预训练的深度学习模型来获得目标深度学习模型。在对深度学习模型进行微调的过程中,需要进行训练数据准备、模型选择和训练参数选择等操作。这需要开发者具备大量的深度学习模型相关知识,因此对于初级开发者不够友好。这不仅需要初级开发者付出大量的劳动,还耽误开发进度。
本公开的实施例提出了一种用于生成目标深度学习模型的方法。图1示出了根据本公开的实施例的用于生成目标深度学习模型的方法的示例性流程图。
在该方法100中,在框S102处,获取用户输入的指示以及用于生成目标深度学习模型的原始数据。该指示包括期望目标深度学习模型执行的任务。在本公开的一些实施例中,该任务可以是搜索任务,例如,以文字搜图片、以文字搜文字、以图片搜图片、以图片搜文字、以及以声音搜声音等。在一个示例中,用户输入的指示可包括期望获得能够执行指定搜索任务的深度学习模型。用户输入的原始数据与指示中包括的任务相关联。在该搜索任务是以图片搜图片的情况下,用户可输入一个图片集作为用于生成目标深度学习模型的原始数据。在该搜索任务是以声音搜声音的情况下,用户可输入一个音频集作为用于生成目标深度学习模型的原始数据。
在框S104处,从原始数据生成训练数据。下文继续以任务是搜索任务为例进行说明。图2示出从原始数据生成训练数据的步骤的示例性流程图。在图2的框S202处,确定原始数据的类型。原始数据的类型可包括:有标签的分类数据,有标签的会话数据,以及无标签数据。分类数据的标签指示分类数据的类别。会话数据的标签指示会话数据的问答相关性。在本公开的一些实施例中,原始数据可包括标签指示字段。该标签指示字段指明原始数据的标签。可通过该标签指示字段来确定原始数据的类型。
在本公开的一些实施例中,分类数据的标签可超过一个。可基于分类 数据的部分或者全部标签来确定分类数据的类别。以图片为例来说明分类数据的标签如何指示分类数据的类别。假设分类数据的标签包括:猫、狗、可爱的、难看的。那么可基于部分标签“猫”和“狗”来将图片分类为猫的图片和狗的图片。可替代地,可基于全部标签“猫”、“狗”、“可爱的”和“难看的”来将图片分类为可爱的猫的图片、难看的猫的图片、可爱的狗的图片、以及难看的狗的图片。无论分类数据的标签有多少个,该标签都是一元标签。一元标签表示该标签仅与一个数据相关。
在本公开的一些实施例中,会话数据例如是电商平台与用户的历史交互数据。在一个示例中,用户可在电商平台中搜索某个商品的关键字(相当于“问”)。基于该关键字,电商平台可向用户推送若干商品链接(相当于“答”)。如果用户点击了某个商品链接,则将该关键字与该商品链接形成的会话数据的标签设置为例如“相关”(例如,用数字1来表示)。如果用户没有点击某个商品链接,则将该关键字与该商品链接形成的会话数据的标签设置为例如“不相关”(例如,用数字0来表示)。会话数据的标签与问和答二者相关,因此是二元标签。此类会话数据可例如来自电商平台针对各个用户保存的搜索日志。
在本公开的一些实施例中,无标签数据是指标签指示字段为空的数据。在电商平台的示例中,无标签数据例如是用户上传的商品照片。此类商品照片并不带有标签,因此被定义为无标签数据。
在框S204处,按照原始数据的类型来生成训练数据。图3示出按照原始数据的类型来生成训练数据的步骤的示例性流程图。在图3的框S302处,确定原始数据是否有标签。在本公开的一些实施例中,可通过原始数据中的标签指示字段是否为空来确定原始数据是否有标签。如果原始数据有标签(在框S302处为“是”),则在框S304处确定原始数据是否是分类数据。
在本公开的一些实施例中,可通过原始数据的标签是一元标签还是二元标签来确定原始数据是否是分类数据。如果标签是一元标签,则确定原始数据是分类数据。如果标签是二元标签,则确定原始数据是会话数据。
在本公开的另一些实施例中,可通过原始数据中的标签指示字段是否包括文字来确定原始数据是否是分类数据。如果标签指示字段包括文字,则确定原始数据是分类数据。如果标签指示字段只包括数字0或1,则确定原始数据是会话数据。上述文字可包括中文、英文、其他语言类型的文字或者它们的组合。
如果原始数据是分类数据(在框S304处为“是”),则在框S306处按照分类数据的标签所指示的类别来生成训练数据。在本公开的一些实施例中,可例如从分类数据中选择部分或全部分类数据作为参考样本。针对参考样本中的每一个,可将具有与该参考样本相同的类别的分类数据确定为与该参考样本相关联的正样本,将具有与该参考样本不同的类别的分类数据确定为与该参考样本相关联的负样本。该参考样本、与该参考样本相关联的正样本和与该参考样本相关联的负样本被组合成一组训练数据。这样针对参考样本中的每一个都生成了对应的一组训练数据。
如果原始数据不是分类数据(在框S304处为“否”),则在框S308处按照会话数据的标签所指示的问答相关性来生成训练数据。在本公开的一些实施例中,每一条会话数据包括一个参考样本以及多个匹配样本。在上述电商平台的示例中,参考样本可以是用户在电商平台中搜索某个商品的关键字,多个匹配样本可以是电商平台向用户推送的若干商品链接。每个匹配样本带有一个标签,用于指示该匹配样本与参考样本是否相关。针对每一条会话数据,可例如将标签为“相关”或1(标签指示肯定的问答相关性)的匹配样本作为正样本,将标签为“不相关”或0(标签指示否定的问答相关性)的匹配样本作为负样本。参考样本、正样本和负样本被组合成一组训练数据。这样针对每一条会话数据都生成了对应的一组训练数据。
如果原始数据没有标签(在框S302处为“否”),则在框S310处使用数据增强技术来生成训练数据。在本公开的一些实施例中,可将原始数据中的每个原始数据作为参考样本。然后可使用数据增强技术从该参考样本生成多个正样本,从除了该参考样本的原始数据生成多个负样本。在数据 增强的过程中,可通过在图1的框S102处获取的任务来确定所采用的数据增强方式。
在任务是以图片搜图片的搜索任务的示例中,可对作为参考样本的图片执行翻转、镜像、裁剪等操作中的一个或多个操作以生成多个正样本。可对原始数据中除了参考样本之外的图片执行翻转、镜像、裁剪等操作中的一个或多个操作以生成多个负样本。
在任务是以文字搜文字或者声音搜声音的搜索任务的示例中,可对作为参考样本的文字或声音段落进行随机掩码操作以生成多个正样本。可对原始数据中除了参考样本之外的文字或声音段落进行随机掩码操作以生成多个负样本。
在对文字进行随机掩码操作时,可随机地遮盖或去除文字中的任意一个或多个字。换句话说,文字中被遮盖或去除的字的个数和位置都是随机的。在一个示例中,对于“我很喜欢北京”进行随机掩码操作则可能得到“我【未知】喜欢北京”,其中“很”被遮盖,并被标识为“【未知】”。在一个替代示例中,对于“我很喜欢北京”进行随机掩码操作还可能得到“我喜欢北京”,其中“很”被去除。在另一个替代示例中,对于“我很喜欢北京”进行随机掩码操作还可能得到“【未知】喜欢北京”,其中“我很”被遮盖,并被标识为“【未知】”。
在对声音段落进行随机掩码操作时,可随机地遮盖或去除声音段落中的任意长度的声音片段。换句话说,声音段落中被遮盖或去除的声音片段的长度和位置都是随机的。
框S104处的操作可以在没有用户参与的情况下自动生成训练数据,减轻了用户的工作负担并提高了工作效率。
回到图1,在框S106处,确定与任务相对应的第一深度学习模型。在本公开的一些实施例中,框S106处的操作可与框S104处的操作并行地执行。在本公开的另一些实施例中,可先执行框S106处的操作再执行框S104处的操作。图4示出确定与任务相对应的第一深度学习模型的步骤的示例性流程图。
在图4的框S402处,确定与任务相对应的多个候选深度学习模型。在本公开的一些实施例中,可预先建立多个预训练的深度学习模型与该多个深度学习模型可执行的多个任务的第一映射表。该多个预训练的深度学习模型可以是现有的预训练的深度学习模型,也可以是未来开发的预训练的深度学习模型。第一映射表可基于经验值来建立。在一个示例中,多个深度学习模型可执行同一个任务。换句话说,一个任务可由多个深度学习模型中的任意一个来执行。因此,可基于第一映射表来确定与任务相对应的多个候选深度学习模型。
在本公开的一些实施例中,可事先对多个预训练的深度学习模型进行调整,并将调整后的深度学习模型加入第一映射表中。这样在接收到包括任务的指示之后,可将上述调整后的深度学习模型作为与该任务相对应的候选深度学习模型。该调整可例如包括:在预训练的深度学习模型的最后一层输出上添加若干层全连接神经元作为输出;改变预训练的深度学习模型的输出层的层号(例如,从倒数第二层输出而非从最后一层输出)。
在本公开的一些实施例中,该多个预训练的深度学习模型可包括大模型和小模型。大模型的总层数比小模型的总层数更多。在本公开的一些实施例中,可针对该多个候选深度学习模型设置不同的训练参数。训练参数可包括以下中的一个或多个:学习率、以及训练停止条件等。
在框S404处,使用训练数据中的部分训练数据来训练该多个候选深度学习模型。在这里使用部分训练数据的目的是减少计算量。使用部分训练数据对该多个候选深度学习模型进行的训练相当于测试训练。在本公开的一些实施例中,可设置测试训练执行的训练轮数N(N为正整数)。在使用部分训练数据分别对该多个候选深度学习模型执行N轮训练之后,结束对该多个候选深度学习模型的测试训练过程。
在框S406处,确定经训练的多个候选深度学习模型中执行任务的表现最好的候选深度学习模型。在本公开的一些实施例中,可将损失函数的值最小的候选深度学习模型确定为执行任务的表现最好的候选深度学习模型。在本公开的另一些实施例中,可将训练数据中除了在框S404处使用的 部分训练数据之外的数据确定为验证数据。然后使用验证数据来验证经训练的多个候选深度学习模型执行任务的表现。在该表现是搜索准确率的情况下,可将搜索准确率最高的候选深度学习模型确定为表现最好的候选深度学习模型。
在框S408处,将表现最好的候选深度学习模型确定为第一深度学习模型。这样,通过框S402至框S406的操作,第一深度学习模型可以是最适合执行用户指定的任务的深度学习模型。
回到图1,在框S108处,使用训练数据来训练第一深度学习模型以获得目标深度学习模型。在本公开的一些实施例中,可确定与第一深度学习模型相对应的损失函数和优化器。其中,所确定的损失函数和优化器用于训练第一深度学习模型。在本公开的一些实施例中,可预先建立多个预训练的深度学习模型与该多个深度学习模型对应的损失函数和优化器的第二映射表。第二映射表可基于经验值来建立。在确定了第一深度学习模型之后,可基于第二映射表来确定与第一深度学习模型相对应的损失函数和优化器。
在本公开的一些实施例中,可在训练第一深度学习模型的过程中显示第一深度学习模型的损失函数在每一轮的值。损失函数在每一轮的值可被绘制成曲线以便用户观察。
在本公开的一些实施例中,在训练第一深度学习模型的过程中可记录第一深度学习模型的训练历史。训练历史包括每一轮训练之后得到的第一深度学习模型的模型参数。这样用户可以回溯模型训练历史。用户可以基于观察到的损失函数的值来选择对第一深度学习模型进行训练的训练轮数。如果接收到用户对第一深度学习模型的训练轮数的选择,则可根据所记录的训练轮数对应的模型参数来生成经过训练轮数训练的第一深度学习模型。然后,可将所生成的第一深度学习模型确定为目标深度学习模型。
通过上述操作,用户无需了解各个深度学习模型的具体结构。其只需要输入包括待执行任务的指示以及用于生成目标深度学习模型的原始数据,就可以获得期望的目标深度学习模型。因此,根据本公开的实施例的 用于生成目标深度学习模型的方法对于用户十分友好,能够减轻用户的工作量并且加快开发进度。
进一步地,在本公开的一些实施例中,还允许用户指定目标深度学习模型的型号、模型参数和训练参数。这样有经验的深度学习模型开发者能够自己选择使用哪个深度学习模型并且设置关于目标深度学习模型的一个或多个参数,以便更灵活的开发目标深度学习模型。在这种情况下,在框S102处获取的指示还可包括以下中的一个或多个:第一深度学习模型的型号,第一深度学习模型的总层数,第一深度学习模型的输出层的层号,以及用于训练第一深度学习模型的训练参数。通过上述方式,深度学习模型的高级开发者可利用根据本公开的实施例的用于生成目标深度学习模型的方法来灵活地工作。
另外,预训练的深度学习模型可具有不同的深度学习框架(格式)。初级开发者往往从单个深度学习框架开始学习如何建立深度学习模型。如果初级开发者要使用的预训练模型是以其不擅长的深度学习框架来编写的,那么他需要先熟悉该深度学习框架,再进行微调深度学习模型的操作。
针对上述情况,本公开的实施例提出可使得生成的深度学习模型具有用户期望的格式(在上下文中可被替换地称为目标格式)。在本公开的一些实施例中,在图1的框S102处获取的指示可包括目标深度学习模型的目标格式。可将在框S106处确定的第一深度学习模型的图描述和模型参数分别转换为通用格式ONNX模型的图描述和模型参数,从而将第一深度学习模型的格式转换为ONNX。在将第一深度学习模型的格式转换为ONNX后,再将该ONNX格式的第一深度学习模型转换为具有目标格式的第一深度学习模型。在这种情况下,在图1的框S108处训练的第一深度学习模型是具有目标格式的第一深度学习模型。经过对具有目标格式的第一深度学习模型的训练,可获得具有目标格式的目标深度学习模型。
图5示出根据本公开的实施例的用于生成训练数据的装置500的示意性框图。该装置500包括:获取模块510、确定模块520、以及生成模块530。获取模块510用于获取用户输入的用于目标深度学习模型的原始数 据。确定模块520用于确定原始数据的类型。生成模块530用于按照原始数据的类型来生成训练数据。原始数据的类型包括有标签的分类数据、有标签的会话数据、以及无标签数据,分类数据的标签指示分类数据的类别,会话数据的标签指示会话数据的问答相关性。
图6示出根据本公开的实施例的执行用于生成目标深度学习模型的方法的电子设备600的示意性框图。如图6所示,该电子设备600可包括处理器610和存储有计算机程序的存储器620。当计算机程序由处理器610执行时,使得电子设备600可执行如图1所示的方法100的步骤。在一个示例中,电子设备600可以是计算机设备或云计算节点。电子设备600可作为用于提供从原始数据生成训练数据服务的平台。电子设备600可获取用户输入的用于目标深度学习模型的原始数据。然后,电子设备600可确定原始数据的类型。原始数据的类型包括有标签的分类数据、有标签的会话数据、以及无标签数据。分类数据的标签指示分类数据的类别。会话数据的标签指示会话数据的问答相关性。接着,电子设备600可按照原始数据的类型来生成训练数据。
在本公开的一些实施例中,如果原始数据是分类数据,则电子设备600可按照分类数据的标签所指示的类别来生成训练数据。
在本公开的一些实施例中,电子设备600可从分类数据中选择部分或全部分类数据作为参考样本。电子设备600可将参考样本中的每个参考样本作为目标参考样本。电子设备600可将具有与目标参考样本相同的类别的分类数据确定为与目标参考样本相关联的正样本。电子设备600可将具有与目标参考样本不同的类别的分类数据确定为与目标参考样本相关联的负样本。然后,电子设备600可将目标参考样本、与目标参考样本相关联的正样本和与目标参考样本相关联的负样本组合成一组训练数据。
在本公开的一些实施例中,如果原始数据是会话数据,则电子设备600可按照会话数据的标签所指示的问答相关性来生成训练数据。
在本公开的一些实施例中,每一条会话数据包括一个参考样本以及多个匹配样本。电子设备600可针对每一条会话数据将其标签指示肯定的问 答相关性的匹配样本作为正样本,并将其标签指示否定的问答相关性的匹配样本作为负样本。然后,电子设备600可将参考样本、正样本和负样本组合成一组训练数据。
在本公开的一些实施例中,如果原始数据是无标签数据,则电子设备600可使用数据增强技术来生成训练数据。
在本公开的一些实施例中,电子设备600可将无标签数据中的每个无标签数据作为参考样本。电子设备600可使用数据增强技术从参考样本生成多个正样本。电子设备600可使用数据增强技术从除了参考样本的无标签数据生成多个负样本。
在本公开的实施例中,处理器610可以是例如中央处理单元(CPU)、微处理器、数字信号处理器(DSP)、基于多核的处理器架构的处理器等。存储器620可以是使用数据存储技术实现的任何类型的存储器,包括但不限于随机存取存储器、只读存储器、基于半导体的存储器、闪存、磁盘存储器等。
此外,在本公开的实施例中,电子设备600也可包括输入设备630,例如键盘、鼠标等,用于获取用于生成训练数据的原始数据。另外,电子设备600还可包括输出设备640,例如显示器等,用于输出所生成的训练数据。
综上所述,根据本公开实施例的用于生成训练数据的方法和装置能够从来自用户的原始数据自动生成用于训练目标深度学习模型的训练数据。这样,用户无需掌握关于从各种类型的原始数据生成训练数据的相关知识,减轻了用户的工作负担并提高了工作效率。
附图中的流程图和框图显示了根据本公开的多个实施例的装置和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有 时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
除非上下文中另外明确地指出,否则在本文和所附权利要求中所使用的词语的单数形式包括复数,反之亦然。因而,当提及单数时,通常包括相应术语的复数。相似地,措辞“包含”和“包括”将解释为包含在内而不是独占性地。同样地,术语“包括”和“或”应当解释为包括在内的,除非本文中明确禁止这样的解释。在本文中使用术语“示例”之处,特别是当其位于一组术语之后时,所述“示例”仅仅是示例性的和阐述性的,且不应当被认为是独占性的或广泛性的。
适应性的进一步的方面和范围从本文中提供的描述变得明显。应当理解,本申请的各个方面可以单独或者与一个或多个其它方面组合实施。还应当理解,本文中的描述和特定实施例旨在仅说明的目的并不旨在限制本申请的范围。
以上对本公开的若干实施例进行了详细描述,但显然,本领域技术人员可以在不脱离本公开的精神和范围的情况下对本公开的实施例进行各种修改和变型。本公开的保护范围由所附的权利要求限定。

Claims (15)

  1. 一种用于生成训练数据的方法,所述训练数据用于训练目标深度学习模型,所述方法包括:
    获取用户输入的用于所述目标深度学习模型的原始数据;
    确定所述原始数据的类型,所述原始数据的所述类型包括有标签的分类数据、有标签的会话数据、以及无标签数据,所述分类数据的标签指示所述分类数据的类别,所述会话数据的标签指示所述会话数据的问答相关性;以及
    按照所述原始数据的所述类型来生成所述训练数据。
  2. 根据权利要求1所述的方法,按照所述原始数据的所述类型来生成所述训练数据包括:
    响应于所述原始数据是所述分类数据,按照所述分类数据的标签所指示的类别来生成训练数据。
  3. 根据权利要求2所述的方法,其中,按照所述分类数据的标签所指示的类别来生成训练数据包括:
    从所述分类数据中选择部分或全部分类数据作为参考样本;
    将所述参考样本中的每个参考样本作为目标参考样本;
    将具有与所述目标参考样本相同的类别的分类数据确定为与所述目标参考样本相关联的正样本;
    将具有与所述目标参考样本不同的类别的分类数据确定为与所述目标参考样本相关联的负样本;以及
    将所述目标参考样本、与所述目标参考样本相关联的正样本和与所述目标参考样本相关联的负样本组合成一组训练数据。
  4. 根据权利要求2或3所述的方法,其中,所述分类数据包括多个标签,所述分类数据的类别由所述分类数据的一个或多个标签来确定。
  5. 根据权利要求1所述的方法,其中,按照所述原始数据的所述类型来生成所述训练数据包括:
    响应于所述原始数据是所述会话数据,按照所述会话数据的标签所指 示的问答相关性来生成训练数据。
  6. 根据权利要求5所述的方法,其中,每一条会话数据包括一个参考样本以及多个匹配样本,按照所述会话数据的标签所指示的问答相关性来生成训练数据包括:针对每一条会话数据,
    将其标签指示肯定的问答相关性的匹配样本作为正样本;
    将其标签指示否定的问答相关性的匹配样本作为负样本;以及
    将所述参考样本、所述正样本和所述负样本组合成一组训练数据。
  7. 根据权利要求1至3或5至6中任一项所述的方法,其中,所述分类数据的标签为一元标签,所述会话数据的标签为二元标签。
  8. 根据权利要求1所述的方法,其中,按照所述原始数据的所述类型来生成所述训练数据包括:
    响应于所述原始数据是所述无标签数据,使用数据增强技术来生成训练数据。
  9. 根据权利要求8所述的方法,其中,使用数据增强技术来生成训练数据包括:
    将所述无标签数据中的每个无标签数据作为参考样本;
    使用所述数据增强技术从所述参考样本生成多个正样本;以及
    使用所述数据增强技术从除了所述参考样本的无标签数据生成多个负样本。
  10. 根据权利要求8或9所述的方法,其中,在所述无标签数据是图片的情况下,所述数据增强技术包括:对所述图片执行翻转、镜像、裁剪等操作中的一个或多个操作。
  11. 根据权利要求8或9所述的方法,其中,在所述无标签数据是文字的情况下,所述数据增强技术包括:对所述文字执行随机掩码操作。
  12. 根据权利要求8或9所述的方法,其中,在所述无标签数据是声音段落的情况下,所述数据增强技术包括:对所述声音段落执行随机掩码操作。
  13. 一种用于生成训练数据的装置,包括:
    获取模块,用于获取用户输入的用于所述目标深度学习模型的原始数据;
    确定模块,用于确定所述原始数据的类型,所述原始数据的所述类型包括有标签的分类数据、有标签的会话数据、以及无标签数据,所述分类数据的标签指示所述分类数据的类别,所述会话数据的标签指示所述会话数据的问答相关性;以及
    生成模块,用于按照所述原始数据的所述类型来生成所述训练数据。
  14. 一种电子设备,包括:
    至少一个处理器;以及
    存储有计算机程序的至少一个存储器;
    其中,当所述计算机程序由所述至少一个处理器执行时,使得所述装置执行根据权利要求1至12中任一项所述的方法的步骤。
  15. 一种存储有计算机程序的计算机可读存储介质,其中,所述计算机程序在由处理器执行时实现根据权利要求1至12中任一项所述的方法的步骤。
PCT/CN2022/100583 2022-06-22 2022-06-22 用于生成训练数据的方法以及装置 WO2023245523A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202280005189.6A CN115836288A (zh) 2022-06-22 2022-06-22 用于生成训练数据的方法以及装置
EP22871054.7A EP4322066A4 (en) 2022-06-22 2022-06-22 METHOD AND DEVICE FOR GENERATING TRAINING DATA
PCT/CN2022/100583 WO2023245523A1 (zh) 2022-06-22 2022-06-22 用于生成训练数据的方法以及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/100583 WO2023245523A1 (zh) 2022-06-22 2022-06-22 用于生成训练数据的方法以及装置

Publications (1)

Publication Number Publication Date
WO2023245523A1 true WO2023245523A1 (zh) 2023-12-28

Family

ID=85520083

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100583 WO2023245523A1 (zh) 2022-06-22 2022-06-22 用于生成训练数据的方法以及装置

Country Status (3)

Country Link
EP (1) EP4322066A4 (zh)
CN (1) CN115836288A (zh)
WO (1) WO2023245523A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117271778B (zh) * 2023-11-17 2024-02-09 北京水滴科技集团有限公司 基于生成式大模型的保险外呼会话信息输出方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316049A (zh) * 2017-05-05 2017-11-03 华南理工大学 一种基于半监督自训练的迁移学习分类方法
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN110414622A (zh) * 2019-08-06 2019-11-05 广东工业大学 基于半监督学习的分类器训练方法及装置
CN112069302A (zh) * 2020-09-15 2020-12-11 腾讯科技(深圳)有限公司 会话意图识别模型的训练方法、会话意图识别方法及装置
CN112115995A (zh) * 2020-09-11 2020-12-22 北京邮电大学 一种基于半监督学习的图像多标签分类方法
CN112560912A (zh) * 2020-12-03 2021-03-26 北京百度网讯科技有限公司 分类模型的训练方法、装置、电子设备和存储介质
US20210342684A1 (en) * 2020-04-29 2021-11-04 International Business Machines Corporation Method and system for table retrieval using multimodal deep co-learning with helper query-dependent and query-independent relevance labels

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139119A (zh) * 2020-01-20 2021-07-20 微软技术许可有限责任公司 用于问题回答(qa)的对仗学习
CN114386503A (zh) * 2022-01-04 2022-04-22 京东科技信息技术有限公司 用于训练模型的方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316049A (zh) * 2017-05-05 2017-11-03 华南理工大学 一种基于半监督自训练的迁移学习分类方法
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN110414622A (zh) * 2019-08-06 2019-11-05 广东工业大学 基于半监督学习的分类器训练方法及装置
US20210342684A1 (en) * 2020-04-29 2021-11-04 International Business Machines Corporation Method and system for table retrieval using multimodal deep co-learning with helper query-dependent and query-independent relevance labels
CN112115995A (zh) * 2020-09-11 2020-12-22 北京邮电大学 一种基于半监督学习的图像多标签分类方法
CN112069302A (zh) * 2020-09-15 2020-12-11 腾讯科技(深圳)有限公司 会话意图识别模型的训练方法、会话意图识别方法及装置
CN112560912A (zh) * 2020-12-03 2021-03-26 北京百度网讯科技有限公司 分类模型的训练方法、装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4322066A4 *

Also Published As

Publication number Publication date
CN115836288A (zh) 2023-03-21
EP4322066A1 (en) 2024-02-14
EP4322066A4 (en) 2024-02-14

Similar Documents

Publication Publication Date Title
US10521463B2 (en) Answering questions via a persona-based natural language processing (NLP) system
US10984319B2 (en) Neural architecture search
US9582757B1 (en) Scalable curation system
JP2022153441A (ja) モデル事前訓練方法および装置、テキスト生成方法および装置、電子機器、記憶媒体並びにコンピュータプログラム
CN108595629B (zh) 用于答案选择系统的数据处理方法及应用
CN110795552A (zh) 一种训练样本生成方法、装置、电子设备及存储介质
CN108846138B (zh) 一种融合答案信息的问题分类模型构建方法、装置和介质
CN116127020A (zh) 生成式大语言模型训练方法以及基于模型的搜索方法
CN115495568B (zh) 一种对话模型的训练方法及装置、对话响应方法及装置
CN117149989A (zh) 大语言模型训练方法、文本处理方法及装置
WO2024011813A1 (zh) 一种文本扩展方法、装置、设备及介质
CN112506945A (zh) 基于知识图谱的自适应导学方法及系统
WO2023245523A1 (zh) 用于生成训练数据的方法以及装置
US11893990B2 (en) Audio file annotation
CN117521814A (zh) 一种基于多模态输入和知识图谱的问答方法及装置
KR20220066554A (ko) Qa 모델을 이용하여 지식 그래프를 구축하는 방법, 장치 및 컴퓨터 프로그램
Surendran et al. Conversational AI-A retrieval based chatbot
Hu et al. Dynamically retrieving knowledge via query generation for informative dialogue generation
WO2023245522A1 (zh) 用于生成目标深度学习模型的方法以及装置
Kumar et al. Building conversational Question Answer Machine and comparison of BERT and its different variants
Li et al. MOOC Guider: An End-to-End Dialogue System for MOOC Users
CN117453895B (zh) 一种智能客服应答方法、装置、设备及可读存储介质
US20240086768A1 (en) Learning device, inference device, non-transitory computer-readable medium, learning method, and inference method
WO2023026444A1 (ja) 要約学習支援装置、要約学習支援方法及びプログラム
JP7126682B2 (ja) 対話システム及びそのコンピュータプログラム

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022871054

Country of ref document: EP

Effective date: 20230328