WO2023071743A1 - 网络模型训练方法、装置和计算机可读存储介质 - Google Patents

网络模型训练方法、装置和计算机可读存储介质 Download PDF

Info

Publication number
WO2023071743A1
WO2023071743A1 PCT/CN2022/124171 CN2022124171W WO2023071743A1 WO 2023071743 A1 WO2023071743 A1 WO 2023071743A1 CN 2022124171 W CN2022124171 W CN 2022124171W WO 2023071743 A1 WO2023071743 A1 WO 2023071743A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
network
sample
training
data
Prior art date
Application number
PCT/CN2022/124171
Other languages
English (en)
French (fr)
Inventor
栗伟清
韩炳涛
屠要峰
王永成
刘涛
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023071743A1 publication Critical patent/WO2023071743A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the embodiments of the present application relate to but are not limited to the technical field of deep learning, and in particular relate to a network model training method, device, and computer-readable storage medium.
  • AI artificial intelligence
  • machine learning especially deep learning. It has developed rapidly in application fields such as computer vision, speech, and natural language, and has begun to empower various industries. Of course, there are many reasons for this.
  • a very common defect in the implementation of industrial applications is that the generality and generalization of the model are relatively poor.
  • For a problem in a specific field In the past, it was generally necessary to collect data, manually label, design models, train models, and repeatedly adjust parameters to finally output a usable model.
  • the development cycle was long, and when faced with a problem in a new field, the model could not be reused and had to It takes a lot of manpower and material resources to repeat this process again.
  • the embodiment of the present application provides a network model training method, including: obtaining unlabeled data to train the pre-training model; modifying the output layer of the pre-training model to the output layer corresponding to the target task, and generating a fine-tuning model Obtain the labeled data of the target task to train the fine-tuning model to generate a teacher network; construct a student network according to the target task; use a plurality of the teacher networks to carry out knowledge distillation on the student network to determine the distillation
  • a loss function performing iterative training on the student network based on the distillation loss function to generate a target network model for the target task.
  • the embodiment of the present application provides a network model training device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the computer program is implemented when the processor executes the computer program.
  • a network model training device including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the computer program is implemented when the processor executes the computer program.
  • an embodiment of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the above-mentioned first step when executing the computer program.
  • the network model training method In one aspect, the network model training method.
  • the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer-executable program, and the computer-executable program is used to make the computer perform the above-mentioned first aspect.
  • Network model training method is used to make the computer perform the above-mentioned first aspect.
  • Fig. 1 is the main flowchart of a kind of network model training method provided by one embodiment of the present application
  • Fig. 2 is a schematic diagram of the composition of three major stages provided by an embodiment of the present application.
  • Fig. 3 is a subflow chart of a network model training method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a pre-trained large model using self-supervised contrastive learning provided by an embodiment of the present application
  • Fig. 5 is a schematic diagram of the internal structure of the projection head provided by an embodiment of the present application.
  • Fig. 6 is a subflow chart of a network model training method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of fine-tuning using domain data provided by an embodiment of the present application.
  • Fig. 8 is a sub-flow chart of a network model training method provided by an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a teacher network knowledge distillation student network provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a network model training device provided by an embodiment of the present application.
  • Fig. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • multiple means more than two, greater than, less than, exceeding, etc. are understood as not including the original number, and above, below, within, etc. are understood as including the original number. If there is a description of "first”, “second”, etc., it is only for the purpose of distinguishing technical features, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features or implicitly indicating the indicated The sequence relationship of the technical characteristics.
  • the embodiment of the present application provides a network model training method, device and computer-readable storage medium, based on large-scale unlabeled data using self-supervised comparative learning method for pre-training
  • the ultra-large-scale network model is then fine-tuned and knowledge-distilled on the pre-trained ultra-large-scale network model, and finally applied to the target task.
  • FIG. 1 is a flowchart of a network model training method provided by an embodiment of the present application.
  • the network model training method includes but not limited to the following steps:
  • Step 101 obtaining unlabeled data to train the pre-trained model
  • Step 102 modifying the output layer of the pre-training model to be the output layer corresponding to the target task to generate a fine-tuning model
  • Step 103 obtaining labeled data of the target task to train the fine-tuning model to generate a teacher network
  • Step 104 constructing a student network according to the target task
  • Step 105 using multiple teacher networks to perform knowledge distillation on the student network to determine a distillation loss function
  • step 106 the student network is iteratively trained based on the distillation loss function to generate a target network model for the target task.
  • this application consists of three stages: self-supervised comparative learning training, domain data fine-tuning and knowledge distillation.
  • Train the pre-training model by obtaining unlabeled data, modify the output layer of the pre-training model to the output layer corresponding to the target task, generate a fine-tuning model, obtain the labeled data of the target task to train the fine-tuning model, and generate a teacher network, according to the target
  • the task builds a student network, uses multiple teacher networks to perform knowledge distillation on the student network to determine the distillation loss function, and iteratively trains the student network based on the distillation loss function to generate the target network model for the target task.
  • the pre-training model in this application is a very large-scale neural network model.
  • self-supervised pre-training, domain data fine-tuning, and knowledge distillation on the pre-training model, that is, using massive data to unsupervised pre-train the ultra-large-scale neural network model, using limited labeled samples to fine-tune the pre-training model, and using the knowledge distillation method to fine-tune the
  • the final super-large model is compressed into a target model, which meets the deployment requirements of the target device, such as computing power, memory, prediction accuracy, and inference performance.
  • the dependence on labeling data can be reduced, the cost of manual labeling can be reduced, the problem of high cost of manual labeling data can be solved, and the versatility and generalization of the model can be improved, so that the target model output by this application is better than the target task.
  • the accuracy exceeds that of the original customized model.
  • the self-supervised contrastive learning method is used to train the super-large pre-training model, and the general visual sentence representation knowledge is implicitly learned.
  • weakly supervised or unsupervised (such as self-supervised) methods can be used for training.
  • Some task-specific knowledge cannot learn a general knowledge, so the feature representation learned by supervised learning is difficult to transfer to other tasks.
  • Self-supervised learning is based on the inherent characteristics of the data itself, which provides far more information than labels, so the generalization performance is better than supervised learning methods, but compared with supervised training methods, the model accuracy of self-supervised learning methods It is lower than the supervised learning model, and requires more data and larger model parameters, which means more computing power and longer training time.
  • self-supervised learning can utilize unlabeled data to train the network. It can use the pseudo-label defined by itself as a training signal, and then use the learned representation as the target task. Contrastive learning is regarded as a very important part of self-supervised learning. Its main principle is to bring different and enhanced new samples of a sample as close as possible in the embedding space, and then make the difference between different samples as far away as possible.
  • the network is corrected for specific tasks
  • a pre-training model that has nothing to do with specific tasks is obtained from large-scale data through self-supervised learning.
  • the model is applied to a specific task, first modify the output layer of the model to the output layer corresponding to the specific task, and then use limited labeled data to conduct supervised training on the model, aiming to use its labeled samples to adjust the parameters of the pre-trained network , so that it gradually adapts to the characteristics of the target task.
  • the first stage of this application that is, the self-supervised pre-training stage
  • the pre-training of large-scale network models based on large-scale data sets requires huge calculations Resources, training will consume a lot of computing power, and the estimated time required may be counted in weeks or even months, so the more hardware resources you have (especially special acceleration devices for deep learning such as GPUs), the stronger the computing power, which can be greatly improved.
  • To reduce the time required for training, such large-scale training generally needs to be carried out on an AI-specific platform with super distributed parallel computing.
  • step 101 may include but not limited to the following sub-steps:
  • Step 1011 randomly sampling from the original image
  • Step 1012 perform two different data enhancements on each sampled image to obtain the first sample and the second sample;
  • Step 1013 performing feature expression extraction and nonlinear transformation on the first sample and the second sample respectively, to obtain the feature representation of the first sample and the feature representation of the second sample;
  • Step 1014 determining a contrastive loss function between the first sample and the second sample according to the feature representation of the first sample and the feature representation of the second sample;
  • Step 1015 train the pre-training model based on the comparison loss function.
  • the original image is randomly sampled, and two different data enhancements are performed on each sampled image x to obtain the first sample v1 and the second sample v2.
  • v1 outputs the feature y1 through the encoder network F1, and y1 is nonlinearly transformed by the projection head g1 to obtain the feature representation z1 of the first sample; at the same time, v2 outputs the feature y2 through the encoder network F2, and y2 is obtained through the nonlinear transformation of the projection head g2
  • the feature representation of the first sample is z2.
  • the contrast loss Loss between the first sample v1 and the second sample v2 calculates the contrast loss Loss between the first sample v1 and the second sample v2, and the difference between the first sample v1 and the second sample v2 can be determined according to the feature representation z1 of the first sample and the feature representation z2 of the second sample.
  • the contrastive loss function between them is used to train the pre-trained model based on the contrastive loss function.
  • Self-supervised comparative learning is to train the network model based on unlabeled data.
  • the trained network can extract meaningful feature representations from images, which is conducive to improving the performance of other target tasks.
  • the core idea is to calculate the relationship between sample representations. The distance makes the positive samples closer and the negative samples farther away. That is, when it is possible to distinguish positive and negative examples of the sample, the resulting feature representation is sufficient.
  • a batch of data is randomly sampled from the original image, and two different data enhancements are performed on each image in the batch, such as Random Crop and Color Distortion.
  • N the size of the batch
  • 2N 2N is obtained after data enhancement A picture; the picture enters the encoder network for feature expression extraction to obtain Embedding.
  • the commonly used encoder network is generally based on the ResNet-50 model and widened by two times and four times. Experiments show that the larger the number of network parameters , the better the performance.
  • the Transformer-based ViT network can be used as the basic encoder network; after encoding to obtain the feature representation, as shown in Figure 5, the nonlinear transformation of the projection head composed of a series of Dense layers and ReLU layers is projected into a new representation ;Calculate contrast loss. Different data enhancements of the same image are regarded as positive sample pairs. Other images in the same batch are negative samples.
  • the contrast loss function is noise contrast estimation (NCE) loss, and the formula is as follows:
  • s i, j represent the cosine similarity between the first sample and the second sample.
  • l(i, j) represents the contrastive loss between any one example and other examples.
  • L denotes the average contrastive loss obtained by averaging the contrastive losses across all examples.
  • the currently commonly used network model is generally CNN network, such as the model based on the ResNet-50 model that is doubled and quadrupled.
  • the Transformer network is introduced as the basic encoder network. Compared with the CNN network, the expression ability and knowledge extraction ability are more powerful.
  • step 102 may include but not limited to the following sub-steps:
  • Step 1021 correspondingly copy all parameters of the pre-training model except the output layer to the fine-tuning model
  • Step 1022 remove the output layer of the pre-training model, and add a new output layer according to the characteristics of the target task;
  • Step 1023 randomly initialize the parameters of the new output layer.
  • fine-tuning is a way to transfer a task-independent pre-trained large model to a task-aware scene. Copy all the parameters of the pre-trained model except the output layer (Output Layer) to the fine-tuning model; remove the output layer of the pre-trained model, and add a new output layer according to the characteristics of the target task, and initialize the parameters of the output layer randomly; Use the labeled data of the target task for training, and usually use a small learning rate to train the model to gradually adapt to the characteristics of the target task.
  • Output Layer the output layer
  • step 105 may include but not limited to the following sub-steps:
  • Step 1051 integrating multiple teacher networks according to weights and fused output
  • Step 1052 determine the distillation loss function based on the corresponding relationship between the fusion output of the teacher network and the backpropagation error of the student network.
  • the fine-tuned pre-training model still has the problem of a large number of parameters and a large amount of calculation. If you want to apply it well to the target task, especially on the edge or end-side devices with limited resources, The model must be compressed, and knowledge distillation has the idea of model compression, by using a larger trained network step by step to teach a smaller network exactly what to do. Then, the small network is trained to learn the exact behavior of the large network by trying to replicate the output of the large network at each layer (not just the final loss). In practice, multiple teacher networks can be used to distill small networks, that is, multi-model ensemble distillation.
  • the specific process of knowledge distillation may include: building a student network, designing a student network model according to specific tasks and scene requirements, and establishing a corresponding relationship with the teacher network; forward propagation of the teacher network. All teacher networks perform forward propagation to obtain all intermediate outputs and perform data enhancement; the outputs of all teacher networks are integrated by weight and fused into one output; the student network performs back propagation process to complete parameter updates. Using the corresponding relationship between the fusion output of the teacher network and the backpropagation error of the student network, the student network can learn to copy the behavior of the teacher network, and finally form an optimized student network.
  • the distillation loss function is defined as follows:
  • the teacher network is much larger than the student network, and correspondingly, the forward calculation will be more time-consuming.
  • the student network needs to perform backpropagation to update the model parameters, while the teacher network does not need The backpropagation process is performed, so in order to speed up the calculation of knowledge distillation, when the student network is performing backpropagation, the teacher network can perform the forward calculation of the next batch of data without having to wait until the learning network backpropagation calculation is completed.
  • teacher networks are used in the knowledge distillation stage, and the output of each teacher network is integrated according to the weight, and finally fused into one output to distill the student network; in order to improve computational efficiency, inverse In the forward propagation phase, let the teacher network perform forward propagation.
  • domain data fine-tuning and knowledge distillation three-stage training optimization proposed in this application can reduce the dependence on labeled data, reduce the cost of manual labeling, and use unlabeled data for pre-training large model training.
  • relying on the powerful generalization capability of the pre-trained large model it can change the development mode of one model for one scenario in the past, shorten the development cycle and reduce the development cost, and can obtain better accuracy than the original model.
  • the pre-trained model makes the original The stage of manual parameter adjustment and relying on experts has entered the stage of large-scale, reproducible large-scale industrial deployment, and finally accelerated the process of AI landing.
  • the embodiment of the present application also provides a network model training device.
  • the network model training device includes: one or more processors and memories, and one processor and memories are taken as an example in FIG. 10 .
  • the processor and the memory may be connected through a bus or in other ways, and connection through a bus is taken as an example in FIG. 10 .
  • the memory can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the network model training method in the above-mentioned embodiments of the present application.
  • the processor implements the network model training method in the above-mentioned embodiments of the present application by running the non-transitory software program and the program stored in the memory.
  • the memory may include a program storage area and a data storage area, wherein the program storage area may store the operating system and at least one application required by the function; data etc.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory may optionally include a memory that is remotely located relative to the processor, and these remote memories may be connected to the network model training device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transient software programs and programs required to implement the network model training method in the above-mentioned embodiments of the present application are stored in the memory, and when executed by one or more processors, the network model training method in the above-mentioned embodiments of the present application is executed , for example, carry out method step 101 to step 106 in Fig. 1 described above, method step 1011 to step 1015 in Fig. 3, method step 1021 to step 1023 in Fig. 6, method step 1051 to step 1052 in Fig.
  • train the pre-training model by obtaining unlabeled data, modify the output layer of the pre-training model to the output layer corresponding to the target task, generate a fine-tuning model, obtain the labeled data of the target task to train the fine-tuning model, and generate a teacher network, according to
  • the target task constructs the student network, uses multiple teacher networks to perform knowledge distillation on the student network to determine the distillation loss function, and iteratively trains the student network based on the distillation loss function to generate the target network model for the target task.
  • the embodiment of the present application also provides an electronic device.
  • the electronic device includes: one or more processors and memories, and one processor and memories are taken as an example in FIG. 11 .
  • the processor and the memory may be connected through a bus or in other ways, and connection through a bus is taken as an example in FIG. 11 .
  • the memory can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the network model training method in the above-mentioned embodiments of the present application.
  • the processor implements the network model training method in the above-mentioned embodiments of the present application by running the non-transitory software program and the program stored in the memory.
  • the memory may include a program storage area and a data storage area, wherein the program storage area may store the operating system and at least one application required by the function; data etc.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory may optionally include a memory that is remotely located relative to the processor, and these remote memories may be connected to the network model training device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transient software programs and programs required to implement the network model training method in the above-mentioned embodiments of the present application are stored in the memory, and when executed by one or more processors, the network model training method in the above-mentioned embodiments of the present application is executed , for example, carry out method step 101 to step 106 in Fig. 1 described above, method step 1011 to step 1015 in Fig. 3, method step 1021 to step 1023 in Fig. 6, method step 1051 to step 1052 in Fig.
  • train the pre-training model by obtaining unlabeled data, modify the output layer of the pre-training model to the output layer corresponding to the target task, generate a fine-tuning model, obtain the labeled data of the target task to train the fine-tuning model, and generate a teacher network, according to
  • the target task constructs the student network, uses multiple teacher networks to perform knowledge distillation on the student network to determine the distillation loss function, and iteratively trains the student network based on the distillation loss function to generate the target network model for the target task.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer-executable program, and the computer-executable program is executed by one or more control processors, for example, shown in FIG. 10
  • Execution by one of the processors can make the above-mentioned one or more processors execute the network model training method in the above-mentioned embodiment of the present application, for example, execute the method steps 101 to 106 in FIG. 1 described above, and the method in FIG. 3 Method steps 1011 to 1015, method steps 1021 to 1023 in FIG. 6, method steps 1051 to 1052 in FIG.
  • the output layer corresponding to the task generates a fine-tuning model, obtains the labeled data of the target task to train the fine-tuning model, generates a teacher network, constructs a student network according to the target task, and uses multiple teacher networks to perform knowledge distillation on the student network to determine the distillation loss function, iteratively trains the student network based on the distillation loss function, and generates the target network model for the target task.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable programs, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了网络模型训练方法、装置和计算机可读存储介质,通过对预训练模型依次进行自监督预训练、领域数据微调和知识蒸馏,例如使用海量数据无监督预训练超大规模神经网络模型,利用有限标注样本对预训练模型进行微调,使用知识蒸馏方法将微调后的超大模型压缩为目标模型,以满足目标设备的部署要求。

Description

网络模型训练方法、装置和计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为202111239621.1、申请日为2021年10月25日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请实施例涉及但不限于深度学习技术领域,特别是涉及一种网络模型训练方法、装置和计算机可读存储介质。
背景技术
目前,人工智能(AI)技术以机器学习特别是深度学习为核心,在计算机视觉、语音和自然语言等应用领域迅速发展,开始对各个行业进行赋能,但是AI在工业界的落地进程与学术界取得的研究成果相比略显缓慢,这当然存在很多原因,现在工业界应用落地时存在的一个非常普遍的缺陷是,模型的通用性和泛化性比较差,针对一个特定领域的问题,过去一般需要经过采集数据、人工标注、设计模型、训练模型和反复调参等过程最终输出一个可用的模型,开发周期长,而当面对一个新领域的问题时,模型无法复用,又得把这个过程重新来一遍,耗费大量的人力物力。造成这个缺陷的原因主要有两个:一个是人工标注成本太高,某些特定领域需要领域专家进行标注,比如医疗行业;另一个是这种小作坊定制化模型的开发模式,严重依赖场景数据,模型通用性和泛化性太差,无法复用,不具备新任务、新场景的快速扩展能力,开发周期长。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
第一方面,本申请实施例提供了一种网络模型训练方法,包括:获取无标签数据对预训练模型进行训练;修改所述预训练模型的输出层为目标任务对应的输出层,生成微调模型;获取所述目标任务的有标签数据对所述微调模型进行训练,生成教师网络;根据所述目标任务构建学生网络;利用多个所述教师网络对所述学生网络进行知识蒸馏,以确定蒸馏损失函数;基于所述蒸馏损失函数对所述学生网络进行迭代训练,生成所述目标任务的目标网络模型。
第二方面,本申请实施例提供了一种网络模型训练装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上第一方面所述的网络模型训练方法。
第三方面,本申请实施例提供了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上第一方面所述的网络模型训练方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序用于使计算机执行如上第一方面所述的网络模型训练方法。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而 易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请实施例一起用于解释本申请技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的一种网络模型训练方法的主流程图;
图2是本申请一个实施例提供的三大阶段组成示意图;
图3是本申请一个实施例提供的一种网络模型训练方法的子流程图;
图4是本申请一个实施例提供的采用自监督对比学习进行预训练大模型的原理图;
图5是本申请一个实施例提供的投影头内部结构示意图;
图6是本申请一个实施例提供的一种网络模型训练方法的子流程图;
图7是本申请一个实施例提供的使用领域数据进行微调的原理图;
图8是本申请一个实施例提供的一种网络模型训练方法的子流程图;
图9是本申请一个实施例提供的一种教师网络知识蒸馏学生网络的原理图;
图10是本申请一个实施例提供的网络模型训练装置结构示意图;
图11是本申请一个实施例提供的电子设备结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
应了解,在本申请实施例的描述中,多个(或多项)的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到“第一”、“第二”等只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。
目前,人工智能技术以机器学习特别是深度学习为核心,在计算机视觉、语音和自然语言等应用领域迅速发展,开始对各个行业进行赋能,但是AI在工业界的落地进程与学术界取得的研究成果相比略显缓慢,这当然存在很多原因,现在工业界应用落地时存在的一个非常普遍的缺陷是,模型的通用性和泛化性比较差,针对一个特定领域的问题,过去一般需要经过采集数据、人工标注、设计模型、训练模型和反复调参等过程最终输出一个可用的模型,开发周期长,而当面对一个新领域的问题时,模型无法复用,又得把这个过程重新来一遍,耗费大量的人力物力。造成这个缺陷的原因主要有两个:一个是人工标注成本太高,某些特定领域需要领域专家进行标注,比如医疗行业;另一个是这种小作坊定制化模型的开发模式,严重依赖场景数据,模型通用性和泛化性太差,无法复用,不具备新任务、新场景的快速扩展能力,开发周期长。
针对人工标注成本高和AI应用开发定制化的问题,本申请实施例提供了一种网络模型训练方法、装置和计算机可读存储介质,基于大规模无标签数据使用自监督对比学习方法来预训练超大规模网络模型,之后对预训练超大规模网络模型进行微调和知识蒸馏,并最终应用于目标任务。通过对预训练模型依次进行自监督预训练、领域数据微调和知识蒸馏,即使用海量数据无监督预训练超大规模神经网络模型,利用有限标注样本对预训练模型进行微调, 使用知识蒸馏方法将微调后的超大模型压缩为目标模型,即满足目标设备的部署要求。基于此,可以减少对标注数据的依赖,降低人工标注的成本,可以解决人工标注数据成本高的问题,并可以提高模型的通用性和泛化性,使得本申请输出的目标模型在目标任务上的精度超越了原定制化模型。
如图1所示,图1是本申请一个实施例提供的一种网络模型训练方法的流程图。网络模型训练方法包括但不限于如下步骤:
步骤101,获取无标签数据对预训练模型进行训练;
步骤102,修改预训练模型的输出层为目标任务对应的输出层,生成微调模型;
步骤103,获取目标任务的有标签数据对微调模型进行训练,生成教师网络;
步骤104,根据目标任务构建学生网络;
步骤105,利用多个教师网络对学生网络进行知识蒸馏,以确定蒸馏损失函数;
步骤106,基于蒸馏损失函数对学生网络进行迭代训练,生成目标任务的目标网络模型。
可以理解的是,如图2所示,本申请是由自监督对比学习训练、领域数据微调和知识蒸馏三大阶段依次组成。通过获取无标签数据对预训练模型进行训练,修改预训练模型的输出层为目标任务对应的输出层,生成微调模型,获取目标任务的有标签数据对微调模型进行训练,生成教师网络,根据目标任务构建学生网络,利用多个教师网络对学生网络进行知识蒸馏,以确定蒸馏损失函数,基于蒸馏损失函数对学生网络进行迭代训练,生成目标任务的目标网络模型。需要指出的是,在本申请中的预训练模型是超大规模神经网络模型。通过对预训练模型依次进行自监督预训练、领域数据微调和知识蒸馏,即使用海量数据无监督预训练超大规模神经网络模型,利用有限标注样本对预训练模型进行微调,使用知识蒸馏方法将微调后的超大模型压缩为目标模型,即满足目标设备的部署要求,比如算力、内存、预测精度和推理性能等。基于此,可以减少对标注数据的依赖,降低人工标注的成本,可以解决人工标注数据成本高的问题,并可以提高模型的通用性和泛化性,使得本申请输出的目标模型在目标任务上的精度超越了原定制化模型。因此,使用无标注的数据进行预训练大模型的训练,同时依托预训练大模型强大的通用化能力可以改变过去一种场景一个模型的开发模式,缩短开发周期并降低开发成本,并可以获得比原来所用模型更好的精度,预训练模型使得由原来的手工调参、依靠专家的阶段,进入到了大规模、可复制的大工业施展的阶段,最终加快AI的落地进程。
可以理解的是,在自监督预训练阶段,利用海量无标注的数据,采用自监督对比学习方法来训练超大预训练模型,隐式地学习到了通用视觉句表征知识。为了减少对标注数据的依赖,可以使用弱监督或者无监督(如自监督)的方法进行训练,同时可以克服有监督训练的固有缺陷,即有监督学习通过标签训练得到的模型往往只能学到一些任务特定的知识,而不能学习到一种通用的知识,因此有监督学习学到的特征表示难以迁移到其他任务。而自监督学习是基于数据本身的内在特征,其提供的信息远比标签丰富,所以泛化性能较有监督学习方式更好,但是同有监督训练方法相比,自监督学习方法训练的模型精度比有监督学习的模型要低一些,同时需要更多的数据和更大的模型参数量,也就意味着需要更多的算力和更长的训练时间。
可以理解的是,自监督学习可以利用无标签数据来对网络进行训练。它可以把自己定义的伪标签当作训练的信号,然后把学习到的表示用作目标任务里。而对比学习被当作自监督 学习中一个非常重要的一部分,其主要原理是:将一个样本的不同的、增强过的新样本们在嵌入空间中尽可能地拉近,然后让不同的样本之间尽可能地拉远。
可以理解的是,在领域数据微调阶段,针对具体的任务对网络进行修正,在预训练阶段通过自监督学习从大规模数据中获得与具体任务无关的预训练模型,在本阶段,将预训练模型应用于特定任务上,首先修改模型的输出层为特定任务对应的输出层,然后使用有限的有标签数据对模型进行有监督的训练,旨在利用其标注样本对预训练网络的参数进行调整,使其渐渐适应目标任务的特性。
可以理解的是,在知识蒸馏阶段,是在Teacher-Student(教师-学生)框架中,将复杂、学习能力强的网络(Teacher)学到的特征表示“知识”蒸馏出来,传递给参数量小、学习能力弱的网络(Student)。教师网络中学习到的特征表示可作为监督信息,训练学生网络以模仿教师网络的行为。
需要说明的是,本申请的第一个阶段,即自监督预训练阶段,由于使用了大规模数据集和大规模的网络模型,基于大规模数据集的大网络模型的预训练需要庞大的计算资源,训练时会非常吃算力,预估所需时间可能是以周甚至月计数的,所以拥有的硬件资源(特别是深度学习专用加速设备如GPU)越多,算力越强,能够大大减少训练所需时间,进行这种大规模的训练,一般都需要在具备超强分布式并行计算的AI专用平台上进行。
如图3所示,步骤101可以包括但不限于如下子步骤:
步骤1011,从原始图像中随机采样;
步骤1012,对采样的每张图像做两种不同的数据增强,得到第一样本和第二样本;
步骤1013,将第一样本和第二样本分别进行特征表达提取以及非线性变换,得到第一样本的特征表示和第二样本的特征表示;
步骤1014,根据第一样本的特征表示和第二样本的特征表示确定第一样本和第二样本之间的对比损失函数;
步骤1015,基于对比损失函数对预训练模型进行训练。
可以理解的是,如图4所示,从原始图像中随机采样,对采样的每张图像x做两种不同的数据增强,得到第一样本v1和第二样本v2。v1经过编码器网络F1输出特征y1,y1经过投影头g1非线性变换得到第一样本的特征表示z1;而同时,v2经过编码器网络F2输出特征y2,y2经过投影头g2非线性变换得到第一样本的特征表示z2。然后计算第一样本v1和第二样本v2之间的对比损失Loss,可以根据第一样本的特征表示z1和第二样本的特征表示z2来确定第一样本v1和第二样本v2之间的对比损失函数,基于对比损失函数对预训练模型进行训练。自监督对比学习是基于无标签的数据来训练网络模型,经过训练的网络能够从图像中提取到有意义的特征表示,有利于提高其他目标任务的性能,其核心思想是通过计算样本表示间的距离,拉近正样本,拉远负样本。也就是说,当能够区分该样本的正负例时,得到的特征表示就够用了。
可以理解的是,从原始图像中随机采样一个Batch的数据,对Batch里的每张图像做两种不同的数据增强,比如Random Crop和Color Distortion,假设Batch的大小为N,数据增强之后得到2N张图片;图片进入编码器网络进行特征表达提取得到Embedding,目前常用的编码器网络一般是基于ResNet-50模型基础上加宽两倍和4倍后的模型,实验表明,网络的参数量越大,性能越好。本实例可以采用基于Transformer的ViT网络作为基础编码器网 络;编码得到特征表示之后,如图5所示,通过一系列的Dense层和ReLU层组成的投影头的非线性变换投影成新的表示形式;计算对比损失,同一张图像的不同数据增强看做是正样例对,同一个Batch里面的其他图像都是负样例,对比损失函数是噪声对比估计(NCE)损失,公式如下:
Figure PCTCN2022124171-appb-000001
Figure PCTCN2022124171-appb-000002
其中,s i,j表示第一样本和第二样本两个样例的余弦相似度。l(i,j)表示任意一个样例与其他样例之间的对比损失。L表示对所有样例之间的对比损失求平均值得到的平均对比损失。
经验表明:多个数据增强方法的组合至关重要,能够保证生成有效的特征表示,相比监督学习,无监督学习能够从数据增强里获得更多的好处;Batch Size越大越好,训练时长越长越好。
可以理解的是,模型越大,表达能力和泛化能力越强,并更加的通用化,当迁移到目标任务时性能会有保证。目前常用的网络模型一般是CNN网络,如基于ResNet-50模型基础上加宽两倍和4倍后的模型,在本申请自监督对比学习预训练阶段,引入了Transformer网络作为基础编码器网络,相比与CNN网络,表达能力以及知识抽取能力更加的强大。
如图6所示,步骤102可以包括但不限于如下子步骤:
步骤1021,将预训练模型除了输出层以外的所有参数对应复制到微调模型;
步骤1022,除去预训练模型的输出层,根据目标任务的特性添加新的输出层;
步骤1023,对新的输出层的参数随机初始化。
可以理解的是,如图7所示,微调是将任务无关的预训练大模型迁移到任务已知场景中的一种方式。将预训练模型除了输出层(Output Layer)外的所有参数对应复制到微调模型;除去预训练模型的输出层,并根据目标任务的特性添加新的输出层,并将输出层的参数随机初始化;使用目标任务有标签的数据进行训练,通常使用较小的学习速率对模型进行训练,使其渐渐适应目标任务的特性。
如图8所示,步骤105可以包括但不限于如下子步骤:
步骤1051,对多个教师网络按照权重进行集成并融合输出;
步骤1052,基于教师网络的融合输出和学生网络反向传播误差的对应关系确定蒸馏损失函数。
可以理解的是,经过微调的预训练模型,仍然存在参数量大、计算量大的问题,如果想很好地应用在目标任务上,特别是资源比较受限的边缘侧或端侧设备上,必须对模型进行压缩,而知识蒸馏就具有模型压缩的思想,通过一步一步地使用一个较大的已经训练好的网络去教导一个较小的网络确切地去做什么。然后,通过尝试复制大网络在每一层的输出(不仅仅是最终的损失),小网络被训练以学习大网络的准确行为。具体实践中,可以使用多个教师网络来蒸馏小网络,即多模型集成蒸馏。
可以理解的是,如图9所示,知识蒸馏具体流程可以包括:构建学生网络,根据具体任务以及场景要求来设计学生网络模型,并建立与教师网络的对应关系;教师网络前向传播。 所有的教师网络执行前向传播,以获得所有中间输出,并执行数据增强;对所有的教师网络的输出进行按权重进行集成,融合为一个输出;学生网络执行反向传播过程,完成参数更新。利用教师网络的融合输出和学生网络反向传播误差的对应关系,使学生网络能够学会复制教师网络的行为,最终形成优化的学生网络,蒸馏损失函数定义如下:
Figure PCTCN2022124171-appb-000003
其中,对于有标签的数据,使用-(1-α)L student计算,对于无标签的数据,使用L distill计算。
把大模型作为教师网络,让一个小模型作为学生网络来学习,接近大模型的能力,但是模型的参数减少很多。
另外,一般情况下,教师网络要比学生网络大很多,相应地前向计算也会更加耗时,而前向计算完毕之后,学生网络需要进行反向传播进行模型参数更新,而教师网络不需要执行反向传播过程,所以为了加快知识蒸馏的计算,在学生网络进行反向传播时,就可以让教师网络执行下一个Batch数据的前向计算而不必等到学习网络反向传播计算完成。
可以理解的是,在知识蒸馏阶段采用了多个教师网络,每个教师网络的输出按权重进行集成,最终融合为一个输出,来对学生网络进行蒸馏;为了提高计算效率,在学生网络进行反向传播阶段,即让教师网络执行前向传播。
使用本申请提出的自监督对比学习训练、领域数据微调和知识蒸馏三大阶段训练优化,可以减少对标注数据的依赖,降低人工标注的成本,使用无标注的数据进行预训练大模型的训练,同时依托预训练大模型强大的通用化能力可以改变过去一种场景一个模型的开发模式,缩短开发周期并降低开发成本,并可以获得比原来所用模型更好的精度,预训练模型使得由原来的手工调参、依靠专家的阶段,进入到了大规模、可复制的大工业施展的阶段,最终加快AI的落地进程。
如图10所示,本申请实施例还提供了一种网络模型训练装置。
在实施例中,该网络模型训练装置包括:一个或多个处理器和存储器,图10中以一个处理器及存储器为例。处理器和存储器可以通过总线或者其他方式连接,图10中以通过总线连接为例。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如上述本申请实施例中的网络模型训练方法。处理器通过运行存储在存储器中的非暂态软件程序以及程序,从而实现上述本申请实施例中的网络模型训练方法。
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述本申请实施例中的网络模型训练方法所需的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该网络模型训练装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述本申请实施例中的网络模型训练方法所需的非暂态软件程序以及程序存储在存储器中,当被一个或者多个处理器执行时,执行上述本申请实施例中的网络模型训练方法,例如,执行以上描述的图1中的方法步骤101至步骤106,图3中的方法步骤1011至步骤1015,图6中的方法步骤1021至步骤1023,图8中的方法步骤1051至步骤1052,通过获取无标签 数据对预训练模型进行训练,修改预训练模型的输出层为目标任务对应的输出层,生成微调模型,获取目标任务的有标签数据对微调模型进行训练,生成教师网络,根据目标任务构建学生网络,利用多个教师网络对学生网络进行知识蒸馏,以确定蒸馏损失函数,基于蒸馏损失函数对学生网络进行迭代训练,生成目标任务的目标网络模型。通过对预训练模型依次进行自监督预训练、领域数据微调和知识蒸馏,即使用海量数据无监督预训练超大规模神经网络模型,利用有限标注样本对预训练模型进行微调,使用知识蒸馏方法将微调后的超大模型压缩为目标模型,即满足目标设备的部署要求。基于此,可以减少对标注数据的依赖,降低人工标注的成本,可以解决人工标注数据成本高的问题,并可以提高模型的通用性和泛化性,使得本申请输出的目标模型在目标任务上的精度超越了原定制化模型。
如图11所示,本申请实施例还提供了一种电子设备。
在实施例中,该电子设备包括:一个或多个处理器和存储器,图11中以一个处理器及存储器为例。处理器和存储器可以通过总线或者其他方式连接,图11中以通过总线连接为例。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如上述本申请实施例中的网络模型训练方法。处理器通过运行存储在存储器中的非暂态软件程序以及程序,从而实现上述本申请实施例中的网络模型训练方法。
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述本申请实施例中的网络模型训练方法所需的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该网络模型训练装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述本申请实施例中的网络模型训练方法所需的非暂态软件程序以及程序存储在存储器中,当被一个或者多个处理器执行时,执行上述本申请实施例中的网络模型训练方法,例如,执行以上描述的图1中的方法步骤101至步骤106,图3中的方法步骤1011至步骤1015,图6中的方法步骤1021至步骤1023,图8中的方法步骤1051至步骤1052,通过获取无标签数据对预训练模型进行训练,修改预训练模型的输出层为目标任务对应的输出层,生成微调模型,获取目标任务的有标签数据对微调模型进行训练,生成教师网络,根据目标任务构建学生网络,利用多个教师网络对学生网络进行知识蒸馏,以确定蒸馏损失函数,基于蒸馏损失函数对学生网络进行迭代训练,生成目标任务的目标网络模型。通过对预训练模型依次进行自监督预训练、领域数据微调和知识蒸馏,即使用海量数据无监督预训练超大规模神经网络模型,利用有限标注样本对预训练模型进行微调,使用知识蒸馏方法将微调后的超大模型压缩为目标模型,即满足目标设备的部署要求。基于此,可以减少对标注数据的依赖,降低人工标注的成本,可以解决人工标注数据成本高的问题,并可以提高模型的通用性和泛化性,使得本申请输出的目标模型在目标任务上的精度超越了原定制化模型。
此外,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行程序,该计算机可执行程序被一个或多个控制处理器执行,例如,被图10中的一个处理器执行,可使得上述一个或多个处理器执行上述本申请实施例中的网络模型训练方法,例如,执行以上描述的图1中的方法步骤101至步骤106,图3中的方法步骤1011至步 骤1015,图6中的方法步骤1021至步骤1023,图8中的方法步骤1051至步骤1052,通过获取无标签数据对预训练模型进行训练,修改预训练模型的输出层为目标任务对应的输出层,生成微调模型,获取目标任务的有标签数据对微调模型进行训练,生成教师网络,根据目标任务构建学生网络,利用多个教师网络对学生网络进行知识蒸馏,以确定蒸馏损失函数,基于蒸馏损失函数对学生网络进行迭代训练,生成目标任务的目标网络模型。通过对预训练模型依次进行自监督预训练、领域数据微调和知识蒸馏,即使用海量数据无监督预训练超大规模神经网络模型,利用有限标注样本对预训练模型进行微调,使用知识蒸馏方法将微调后的超大模型压缩为目标模型,即满足目标设备的部署要求。基于此,可以减少对标注数据的依赖,降低人工标注的成本,可以解决人工标注数据成本高的问题,并可以提高模型的通用性和泛化性,使得本申请输出的目标模型在目标任务上的精度超越了原定制化模型。
因此,使用无标注的数据进行预训练大模型的训练,同时依托预训练大模型强大的通用化能力可以改变过去一种场景一个模型的开发模式,缩短开发周期并降低开发成本,并可以获得比原来所用模型更好的精度,预训练模型使得由原来的手工调参、依靠专家的阶段,进入到了大规模、可复制的大工业施展的阶段,最终加快AI的落地进程。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读程序、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读程序、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上是对本申请的较佳实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。

Claims (10)

  1. 一种网络模型训练方法,包括:
    获取无标签数据对预训练模型进行训练;
    修改所述预训练模型的输出层为目标任务对应的输出层,生成微调模型;
    获取所述目标任务的有标签数据对所述微调模型进行训练,生成教师网络;
    根据所述目标任务构建学生网络;
    利用多个所述教师网络对所述学生网络进行知识蒸馏,以确定蒸馏损失函数;
    基于所述蒸馏损失函数对所述学生网络进行迭代训练,生成所述目标任务的目标网络模型。
  2. 根据权利要求1所述的方法,其中,所述获取无标签数据对预训练模型进行训练,包括:
    从原始图像中随机采样;
    对采样的每张图像做两种不同的数据增强,得到第一样本和第二样本;
    将第一样本和第二样本分别进行特征表达提取以及非线性变换,得到所述第一样本的特征表示和所述第二样本的特征表示;
    根据所述第一样本的特征表示和所述第二样本的特征表示确定所述第一样本和所述第二样本之间的对比损失函数;
    基于所述对比损失函数对所述预训练模型进行训练。
  3. 根据权利要求2所述的方法,其中,所述将第一样本和第二样本分别进行特征表达提取以及非线性变换,得到所述第一样本的特征表示和所述第二样本的特征表示,包括:
    将第一样本和第二样本分别输入至编码器网络进行特征表达提取,得到第一特征表示和第二特征表示;
    将所述第一特征表示和所述第二特征表示输入至投影头进行非线性变换投影成所述第一样本的特征表示和所述第二样本的特征表示,其中,所述投影头由Dense层和ReLU层组成。
  4. 根据权利要求3所述的方法,其中,所述编码器网络为视觉Transformer编码器网络。
  5. 根据权利要求1所述的方法,其中,所述修改所述预训练模型的输出层为目标任务对应的输出层,生成微调模型,包括:
    将所述预训练模型除了输出层以外的所有参数对应复制到所述微调模型;
    除去所述预训练模型的输出层,根据所述目标任务的特性添加新的输出层;
    对新的输出层的参数随机初始化。
  6. 根据权利要求1所述的方法,其中,所述利用多个所述教师网络对所述学生网络进行知识蒸馏,以确定蒸馏损失函数,包括:
    对多个所述教师网络按照权重进行集成并融合输出;
    基于所述教师网络的融合输出和所述学生网络反向传播误差的对应关系确定蒸馏损失函数。
  7. 根据权利要求6所述的方法,其中,所述教师网络执行前向传播,并执行数据增强。
  8. 一种网络模型训练装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至7中任意一 项所述的网络模型训练方法。
  9. 一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至7中任意一项所述的网络模型训练方法。
  10. 一种计算机可读存储介质,存储有计算机可执行程序,当所述计算机可执行程序被处理器执行时,使得所述处理器执行如权利要求1至7任意一项所述的网络模型训练方法。
PCT/CN2022/124171 2021-10-25 2022-10-09 网络模型训练方法、装置和计算机可读存储介质 WO2023071743A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111239621.1A CN113947196A (zh) 2021-10-25 2021-10-25 网络模型训练方法、装置和计算机可读存储介质
CN202111239621.1 2021-10-25

Publications (1)

Publication Number Publication Date
WO2023071743A1 true WO2023071743A1 (zh) 2023-05-04

Family

ID=79332117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124171 WO2023071743A1 (zh) 2021-10-25 2022-10-09 网络模型训练方法、装置和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN113947196A (zh)
WO (1) WO2023071743A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310667A (zh) * 2023-05-15 2023-06-23 鹏城实验室 联合对比损失和重建损失的自监督视觉表征学习方法
CN116665025A (zh) * 2023-07-31 2023-08-29 福思(杭州)智能科技有限公司 数据闭环方法和系统
CN116663678A (zh) * 2023-06-20 2023-08-29 北京智谱华章科技有限公司 面向超大规模模型的蒸馏优化方法、装置、介质及设备
CN117273821A (zh) * 2023-11-20 2023-12-22 阿里健康科技(杭州)有限公司 电子权益凭证的发放方法、训练方法以及相关装置
CN117373016A (zh) * 2023-10-20 2024-01-09 农芯(南京)智慧农业研究院有限公司 烟叶烘烤状态判别方法、装置、设备及存储介质
CN117436500A (zh) * 2023-12-19 2024-01-23 杭州宇谷科技股份有限公司 一种基于对比学习的电池数据处理模型的无监督训练方法
CN117493486A (zh) * 2023-11-10 2024-02-02 华泰证券股份有限公司 基于数据重放的可持续金融事件抽取系统及方法
CN117726884A (zh) * 2024-02-09 2024-03-19 腾讯科技(深圳)有限公司 对象类别识别模型的训练方法、对象类别识别方法及装置
CN117788836A (zh) * 2024-02-23 2024-03-29 中国第一汽车股份有限公司 图像处理方法、装置、计算机设备和存储介质
CN117892139A (zh) * 2024-03-14 2024-04-16 中国医学科学院医学信息研究所 基于层间比对的大语言模型训练和使用方法及相关装置
CN117992220A (zh) * 2024-01-29 2024-05-07 厦门渊亭信息科技有限公司 一种基于改进ZeRO-Offload技术的大模型训练方法
CN118093210A (zh) * 2024-04-29 2024-05-28 浙江鹏信信息科技股份有限公司 基于模型蒸馏的异构算力调度方法、系统及可读存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113947196A (zh) * 2021-10-25 2022-01-18 中兴通讯股份有限公司 网络模型训练方法、装置和计算机可读存储介质
CN116563660A (zh) * 2022-01-28 2023-08-08 华为云计算技术有限公司 一种基于预训练大模型的图像处理方法及相关装置
CN115147680B (zh) * 2022-06-30 2023-08-25 北京百度网讯科技有限公司 目标检测模型的预训练方法、装置以及设备
CN115511059B (zh) * 2022-10-12 2024-02-09 北华航天工业学院 一种基于卷积神经网络通道解耦的网络轻量化方法
CN115879535B (zh) * 2023-02-10 2023-05-23 北京百度网讯科技有限公司 一种自动驾驶感知模型的训练方法、装置、设备和介质
CN116091895B (zh) * 2023-04-04 2023-07-11 之江实验室 一种面向多任务知识融合的模型训练方法及装置
CN116681123B (zh) * 2023-07-31 2023-11-14 福思(杭州)智能科技有限公司 感知模型训练方法、装置、计算机设备和存储介质
CN117057413B (zh) * 2023-09-27 2024-03-15 传申弘安智能(深圳)有限公司 强化学习模型微调方法、装置、计算机设备及存储介质
CN117993468B (zh) * 2024-04-03 2024-06-28 杭州海康威视数字技术股份有限公司 一种模型训练方法、装置、存储介质和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598216A (zh) * 2020-04-16 2020-08-28 北京百度网讯科技有限公司 学生网络模型的生成方法、装置、设备及存储介质
CN112508169A (zh) * 2020-11-13 2021-03-16 华为技术有限公司 知识蒸馏方法和系统
US20210142164A1 (en) * 2019-11-07 2021-05-13 Salesforce.Com, Inc. Multi-Task Knowledge Distillation for Language Model
CN113947196A (zh) * 2021-10-25 2022-01-18 中兴通讯股份有限公司 网络模型训练方法、装置和计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800053B (zh) * 2021-01-05 2021-12-24 深圳索信达数据技术有限公司 数据模型的生成方法、调用方法、装置、设备及存储介质
CN112507990A (zh) * 2021-02-04 2021-03-16 北京明略软件系统有限公司 视频时空特征学习、抽取方法、装置、设备及存储介质
CN113011427B (zh) * 2021-03-17 2022-06-21 中南大学 基于自监督对比学习的遥感图像语义分割方法
CN113470695B (zh) * 2021-06-30 2024-02-09 平安科技(深圳)有限公司 声音异常检测方法、装置、计算机设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210142164A1 (en) * 2019-11-07 2021-05-13 Salesforce.Com, Inc. Multi-Task Knowledge Distillation for Language Model
CN111598216A (zh) * 2020-04-16 2020-08-28 北京百度网讯科技有限公司 学生网络模型的生成方法、装置、设备及存储介质
CN112508169A (zh) * 2020-11-13 2021-03-16 华为技术有限公司 知识蒸馏方法和系统
CN113947196A (zh) * 2021-10-25 2022-01-18 中兴通讯股份有限公司 网络模型训练方法、装置和计算机可读存储介质

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310667A (zh) * 2023-05-15 2023-06-23 鹏城实验室 联合对比损失和重建损失的自监督视觉表征学习方法
CN116310667B (zh) * 2023-05-15 2023-08-22 鹏城实验室 联合对比损失和重建损失的自监督视觉表征学习方法
CN116663678A (zh) * 2023-06-20 2023-08-29 北京智谱华章科技有限公司 面向超大规模模型的蒸馏优化方法、装置、介质及设备
CN116665025A (zh) * 2023-07-31 2023-08-29 福思(杭州)智能科技有限公司 数据闭环方法和系统
CN116665025B (zh) * 2023-07-31 2023-11-14 福思(杭州)智能科技有限公司 数据闭环方法和系统
CN117373016B (zh) * 2023-10-20 2024-04-30 农芯(南京)智慧农业研究院有限公司 烟叶烘烤状态判别方法、装置、设备及存储介质
CN117373016A (zh) * 2023-10-20 2024-01-09 农芯(南京)智慧农业研究院有限公司 烟叶烘烤状态判别方法、装置、设备及存储介质
CN117493486A (zh) * 2023-11-10 2024-02-02 华泰证券股份有限公司 基于数据重放的可持续金融事件抽取系统及方法
CN117273821B (zh) * 2023-11-20 2024-03-01 阿里健康科技(杭州)有限公司 电子权益凭证的发放方法、训练方法以及相关装置
CN117273821A (zh) * 2023-11-20 2023-12-22 阿里健康科技(杭州)有限公司 电子权益凭证的发放方法、训练方法以及相关装置
CN117436500A (zh) * 2023-12-19 2024-01-23 杭州宇谷科技股份有限公司 一种基于对比学习的电池数据处理模型的无监督训练方法
CN117436500B (zh) * 2023-12-19 2024-03-26 杭州宇谷科技股份有限公司 一种基于对比学习的电池数据处理模型的无监督训练方法
CN117992220A (zh) * 2024-01-29 2024-05-07 厦门渊亭信息科技有限公司 一种基于改进ZeRO-Offload技术的大模型训练方法
CN117726884A (zh) * 2024-02-09 2024-03-19 腾讯科技(深圳)有限公司 对象类别识别模型的训练方法、对象类别识别方法及装置
CN117726884B (zh) * 2024-02-09 2024-05-03 腾讯科技(深圳)有限公司 对象类别识别模型的训练方法、对象类别识别方法及装置
CN117788836A (zh) * 2024-02-23 2024-03-29 中国第一汽车股份有限公司 图像处理方法、装置、计算机设备和存储介质
CN117892139A (zh) * 2024-03-14 2024-04-16 中国医学科学院医学信息研究所 基于层间比对的大语言模型训练和使用方法及相关装置
CN117892139B (zh) * 2024-03-14 2024-05-14 中国医学科学院医学信息研究所 基于层间比对的大语言模型训练和使用方法及相关装置
CN118093210A (zh) * 2024-04-29 2024-05-28 浙江鹏信信息科技股份有限公司 基于模型蒸馏的异构算力调度方法、系统及可读存储介质

Also Published As

Publication number Publication date
CN113947196A (zh) 2022-01-18

Similar Documents

Publication Publication Date Title
WO2023071743A1 (zh) 网络模型训练方法、装置和计算机可读存储介质
CN110799992B (zh) 使用模拟和域适配以用于机器人控制
EP4018390A1 (en) Resource constrained neural network architecture search
JP2021518939A (ja) データ拡張方策の学習
CN111079532A (zh) 一种基于文本自编码器的视频内容描述方法
CN111386536A (zh) 语义一致的图像样式转换
CN112699247A (zh) 一种基于多类交叉熵对比补全编码的知识表示学习框架
CN113344206A (zh) 融合通道与关系特征学习的知识蒸馏方法、装置及设备
WO2023137911A1 (zh) 基于小样本语料的意图分类方法、装置及计算机设备
CN113010683A (zh) 基于改进图注意力网络的实体关系识别方法及系统
CN113987196A (zh) 一种基于知识图谱蒸馏的知识图谱嵌入压缩方法
Slijepcevic et al. Learning useful representations for radio astronomy" in the wild" with contrastive learning
Zhou et al. Advancing Incremental Few-Shot Semantic Segmentation via Semantic-Guided Relation Alignment and Adaptation
CN116188785A (zh) 运用弱标签的PolarMask老人轮廓分割方法
US20220172048A1 (en) Method and system for learning representations less prone to catastrophic forgetting
Joshi et al. Video object segmentation with self-supervised framework for an autonomous vehicle
CN114612961A (zh) 一种多源跨域表情识别方法、装置及存储介质
CN115482426A (zh) 视频标注方法、装置、计算设备和计算机可读存储介质
Rebanowako et al. Age-Invariant Facial Expression Classification Method Using Deep Learning
He et al. Action Recognition Method Based on Graph Neural Network
Wei et al. Entropy-minimization Mean Teacher for Source-Free Domain Adaptive Object Detection
JP7485028B2 (ja) 学習装置、方法及びプログラム
CN116665064B (zh) 基于生成蒸馏与特征扰动的城市变化图生成方法及其应用
Zhou et al. UWYOLOX: An Underwater Object Detection Framework Based on Image Enhancement and Semi-supervised Learning
CN116935102A (zh) 一种轻量化模型训练方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885624

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE