CN113139463B

CN113139463B - Method, apparatus, device, medium and program product for training a model

Info

Publication number: CN113139463B
Application number: CN202110442612.6A
Authority: CN
Inventors: 郭若愚; 杜宇宁; 李晨霞; 杨烨华; 赵乔; 刘其文; 毕然; 胡晓光; 于佃海; 马艳军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2022-05-13
Anticipated expiration: 2041-04-23
Also published as: CN113139463A

Abstract

According to example embodiments of the present disclosure, a method, apparatus, device, medium, and program product for training a model are provided. Relate to the artificial intelligence field, especially relate to deep learning and image processing technical field. The specific implementation scheme is as follows: combining a first feature of the first model output for the training sample and a second feature of the second model output for the training sample to obtain a combined feature, the first model and the second model being initialized to have different model parameters; determining a first constraint, a second constraint, and a third constraint, respectively, based on differences between the first feature, the second feature, and the combined feature; and training the first model and the second model based on at least the first constraint, the second constraint, and the third constraint. According to the embodiment of the disclosure, the performance of the model obtained by training can be optimized.

Description

Method, apparatus, device, medium and program product for training a model

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to methods, apparatuses, devices, computer-readable storage media and computer program products for training models.

Background

With the development of information technology, neural networks are widely used for various machine learning tasks such as computer vision, speech recognition, and information retrieval. Optical Character Recognition (OCR) is a technology that can convert picture information into text information that is easier to edit and store. OCR recognition using a neural network is verified as an effective recognition method. However, the accuracy of the trained model still remains to be improved.

Disclosure of Invention

In accordance with example embodiments of the present disclosure, a method, apparatus, device, computer-readable storage medium, and computer program product for training a model are provided.

In a first aspect of the disclosure, a method for training a model is provided. The method comprises the following steps: combining a first feature output by the first model for the training sample and a second feature output by the second model for the training sample to obtain a combined feature, the first model and the second model being initialized to have different model parameters; determining a first constraint, a second constraint, and a third constraint, respectively, based on differences between the first feature, the second feature, and the combined feature; and training the first model and the second model based on at least the first constraint, the second constraint, and the third constraint.

In a second aspect of the present disclosure, an apparatus for training a model is provided. The device includes: a feature fusion module configured to combine a first feature output by the first model for the training sample and a second feature output by the second model for the training sample to obtain a combined feature, the first model and the second model being initialized to have different model parameters; a first constraint determination module configured to determine a first constraint, a second constraint, and a third constraint, respectively, based on differences between the first feature, the second feature, and the combined feature; and a first model training module configured to train the first model and the second model based on at least the first constraint, the second constraint, and the third constraint.

In a third aspect of the disclosure, an electronic device is provided that includes one or more processors; and storage means for storing the one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the first aspect of the disclosure.

In a fourth aspect of the disclosure, an electronic device is provided that includes one or more processors; and storage means for storing the one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the second aspect of the disclosure.

In a fifth aspect of the present disclosure, a computer readable medium is provided, on which a computer program is stored, which program, when executed by a processor, performs the method according to the first aspect of the present disclosure.

In a sixth aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored which, when executed by a processor, implements a method according to the second aspect of the present disclosure.

In a seventh aspect of the present disclosure, a computer program product is provided, comprising computer program instructions to implement a method according to the first aspect of the present disclosure by a processor.

In an eighth aspect of the present disclosure, there is provided a computer program product comprising computer program instructions to implement a method according to the second aspect of the present disclosure by a processor.

It should be understood that what is described in this summary section is not intended to define key or essential features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements. The accompanying drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure, in which:

FIG. 1A illustrates a schematic diagram of an example of an environment for data processing in which some embodiments of the present disclosure can be implemented;

FIG. 1B illustrates a schematic diagram of an example of an environment in which a training model can be implemented in some embodiments of the present disclosure;

FIG. 2 illustrates a flow diagram of an example method for training a model, in accordance with some embodiments of the present disclosure;

FIG. 3 illustrates a flow diagram of an example method for processing data, in accordance with some embodiments of the present disclosure;

FIG. 4 shows a schematic block diagram of an apparatus for training a model according to an embodiment of the present disclosure;

FIG. 5 shows a schematic block diagram of an apparatus for processing data according to an embodiment of the present disclosure; and

FIG. 6 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

In embodiments of the present disclosure, the term "model" is capable of processing inputs and providing corresponding outputs. Taking a neural network model as an example, it typically includes an input layer, an output layer, and one or more hidden layers between the input layer and the output layer. Models used in deep learning applications (also referred to as "deep learning models") typically include many hidden layers, extending the depth of the network. The layers of the neural network model are connected in sequence such that the output of the previous layer is used as the input of the next layer, wherein the input layer receives the input of the neural network model and the output of the output layer is the final output of the neural network model. Each layer of the neural network model includes one or more nodes (also referred to as processing nodes or neurons), each node processing input from a previous layer. The terms "neural network," "model," "network," and "neural network model" are used interchangeably herein.

As mentioned above, there is a need to improve the accuracy of the trained model. In the conventional scheme, only the output of the teacher model and the student model is generally determined, and the training of the models is completed by the teacher model supervising the student models. The traditional scheme has the defect that model training is supervised only by the difference between output results of the student model and the teacher model, and the precision is insufficient.

An example embodiment of the present disclosure proposes a scheme for training a model. In this scheme, training samples are first input into a first model and a second model, the output first features and second features are combined to obtain combined features, and the first model and the second model are initialized to have different model parameters. Then, based on the difference between the first feature, the second feature and the combined feature, a first constraint, a second constraint and a third constraint are respectively determined. Finally, the first model and the second model are trained based on at least the first constraint, the second constraint, and the third constraint. In this way, the trained model is made more accurate by considering the fusion result of the outputs of the two models and supervising the model training by the fusion result.

FIG. 1A illustrates a schematic diagram of an example of a data processing environment 100 in which some embodiments of the present disclosure can be implemented. As shown in fig. 1A, environment 100 includes a computing device 110. The computing device 110 may be any device with computing capabilities, such as a personal computer, tablet computer, wearable device, cloud server, mainframe, distributed computing system, and the like.

The computing device 110 obtains the input 120. For example, the input 120 may be an image, video, audio, text, and/or multimedia file, and the like. Computing device 110 may apply input 120 to network model 130 to generate, using network model 130, a processing result 140 corresponding to input 120. In some embodiments, the network model 130 may be, but is not limited to, an OCR recognition model, an image classification model, a semantic segmentation model, an object detection model, or other image processing related neural network models. The network model 130 may be implemented using any suitable network architecture, including but not limited to Support Vector Machine (SVM) models, bayesian models, random forest models, various deep learning/neural network models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), deep enhanced learning networks (DQNs), and so forth. The scope of the present disclosure is not limited in this respect.

The environment 100 may also include a training data acquisition device, a model training device, and a model application device (not shown). In some embodiments, the above-mentioned apparatuses may be respectively implemented in different physical computing devices. Alternatively, at least some of the above-described plurality of apparatuses may be implemented in the same computing device. For example, the training data acquisition means, the model training means and may be implemented in the same computing device, while the model application means may be implemented in another computing device.

In some embodiments, during the model training phase, the training data acquisition device may acquire input 120 and provide it to the model. The input 120 may be a raw sample and a different augmented sample corresponding to the raw sample, and the network model 130 is a model to be trained. The model training device may train the network model 130 based on the input. The processing results 140 may be for different constraints of the model, and the computing device 110 may adjust training parameters (e.g., weights and biases, etc.) of the network model 130 by the different constraints such that the error of the model on the training samples is reduced.

Alternatively, in some embodiments, at the final stage of model training, the input may be a test sample and the processing results 140 may be a characterization of a performance metric (e.g., accuracy) of the trained network model 130, which may be represented, for example, by a test penalty.

The environment 150 for training the model is described in detail below with reference to FIG. 1B. The environment 150 may include a training sample 122 as an input 120, although illustrated as one training sample, there may also be multiple training samples, and the disclosure is not limited thereto. In some embodiments, the sample may be image data. The training samples 122 may be comprised of raw samples 124 and augmented samples 126, and the computing device 110 (e.g., a training data acquisition device of the computing device) may be configured to perform data augmentation processing on the raw samples 124 to acquire the augmented samples 126. In some embodiments, for an image sample, an augmented sample of the image may be obtained by image cropping, rotating, and flipping the image therein. In other examples, for image samples, an automated sample augmentation strategy, such as automated data augmentation (AutoAutoAutoAutoAutoAutoAutoAutoAutoAutoAutoaugmentation), may be applied to obtain augmented training samples for the images.

Computing device 110 may use training samples 122 as input to first model 132 and second model 134 to determine first features 142 and second features 154, respectively. Computing device 110 may then determine first constraint 141, second constraint 143, third constraint 145, fourth constraint 147, and fifth constraint 149, respectively, from the output and the label of the augmented sample. Computing device 110 may then train first model 132 and second model 134 according to the constraints described above.

The first model 132 and the second model 134 are models to be trained, and the first model 132 and the second model 134 may have the same structure, i.e., they contain the same parameters in smaller amounts.

Referring back to fig. 1A, the trained network model may be provided to a model application device. The model application device may take the trained model along with the input 120 and determine a processing result 140 for the input 120. In the model application stage, the input 120 may be input data to be processed (e.g., image data), the network model 130 may be a trained model (e.g., a trained image classification model), and the processing result 140 may be a prediction result (e.g., a classification result of an image, a semantic segmentation result, or an object recognition result) corresponding to the input 120 (e.g., image data).

It should be understood that the environment 100 shown in FIG. 1A and the environment 150 shown in FIG. 1B are merely one example in which embodiments of the present disclosure may be implemented and are not intended to limit the scope of the present disclosure. Embodiments of the present disclosure are equally applicable to other systems or architectures.

The process of training the model in detail is further described below in conjunction with fig. 2-3. FIG. 2 illustrates a flow diagram of a process 200 for training a model according to an embodiment of the present disclosure. Process 200 may be implemented by computing device 110 in fig. 1. For ease of description, the process 200 will be described with reference to fig. 1A and 1B.

At block 210 of fig. 2, the computing device 110 combines the first features 152 output by the first model 132 for the training samples 122 and the second features 154 output by the second model 134 for the training samples 122 to obtain combined features 156, the first model 132 and the second model 134 being initialized to have different model parameters. For example, the computing device 110 may take the training samples 122 as input to the model to obtain a feature map of its output, and then fuse the feature maps.

In some embodiments, the training samples 122 may include at least one of original samples 124 and augmented samples 126 that are augmented based on the original samples. For example, the computing device 110 may randomly select an image from the set of images as the original sample 124, and then the computing device 110 may perform data augmentation operations, such as luminance transformation, random cropping, random rotation, etc., on the image to form augmented samples, respectively. The above examples of data augmentation are merely exemplary, and the computing device 110 may also process video data, for example, by variously combining different image frames in the video data, or may also process text and voice data in a suitable manner, for example, and the disclosure is not limited thereto.

In some embodiments, to reduce the computational load of the model, the computing device 110 may further process the training samples 122 composed of the original samples 124 and the augmented samples 126. For example, the computing device 110 may perform resizing and normalization operations on the pictures described above to form a pre-processed image.

In some embodiments, the training samples include at least one of: images, video, audio, and text.

After determining the training samples 122, the computing device 110 may input the training samples 122 into the first model 132 and the second model 134, respectively, to obtain first features 152 and second features 154, and then combine the first features 152 and the second features 154 to obtain combined features 156. For example, the output of the model may be a feature map representing the training samples 122, such as an 80 × 6000 vector matrix, where 80 represents that the length of the vector output by the model is a fixed length of 80 (which may also be referred to as a time step), and 6000 may indicate 6000 classification results. For an OCR model, 90 may indicate that for any image, 80 characters are output, with 6000 classification results for each character. It will be appreciated that, since the first and

second models

132, 134 are identical in structure and differ in model parameters, the first and

second features

152, 154 may be vector matrices of the same dimension and differing parameters. The combined features 156 may be derived from equation (1) below:

first feature + (1- α) second feature equation (1) with combined features α ═ α ·

Wherein 0< α < 1.

Note that the above figures and feature combinations are merely exemplary, and any figures and suitable features and combinations may also exist depending on the scene, and the present disclosure is not limited thereto.

At block 220 of fig. 2, computing device 110 determines first constraint 141, second constraint 143, and third constraint 145 based on differences between first feature 152, second feature 154, and combined feature 156, respectively. After determining the above features, computing device 110 processes them to determine constraints for supervised model training.

In some embodiments, computing device 110 may determine first constraint 143 based on a difference between first feature 152 and second feature 154; determining a second constraint 145 based on a difference between the first feature 152 and the combined feature 156; and determining third constraint 145 based on a difference between second feature 154 and combined feature 156. For example, the features may indicate probability distributions, and the variance may indicate a variance between the probability distributions. In this case, the computing device 110 may calculate KL divergence, JS divergence, L1 distance, L2 distance, and the like between the first feature 152 and the second feature 154 and the combined feature 156 as the differences between the features. Other suitable algorithms may also be utilized to calculate differences between features, and the disclosure is not limited thereto.

Since the fusion result (combined feature) of the model has higher accuracy, supervising the model training with the difference between the fusion result and the respective output results (first feature and second feature) of the model can improve the accuracy of the trained model.

At block 230 of FIG. 2, computing device 110 trains first model 132 and second model 134 based on at least first constraint 141, second constraint 143, and third constraint 145. For example, the computing device 110 may adjust the parameters of the first model and the second model according to the determined constraints described above.

In some embodiments, the training samples 122 have labels 160 that indicate the class of the augmented sample. For example, the label 160 may indicate that the number in the picture is 1 or that the color of the light being lit is green. The computing device 110 may determine the fourth constraint 147 based on a difference between the first feature 152 and the label 160. Based on the difference between the second features 154 and the label 160, a fifth constraint 149 is determined. And finally training first model 132 and second model 134 based on first constraint 141, second constraint 143, third constraint 145, fourth constraint 147, fifth constraint 149. For example, computing device 110 may determine a CTC loss function between first and

second features

152, 154 and tag 160 as a difference between the features and the tag. Any suitable algorithm is also applied between the features and the tags to determine the differences therebetween, and the disclosure is not limited thereto.

After computing device 110 determines the third constraint, computing device 110 may determine weights associated with first constraint 141, second constraint 143, third constraint 145, fourth constraint 147, fifth constraint 149, respectively. And training first model 132 and second model 134 based on first constraint 141, second constraint 143, third constraint 145, fourth constraint 147, fifth constraint 149, and associated weights.

In one embodiment, the computing device 110 may determine the total constraints to train the first model 132 and the second model 134 according to the constraints and weights described above. For example, the computing device 110 may calculate the overall constraint according to equation (2) as follows:

total constraint ═ a (fourth constraint + fifth constraint) + b first constraint + c (second constraint + third constraint) equation (2) where a, b, c are associated weights. The weights may be set by a user or dynamically adjusted by the computing device according to the type of model, the type of constraints, the results of model testing, and so forth. And each constraint may have the same weight or a different weight, respectively, and the disclosure is not limited thereto. The computing device 110 may finally adjust the parameters of the first model 132 and the second model 134 according to the total constraint to minimize the total constraint, thereby enabling training of the models.

In some embodiments, the computing device 110 may continually adjust the weights based on the results of the testing of the first model 132. For example, if the computing device 110 determines that the difference between the output of the model and the label of the true value is large during the testing phase of the model, the value of weight a may be set much higher than the values of weights b and c. Therefore, the model can be trained in a targeted manner by adjusting the weights representing different constraints. Thereby realizing high-efficiency and accurate model training.

In some embodiments, after the first model 132 and the second model 134 are trained to converge, the computing device 110 may determine the more accurate one of the trained first model 132 and the trained second model 134 as the target model. For example, the computing device 110 may test the trained first model 132 and the trained second model 134 using the same test set, thereby having as the target model the model that differs least from the truth label. By further selecting the trained models, the accuracy of the finally obtained models can be further improved.

According to the embodiments of the present disclosure, it is possible to supervise training of a model by utilizing the difference between the fusion result with the greater feature representation capability and the respective output results of the model, whereby the accuracy of the trained model can be improved. Adjusting the weights for different constraints according to the type of model and the test results may further improve the accuracy of the trained model.

Fig. 3 shows a flowchart of an example method 300 for processing data, in accordance with an embodiment of the present disclosure. For example, the method 300 may be performed by a computing device as shown in FIG. 1A.

At block 310 of fig. 3, the computing device 110 may obtain input data. A trained model trained in the manner described above may be deployed at the computing device 110. In some embodiments, the input data may be image data to be image classified, and the trained model is one of an image classification model, a semantic segmentation model, and a target recognition model.

At block 320 of fig. 3, the computing device 110 may determine a prediction result for the input data using the trained model. For example, in an embodiment where the input data described above may be image data to be subjected to image classification, and the trained model is an image classification model, the prediction result is a classification result of the image. In embodiments where the input data described above may be image data to be semantically segmented and the trained model is a semantic segmentation model, the prediction result is a semantic segmentation result. In embodiments where the input data described above may be image data to be semantically segmented and the trained model is a target recognition model, the prediction result is a target recognition result. The scheme according to the present disclosure may also be applied to other tasks related to image processing, or tasks performed based on image processing techniques (e.g., automatic driving, autonomous parking, etc.).

Fig. 4 shows a schematic block diagram of an apparatus 400 for training a model according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes: a feature fusion module 410 configured to combine a first feature output by the first model for the training sample and a second feature output by the second model for the training sample to obtain a combined feature, the first model and the second model being initialized to have different model parameters; a first constraint determination module 420 configured to determine a first constraint, a second constraint, and a third constraint based on differences between the first feature, the second feature, and the combined feature, respectively; and a first model training module 430 configured to train the first model and the second model based on at least the first constraint, the second constraint, and the third constraint.

In some embodiments, the first constraint determining module 420 may include: a second constraint determination module configured to determine a first constraint based on a difference between the first feature and the second feature; a third constraint determination module configured to determine a second constraint based on a difference between the first feature and the combined feature; and a fourth constraint determination module configured to determine a third constraint based on a difference between the second feature and the combined feature.

In some embodiments, where the training samples have labels indicating the class of the training samples, the apparatus 400 may further comprise: a fourth constraint determination module configured to determine a fourth constraint based on a difference between the first feature and the tag; a fifth constraint determination module configured to determine a fifth constraint based on a difference between the second feature and the tag; and a second model training module configured to train the first model and the second model based on the first constraint, the second constraint, the third constraint, the fourth constraint, and the fifth constraint.

In some embodiments, wherein the second model training module may comprise: a weight determination module configured to determine weights associated with the first constraint, the second constraint, the third constraint, the fourth constraint, and the fifth constraint, respectively; and a third model training module configured to train the first model and the second model based on the first constraint, the second constraint, the third constraint, the fourth constraint, the fifth constraint, and the associated weights.

In some embodiments, the apparatus 400 may further comprise: a target model determination module configured to determine a model of higher precision of the trained first model and the trained second model as a target model.

In some embodiments, the training samples may include at least one of original samples and augmented samples augmented based on the original samples.

In some embodiments, the training samples may include at least one of: images, video, audio, and text.

Fig. 5 shows a schematic block diagram of an apparatus 500 for processing data according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 includes: a data acquisition module 510 configured to acquire input data; and a prediction module 520 configured to determine a prediction result for the input data using a trained model trained using the apparatus according to any one of claims 9-14.

In some embodiments, wherein the input data may be data of an image, the trained model may be one of an image classification model, a semantic segmentation model, and a target recognition model, and the prediction result may be a corresponding one of a classification result, a semantic segmentation result, and a target recognition result of the image.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as the

processes

200 and 300. For example, in some embodiments, processes 200 and 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of

processes

200 and 300 described above may be performed. Alternatively, in other embodiments, computing unit 601 may be configured to perform

processes

200 and 300 in any other suitable manner (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service amplification in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for training a model for processing image data, comprising:

combining first features of an image output by a first model for a training sample comprising the image and second features of the image output by a second model for the training sample to obtain combined features for the image, the first model and the second model being initialized to have different model parameters;

determining a first constraint, a second constraint, and a third constraint, respectively, based on differences between the first feature, the second feature, and the combined feature; and

adjusting model parameters of the first model and the second model based at least on the first constraint, the second constraint, and the third constraint to obtain the trained first model and the trained second model;

determining a more accurate model of the trained first model and the trained second model as a target model for processing image data;

wherein determining the first constraint, the second constraint, and the third constraint, respectively, comprises:

determining the first constraint based on a difference between the first feature and the second feature;

determining the second constraint based on a difference between the first feature and the combined feature; and

determining the third constraint based on a difference between the second feature and the combined feature.

2. The method of claim 1, wherein the training samples have labels indicating categories of the training samples, the method further comprising:

determining a fourth constraint based on a difference between the first feature and the label;

determining a fifth constraint based on a difference between the second feature and the label; and

training the first model and the second model based on the first constraint, the second constraint, the third constraint, the fourth constraint, and the fifth constraint.

3. The method of claim 2, wherein training the first model and the second model based on the first constraint, the second constraint, the third constraint, the fourth constraint, the fifth constraint comprises:

determining weights associated with the first constraint, the second constraint, the third constraint, the fourth constraint, and the fifth constraint, respectively; and

training the first model and the second model based on the first constraint, the second constraint, the third constraint, the fourth constraint, the fifth constraint, and the associated weights.

4. The method of claim 1, wherein the training samples comprise at least one of original samples and augmented samples augmented based on the original samples.

5. The method of claim 1, wherein the training samples further comprise at least one of: video, audio, and text.

6. A method for processing data, comprising:

acquiring input data including image data; and

determining a prediction result for the input data using a trained model trained according to the method of any one of claims 1-5.

7. The method of claim 6, wherein the trained model is one of an image classification model, a semantic segmentation model, and a target recognition model, and the prediction result is a corresponding one of a classification result, a semantic segmentation result, and a target recognition result of the image.

8. An apparatus for training a model for processing image data, comprising:

a feature fusion module configured to combine a first feature of an image output by a first model for a training sample comprising the image and a second feature of the image output by a second model for the training sample to obtain a combined feature for the image, the first model and the second model being initialized to have different model parameters;

a first constraint determination module configured to determine a first constraint, a second constraint, and a third constraint, respectively, based on differences between the first feature, the second feature, and the combined feature; and

a first model training module configured to adjust model parameters of the first model and the second model based on at least the first constraint, the second constraint, and the third constraint to yield the trained first model and the trained second model;

a target model determination module configured to determine a higher-precision model of the trained first model and the trained second model as a target model for processing image data; wherein the first constraint determination module comprises:

a second constraint determination module configured to determine the first constraint based on a difference between the first feature and the second feature;

a third constraint determination module configured to determine the second constraint based on a difference between the first feature and the combined feature; and

a fourth constraint determination module configured to determine the third constraint based on a difference between the second feature and the combined feature.

9. The apparatus of claim 8, wherein the training samples have labels indicating categories of the training samples, the apparatus further comprising:

a fourth constraint determination module configured to determine a fourth constraint based on a difference between the first feature and the tag;

a fifth constraint determination module configured to determine a fifth constraint based on a difference between the second feature and the tag; and

a second model training module configured to train the first model and the second model based on the first constraint, the second constraint, the third constraint, the fourth constraint, the fifth constraint.

10. The apparatus of claim 9, wherein the second model training module comprises:

a weight determination module configured to determine weights associated with the first constraint, the second constraint, the third constraint, the fourth constraint, and the fifth constraint, respectively; and

a third model training module configured to train the first model and the second model based on the first constraint, the second constraint, the third constraint, the fourth constraint, the fifth constraint, and the associated weights.

11. The apparatus of claim 8, wherein the training samples comprise at least one of original samples and augmented samples augmented based on the original samples.

12. The apparatus of claim 8, wherein the training samples further comprise at least one of: video, audio, and text.

13. An apparatus for processing data, comprising:

a data acquisition module configured to acquire input data including image data; and

a prediction module configured to determine a prediction result for the input data using a trained model trained according to the apparatus of any one of claims 8-12.

14. The apparatus of claim 13, wherein the trained model is one of an image classification model, a semantic segmentation model, and a target recognition model, and the prediction result is a corresponding one of a classification result, a semantic segmentation result, and a target recognition result of the image.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 6-7.

17. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 6-7.