CN113052246B

CN113052246B - Method and related apparatus for training classification model and image classification

Info

Publication number: CN113052246B
Application number: CN202110340913.8A
Authority: CN
Inventors: 王美玲
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2023-08-04
Anticipated expiration: 2041-03-30
Also published as: CN113052246A

Abstract

The disclosure provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for training a classification model and image classification, relates to the technical field of artificial intelligence such as computer vision, deep learning and the like, and can be applied to live broadcast auditing scenes. One embodiment of the method comprises the following steps: obtaining a first loss value obtained by a first model through iterative training of a first round number on a sample image marked with scene type information, and obtaining a second loss obtained by iterative training of a second round number on the sample image by a second model, determining a comprehensive loss value based on the first loss value and the second loss value, and finally training based on the comprehensive loss value to obtain a scene classification model, wherein the model scale of the first model is larger than that of the second model, and the first round number is smaller than that of the second round number. According to the embodiment, different models forming the classification model are subjected to differential training, and model training efficiency is improved on the premise of guaranteeing classification quality.

Description

Method and related apparatus for training classification model and image classification

Technical Field

The present disclosure relates to the field of image processing technology, and in particular to the field of artificial intelligence technology such as computer vision and deep learning, and more particularly to a method, an apparatus, an electronic device, a computer readable storage medium and a computer program product for training a classification model, and an image classification method, an apparatus, an electronic device, a computer readable storage medium and a computer program product.

Background

With the development of society, in order to meet the real-time social interaction and sharing demands of users, the live broadcast industry is gradually exploded, so that the live broadcast industry is also receiving attention of staff in various fields.

However, the development of the current live broadcast industry is good and bad, the live broadcast content is various, and in order to realize the network supervision of the live broadcast content and maintain the network space health, the live broadcast content and the scene of the user need to be classified and identified, and the content which does not meet the moral and legal requirements is removed.

The scene classification method provided in the prior art cannot be well compatible with scene classification quality and scene classification efficiency.

Disclosure of Invention

Embodiments of the present disclosure propose a method, apparatus, electronic device, computer readable storage medium and computer program product for training a classification model and image classification.

In a first aspect, an embodiment of the present disclosure proposes a method for training a classification model, comprising: acquiring a first loss value determined by a first model after a first round of iterative training on a sample image, wherein scene type information is marked in the sample image; obtaining a second loss value determined by a second model after iterative training of a second round number on the sample image, wherein the model scale of the first model is larger than that of the second model, and the first round number is smaller than that of the second round number; and determining a comprehensive loss value based on the first loss value and the second loss value, and training to obtain a scene classification model based on the comprehensive loss value.

In a second aspect, embodiments of the present disclosure propose an apparatus for training a classification model, comprising: the first loss value acquisition unit is configured to acquire a first loss value determined by a first model after a first round of iterative training on a sample image, wherein scene type information is marked in the sample image; a second loss value obtaining unit configured to obtain a second loss value determined by a second model after a second round of iterative training on the sample image, the model scale of the first model being larger than the model scale of the second model, the first round being smaller than the second round; a comprehensive loss value determination unit configured to determine a comprehensive loss value based on the first loss value and the second loss value; and the model training unit is configured to train to obtain a scene classification model based on the comprehensive loss value.

In a third aspect, an embodiment of the present disclosure provides an image classification method, including: receiving a scene image to be classified; invoking a scene classification model to identify the actual scene type of the scene image to be classified; the scene classification model is trained by the method for training the classification model described in any implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides an image classification apparatus, including: an image receiving unit configured to receive an image of a scene to be classified; the image classification unit is configured to call a scene classification model to identify the actual scene type of the scene image to be classified; wherein the scene classification model is trained by the means for training a classification model described by any implementation of the second aspect.

In a fifth aspect, embodiments of the present disclosure provide an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement a method for training a classification model as described in any one of the implementations of the first aspect or an image classification method as described in any one of the implementations of the third aspect when executed.

In a sixth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement a method for training a classification model as described in any one of the implementations of the first aspect or an image classification method as described in any one of the implementations of the third aspect when executed.

In a seventh aspect, the presently disclosed embodiments provide a computer program product comprising a computer program which, when executed by a processor, is capable of implementing a method for training a classification model as described in any of the implementations of the first aspect or an image classification method as described in any of the implementations of the third aspect.

The embodiment of the disclosure provides a method, a device, an electronic device, a computer readable storage medium and a computer program product for training a classification model, which are used for acquiring a first loss value obtained by a first model through iterative training of a first round number on a sample image marked with scene type information and a second loss obtained by a second model through iterative training of a second round number on the sample image, determining a comprehensive loss value based on the first loss value and the second loss value, and finally training based on the comprehensive loss value to obtain the scene classification model, wherein the model scale of the first model is larger than that of the second model, and the first round number is smaller than the second round number.

On the basis of the model training scheme provided by the method, the image classification method, the device, the electronic equipment, the computer readable storage medium and the computer program product are provided, so that when the scene images to be classified are received, the scene classification model trained by the scheme is called for recognition.

In order to improve accuracy of scene classification results, the method and the device combine scene classification models formed by a plurality of models, and further improve training efficiency of the scene classification models, the method and the device perform differential training on a first model and a second model which form the scene classification models and have different scales, perform less-round training on models with larger model scales in the first model and the second model, perform multiple-time training on models with smaller scales in the first model and the second model, and improve model training efficiency on the premise of guaranteeing classification quality.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture in which the present disclosure may be applied;

FIG. 2 is a flow chart of a method for training a classification model provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of another method for training a classification model provided by an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an attention residual block for building a residual neural network model in an image classification method according to an embodiment of the disclosure;

FIG. 5 is a flowchart of an image classification method according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of an apparatus for training a classification model according to an embodiment of the present disclosure;

fig. 7 is a block diagram of an image classification apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device suitable for performing the method for training the classification model and/or the image classification method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.

In addition, in the technical scheme related to the disclosure, the acquisition (such as the sample image related in the disclosure and the face information possibly related in the scene image to be classified), the storage, the application and the like of the related personal information of the user all conform to the regulations of related laws and regulations, and the public welfare is not violated.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the methods, apparatus, electronic devices, and computer-readable storage media for training classification models and image classification of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various applications for implementing information communication between the terminal devices 101, 102, 103 and the server 105, such as a scene recognition application, a live broadcast audit application, an instant messaging application, and the like, may be installed on the terminal devices.

The terminal devices 101, 102, 103 and the server 105 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the terminal devices 101, 102, 103 are software, they may be installed in the above-listed electronic devices, which may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server; when the server is software, the server may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein.

The server 105 can provide various services through various built-in applications, for example, a scene recognition application capable of providing a scene classification service of content in an image, and the server 105 can realize the following effects when running the electronic album application: firstly, a scene image to be classified is received from terminal devices 101, 102 and 103 through a network 104, a scene classification model which is trained in advance is obtained from a preset storage position, the scene image to be classified is input into the scene classification model as input data, and finally, the result output by the scene classification model is returned to the terminal devices 101, 102 and 103.

The scene classification model may be obtained by training a model training class application built in the server 105 based on a sample image marked with scene type information in advance according to the following steps: firstly, a first loss value determined by a first model after iterative training of a first round number on a sample image and a second loss value determined by a second model after iterative training of a second round number on the sample image are obtained, wherein the model scale of the first model is larger than that of the second model, the first round number is smaller than that of the second model, then a comprehensive loss value is determined based on the first loss value and the second loss value, and finally a scene classification model is obtained based on the comprehensive loss value training.

It is to be noted that the sample image and/or the image to be classified, which are marked with scene type information, may be stored in advance in the server 105 in various ways, in addition to being acquired from the terminal apparatuses 101, 102, 103 through the network 104. Thus, when the server 105 detects that such data has been stored locally (e.g., sample images marked with scene type information and/or images to be classified collected prior to beginning processing), such data may optionally be obtained directly from locally, in which case the exemplary system architecture 100 may not include the terminal devices 101, 102, 103 and network 104.

Since training the scene classification model requires more computing resources and stronger computing power, the method for training the classification model provided in the following embodiments of the present disclosure is generally performed by the server 105 having stronger computing power and more computing resources, and accordingly, the apparatus for training the classification model is also generally disposed in the server 105. However, it should be noted that, when the terminal devices 101, 102, 103 also have the required computing capability and computing resources, the terminal devices 101, 102, 103 may also complete each operation performed by the server 105 through the model training class application installed thereon, and further output the same result as the server 105. Especially in the case that there are a plurality of terminal devices having different computing capabilities at the same time, but when the model training class application judges that the terminal device where the model training class application is located has a stronger computing capability and more computing resources remain, the terminal device can be allowed to perform the above-mentioned computation, so that the computation pressure of the server 105 is properly reduced, and correspondingly, the device for training the classification model may also be provided in the terminal devices 101, 102, 103. In this case, the exemplary system architecture 100 may also not include the server 105 and the network 104.

In particular, the target face recognition model obtained through training of the server 105 can also obtain a lightweight scene classification model suitable for being placed in the terminal equipment 101, 102 and 103 through a model distillation mode, namely the lightweight scene classification model in the terminal equipment 101, 102 and 103 can be flexibly selected to be used according to the identification accuracy of actual requirements, or the scene classification model in the server 105 can be selected to be used.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart of a method for training a classification model according to an embodiment of the disclosure, wherein the flowchart 200 includes the following steps:

step 201, obtaining a first loss value determined by the first model after a first round of iterative training on the sample image.

In this embodiment, an execution subject (e.g., the server 105 shown in fig. 1) of the method for training the classification model acquires a first loss value determined by a first model after a first number of iterative training on a sample image marked with scene type information.

The first model is one of a plurality of models constituting a scene classification model, and generally the first model can be used for classifying scenes related to content in an image independently, for example, an attention residual 101 network model (attention-Resnet 101) constructed based on a residual network model, a convolutional neural network model, a LeNet-5 model and the like.

It should be noted that, the training of the first model by using the sample image marked with the scene type information may be performed in the execution body or performed by another terminal device, and when the first model is trained by another terminal device, only the first model configuration parameter obtained after the first round of iterative training and the determined first loss value need to be sent to the execution body.

In practice, since the model training process generally updates the model as the model is used and training work proceeds, it is preferable to perform iterative training on the first model in the execution body to obtain the first loss value.

It should be noted that the sample image marked with the scene type information may be obtained by the execution body directly from a local storage device or may be obtained from a non-local storage device (for example, the terminal devices 101, 102, 103 shown in fig. 1). The local storage device may be a data storage module, such as a server hard disk, provided in the execution body, in which case the sample image marked with scene type information may be read quickly locally; the non-local storage device may also be any other electronic device arranged to store data, such as some user terminals or the like, in which case the executing entity may acquire the required sample image marked with scene type information by sending an acquisition command to the electronic device.

It should be understood that the scene type information in the sample image may be obtained by manually pre-marking, marking the sample image after processing with other trained scene classification models, and so on.

Step 202, obtaining a second loss value determined by the second model after the sample image is subjected to iterative training of a second round number.

In this embodiment, on the basis of the step 201, the executing body obtains a second loss value determined by performing iterative training on the same sample image by using a second model different from the first model to form a scene classification model, where the selected second model also generally has the capability of separately implementing classification on the scene related to the content in the image, and after determining the specific model of the first model in the step 201, selects a second model with a model size smaller than the first model, for example, when the first model is the attention residual 101 network model, the corresponding second model may select an attention lightweight model (anti-mobilet) with a model size smaller than the first model, and performs iterative training on the selected second model by using a second number of rounds greater than the first number of rounds to obtain the second loss value.

In some optional embodiments of the present disclosure, the sample images used to train the first model and the second model may be different to promote independence of training of the first model and the second model to promote recognition capability of a scene classification model consisting of the trained first model and the trained second model.

In some optional embodiments of the present disclosure, the first model and the second model may be subjected to iterative training for a preset number of times, where the preset number of times is greater than or equal to the second number of rounds, and in the iterative training process, when the first number of rounds is obtained for the training number of times of the first model, when the iterative training between the first number of rounds and the preset number of times is performed subsequently, configuration parameters of the first model are fixed, and only training and configuration parameter updating are performed for the second model, so that training efficiency of the model is improved while achieving the same effect as described above.

Under the condition, in the subsequent training process of the scene classification model by using other training samples, the configuration parameters of the first model forming the classification model can be fixed, and only the second model is independently trained and updated, so that the purpose of rapidly updating the scene classification model is realized.

Step 203, determining a comprehensive loss value based on the first loss value and the second loss value, and training to obtain a scene classification model based on the comprehensive loss value.

In this embodiment, the first loss value and the second loss value determined in the above step 201 and step 202 are obtained, and the first loss value and the second loss value may be processed by means of weight addition, numerical average, and the like, so as to obtain a comprehensive loss value, and finally, the scene classification model is trained according to the obtained comprehensive loss value, so as to obtain a scene classification model based on the first model and the second model.

In order to improve accuracy of scene classification results, a scene classification model formed by a plurality of models is jointly used, and then training efficiency of the scene classification model is further improved.

Referring to fig. 3, fig. 3 is a flowchart of another method for training a classification model according to an embodiment of the disclosure, wherein the flowchart 300 includes the following steps:

step 301, obtaining a first loss value determined by the first model after a first round of iterative training on the sample image.

Step 302, obtaining a second loss value determined by the second model after the sample image is subjected to iterative training of a second round number.

Step 303, determining a first ratio of the first model to the second model on the model scale and/or a second ratio between the first number of rounds and the second number of rounds.

In this embodiment, a first ratio of the scale of the first model to the scale of the second model and/or a second ratio of the first number of rounds to the second number of rounds is determined according to a proportional relationship between the scale of the first model and the scale of the second model.

Step 304, determining a target weight value based on the first ratio and/or the second ratio.

In this embodiment, the target weight may be directly determined based on the first ratio or the second ratio, where the first ratio or the second ratio adopted by the target weight is positively correlated, that is, the higher the first ratio is, the greater the target weight is, that is, the value of the first loss value obtained based on the first model is relative to the value of the second loss value obtained based on the second model is, or the target weight may be determined based on the sum and the average ratio of the first ratio and the second ratio at the same time, so that the reference weight relationship between the first loss value obtained based on the first model and the second loss value obtained based on the second model is determined according to the target weight.

Step 305, weighting the first loss value and the second loss value according to the target weight, and determining the integrated loss value.

In this embodiment, the first loss value and the second loss value are weighted based on the target weight determined in step 304, so as to obtain a comprehensive loss value.

And step 306, training to obtain a scene classification model based on the comprehensive loss value.

Steps 301, 302 and 306 are similar to steps 201-203 shown in fig. 2, respectively, and the same parts are referred to the corresponding parts of the previous embodiment, and will not be described again.

Based on the implementation shown in fig. 2, the present embodiment further determines the reference values corresponding to the first model and the second model based on the model scale and/or the training iteration of the first model and the second model, so as to generate corresponding target weights, and in the process of determining the comprehensive loss value for obtaining the scene classification model according to the target weights, the comprehensive loss value with higher quality can be further determined by combining the scale and the training condition of the first model and the second model, so that the quality of the obtained scene classification model is improved.

In some optional implementations of this embodiment, weighting the first loss value and the second loss value according to a target weight, determining the composite loss value includes: weighting the first loss value and the second loss value according to the target weight value to obtain a target loss value; the integrated loss value is determined based on an average of the target loss value and the second loss value.

Specifically, after the first loss value and the second loss value are weighted according to the target weight, the weighted results obtained by weighting the first loss value and the second loss value according to the target weight can be further averaged by combining the second loss value obtained by iterative training based on the second model, so that the result of the comprehensive loss value is more close to the iterative training result of the last number of rounds, and especially in the scene of training only the second model after updating the sample image, the specific gravity of the training result obtained by training with higher quality and/or using the updated sample image can be improved, and the training quality of the scene classification model is improved.

In some optional implementations of the present embodiment, training to obtain a scene classification model based on the composite loss value includes: and training to obtain the scene classification model by controlling the minimum comprehensive loss value.

Specifically, corresponding comprehensive loss values may be determined based on the various implementation manners provided in the embodiment, an available minimum comprehensive loss value is determined from the obtained multiple comprehensive loss value results, and a scene classification model is obtained based on the minimum loss value training, so as to obtain a high-quality scene classification model.

On the basis of any one of the above embodiments, the first model includes: and a residual neural network model built by the scene attention residual block.

Specifically, the residual neural network may be built based on an attention residual block, where the attention residual block includes: a convolution layer (conv), a modified linear unit (relu), a transform dimension unit (reshape), a transpose unit (transfer), a logistic regression unit (sofamax) and a reverse mapping unit (1-matrix), the intermediate results of the corresponding outcomes comprising: the convolution kernel (kernel) and the feature matrix (matrix) are constructed through the attention residual block to obtain the residual neural network, so that more attention scene information of an attention mechanism can be utilized, and more fine recognition and classification of scene information of the content in the image to be classified are realized.

By way of example, the specific structure of the attention residual block and the corresponding data processing flow may be as shown in fig. 4, where 1*1 and 3*3, labeled before the convolution layer, represent the size of the convolution kernel.

Further, on the basis of any of the above embodiments, in order to fully enable the scene classification model trained in the training manner to function, the scene classification model may further function in an actual face recognition operation according to the following steps:

Receiving a scene image to be classified;

invoking a scene classification model to identify the actual scene type of the scene image to be classified; wherein the scene classification model is trained by the method for training the classification model provided in fig. 2 and/or fig. 3.

On this basis, please refer to fig. 5, fig. 5 is a flowchart of an image classification method according to an embodiment of the disclosure, wherein the flowchart 500 includes the following steps:

step 501, receiving a scene image to be classified.

In the present embodiment, an execution subject of the image classification method (e.g., the server 105 shown in fig. 1) receives an image of a scene to be classified.

The scene image to be classified may be obtained directly from a local storage device by the execution subject, or may be obtained from a non-local storage device (for example, terminal devices 101, 102, 103 shown in fig. 1). The local storage device may be a data storage module, such as a server hard disk, provided in the execution body, in which case the scene images to be classified may be read out quickly locally; the non-local storage device may also be any other electronic device arranged to store data, such as some user terminal or the like, in which case the executing entity may acquire the desired scene image to be classified by sending an acquisition command to the electronic device.

Step 502, in response to the existence of the first scene classification model and the second scene classification model, respectively calling the first scene classification model and the second scene classification model to identify the image to be identified, and obtaining a first scene type and a second scene type.

In this embodiment, in response to the existence of a first scene classification model and a second scene classification model, where the first scene classification model is obtained by training the scene classification model provided in any of the foregoing embodiments based on a training sample with a scene large-class label, and the second scene classification model is obtained by training the scene classification model provided in any of the foregoing embodiments based on a training sample with a scene small-class label, the first scene classification model and the second scene classification model are respectively called to process the scene image to be classified obtained in step 501, so as to obtain a corresponding first scene type and second scene type as processing results.

Wherein the large category labels and the small category labels are generally distinguished based on the fine granularity of the recognition result, and the large category labels may be "bathroom", "bedroom" and "in car", and the small category labels may be "bathroom shower", "bedroom bed" and "back seat in car".

In step 503, in response to the similarity between the large scene category to which the second scene type belongs and the first scene type meeting the preset requirement, the first scene type is determined as the actual scene type of the scene image to be classified.

In this embodiment, after the first scene type and the second scene type are obtained based on the step 502, a scene large category included in the second scene type is determined, the scene large category is determined as a scene large category to which the second scene type belongs, and when the similarity between the scene large category and the first scene type meets a preset requirement, the first scene type is determined as an actual scene type of the scene image to be classified.

In this embodiment, the first scene type result is verified by the second scene type result with higher fine granularity, so that the problem of larger recognition error of the second scene type caused by too high fine granularity of the recognition result can be avoided, and the first scene type result can be verified by the mutual verification of the large category and the small category, so as to realize high-quality image classification.

For deepening understanding, the application further provides a specific implementation scheme in combination with a specific application scenario:

In order to meet the requirement of high-quality classification of scenes where contents in images are located, a service provider of an image classification method firstly trains to obtain a scene classification model according to the following steps:

1) Acquiring a first loss value determined by a first model after a first round of iterative training on a sample image; wherein, the sample image is provided with a small category label;

2) Obtaining a second loss value determined by a second model after iterative training of a second round number on the sample image, wherein the model scale of the first model is larger than that of the second model, and the first round number is smaller than that of the second round number;

3) And determining a comprehensive loss value based on the first loss value and the second loss value, and training to obtain a scene classification model based on the comprehensive loss value.

After training to obtain the scene classification model, the service provider also tests the scene classification model by using 5 test images, wherein the scenes corresponding to the 5 test images are respectively a bathroom shower area, an in-car front seat, an in-car rear seat, a kitchen range side and a bedroom dressing table side, and the corresponding results obtained after the scene classification model are the bathroom shower area, the in-car front seat, the in-car rear seat, the kitchen range side and the bedroom wardrobe side, and the 4 test images are accurate in result, so that the predetermined precision requirement is met, the scene classification model is determined to have usability, and the training of the scene classification model is completed.

The service provider jointly uses scene classification models formed by a plurality of models, and performs differential training on a first model and a second model which form the scene classification models and have different scales, performs less-rotation training on the model with larger model scale in the first model and the second model, performs multiple-time training on the model with smaller scale in the first model and the second model, and improves model training efficiency on the premise of ensuring classification quality.

With further reference to fig. 6 and fig. 7, as an implementation of the method shown in the foregoing figures, the present application further provides an embodiment of an apparatus for training a classification model and an image classification apparatus, where the apparatus embodiment corresponds to the foregoing method embodiment, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for training a classification model according to the present embodiment may include: a first loss value acquisition unit 601, a second loss value acquisition unit 602, a comprehensive loss value determination unit 603, and a model training unit 604. The first loss value obtaining unit 601 is configured to obtain a first loss value determined by a first model after a first round of iterative training on a sample image, where scene type information is marked in the sample image; a second loss value obtaining unit 602, configured to obtain a second loss value determined by a second model after a second number of rounds of iterative training on the sample image, where a model scale of the first model is greater than a model scale of the second model, and the first number of rounds is less than the second number of rounds; an integrated loss value determination unit 603 configured to determine an integrated loss value based on the first loss value and the second loss value; a model training unit 604 configured to train to obtain a scene classification model based on the integrated loss value.

In the present embodiment, in the apparatus 600 for training a classification model: specific processing of the first loss value obtaining unit 601, the second loss value obtaining unit 602, the comprehensive loss value determining unit 603, and the model training unit 604 and technical effects thereof may refer to the relevant descriptions of steps 201 to 203 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of this embodiment, the apparatus 600 for training a classification model further includes: a ratio determining unit configured to determine a first ratio of the first model and the second model on a model scale and/or a second ratio target weight generating unit between the first number of rounds and the second number of rounds, configured to determine a target weight based on the first ratio and/or the second ratio, the target weight being positively correlated with both the first ratio and the second ratio; the integrated loss value determination unit is further configured to weight the first loss value and the second loss value according to the target weight, and determine the integrated loss value.

In some optional implementations of the present embodiment, the comprehensive loss value determining unit 603 includes: the target loss value acquisition unit is configured to weight the first loss value and the second loss value according to the target weight value to obtain a target loss value; and a comprehensive loss value determining subunit configured to determine the comprehensive loss value based on an average of the target loss value and the second loss value.

In some optional implementations of this embodiment, the model training unit 604 is further configured to train to obtain the scene classification model by controlling the composite loss value to be minimum.

As shown in fig. 7, the image classification apparatus 700 of the present embodiment may include: an image receiving unit 701 configured to receive a scene image to be classified; an image classification unit 702 configured to invoke a scene classification model to identify an actual scene type of the scene image to be classified; wherein the scene classification model is trained by the apparatus for training a classification model shown in fig. 6.

The embodiment exists as an embodiment of a device corresponding to the embodiment of the method, in order to improve accuracy of a scene classification result, the embodiment not only jointly uses a scene classification model formed by a plurality of models to realize image classification, but also carries out differential training on a first model and a second model which form the scene classification model and have different scales, carries out less-rotation training on a model with a larger model scale in the first model and the second model, carries out multiple-time training on a model with a smaller scale in the first model and the second model, and improves model training efficiency on the premise of ensuring classification quality.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as methods for training classification models. For example, in some embodiments, the method for training the classification model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the method for training a classification model described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method for training the classification model by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service. Servers may also be divided into servers of a distributed system or servers that incorporate blockchains.

According to the technical scheme of the embodiment of the disclosure, in order to improve the accuracy of scene classification results, not only is the scene classification model formed by a plurality of models jointly used to realize image classification, but also the model training efficiency is improved on the premise of ensuring classification quality by carrying out differential training on the first model and the second model which form the scene classification model and have larger model scale, carrying out less-rotation training on the model with larger model scale in the first model and the second model, and carrying out multiple-time training on the model with smaller scale in the first model and the second model.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for training a classification model, comprising:

acquiring a first loss value determined by a first model after a first round of iterative training on a sample image, wherein scene type information is marked in the sample image;

obtaining a second loss value determined by a second model after iterative training of a second round number on the sample image, wherein the model scale of the first model is larger than that of the second model, and the first round number is smaller than that of the second round number;

determining a comprehensive loss value based on the first loss value and the second loss value, and training to obtain a scene classification model based on the comprehensive loss value;

wherein said determining a composite loss value based on said first loss value and said second loss value comprises:

determining a first ratio of the first model to the second model on a model scale and/or a second ratio between the first number of rounds and the second number of rounds;

determining a target weight based on the first ratio and/or the second ratio, wherein the target weight is positively correlated with the first ratio and the second ratio;

and weighting the first loss value and the second loss value according to the target weight value, and determining the comprehensive loss value.

2. The method of claim 1, wherein the weighting the first and second loss values according to the target weight value, determining the composite loss value comprises:

weighting the first loss value and the second loss value according to the target weight value to obtain a target loss value;

the integrated loss value is determined based on an average of the target loss value and the second loss value.

3. The method of claim 1, wherein the training to obtain a scene classification model based on the composite loss value comprises:

and training to obtain the scene classification model by controlling the minimum comprehensive loss value.

4. An image classification method, comprising:

receiving a scene image to be classified;

invoking a scene classification model to identify the actual scene type of the scene image to be classified; the scene classification model being trained by the method for training a classification model according to any of claims 1-3.

5. The method of claim 4, wherein the invoking the scene classification model to identify the actual scene type of the scene image to be classified comprises:

respectively calling a first scene classification model and a second scene classification model to identify an image to be identified in response to the existence of the first scene classification model and the second scene classification model, so as to obtain a first scene type and a second scene type; the first scene classification model is trained based on training samples with large category labels only, and the second scene classification model is trained based on training samples with small category labels only;

And determining the first scene type as the actual scene type of the scene image to be classified in response to the similarity between the large scene category to which the second scene type belongs and the first scene type meeting a preset requirement.

6. An apparatus for training a classification model, comprising:

the first loss value acquisition unit is configured to acquire a first loss value determined by a first model after a first round of iterative training on a sample image, wherein scene type information is marked in the sample image;

a second loss value obtaining unit configured to obtain a second loss value determined by a second model after a second round of iterative training on the sample image, wherein a model scale of the first model is larger than a model scale of the second model, and the first round is smaller than the second round;

a comprehensive loss value determination unit configured to determine a comprehensive loss value based on the first loss value and the second loss value;

a model training unit configured to train to obtain a scene classification model based on the comprehensive loss value;

a ratio determination unit configured to determine a first ratio of the first model to the second model on a model scale and/or a second ratio between the first number of rounds and the second number of rounds

A target weight generating unit configured to determine a target weight based on the first ratio and/or the second ratio, the target weight being positively correlated with both the first ratio and the second ratio;

the integrated loss value determination unit is further configured to weight the first loss value and the second loss value according to the target weight, and determine the integrated loss value.

7. The apparatus of claim 6, wherein the integrated loss value determining unit comprises:

a target loss value obtaining unit configured to weight the first loss value and the second loss value according to the target weight value to obtain a target loss value;

a comprehensive loss value determination subunit configured to determine the comprehensive loss value based on an average of the target loss value and the second loss value.

8. The apparatus of claim 6, wherein the model training unit is further configured to train to the scene classification model by controlling the composite loss value to be minimal.

9. An image classification apparatus comprising:

an image receiving unit configured to receive an image of a scene to be classified;

The image classification unit is configured to call a scene classification model to identify the actual scene type of the scene image to be classified; wherein the scene classification model is trained by the apparatus for training a classification model according to any of claims 6-8.

10. The apparatus of claim 9, wherein the image classification unit includes:

an image type generating subunit configured to call a first scene classification model and a second scene classification model respectively to identify an image to be identified in response to the existence of the first scene classification model and the second scene classification model, so as to obtain a first scene type and a second scene type; the first scene classification model is trained based on training samples with large category labels only, and the second scene classification model is trained based on training samples with small category labels only;

and the classification result determining subunit is configured to determine the first scene type as the actual scene type of the scene image to be classified in response to the similarity between the scene large category to which the second scene type belongs and the first scene type meeting a preset requirement.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for training a classification model of any one of claims 1-3 and/or the image classification method of any one of claims 4-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method for training a classification model of any one of claims 1-3 and/or the image classification method of any one of claims 4-5.