CN109816023B

CN109816023B - Method and device for generating picture label model

Info

Publication number: CN109816023B
Application number: CN201910084639.5A
Authority: CN
Inventors: 李伟健; 王长虎
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2022-01-04
Anticipated expiration: 2039-01-29
Also published as: CN109816023A

Abstract

The embodiment of the disclosure discloses a method and a device for generating a picture label model. One embodiment of the method comprises: acquiring at least two sample picture sets; selecting a sample picture set from at least two sample picture sets, and executing the following training steps by using the selected sample picture set: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model; determining whether at least two sample picture sets include a sample picture set that is not selected; in response to determining not to include, determining the initial model after the last training to be a picture label model. The embodiment improves the flexibility of model training and is beneficial to improving the accuracy of classifying the pictures by using the picture label model.

Description

Method and device for generating picture label model

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for generating a picture label model.

Background

The multi-label classification is to classify certain information into a plurality of categories, that is, one information may have a plurality of labels. The existing method for multi-label classification of pictures usually adopts a multi-label classification model. The model includes a plurality of sigmoid activation functions, each activation function corresponding to a tag. When the model is trained, a single sample picture corresponds to a plurality of labeled labels, the trained model can output a plurality of labels, and each label corresponds to a picture category.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for generating a picture label model and a method and a device for generating a category label set of a picture.

In a first aspect, an embodiment of the present disclosure provides a method for generating a picture label model, where the method includes: acquiring at least two sample picture sets, wherein the sample picture sets correspond to preset picture categories, the sample picture sets comprise positive sample pictures belonging to the corresponding picture categories and negative sample pictures not belonging to the corresponding picture categories, the positive sample pictures correspond to pre-labeled positive category information, and the negative sample pictures correspond to pre-labeled negative category information; selecting a sample picture set from at least two sample picture sets, and executing the following training steps by using the selected sample picture set: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model; determining whether at least two sample picture sets include a sample picture set that is not selected; in response to determining not to include, determining the initial model after the last training to be a picture label model.

In some embodiments, the method further comprises: and in response to determining that the at least two sample picture sets include an unselected sample picture set, reselecting the sample picture set from the unselected sample picture set, and continuing to perform the training step using the reselected sample picture set and the initial model after the last training.

In some embodiments, the at least two sample picture sets are obtained in advance according to the following steps: acquiring at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets comprise positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information; for a sample video set in at least two sample video sets, extracting a video frame from a positive sample video included in the sample video set as a positive sample picture, and determining positive category information corresponding to the positive sample video as the positive category information of the extracted positive sample picture; extracting a video frame from a negative sample video included in the sample video set as a negative sample picture, and determining negative category information corresponding to the negative sample video as the negative category information of the extracted negative sample picture; determining the set of extracted positive sample pictures and negative sample pictures as a sample picture set.

In some embodiments, the positive sample video and the negative sample video are compressed videos, the positive sample picture is a key frame extracted from the positive sample video, and the negative sample picture is a key frame extracted from the negative sample video.

In some embodiments, the positive category information and the negative category information are vectors respectively including a preset number of elements, a target element in a vector corresponding to the positive sample picture is used to characterize that the positive sample picture belongs to a corresponding picture category, a target element in a vector corresponding to the negative sample picture is used to characterize that the negative sample picture does not belong to a corresponding picture category, the target element is an element located at an element position that previously establishes a correspondence relationship with the picture category corresponding to the vector among element positions in the vector, and the picture category corresponding to the vector is a picture category corresponding to a sample picture set to which the sample picture corresponding to the vector belongs.

In some embodiments, the initial model is a convolutional neural network model, and includes a feature extraction layer and a classification layer, where the classification layer includes a preset number of weight data, and the weight data corresponds to a preset picture category and is used to determine a probability of a category to which an input picture belongs.

In some embodiments, training the initial model comprises: and fixing other weight data in the preset number of weight data except the weight data corresponding to the picture category corresponding to the sample picture set, and adjusting the weight data corresponding to the picture category corresponding to the sample picture set so as to train the initial model.

In a second aspect, an embodiment of the present disclosure provides a method for generating a category label set of a picture, the method including: acquiring a picture to be classified; inputting a picture to be classified into a pre-trained picture label model, and generating a class label set, wherein the class label corresponds to a preset picture class and is used for representing that the picture to be classified belongs to the picture class corresponding to the class label, and the picture label model is generated according to the method described in any embodiment of the first aspect.

In a third aspect, an embodiment of the present disclosure provides an apparatus for generating a picture label model, including: an obtaining unit configured to obtain at least two sample picture sets, wherein the sample picture sets correspond to preset picture categories, the sample picture sets include positive sample pictures belonging to the corresponding picture categories and negative sample pictures not belonging to the corresponding picture categories, the positive sample pictures correspond to pre-labeled positive category information, and the negative sample pictures correspond to pre-labeled negative category information; a training unit configured to select a sample picture set from at least two sample picture sets, and with the selected sample picture set, perform the following training steps: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model; determining whether at least two sample picture sets include a sample picture set that is not selected; in response to determining not to include, determining the initial model after the last training to be a picture label model.

In some embodiments, the apparatus further comprises: a selecting unit configured to, in response to determining that the at least two sample picture sets include an unselected sample picture set, reselect the sample picture set from the unselected sample picture set, and continue to perform the training step using the reselected sample picture set and the initial model after the last training.

In some embodiments, the training unit is further configured to: and fixing other weight data in the preset number of weight data except the weight data corresponding to the picture category corresponding to the sample picture set, and adjusting the weight data corresponding to the picture category corresponding to the sample picture set so as to train the initial model.

In a fourth aspect, an embodiment of the present disclosure provides an apparatus for generating a category label set of a picture, the apparatus including: an acquisition unit configured to acquire a picture to be classified; a generating unit configured to input the picture to be classified into a pre-trained picture label model, and generate a class label set, wherein the class label corresponds to a preset picture class and is used for representing that the picture to be classified belongs to a picture class corresponding to the class label, and the picture label model is generated according to the method in one of claims 1 to 7.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when executed by one or more processors, cause the one or more processors to implement a method as described in any of the implementations of the first or second aspects.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.

According to the method and the device for generating the picture label model, at least two sample picture sets are obtained, wherein the sample picture sets correspond to preset picture categories and comprise positive sample pictures and negative sample pictures, the positive sample pictures correspond to positive category information, and the negative sample pictures correspond to negative category information; then, the positive sample picture is used as input, the positive category information is used as expected output, the negative sample picture is used as input, the negative category information is used as expected output, the initial model is trained, and finally the picture label model is obtained through training.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating a picture label model, according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for generating a picture tag model according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of one embodiment of a method for generating a category label set for a picture, in accordance with embodiments of the present disclosure;

FIG. 5 is a schematic diagram illustrating an embodiment of an apparatus for generating a photo label model according to an embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating an embodiment of an apparatus for generating a category label set for a picture according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant disclosure and are not limiting of the disclosure. It should be noted that, for the convenience of description, only the parts relevant to the related disclosure are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of a method for generating a picture tag model or an apparatus for generating a picture tag model to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as an image processing application, a web browser application, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal apparatuses

101, 102, 103 are hardware, various electronic apparatuses are possible. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background model server performing model training using a sample picture set uploaded by the

terminal devices

101, 102, 103. The background model server can perform model training by using the obtained at least two sample picture sets to generate a picture label model, and can also send the picture label model to the terminal equipment, or process the picture to be classified by using the picture label model to obtain a label of the picture to be classified.

It should be noted that the method for generating the picture tag model provided in the embodiment of the present disclosure may be executed by the server 105, or may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the apparatus for generating the picture tag model may be disposed in the server 105, or may also be disposed in the

terminal devices

101, 102, and 103. Furthermore, the method for generating the category label set of the picture provided by the embodiment of the present disclosure may be executed by the server 105, and may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the apparatus for generating the category label set of the picture may be disposed in the server 105, and may also be disposed in the

terminal devices

101, 102, and 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case that the sample picture set required for training the model is not required to be acquired from a remote place, or the picture to be classified is not required to be acquired from a remote place, the system architecture may not include a network, and only a server or a terminal device is required.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating a picture tag model according to the present disclosure is shown. The method for generating the picture label model comprises the following steps:

step 201, at least two sample picture sets are obtained.

In this embodiment, an executing subject (for example, a server or a terminal device shown in fig. 1) of the method for generating a picture label model may remotely obtain at least two sample picture sets or locally obtain at least two sample picture sets through a wired connection manner or a wireless connection manner. The sample picture set corresponds to a preset picture category, the sample picture set comprises a positive sample picture (namely, the positive sample picture contains an image of an object belonging to the object category indicated by the corresponding picture category) belonging to the corresponding picture category and a negative sample picture (namely, the negative sample picture does not contain an image of an object belonging to the object category indicated by the corresponding picture category), the positive sample picture corresponds to pre-labeled positive category information, and the negative sample picture corresponds to pre-labeled negative category information.

Specifically, the positive category information and the negative category information may include information in at least one of the following forms: letters, numbers, symbols, etc. For example, when a positive sample picture in a certain sample picture set includes a seaside image, the positive category information may be "seaside" and the negative category information may be "non-seaside".

In some optional implementations of this embodiment, the at least two sample picture sets may be obtained by the executing subject or other electronic device in advance according to the following steps:

first, at least two sample video sets are obtained. In particular, the executing entity or other electronic device may obtain at least two sample video sets remotely or locally. The sample video set corresponds to a preset video category, the sample video set comprises a positive sample video belonging to the corresponding video category and a negative sample video not belonging to the corresponding video category, the positive sample video corresponds to pre-labeled positive category information, and the negative sample video corresponds to pre-labeled negative category information. It should be noted that the positive sample video and the negative sample video include an image sequence including at least two images.

Then, for a sample video set of the at least two sample video sets, the executing entity or other electronic device may perform the following steps:

step one, extracting a video frame from a positive sample video included in the sample video set as a positive sample picture, and determining positive category information corresponding to the positive sample video as the positive category information of the extracted positive sample picture.

And step two, extracting a video frame from the negative sample video included in the sample video set as a negative sample picture, and determining the negative category information corresponding to the negative sample video as the negative category information of the extracted negative sample picture.

Specifically, in the first step and the second step, the execution subject or other electronic device may extract the video frame from the sample video in various manners. As an example, video frames may be extracted from a sample video at preset playing time intervals. Alternatively, the video frame specified by the extraction video frame operation performed by the technician is extracted.

In some optional implementations of the present embodiment, the positive sample video and the negative sample video are compressed (i.e., encoding the original image sequence), the positive sample picture is a key frame extracted from the positive sample video, and the negative sample picture is a key frame extracted from the negative sample video. The key frame (also called I frame) is a frame that completely retains image data in the compressed video, and when decoding the key frame, decoding can be completed only by the image data of the frame. By extracting the key frames, the efficiency of extracting the video frames from the sample video as the sample pictures can be improved. Because the similarity between key frames in one video is small, sample pictures included in the sample picture set can be enriched.

It should be noted that the method for extracting key frames from videos is a well-known technology widely studied and applied at present, and is not described herein again.

And step three, determining the set of the extracted positive sample pictures and the negative sample pictures as a sample picture set.

By performing the above-mentioned steps one to three, a sample picture set can be generated from the sample video set, and since the sample video set is predetermined, the process of generating the sample picture set can be simplified. And each frame in the sample video is correlated, so that certain similarity exists between positive sample pictures and negative samples in a sample picture set, and the accuracy of the trained picture label model for classifying the pictures can be improved.

In some optional implementation manners of this embodiment, the positive category information and the negative category information are vectors that include a preset number of elements, respectively, a target element in a vector corresponding to the positive sample picture is used to represent that the positive sample picture belongs to a corresponding picture category, and a target element in a vector corresponding to the negative sample picture is used to represent that the negative sample picture does not belong to a corresponding picture category. The target element is an element located at an element position which is in the element position in the vector and is in a correspondence relation with the picture type corresponding to the vector in advance. And the picture category corresponding to the vector is the picture category corresponding to the sample picture set to which the sample picture corresponding to the vector belongs.

As an example, assuming that the preset number is 100, for a sample picture set, the picture category corresponding to the sample picture set is a seaside category, the positive category information corresponding to the positive sample picture in the sample picture set may be a vector (1,0,0,0, …, 0), where the vector includes 100 elements, where the first element corresponds to the seaside category. Here, the numeral 1 indicates that the picture belongs to the seaside class, and the other element 0 indicates that the picture does not belong to the picture class corresponding to the element position where 0 is located. Accordingly, the negative category information may be a vector (0,0,0,0, …, 0). The other elements may have other values and are not limited to 0. The class information labeled in a vector form is usually used for training a multi-label classification model, and since one vector is used for representing whether one picture belongs to one picture class, the class information of the implementation mode can be regarded as a single label. When a certain sample picture set is used for training, a training method of a single-label model can be adopted, and the training steps are simplified.

By using the vector to represent the category information, the picture category identified by the picture label model can be flexibly expanded. For example, assume that the preset number is 100, i.e., the model can identify 100 classes of pictures at most. In practical application, only 10 picture categories need to be identified, and the 1 st to 10 th elements in the vector respectively correspond to the preset picture categories. When the image tag model is required to identify more types of images, only the image types corresponding to other elements are required to be set, so that the identification capability of the image tag model can be flexibly expanded.

Step 202, selecting a sample picture set from at least two sample picture sets, and executing the following training steps by using the selected sample picture set: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model; determining whether at least two sample picture sets include a sample picture set that is not selected; in response to determining not to include, determining the initial model after the last training to be a picture label model.

In this embodiment, the executing body may execute the following sub-steps:

step 2021, select a sample picture set from at least two sample picture sets.

Specifically, the execution subject may select the sample picture sets in various manners, such as random selection, selection in a number order of respective sample picture sets set in advance, and the like.

Next, using the selected sample picture set, the following training steps (including steps 2022-2024) are performed.

Step 2022, using a machine learning method, training the initial model by taking the positive sample pictures included in the sample picture set as input, taking the positive category information corresponding to the input positive sample pictures as expected output, taking the negative sample pictures in the sample picture set as input, and taking the negative category information corresponding to the input negative sample pictures as expected output.

Specifically, the initial model may be various types of models, such as a recurrent neural network model, a convolutional neural network model, and the like. In the process of training the initial model, actual output can be obtained for the positive sample picture or the negative sample picture input in each training. And the actual output is data actually output by the initial model and used for representing the class information. Then, the executing entity may adopt a gradient descent method, adjust parameters of the initial model based on the actual output and the expected output, use the model obtained after each parameter adjustment as the initial model for the next training, and end the training for one sample picture set when a preset end condition is met. It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the loss value calculated using a predetermined loss function (e.g., a cross entropy loss function) is less than a predetermined loss value threshold.

As one example, the initial model may include at least two bi-classification models, each corresponding to a sample picture set. For a certain binary model, the binary model may be trained based on the positive sample pictures and the negative sample pictures included in the corresponding sample picture set. And finally, determining whether the input picture belongs to the picture category corresponding to the two classification models by the trained two classification models, and if so, generating a label for representing the picture category corresponding to the two classification models. Therefore, when the trained picture label model is finally used for picture classification, at least one label used for representing the picture category can be generated, and the effect of multi-label classification is achieved.

In some optional implementation manners of this embodiment, the initial model is a convolutional neural network model, and includes a feature extraction layer and a classification layer, where the classification layer includes a preset number of weight data, and the weight data corresponds to a preset picture category and is used to determine a probability of a category to which an input picture belongs. In general, the feature extraction layer may include a convolutional layer, a pooling layer, and the like, for generating feature data of the picture, which may be used to characterize features such as color, shape, and the like of images in the picture. The classification layer comprises a full-concatenation layer, and the full concatenation is used for generating a feature vector (for example, 2048-dimensional vector) according to feature data output by the feature extraction layer. The weight data comprises a weight coefficient, the weight coefficient can be multiplied by the feature data, the weight data can also comprise a bias value, and a probability value corresponding to the weight data can be obtained by using the weight coefficient and the bias value, and the probability value is used for representing the probability that the input picture belongs to the picture category corresponding to the weight data.

In some optional implementations of this embodiment, the executing entity may train the initial model according to the following steps:

and fixing other weight data in the preset number of weight data except the weight data corresponding to the picture category corresponding to the sample picture set, and adjusting the weight data corresponding to the picture category corresponding to the sample picture set so as to train the initial model.

Specifically, for a sample picture set, the other weight data except the weight data corresponding to the sample picture set is fixed, and the weight data corresponding to the sample picture set can be adjusted by adopting a training method of a binary classification model. Thereby optimizing the weight data corresponding to the sample picture set. It should be noted that the method for training the two-class model is a well-known technique widely studied and applied at present, and is not described herein again. By the implementation mode, the weight data included by the image label model can be independent from each other, and when a sample image set is used for training, other weight data are not influenced, so that the finally obtained image label model can classify the images more accurately. Due to the fact that the multiple weight data are adopted, the finally obtained picture label model can divide multiple picture categories of pictures input into the picture label model, and the effect of multi-label classification is achieved.

Step 2023, determine whether at least two sample picture sets include an unselected sample picture set.

Step 2024, in response to determining not to include, determining the initial model after the last training to be the picture label model.

In some optional implementations of this embodiment, the executing entity may, in response to determining that the at least two sample picture sets include the unselected sample picture set, reselect the sample picture set from the unselected sample picture set, and continue to execute the training step (i.e., step 2022-step 2024) using the reselected sample picture set and the initial model after the last training. The method for reselecting the sample picture set from the unselected sample picture set may be random selection or selection according to the number order of the sample picture sets, which is not limited herein.

The picture label model obtained by training according to the above steps can be used to determine probability values of the input picture belonging to each preset picture category, and if a certain probability value is greater than or equal to a preset probability threshold, a category label used for representing that the input picture belongs to the picture category corresponding to the probability value is generated. In practical applications, the picture tag model may output a set of class tags, where each class tag corresponds to a preset picture class, and is used to represent that a picture input to the picture tag model belongs to the picture class. Therefore, the image label model obtained through training is a multi-label classification model.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating a picture tag model according to the present embodiment. In the application scenario of fig. 3, the electronic device 301 first acquires at least two sample picture sets 302. Each sample picture set corresponds to a preset picture category, each sample picture set comprises a positive sample picture belonging to the corresponding picture category and a negative sample picture not belonging to the corresponding picture category, the positive sample pictures correspond to pre-labeled positive category information, and the negative sample pictures correspond to pre-labeled negative category information. For example, sample picture set 3021 corresponds to the picture category "seaside" and sample picture set 3022 corresponds to the picture category "hotel". The positive category information corresponding to each positive sample picture included in the sample picture set 3021 is a vector (1,0,0, …), and the negative category information corresponding to each negative sample picture included is a vector (0,0,0, …). The positive category information corresponding to each positive sample picture included in the sample picture set 3022 is a vector (0,1,0, …), and the negative category information corresponding to each negative sample picture included is a vector (0,0,0, …). Wherein each element position in the vector corresponds to a picture category.

Then, the electronic device 301 sequentially selects sample picture sets from the at least two sample picture sets 302 according to a preset numbering sequence of the sample picture sets, and performs the following training steps using the selected sample picture sets: with the machine learning method, the initial model 303 is trained with the positive sample pictures included in the sample picture set as input, the positive category information corresponding to the input positive sample pictures as expected output, the negative sample pictures in the sample picture set as input, and the negative category information corresponding to the input negative sample pictures as expected output. The initial model 303 is shown trained using a sample picture set 3021. After the initial model 303 is trained using the sample picture each time, the adjusted parameters are retained, and training is continued using other sample pictures. After each training using the sample picture sets is finished, the electronic device 301 determines whether at least two sample picture sets 302 include unselected sample picture sets, and if not, that is, all sample picture sets are used for training, determines that the initial model after the last training is the picture label model 304.

In the method provided by the above embodiment of the present disclosure, at least two sample picture sets are obtained, where the sample picture sets correspond to preset picture categories, and each sample picture set includes a positive sample picture and a negative sample picture, the positive sample picture corresponds to positive category information, and the negative sample picture corresponds to negative category information; then, the positive sample picture is used as input, the positive category information is used as expected output, the negative sample picture is used as input, the negative category information is used as expected output, the initial model is trained, and finally the picture label model is obtained through training.

With further reference to fig. 4, a flow 400 of one embodiment of a method for generating a category label set for a picture is shown. The flow 400 of the method for generating a category label set for a picture comprises the steps of:

step 401, obtaining a picture to be classified.

In this embodiment, the execution subject (such as the server or the terminal device shown in fig. 1) of the method for generating the category label set of pictures may acquire the pictures to be classified from the local or from the remote. The pictures to be classified are pictures to be classified.

And 402, inputting the picture to be classified into a pre-trained picture label model to generate a class label set.

In this embodiment, the executing body may input the picture to be classified into a pre-trained picture label model, so as to generate a class label set. The category label corresponds to a preset picture category and is used for representing that the picture to be classified belongs to the picture category corresponding to the category label.

The category labels may be various forms of labels including, but not limited to, at least one of: letters, numbers, symbols, etc.

In this embodiment, the image tag model is generated according to the method described in the embodiment corresponding to fig. 2, which may specifically refer to each step described in the embodiment corresponding to fig. 2, and is not described herein again.

In general, the generated category label set may be stored in association with the picture to be classified. For example, the category label set may be stored as attribute information of the picture to be classified into the attribute information set of the picture to be classified. Thereby increasing the comprehensiveness of the attributes characterizing the pictures to be classified. The attribute information set may include, but is not limited to, at least one of the following attribute information: name, size, generation time, etc. of the picture to be classified.

Alternatively, the generated category label set may be output in various manners, for example, the category label set is displayed on a display screen included in the execution main body. Or sending the category label set to other electronic equipment in communication connection with the execution main body.

In the method provided by the embodiment of the disclosure, the picture to be classified is classified by using the picture label model generated by the embodiment corresponding to fig. 2, so as to generate the class label set of the picture to be classified, and thus the picture label model for multi-label classification is generated by using the training of the single label sample, thereby improving the accuracy and efficiency of picture classification.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present disclosure provides an embodiment of an apparatus for generating a picture label model, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for generating a picture label model of the present embodiment includes: an obtaining unit 501 configured to obtain at least two sample picture sets, wherein the sample picture sets correspond to preset picture categories, each sample picture set includes a positive sample picture belonging to a corresponding picture category and a negative sample picture not belonging to the corresponding picture category, the positive sample picture corresponds to pre-labeled positive category information, and the negative sample picture corresponds to pre-labeled negative category information; a training unit 502 configured to select a sample picture set from the at least two sample picture sets, and with the selected sample picture set, perform the following training steps: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model; determining whether at least two sample picture sets include a sample picture set that is not selected; in response to determining not to include, determining the initial model after the last training to be a picture label model.

In this embodiment, the obtaining unit 501 may obtain at least two sample picture sets from a remote location or from a local location through a wired connection or a wireless connection. The sample picture set corresponds to a preset picture category, the sample picture set comprises a positive sample picture (namely, the positive sample picture contains an image of an object belonging to the object category indicated by the corresponding picture category) belonging to the corresponding picture category and a negative sample picture (namely, the negative sample picture does not contain an image of an object belonging to the object category indicated by the corresponding picture category), the positive sample picture corresponds to pre-labeled positive category information, and the negative sample picture corresponds to pre-labeled negative category information.

In this embodiment, the training unit 502 may perform the following sub-steps:

step 5021, a sample picture set is selected from at least two sample picture sets.

Specifically, the training unit 502 may select the sample picture sets in various manners, such as random selection, selection according to a preset numbering sequence of each sample picture set, and the like.

Next, using the selected sample picture set, the following training steps (including steps 5022-5024) are performed.

Step 5022, by using a machine learning method, the positive sample pictures included in the sample picture set are used as input, the positive category information corresponding to the input positive sample pictures is used as expected output, the negative sample pictures in the sample picture set are used as input, the negative category information corresponding to the input negative sample pictures is used as expected output, and the initial model is trained.

Specifically, the initial model may be various types of models, such as a recurrent neural network model, a convolutional neural network model, and the like. In the process of training the initial model, actual output can be obtained for the positive sample picture or the negative sample picture input in each training. And the actual output is data actually output by the initial model and used for representing the class information. Then, the training unit 502 may adjust parameters of the initial model based on the actual output and the expected output by using a gradient descent method, use the model obtained after each parameter adjustment as the initial model for the next training, and end the training for one sample picture set when a preset end condition is satisfied. It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the loss value calculated using a predetermined loss function (e.g., a cross entropy loss function) is less than a predetermined loss value threshold.

Step 5023, determining whether at least two sample picture sets comprise unselected sample picture sets.

Step 5024, responding to the fact that the initial model after the last training is not included, and determining that the initial model after the last training is the picture label model.

In some optional implementations of this embodiment, the apparatus 500 may further include: a selection unit (not shown in the figures) configured to, in response to determining that the at least two sample picture sets comprise an unselected sample picture set, reselect the sample picture set from the unselected sample picture set, and continue to perform the training step using the reselected sample picture set and the initial model after the last training.

In some optional implementations of this embodiment, the at least two sample picture sets are obtained in advance according to the following steps: acquiring at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets comprise positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information; for a sample video set in at least two sample video sets, extracting a video frame from a positive sample video included in the sample video set as a positive sample picture, and determining positive category information corresponding to the positive sample video as the positive category information of the extracted positive sample picture; extracting a video frame from a negative sample video included in the sample video set as a negative sample picture, and determining negative category information corresponding to the negative sample video as the negative category information of the extracted negative sample picture; determining the set of extracted positive sample pictures and negative sample pictures as a sample picture set.

In some optional implementations of the embodiment, the positive sample video and the negative sample video are compressed videos, the positive sample picture is a key frame extracted from the positive sample video, and the negative sample picture is a key frame extracted from the negative sample video.

In some optional implementation manners of this embodiment, the positive category information and the negative category information are vectors including a preset number of elements, respectively, a target element in a vector corresponding to the positive sample picture is used to represent that the positive sample picture belongs to a corresponding picture category, a target element in a vector corresponding to the negative sample picture is used to represent that the negative sample picture does not belong to a corresponding picture category, the target element is an element located at an element position in the vector, where a correspondence relationship is established in advance with the picture category corresponding to the vector, and the picture category corresponding to the vector is a picture category corresponding to a sample picture set to which the sample picture corresponding to the vector belongs.

In some optional implementation manners of this embodiment, the initial model is a convolutional neural network model, and includes a feature extraction layer and a classification layer, where the classification layer includes a preset number of weight data, and the weight data corresponds to a preset picture category and is used to determine a probability of a category to which an input picture belongs.

In some optional implementations of this embodiment, the training unit 502 may be further configured to: and fixing other weight data in the preset number of weight data except the weight data corresponding to the picture category corresponding to the sample picture set, and adjusting the weight data corresponding to the picture category corresponding to the sample picture set so as to train the initial model.

The apparatus 500 provided in the foregoing embodiment of the present disclosure obtains at least two sample picture sets, where the sample picture sets correspond to preset picture categories, and each sample picture set includes a positive sample picture and a negative sample picture, where the positive sample picture corresponds to positive category information, and the negative sample picture corresponds to negative category information; then, the positive sample picture is used as input, the positive category information is used as expected output, the negative sample picture is used as input, the negative category information is used as expected output, the initial model is trained, and finally the picture label model is obtained through training.

With further reference to fig. 6, as an implementation of the method shown in fig. 4, the present disclosure provides an embodiment of an apparatus for generating a category label set of a picture, where the apparatus embodiment corresponds to the method embodiment shown in fig. 4, and the apparatus may be applied to various electronic devices in particular.

As shown in fig. 6, the apparatus 600 for generating a picture label model of the present embodiment includes:

an acquisition unit 601 configured to acquire a picture to be classified; the generating unit 602 is configured to input a picture to be classified into a pre-trained picture tag model, and generate a class tag set, where the class tag corresponds to a preset picture class and is used to represent that the picture to be classified belongs to a picture class corresponding to the class tag, and the picture tag model is generated according to the method described in the embodiment corresponding to fig. 2.

In this embodiment, the acquiring unit 601 may acquire the picture to be classified from a local place or a remote place. The pictures to be classified are pictures to be classified.

In this embodiment, the generating unit 602 may input the picture to be classified into a pre-trained picture label model, and generate a category label set. The category label corresponds to a preset picture category and is used for representing that the picture to be classified belongs to the picture category corresponding to the category label.

Alternatively, the generated category label set may be output in various manners, for example, displaying the category label set on a display screen included in the apparatus 600. Alternatively, the set of category labels may be transmitted to other electronic devices communicatively coupled to the apparatus 600.

According to the device provided by the embodiment of the disclosure, the picture to be classified is classified by using the picture label model generated by the embodiment corresponding to fig. 2, so as to generate the class label set of the picture to be classified, and thus the picture label model for multi-label classification is generated by utilizing the training of the single label sample, so that the accuracy and the efficiency of picture classification are improved.

Referring now to fig. 7, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring at least two sample picture sets, wherein the sample picture sets correspond to preset picture categories, the sample picture sets comprise positive sample pictures belonging to the corresponding picture categories and negative sample pictures not belonging to the corresponding picture categories, the positive sample pictures correspond to pre-labeled positive category information, and the negative sample pictures correspond to pre-labeled negative category information; selecting a sample picture set from at least two sample picture sets, and executing the following training steps by using the selected sample picture set: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model; determining whether at least two sample picture sets include a sample picture set that is not selected; in response to determining not to include, determining the initial model after the last training to be a picture label model.

Further, the one or more programs, when executed by the electronic device, may further cause the electronic device to: acquiring a picture to be classified; and inputting the picture to be classified into a pre-trained picture label model to generate a class label set.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a training unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit acquiring at least two sample picture sets".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for generating a picture label model, comprising:

acquiring at least two sample picture sets, wherein the sample picture sets correspond to preset picture categories, the preset picture categories corresponding to the sample picture sets in the at least two sample picture sets are different, the sample picture sets comprise positive sample pictures belonging to the corresponding picture categories and negative sample pictures not belonging to the corresponding picture categories, the positive sample pictures correspond to pre-labeled positive category information, and the negative sample pictures correspond to pre-labeled negative category information;

selecting a sample picture set from the at least two sample picture sets, and executing the following training steps by using the selected sample picture set: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model, wherein the initial model comprises at least two binary models, and each binary model corresponds to one sample picture set; determining whether the at least two sample picture sets include an unselected sample picture set; in response to determining that the at least two sample picture sets do not include an unselected sample picture set, determining that the initial model after the last training is a picture label model;

in response to determining that the at least two sample picture sets include an unselected sample picture set, reselecting the sample picture set from the unselected sample picture set, and continuing to perform the training step using the reselected sample picture set and the initial model after the last training.

2. The method of claim 1, wherein the at least two sample picture sets are obtained in advance as follows:

acquiring at least two sample video sets, wherein the sample video sets correspond to preset video categories, the sample video sets comprise positive sample videos belonging to the corresponding video categories and negative sample videos not belonging to the corresponding video categories, the positive sample videos correspond to pre-labeled positive category information, and the negative sample videos correspond to pre-labeled negative category information;

for a sample video set in the at least two sample video sets, extracting a video frame from a positive sample video included in the sample video set as a positive sample picture, and determining positive category information corresponding to the positive sample video as the positive category information of the extracted positive sample picture; extracting a video frame from a negative sample video included in the sample video set as a negative sample picture, and determining negative category information corresponding to the negative sample video as the negative category information of the extracted negative sample picture; determining the set of extracted positive sample pictures and negative sample pictures as a sample picture set.

3. The method of claim 2, wherein the positive sample video and the negative sample video are compressed videos, the positive sample picture is a key frame extracted from the positive sample video, and the negative sample picture is a key frame extracted from the negative sample video.

4. The method according to claim 1, wherein the positive category information and the negative category information are vectors each including a preset number of elements, a target element in a vector corresponding to the positive sample picture is used to characterize that the positive sample picture belongs to a corresponding picture category, a target element in a vector corresponding to the negative sample picture is used to characterize that the negative sample picture does not belong to a corresponding picture category, the target element is an element located at an element position in the vector, where a correspondence relationship with the picture category corresponding to the vector is established in advance, and the picture category corresponding to the vector is a picture category corresponding to a sample picture set to which the sample picture corresponding to the vector belongs.

5. The method according to one of claims 1 to 4, wherein the initial model is a convolutional neural network model comprising a feature extraction layer and a classification layer, the classification layer comprising a preset number of weight data, the weight data corresponding to a preset picture class for determining a probability of the class to which the input picture belongs.

6. The method of claim 5, wherein the training an initial model comprises:

7. A method for generating a category label set for a picture, comprising:

acquiring a picture to be classified;

inputting the picture to be classified into a pre-trained picture label model to generate a class label set, wherein class labels correspond to preset picture classes and are used for representing that the picture to be classified belongs to the picture class corresponding to the class labels, and the picture label model is generated according to the method of one of claims 1 to 6.

8. An apparatus for generating a picture tag model, comprising:

an obtaining unit configured to obtain at least two sample picture sets, wherein the sample picture sets correspond to preset picture categories, the preset picture categories corresponding to the sample picture sets in the at least two sample picture sets are different, the sample picture sets include positive sample pictures belonging to the corresponding picture categories and negative sample pictures not belonging to the corresponding picture categories, the positive sample pictures correspond to pre-labeled positive category information, and the negative sample pictures correspond to pre-labeled negative category information;

a training unit configured to select a sample picture set from the at least two sample picture sets, and with the selected sample picture set, perform the following training steps: by utilizing a machine learning method, taking a positive sample picture included in a sample picture set as input, taking positive category information corresponding to the input positive sample picture as expected output, taking a negative sample picture in the sample picture set as input, taking negative category information corresponding to the input negative sample picture as expected output, and training an initial model, wherein the initial model comprises at least two binary models, and each binary model corresponds to one sample picture set; determining whether the at least two sample picture sets include an unselected sample picture set; in response to determining that the at least two sample picture sets do not include an unselected sample picture set, determining that the initial model after the last training is a picture label model;

a selection unit configured to, in response to determining that the at least two sample picture sets include an unselected sample picture set, reselect the sample picture set from the unselected sample picture set, and continue performing the training step using the reselected sample picture set and the initial model after the last training.

9. The apparatus of claim 8, wherein the at least two sample picture sets are obtained in advance as follows:

10. The apparatus of claim 9, wherein the positive sample video and the negative sample video are compressed videos, the positive sample picture is a key frame extracted from the positive sample video, and the negative sample picture is a key frame extracted from the negative sample video.

11. The apparatus according to claim 8, wherein the positive category information and the negative category information are vectors each including a preset number of elements, a target element in a vector corresponding to the positive sample picture is used to characterize that the positive sample picture belongs to a corresponding picture category, a target element in a vector corresponding to the negative sample picture is used to characterize that the negative sample picture does not belong to a corresponding picture category, the target element is an element located at an element position in the vector, where a correspondence relationship with the picture category corresponding to the vector is established in advance, and the picture category corresponding to the vector is a picture category corresponding to a sample picture set to which the sample picture corresponding to the vector belongs.

12. The apparatus according to one of claims 8 to 11, wherein the initial model is a convolutional neural network model comprising a feature extraction layer and a classification layer, the classification layer comprising a preset number of weight data, the weight data corresponding to a preset picture class for determining a probability of the class to which the input picture belongs.

13. The apparatus of claim 12, wherein the training unit is further configured to:

14. An apparatus for generating a category label set for a picture, comprising:

an acquisition unit configured to acquire a picture to be classified;

a generating unit, configured to input the picture to be classified into a pre-trained picture label model, and generate a class label set, where a class label corresponds to a preset picture class and is used for representing that the picture to be classified belongs to a picture class corresponding to the class label, and the picture label model is generated according to the method in one of claims 1 to 6.

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.