CN117173511A

CN117173511A - Category identification method, apparatus, device, storage medium, and program product

Info

Publication number: CN117173511A
Application number: CN202311160076.6A
Authority: CN
Inventors: 陈祥
Original assignee: Bigo Technology Pte Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-05

Abstract

The embodiment of the application provides a kind of identification method, a device, equipment, a storage medium and a program product, wherein the method comprises the following steps: acquiring a picture sample set and a text sample set, wherein picture samples in the picture sample set and text samples in the text sample set have different association relations; training a set recognition model based on the picture sample set and the text sample set; and inputting the picture to be identified and the set category information into the identification model after training is completed so as to obtain a matching picture of the category information in the picture to be identified. The scheme can greatly save time cost and labor cost and improve class identification efficiency.

Description

Category identification method, apparatus, device, storage medium, and program product

Technical Field

Embodiments of the present application relate to the field of computer technologies, and in particular, to a class identification method, apparatus, device, storage medium, and program product.

Background

Currently, a visual recognition system applied to a large-scale image auditing technology is mainly realized by a deep learning-based method. The method can ensure that the model has better generalization capability and practical application value only by accumulating larger data volume. It requires that positive samples of interest be obtained from a large amount of data, such positive samples often requiring hundreds of thousands or tens of thousands of accumulated amounts when applied to large-scale audit data, and in some training data, over class identification which has not been well defined, the related art mitigates the data shortfall by manually collecting more data of interest, while further training a separate visual recognition model, in such a way that efficient recognition of new classes is achieved.

In the above scheme, a mode of retraining the identification model is adopted, so that a great deal of labor cost and time cost are required to be input, such as great deal of labeling time cost is consumed, particularly, in the condition that the occupied ratio of the concerned positive sample is extremely small (such as less than one part per million), millions of data or even tens of millions of data need to be labeled for obtaining a sufficient number of positive samples, and in practical application, the time and labor cost are extremely high, and improvement is required.

Disclosure of Invention

The embodiment of the application provides a class identification method, a device, equipment, a storage medium and a program product, which can greatly save time cost and labor cost and improve class identification efficiency.

In a first aspect, an embodiment of the present application provides a class identification method, including:

acquiring a picture sample set and a text sample set, wherein picture samples in the picture sample set and text samples in the text sample set have different association relations;

training a set recognition model based on the picture sample set and the text sample set;

and inputting the picture to be identified and the set category information into the identification model after training is completed so as to obtain a matching picture of the category information in the picture to be identified.

In a second aspect, an embodiment of the present application further provides a class identification device, including:

the acquisition module is configured to acquire a picture sample set and a text sample set, wherein the picture samples in the picture sample set and the text samples in the text sample set have different association relations;

the training module is configured to train the set recognition model based on the picture sample set and the text sample set;

the identification module is configured to input the picture to be identified and the set category information into the identification model after training is completed so as to obtain a matching picture of the category information in the picture to be identified.

In a third aspect, an embodiment of the present application further provides a class identification device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the class identification method described in the embodiments of the present application.

In a fourth aspect, embodiments of the present application also provide a non-volatile storage medium storing computer-executable instructions that, when executed by a computer processor, are configured to perform the class identification method of embodiments of the present application.

In a fifth aspect, the present embodiment further provides a computer program product, where the computer program product includes a computer program, where the computer program is stored in a computer readable storage medium, and where at least one processor of the device reads from the computer readable storage medium and executes the computer program, so that the device performs the category identification method according to the embodiment of the present application.

According to the embodiment of the application, the picture sample set and the text sample set are obtained, wherein the picture sample in the picture sample set and the text sample in the text sample set have different association relations, the set recognition model is trained based on the picture sample set and the text sample set, and the picture to be recognized and the set category information are input into the trained recognition model, so that the matching picture of the category information in the picture to be recognized is obtained. In the above-mentioned category recognition mode, utilize the recognition model that obtains based on training data training of picture and text to confirm the matching picture that corresponds with the category mode of setting, need not to carry out independent model training to specific category, this scheme need not to carry out the mark of sample to the training in-process of recognition model simultaneously, can save a large amount of time and human cost, and the recognition accuracy of model is high, and it is pressed close to the service scenario more, and the commonality is stronger.

Drawings

FIG. 1 is a flow chart of a class identification method provided in an embodiment of the present application;

FIG. 2 is a flowchart of an identification model training method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a network structure in an identification model according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for model training based on a generated association according to an embodiment of the present application;

FIG. 5 is a flowchart of another type of identification method according to an embodiment of the present application;

FIG. 6 is a block diagram of a class identification device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a class identification device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in further detail below with reference to the drawings and examples. It should be understood that the particular embodiments described herein are illustrative only and are not limiting of embodiments of the application. It should be further noted that, for convenience of description, only some, but not all of the structures related to the embodiments of the present application are shown in the drawings.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The class identification method provided by the embodiment of the application can be applied to the auditing process of videos and pictures in the live broadcast industry, and the application scene of the matched pictures is given to the newly added class.

Fig. 1 is a flowchart of a class identification method according to an embodiment of the present application, as shown in fig. 1, specifically including the following steps:

step S101, a picture sample set and a text sample set are obtained, wherein the picture samples in the picture sample set and the text samples in the text sample set have different association relations.

In one embodiment, prior to training of the recognition model, a stored picture sample set and a text sample set are obtained, wherein the picture sample set comprises a plurality of pictures and the text sample set comprises a plurality of texts. The text contained for the text sample set may be natural language descriptive text composed of words and sentences, etc. The picture samples in the picture sample set and the text samples in the text sample set have different association relations. Alternatively, the association relationship may be a relationship of association or non-association of the dichotomy division.

In one embodiment, before storing the picture sample set and the text sample set, a process of acquiring the picture and the text through a network and generating an association relationship between the picture and the text is further included. Alternatively, it may be: and acquiring pictures and text description information in the website information, and generating a picture sample set and a text sample set based on the pictures and the text description information, and the association relationship between the picture samples in the picture sample set and the text samples in the text sample set. The website information may be bar information of a live broadcast platform, and the bar information includes pictures and corresponding texts, that is, the pictures in the bar are stored as picture samples in a picture sample set in a mode of collecting information through a network, and texts of characters are stored as text samples in a text sample set. The association relation between the generated picture sample and the text sample is that the association relation can be automatically generated according to the specific sources of the pictures and the texts, such as the association relation of the pictures and the texts of the same source, and the association relation of the pictures and the texts of different sources. The judgment standards of the same source and different sources can be set by a developer, and the texts and pictures under the same post can be determined to be the same in terms of the bar information in the website information, and the texts and pictures under different posts can be determined to be different in source. By the aid of the generation mode of the picture sample set and the text sample set, sample labeling work is not needed, and the generation can be automatically performed.

In the sample generation mode, firstly, starting from a batch of initial data, each piece of data at least comprises pictures and short text information, and although the data has no clear label, the data is quite huge and common in magnitude, such as posting in a live broadcasting room, posting on a social platform and the like, comprises some pictures and short text descriptions, and is used as a sample for subsequent model training, so that the efficiency of obtaining the sample can be remarkably improved, no label is needed, and efficient information identification under the condition of data deficiency can be realized. The method also solves the defects brought by the way of describing possible labels or entity sets in advance, and mainly has the defects that the identification of zero samples can be realized to a certain extent by predefining the attribute information of the labels, but the technology often assumes that different labels or categories have similar attribute information on the bottom layer, such as different birds have different colors, heads, abdomen and other attribute information, the assumption is too strict in an actual identification system, and an expert is required to predefine attribute vectors corresponding to the labels of the different categories, the design process is very time-consuming, and a lot of time and labor cost are required to be invested for collecting a certain number of samples, so the practicability is greatly compromised.

And step S102, training the set recognition model based on the picture sample set and the text sample set.

In one embodiment, after the picture sample set and the text sample set are obtained, the set recognition model is trained based on the picture sample set and the text sample set. Optionally, the set network model includes a picture coding network and a text coding network. An optional training manner is shown in fig. 2, and fig. 2 is a flowchart of a training method for an identification model according to an embodiment of the present application, where the method includes:

and S1021, performing picture normalization processing on the picture samples in the picture sample set to obtain a standard picture, and performing text normalization processing on the texts in the text sample set to obtain a standard text.

In one embodiment, before inputting the picture and the text into the picture coding network and the text coding network, respectively, picture normalization processing is performed on the picture samples to obtain standard pictures, and text normalization processing is performed on the text in the text sample set to obtain standard texts. The standard pictures obtained through processing can be tensor matrixes with preset sizes, and the standard texts can be matrixes with preset dimensions. For example, the standard picture may be a 224×224×3 tensor matrix, and the standard text may be a 76×768-dimensional matrix. The specific mode of picture standardization and text standardization can be to process by adopting a set corresponding function or programming language code, such as imread (filename) function to perform picture standardization, and the text standardization is realized by writing txt text standardization processing code by using the python programming language. Optionally, when the image normalization processing is performed, the image is converted into a 2-dimensional matrix by subtracting the mean and variance of three channels of the image from the RGB image information, and the image can also be converted into a 1-dimensional vector.

Step S1022, inputting the standard picture and the standard text to a picture coding network and a text coding network, respectively, to obtain a picture vector corresponding to the standard picture and a text vector corresponding to the standard text.

After obtaining the standard picture and the standard text, inputting the standard picture into a picture coding network under training, and inputting the standard text into a text coding network under training to obtain a picture vector and a text vector respectively. Exemplary, 512-dimensional image vector E can be obtained _I (I)∈R ⁵¹² And 512-dimensional text vector E _T (T)∈R ⁵¹² . Through the standardization departmentAnd finally, the acquired massive pictures and texts can be aligned for subsequent model training. This allows for multi-modal model training in the absence of data and subsequent recognition.

The image coding network and the text coding network can be a Transformer network, and also can adopt a visual network structure based on RNN or CNN.

Optionally, the specially configured picture coding network and text coding network of the scheme adopt a Transformer network architecture, and the picture coding network and text coding network comprise a self-attention module, a residual neural network module and a forward network module. The network structure may be a 12-layer self-attention and residual neural network, as shown in fig. 3, fig. 3 is a network structure schematic diagram in an identification model provided by the embodiment of the application, where the network structure includes a multi-head self-attention module and a short-cut residual link normalization module for information extraction, and includes a learnable forward network, where the above components form an encocoder-Block, and each of the information input and information output of the encocoder-Block is an exemplary matrix output fixed to 768 dimensions, and the dimensions remain unchanged. The converter network structure comprises 12 Encoder-blocks which are arranged in a cascading way, and after the cascading module extracts high-level semantic information in images and texts, the image vectors/text vectors are output after the image and text are connected to a full-connection layer.

Step S1023, calculating the similarity between the picture vector and the text vector, and training the picture coding network and the text coding network based on the association relation between the picture sample and the text sample to obtain a training recognition model.

In one embodiment, the similarity between the picture vector and the text vector may be obtained by calculating the euclidean distance, or the similarity may be obtained by calculating the cosine distance of the picture vector and the text vector, which is not limited in this scheme. Optionally, when training the recognition model, as shown in fig. 4, fig. 4 is a flowchart of a method for training the model based on the generated association relationship according to an embodiment of the present application, where the method includes:

step S10231, calculating the similarity between the picture vector and the text vector through the set similarity calculation formula.

And step S10232, carrying out loss calculation based on the association relation between the picture sample and the text sample to obtain a loss value.

Taking the association relationship as an example, the association relationship includes association and non-association, the sample label value corresponding to the association is 1, and the non-association is 0. Wherein for image vector E _I (I) And a text vector E _T (T) for the tag value y, the similarity calculation result of the tag value y and the tag value y is denoted as Sim (I, T), and the loss value calculation mode can be as follows:

L(Sim(I,T),y)＝-ylog(Sim(I,T))

step S10233, adjusting network parameters of the picture coding network and the text coding network based on the loss value to obtain a trained recognition model.

And continuously adjusting network parameters of the picture coding network and the text coding network through the input of different pictures and texts and the calculation feedback of the loss values, after the optimal network parameters are obtained after convergence, not updating the weights any more, and using the coding network under the fixed network parameters as a feature extraction model for subsequent category identification.

Step S103, inputting the picture to be identified and the set category information into the identification model after training is completed, so as to obtain a matching picture of the category information in the picture to be identified.

The picture to be identified can be a picture for auditing generated in a live broadcast process such as live broadcast screenshot, or can be other picture sets for finding pictures with matched categories. The category information may be a new category that needs to be set to find a corresponding matching picture. The category as previously entered may be volleyball, in which case the newly set category may be beach volleyball or beach football, etc. Through the recognition model which is completed through the training, a picture with high matching degree corresponding to the category information can be obtained as a matching picture.

According to the method, the picture sample set and the text sample set are obtained, wherein the picture sample in the picture sample set and the text sample in the text sample set have different association relations, the set recognition model is trained based on the picture sample set and the text sample set, and the picture to be recognized and the set category information are input into the trained recognition model, so that the matching picture of the category information in the picture to be recognized is obtained. In the above-mentioned category recognition mode, utilize the recognition model that obtains based on training data training of picture and text to confirm the matching picture that corresponds with the category mode of setting, need not to carry out independent model training to specific category, this scheme need not to carry out the mark of sample to the training in-process of recognition model simultaneously, can save a large amount of time and human cost, and the recognition accuracy of model is high, and it is pressed close to the service scenario more, and the commonality is stronger.

Fig. 5 is a flowchart of another type of identification method according to an embodiment of the present application, as shown in fig. 5, including:

step 201, a picture sample set and a text sample set are obtained, wherein the picture samples in the picture sample set and the text samples in the text sample set have different association relations.

And step S202, training the set recognition model based on the picture sample set and the text sample set.

Step S203, inputting the picture to be identified and the set category information into the identification model after training is completed, obtaining the similarity value of the category information and each picture in the picture to be identified, and determining the picture corresponding to the similarity value meeting the set similarity condition as the matching picture of the category information.

In one embodiment, the set category information may be a brief description of the compliance with the category requirements, without requiring strict restrictions on the specification requirements text. The category information may be identified by a category that has not been previously well defined. And inputting the pictures to be identified and the set category information into the identification model after training is completed, so that a picture vector of each picture in the pictures to be identified and a text vector corresponding to the category information can be obtained, and calculating the similarity between the text vector and each picture vector to obtain the similarity between the category information and each picture, wherein the picture with the similarity value larger than the set threshold value is used as a matching picture corresponding to the category information, namely the matching picture is directly used as an identification result.

In the scheme of the category identification, in the scene of the multi-mode visual identification system based on the condition of lack of data, a large number of images of target categories can be obtained by defining text labels or text descriptions of target data, namely category information, so that the time and labor cost of data collection and labeling are greatly reduced. In the multi-mode visual recognition system based on the condition of lack of data, a multi-mode model with text and image information aligned semantically can be obtained by utilizing massive image-text pair data in a service scene and is used for subsequent efficient recognition. Meanwhile, through the description of the test sample set, a matching recognition result can be obtained, and the problem that the recognition under the limited category needs to additionally develop the low-efficiency behavior of the algorithm model is avoided. The text description can be based on natural language or word, so that the refining time cost for category labeling is greatly reduced, the text description is closer to the use scene of a user, the text description is more universal, and dynamic requirements such as standard change in auditing service can be responded quickly.

Fig. 6 is a block diagram of a class identification device according to an embodiment of the present application, and as shown in fig. 6, the device is configured to execute the class identification method according to the foregoing embodiment, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 6, the apparatus specifically includes: an acquisition module 101, a training module 102, and an identification module 103, wherein,

an obtaining module 101, configured to obtain a picture sample set and a text sample set, where picture samples in the picture sample set and text samples in the text sample set have different association relations;

a training module 102 configured to train the set recognition model based on the picture sample set and the text sample set;

the recognition module 103 is configured to input the picture to be recognized and the set category information into the recognition model after training is completed, so as to obtain a matching picture of the category information in the picture to be recognized.

In one possible embodiment, the apparatus further comprises a sample generation module configured to:

before the picture sample set and the text sample set are acquired, acquiring pictures and text description information in website information;

and generating a picture sample set and a text sample set based on the picture and the text description information, and the association relation between the picture sample in the picture sample set and the text sample in the text sample set.

In one possible embodiment, the training module 102 is configured to:

performing picture standardization processing on the picture samples in the picture sample set to obtain a standard picture, and performing text standardization processing on the text in the text sample set to obtain a standard text;

respectively inputting the standard picture and the standard text into a picture coding network and a text coding network which are arranged to obtain a picture vector corresponding to the standard picture and a text vector corresponding to the standard text;

and calculating the similarity of the picture vector and the text vector, and training the picture coding network and the text coding network based on the association relation between the picture sample and the text sample to obtain a trained recognition model.

In one possible embodiment, the standard picture includes a tensor matrix of a preset size, and the standard text includes a matrix of a preset dimension.

In one possible embodiment, the training module 102 is configured to:

calculating the similarity between the picture vector and the text vector through a set similarity calculation formula;

performing loss calculation based on the similarity and the association relation between the picture sample and the text sample to obtain a loss value;

and adjusting network parameters of the picture coding network and the text coding network based on the loss value to obtain a trained identification model.

In one possible embodiment, the picture coding network and the text coding network include a self-attention module, a residual neural network module, and a forward network module.

In a possible embodiment, the identification module 103 is configured to:

inputting the picture to be identified and the set category information into the identification model after training is completed, and obtaining a similarity value of the category information and each picture in the picture to be identified;

and determining the picture corresponding to the similarity value meeting the set similarity condition as the matching picture of the category information.

Fig. 7 is a schematic structural diagram of a class identification device according to an embodiment of the present application, as shown in fig. 7, the device includes a processor 201, a memory 202, an input device 203, and an output device 204; the number of processors 201 in the device may be one or more, one processor 201 being taken as an example in fig. 7; the processor 201, memory 202, input devices 203, and output devices 204 in the apparatus may be connected by a bus or other means, for example in fig. 7. The memory 202 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the category identification method in the embodiment of the present application. The processor 201 executes various functional applications of the device and data processing, i.e., implements the above-described category identification method, by running software programs, instructions, and modules stored in the memory 202. The input device 703 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the apparatus. The output device 204 may include a display device such as a display screen.

The embodiments of the present application also provide a non-volatile storage medium containing computer executable instructions which, when executed by a computer processor, are adapted to carry out a class identification method as described in the above embodiments, comprising:

It should be noted that, in the embodiment of the category identifying device, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present application.

In some possible embodiments, aspects of the method provided by the present application may also be implemented in the form of a program product, which comprises a program code for causing a computer device to carry out the steps of the method according to the various exemplary embodiments of the application described in the present specification, when said program product is run on the computer device, for example, the computer device may carry out the category identification method described in the examples of the present application. The program product may be implemented using any combination of one or more readable media.

Claims

1. The category identification method is characterized by comprising the following steps:

2. The category identification method of claim 1, further comprising, prior to the acquiring the picture sample set and the text sample set:

acquiring pictures and text description information in website information;

3. The category identification method of claim 1, wherein the training the set identification model based on the picture sample set and the text sample set includes:

4. A category identification method as claimed in claim 3, wherein the standard picture comprises a tensor matrix of a preset size and the standard text comprises a matrix of a preset dimension.

5. The method of claim 3, wherein the calculating the similarity between the picture vector and the text vector, and training the picture coding network and the text coding network based on the association between the picture sample and the text sample to obtain the trained recognition model comprises:

6. The category identification method of claim 3, wherein the picture coding network and the text coding network include a self-attention module, a residual neural network module, and a forward network module.

7. The method for identifying a category according to any one of claims 1 to 6, wherein inputting the picture to be identified and the set category information into the identification model after training is completed, to obtain a matching picture of the category information in the picture to be identified, includes:

8. Category recognition device, characterized by comprising:

9. A class identification device, the device comprising: one or more processors; storage means for storing one or more programs that when executed by the one or more processors cause the one or more processors to implement the class identification method of any of claims 1-7.

10. A non-transitory storage medium storing computer executable instructions which, when executed by a computer processor, are for performing the class identification method of any one of claims 1-7.

11. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the class identification method of any of claims 1-7.