CN113836992A

CN113836992A - Method for identifying label, method, device and equipment for training label identification model

Info

Publication number: CN113836992A
Application number: CN202110662545.9A
Authority: CN
Inventors: 尚焱
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-12-24
Anticipated expiration: 2041-06-15
Also published as: CN113836992B

Abstract

A method for identifying labels, a method, a device and equipment for training a label identification model are provided, which relate to the field of video processing of network media, and the method comprises the following steps: extracting a plurality of modal characteristics of a target image or video; performing feature fusion on the plurality of modal features to obtain fusion features; obtaining the intermediate features of the ith layer of classifier based on the fusion features and the intermediate features of the (i-1) th layer of classifier in the M layers of classifiers until the intermediate features of the M layer of classifier are obtained; based on the intermediate features of the M-th layer classifier, outputting probability distribution features by using the M-th layer classifier; and determining the label of the target image or video based on the probability distribution characteristics. According to the method, the characteristics of multiple modes are extracted and fused, the fused characteristics are subjected to characteristic multiplexing of different levels by using the M-layer classifier, the accuracy of the classifier is improved, namely the accuracy of the probability distribution characteristics output by the M-layer classifier is improved, namely the accuracy of the identification label is improved.

Description

Method for identifying label, method, device and equipment for training label identification model

Technical Field

The embodiment of the application relates to the field of video processing of network media, in particular to a method for identifying a label, a method for training a label identification model, a device and equipment.

Background

With the advent of the fifth Generation mobile communication technology (5-Generation, 5G) and the development of mobile internet platforms, more and more videos are accumulated on the internet platforms, and the consumption of short videos and images shows a blowout type outbreak, so that intelligent understanding of images or video contents becomes indispensable in each link of visual contents. The most basic intelligent image understanding task is to mark accurate and rich labels on images or videos, so that a user or a downstream task can quickly search the images or videos, and the retrieval quality and efficiency are improved.

Up to now, the method for identifying tags of images or videos includes: firstly, the image or video is usually expressed by visual features, then the visual features are sent to a classifier for classification label identification, and the identified labels are output. However, the feature expression of the image or the video only through the visual features may cause insufficient feature expression, and further, may cause the identification accuracy of the tag to be too low; in addition, the identification accuracy of the relevant classifier on the label is too low, and accordingly, the requirement of quickly searching the image or video scene on the identification accuracy cannot be met.

Therefore, a method for identifying a tag is urgently needed in the field to improve the identification accuracy and the identification effect of the tag, and particularly, the method can meet the requirement of quickly searching for the scene of an image or a video on the identification accuracy of the tag, so that the user experience is improved.

Disclosure of Invention

The embodiment of the application provides a method for identifying a label, a method for training a label identification model, a device and equipment, which can improve the identification accuracy and the identification effect of the label, especially can meet the requirement of rapidly searching for the identification accuracy of the label in an image or video scene, and further can improve the user experience.

In one aspect, a method of identifying a tag is provided, including:

extracting a plurality of modal characteristics of a target image or video;

performing feature fusion on the plurality of modal features to obtain a fused feature after the plurality of modal features are fused;

based on the fusion feature and the intermediate feature of the i-1 th layer classifier in the M layers of classifiers, utilizing the i layer classifier to obtain the intermediate feature of the i layer classifier until the intermediate feature of the M layer classifier is obtained; 1 < i < M, wherein the feature output by the ith layer classifier in the M layers of classifiers is used for identifying the label of the ith level in the M levels, and the intermediate feature of the first layer classifier in the M layers of classifiers is obtained based on the fusion feature;

based on the intermediate features of the M-th layer classifier, outputting probability distribution features by using the M-th layer classifier;

and determining the label of the target image or video based on the probability distribution characteristics.

In another aspect, a method for training a label recognition model is provided, including:

acquiring an image or video to be trained;

extracting a plurality of modal characteristics of the image or video to be trained;

acquiring an ith level of label corresponding to the image or video to be trained;

and training the ith layer classifier by taking the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layers of classifiers and the label of the ith level as input to obtain a label identification model, wherein i is more than 1 and less than or equal to M, the first layer classifier in the M layers of classifiers is obtained by training based on the fusion feature and the label of the first level, and the intermediate feature of the first layer classifier in the M layers of classifiers is obtained based on the fusion feature.

In another aspect, the present application provides a tag identification apparatus, including:

the extraction unit is used for extracting a plurality of modal characteristics of the target image or video;

the fusion unit is used for performing feature fusion on the plurality of modal features to obtain a fusion feature after the plurality of modal features are fused;

the first determining unit is used for obtaining the intermediate features of the ith layer classifier by using the ith layer classifier based on the fusion features and the intermediate features of the ith-1 layer classifier in the M layers of classifiers until the intermediate features of the M layers of classifiers are obtained; 1 < i < M, wherein the feature output by the ith layer classifier in the M layers of classifiers is used for identifying the label of the ith level in the M levels, and the intermediate feature of the first layer classifier in the M layers of classifiers is obtained based on the fusion feature;

the output unit is used for outputting probability distribution characteristics by utilizing the M-th layer classifier based on the intermediate characteristics of the M-th layer classifier;

and the second determining unit is used for determining the label of the target image or video based on the probability distribution characteristic.

In another aspect, the present application provides an apparatus for training a label recognition model, including:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring an image or a video to be trained;

the extraction unit is used for extracting a plurality of modal characteristics of the image or video to be trained;

the second acquisition unit is used for acquiring the marking label of the ith level corresponding to the image or video to be trained;

and the training unit is used for training the ith classifier by taking the fusion feature, the intermediate feature of the ith-1 layer classifier in the M layers of classifiers and the label of the ith level as input to obtain a label recognition model, wherein i is more than 1 and less than or equal to M, the first layer classifier in the M layers of classifiers is obtained by training based on the fusion feature and the label of the first level, and the intermediate feature of the first layer classifier in the M layers of classifiers is obtained based on the fusion feature.

In another aspect, an embodiment of the present application provides an electronic device, including:

a processor adapted to execute a computer program;

a computer-readable storage medium, in which a computer program is stored, which, when being executed by the processor, implements the above-mentioned method of recognizing a tag or the method of training a tag recognition model.

In another aspect, an embodiment of the present application provides a computer-readable storage medium storing computer instructions, which when read and executed by a processor of a computer device, cause the computer device to perform the above-mentioned method for identifying a tag or method for training a tag identification model.

Based on the scheme, the plurality of modal features of the target image or the video are extracted and subjected to feature fusion, and equivalently, the plurality of modal features of the target image or the video are fused to perform feature expression, so that the target image or the video is more sufficient in the aspect of feature expression, and the identification accuracy of the label can be improved.

In addition, based on the fusion features and the intermediate features of the i-1 th-layer classifier in the M-layer classifier, the intermediate features of the i-layer classifier are obtained by using the i-layer classifier until the intermediate features of the M-layer classifier are obtained, the multiplexing of the intermediate features of the front-layer classifier by the rear-layer classifier can be realized, and equivalently, the identification accuracy of the label by the rear-layer classifier can be improved by multiplexing the intermediate features of the front-layer classifier.

In other words, the M-th layer classifier considers the intermediate features of the preceding M-1 layer classifier in a layer-by-layer multiplexing manner, so that the accuracy of the probability distribution features output by the M-th layer classifier can be improved, which is equivalent to the improvement of the identification accuracy of the M-th level label, that is, the accuracy of identifying the label of the target image or video.

In addition, the accuracy of the label of the target image or video is improved, so that the image or video can be searched by utilizing the label for identifying the image or video on the Internet platform, and the searching quality and efficiency can be further improved; meanwhile, the user experience of the product can be greatly improved by utilizing the identified label to carry out video recommendation on the user.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic block diagram of a system framework provided in an embodiment of the present application.

Fig. 2 is a schematic flow chart of a method for identifying a tag provided by an embodiment of the present application.

Fig. 3 is a schematic block diagram of feature fusion provided by an embodiment of the present application.

Fig. 4 is a schematic block diagram of feature multiplexing of a three-layer classifier provided in an embodiment of the present application.

Fig. 5 is a schematic flow chart of a method for training a label recognition model provided in an embodiment of the present application.

FIG. 6 is a schematic block diagram of a device for identifying a tag provided in an embodiment of the present application

FIG. 7 is a schematic block diagram of an apparatus for training a tag recognition model according to an embodiment of the present application

Fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The scheme provided by the application can relate to artificial intelligence technology.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

It should be understood that the artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, namely a hardware technology and a software technology. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The embodiment of the application can relate to a Computer Vision (CV) technology in an artificial intelligence technology, wherein the Computer Vision is a science for researching how to enable a machine to see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image which is more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application also relates to the image or video processing technology in the field of network media. Network media works differently from traditional audio and video devices, and relies on technology and equipment provided by Information Technology (IT) device developers to transmit, store and process audio and video signals. The conventional Serial Digital (SDI) transmission method lacks a network switching characteristic in a true sense. A great deal of work is required to create a portion of the network functions like those provided by ethernet and Internet Protocol (IP) using SDI. Therefore, network media technology in the video industry has been developed. Further, the video processing technology of the network media can comprise the transmission, storage and processing process of audio and video signals and the text recognition technology of the audio and video. Among them, the Speech recognition technology asr (automatic Speech recognition) is a technology for converting human Speech into text, which has the greatest advantage of making a human-computer user interface more natural and easy to use, and the character recognition technology ocr (optical character recognition) is a technology for obtaining text in an image by analyzing the position and character type of characters in a scanned or photographed image.

It should be noted that the apparatus provided in this embodiment of the present application may be integrated in a server, where the server may include a server or a distributed server that operates independently, may also include a server cluster or a distributed system that is composed of a plurality of servers, and may also be a cloud server that provides basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, and big data and an artificial intelligence platform, and the server may be directly or indirectly connected in a wired or wireless communication manner, which is not limited herein.

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

It should be noted that the scheme of identifying tags provided in the present application can be applied to any scene that needs to be intelligently understood with respect to images or video contents. Such as picture and video search, recommendation, audit, etc. scenarios. In addition, in practical applications, an image or video may be described from different angles, such as a text description of the image or video title, a title map expressing the main content of the image or video, a plurality of image frames describing the detailed content of the video, audio depicting the video expression, and so on. The richer the description angle used, the more accurate the representation of the image or video. The system framework provided by the embodiment of the present application is exemplified by extracting visual features, audio features and text features in a target image or video, of course, in other alternative embodiments, other modal features in the target image or video, for example, timing features and the like, may also be mentioned, and the present application does not specifically limit the specific representation form of the plurality of modal features.

The following will describe the system framework provided in the present application in detail by taking the example of extracting the visual feature, the audio feature and the text feature of the target image or video.

Fig. 1 is a schematic block diagram of a system framework 100 provided by an embodiment of the present application.

As shown in fig. 1, the system framework 100 may include a visual feature extraction module 111, an audio feature extraction module 112, a text feature extraction module 113, a feature fusion module 120, a hierarchical classification module 130, a candidate tag processing module 140, and a custom tag processing module 150, wherein the hierarchical classification module 130 may include a multi-layer classifier 131 and a tag identification module 132.

The visual feature extraction module 111 can be used for extracting visual features of a target image or video, the audio feature extraction module 112 can be used for extracting audio features of the target image or video, the text feature extraction module 113 can be used for extracting text features of the target image or video and respectively sending the extracted visual features, audio features and text features to the feature fusion module 120, and in addition, the text feature extraction module 113 can also be used for sending extracted text information to the custom tag processing module 150; it should be noted that the visual feature extraction module 111, the audio feature extraction module 112, and the text feature extraction module 113 may be any modules having corresponding feature extraction or extraction functions, and this application does not limit this, for example, the visual feature extraction module 111 may perform visual feature extraction based on a RestNet frame of a residual network in a slow-fast Slowfast channel video classification algorithm, the audio feature extraction module 112 may perform audio feature extraction based on a VGGish frame, the text feature extraction module 113 may perform text feature extraction using a BERT frame or may supplement text information using an OCR technology or an ASR technology while using the BERT frame, and this application does not limit the specific extraction manner of a plurality of modal features.

The feature fusion module 120 is configured to receive the visual features, the audio features, and the text features respectively sent by the visual feature extraction module 111, the audio feature extraction module 112, and the text feature extraction module 113, perform feature fusion on the received visual features, audio features, and text features to obtain fused features, and send the fused features to the hierarchical classification module 130. It should be noted that the feature fusion module 120 may be any module having a feature fusion function, and the present application does not limit this; for example, the feature fusion module may be a transform framework based feature fusion module.

The multi-layer classifier 131 in the hierarchical classification module 130 is configured to receive the fused features sent by the feature fusion module 120, and the post-hierarchical classifier in the multi-layer classifier 131, based on the received fused features and the intermediate features of the pre-hierarchical classifier, implements multiplexing of the post-hierarchical classifier on the intermediate features of the pre-hierarchical classifier to improve the accuracy of the post-hierarchical classifier until the probability distribution features output by the last-layer classifier are obtained, and sends the probability distribution features output by the last-layer classifier to the tag identification module 132 in the hierarchical classification module 130, so that the tag identification module 132 identifies the tags of the image or video according to the probability distribution features output by the last-layer classifier, and finally sends the obtained tags of the image or video to the candidate tag processing module 140; it should be noted that the probability distribution characteristic may be a distribution with a length or a dimension N. Each bit or value in the probability distribution feature corresponds to a label, and the label corresponding to the maximum value or a value greater than a certain threshold in the probability distribution feature can be determined as the label of the image or video. In other words, the image or video may be labeled with the maximum value in the probability distribution characteristic or a value greater than a certain threshold. It should be noted that the multi-layer classifier 131 may be any multi-layer classifier, which is not limited in this application; for example, the multi-layered classifier may be a multi-layered classifier based on a unit of a Multi Layer Perceptron (MLP).

The custom tag processing module 150 may be configured to receive text information of a target image or video sent by the text feature extraction module 113, perform word segmentation on the text information, match a plurality of processed words with a custom tag to obtain a first tag set, and send the first tag set to the candidate tag processing module 140.

The candidate tag processing module 140 may be configured to receive the tags of the image or video sent by the tag identification module 132 in the hierarchical classification module 130 and the first tag set sent by the custom tag processing module 150, and supplement or deduplicate the tags of the image or video based on the received first tag set to obtain a final tag of the target image or video.

As can be seen from the above, firstly, the visual feature extraction module 111, the audio feature extraction module 112, and the text feature extraction module 113 respectively extract the visual feature, the audio feature, and the text feature of the target image or video, and perform feature fusion on the visual feature, the audio feature, and the text feature of the target image or video by using the feature fusion module 120, which is equivalent to make the feature expression of the target image or video more accurate and sufficient; secondly, the fused features are used as input, hierarchical multiplexing of intermediate features of different layers of classifiers is realized through the multi-layer classifier 131 in the hierarchical classification module 130, and the accuracy of the last layer of classifier in the multi-layer classifier 131 is improved, namely the accuracy of probability distribution features output by the last layer of classifier is improved; finally, the probability distribution characteristics output by the last layer of classifier are used as the input of the label identification module 132, and the label of the target image or video is identified by the label identification module 132, so that the label identification accuracy of the target image or video is improved; in addition, in order to further improve the accuracy, the tag identification module 132 sends the obtained tag of the target image or video to the candidate tag processing module 140, the text feature extraction module 113 sends the extracted text information to the custom tag processing module 150, the custom tag processing module 150 performs word segmentation on the received text information, matches the segmented words with the tags defined in advance in the database to obtain a first tag set, and sends the first tag set to the candidate tag processing module 140 at the same time, the candidate tag processing module 140 performs deduplication or supplementation on the received tag of the target image or video by using the received first tag set to obtain a final tag of the target image or video, and further improves the accuracy of the tag of the target image or video.

It should be understood that fig. 1 is only an example of the present application and should not be construed as limiting the present application.

For ease of understanding, the relevant terms in this application are described below.

Identification of image or video tags: the image or video label technology generally refers to high-level semantic description of the content of an image or video, and is a basic task in computer vision, and the image or video label plays an extremely important role in downstream tasks in the short video age and is widely applied to recommendation systems.

Multimodal: multimodal in this application refers to multimedia data, information such as text, video and speech describing the same object or object entities with the same semantics in the internet.

Character recognition technology (OCR): text recognition technology derives the process of text in an image by analyzing the position and character type of characters in a scanned or photographed image.

Automatic speech recognition technology (ASR): automatic speech recognition technology converts human speech into text by analyzing audio information.

Slowfast video classification algorithm: two parallel convolutional neural networks are applied to the same video segment, namely a Slow (Slow) channel and a Fast (Fast) channel; the Slow channel is used for analyzing static content in the video, the Fast channel is used for analyzing dynamic content in the video, the Slow channel and the Fast channel both use a residual error network RestNet model, and the convolution operation is executed immediately after a plurality of video frames are captured.

The method for identifying a tag provided in the present application will be described in detail below with reference to fig. 2 to 4.

Fig. 2 is a schematic flow chart of a method 200 for identifying a tag provided by an embodiment of the present application.

It should be noted that the solutions provided in the embodiments of the present application can be implemented by any electronic device having data processing capability. For example, the electronic device may be implemented as a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and an artificial intelligence platform, and the server may be directly or indirectly connected in a wired or wireless communication manner, which is not limited herein;

as shown in fig. 2, the method 200 may include some or all of the following:

s201, extracting a plurality of modal characteristics of a target image or video;

s202, performing feature fusion on the plurality of modal features to obtain fused features after the plurality of modal features are fused;

s203, based on the fusion characteristics and the intermediate characteristics of the i-1 th layer classifier in the M layers of classifiers, utilizing the i layer classifier to obtain the intermediate characteristics of the i layer classifier until the intermediate characteristics of the M layer classifier are obtained; 1 < i < M, wherein the feature output by the ith layer classifier in the M layers of classifiers is used for identifying the label of the ith level in the M levels, and the intermediate feature of the first layer classifier in the M layers of classifiers is obtained based on the fusion feature;

s204, based on the intermediate features of the M-th layer classifier, utilizing the M-th layer classifier to output probability distribution features;

and S205, determining the label of the target image or video based on the probability distribution characteristics.

In other words, the server extracts a plurality of modal features of the target image or video, performs feature fusion on the plurality of modal features to obtain fused features, and takes the fused features as the input of the M-layer classifier, so that the M-layer classifier uses the i-layer classifier to obtain the intermediate features of the i-layer classifier based on the fused features and the intermediate features of the i-1-th classifier in the M-layer classifier until obtaining the intermediate features of the M-layer classifier; taking the intermediate features of the M-th layer classifier as the input of the M-th layer classifier, outputting the probability distribution features of the target image or video, and determining the label of the target image or video based on the probability distribution features; wherein, i is more than 1 and less than or equal to M, the characteristics output by the ith layer classifier in the M layers of classifiers are used for identifying the label of the ith level in the M levels, and the intermediate characteristics of the first layer classifier in the M layers of classifiers are obtained based on the fusion characteristics.

Based on the scheme, the plurality of modal features of the target image or the video are extracted, and the plurality of modal features are subjected to feature fusion, so that feature expression of the plurality of modal features of the fused target image or the fused video is considered, the target image or the fused video is more sufficient in feature expression, and the identification accuracy of the label can be improved.

In addition, the accuracy of the label of the target image or video is improved, so that the image or video can be searched by utilizing the label identified by the image or video on the Internet platform, and the quality and the efficiency of the search can be further improved; meanwhile, the user experience of the product can be greatly improved by utilizing the identified label to carry out video recommendation on the user.

It should be noted that the plurality of modal features may include, but are not limited to, visual features, audio features, and text features. The way to extract the visual features of the target image or video includes, but is not limited to, extracting the visual features based on a residual network (RestNet) framework in a slow-fast (Slowfast) channel video classification algorithm. The way of extracting the audio feature of the target image or video includes, but is not limited to, audio feature extraction based on the VGGish framework. Ways to extract the text features of the target image or video include, but are not limited to, text feature extraction using a BERT frame (BERT), or text information may be supplemented using a character recognition technology (OCR) or a speech recognition technology (ASR) while using a BERT frame. For example, the audio in the target video is separated, and the automatic speech recognition ASR technology is used to obtain the sound text in the audio, and the like, which is not limited in this application. Note that, the manner of performing feature fusion on the plurality of modal features includes, but is not limited to, feature fusion based on a transform (Transformer) framework, and the present application does not specifically limit this. In addition, the M-layer classifier in the present application may preferably be based on a multi-layer classifier with a multi-layer Perceptron (MLP) as a unit, or may also be based on an M-layer classifier with other frameworks, as long as the post-level classifier can multiplex the intermediate features of the pre-level classifier, so as to implement the hierarchical multiplexing of the intermediate features of the pre-level classifier, and the present application does not specifically limit this. It should be noted that the probability distribution characteristic may be a distribution with a length or a dimension N. Each bit or value in the probability distribution feature corresponds to a label, and the label corresponding to the maximum value or a value greater than a certain threshold in the probability distribution feature can be determined as the label of the image or video. In other words, the image or video may be labeled with the maximum value in the probability distribution characteristic or a value greater than a certain threshold.

To verify the validity of multiple modalities, we will now take as an example the data obtained by collecting 22328 videos in the business for manual annotation, which is 7: the proportion of 1 is divided into a training set and a test set, and the accuracy of the identification label improved by the method is explained by combining the experimental results of the table 1.

TABLE 1

Method	Highest classification error rate	Minimum classification error rate
			Pure visual characteristic (Baseline)	61.48％	28.46％
Visual + speech features	59.28％	27.03％
			Visual feature + speech feature + text feature	55.51％	22.35％

As shown in table 1, a video frame is used as a visual feature input, an audio of a video is used as an audio feature input, a title of the video is used as a text feature input, and a comparison experiment is designed, wherein Baseline is the highest classification error rate and the lowest classification error rate based on pure visual features, and the experiment shows that with the increase of the number of modes, the classification error rates are all reduced to different degrees, which shows that the multi-mode information is helpful for improving the effect of label accuracy.

The following will explain the feature fusion method provided in the present application in detail by taking the example of extracting the visual feature, the audio feature and the text feature of the target image or video. It should be noted that, the present application takes the example of extracting the visual feature, the audio feature, and the text feature of the target image or video as an example, but not limited to the example that the plurality of modal features are the visual feature, the audio feature, and the text feature, and of course, other modal features such as a timing feature may also be included.

As shown in fig. 3, the block diagram 300 may include a modal characteristics linear mapping module 301, a modal and position encoding module 302; the linear mapping module 301 is configured to map the plurality of modal features into a plurality of first features with the same dimension, respectively; the mode and position coding module 302 is configured to perform mode and position coding on the plurality of first features to obtain a fused feature.

It should be noted that the modality and position encoding module 302 may be a transform framework-based module.

In some embodiments of the present application, S202 may include:

mapping the modal features into first features with the same dimensionality respectively; and carrying out modal and position coding on the plurality of first features to obtain the fused feature.

As an example, the visual feature, the audio feature, and the text feature of the target image or video are respectively mapped to a plurality of first features of the same dimension, which is beneficial to performing feature fusion on the visual feature, the audio feature, and the text feature of the target image or video, and then performing modality and position coding on the plurality of first features of the same dimension, that is, performing feature fusion on the visual feature, the audio feature, and the text feature of the same dimension, to obtain a fused feature.

It should be noted that, the modality and the position of the plurality of first features are coded, that is, the feature fusion of the first features corresponding to the plurality of modalities may be performed based on a feature fusion model of a transform framework, or may be performed based on other feature fusion models, which is not limited in this application.

In some implementations, for a jth feature in the plurality of first features, modifying the jth feature based on other first features except the jth feature to obtain a second feature corresponding to the jth feature; and determining the fusion feature based on a plurality of second features respectively corresponding to the plurality of first features.

As an example, first, for a first feature of a visual feature map in a plurality of first features, modifying the first feature of the visual feature map based on a first feature of an audio feature map and a first feature of a text feature map except the first feature of the visual feature map to obtain a second feature corresponding to the first feature of the visual feature map; similarly, for a first feature of an audio feature map in the plurality of first features, modifying the first feature of the audio feature map based on a first feature of a visual feature map and a first feature of a text feature map except the first feature of the audio feature map to obtain a second feature corresponding to the first feature of the audio feature map; similarly, for a first feature of a text feature map in the plurality of first features, the first feature of the text feature map is corrected based on the first feature of the visual feature map and the first feature of the audio feature map except the first feature of the text feature map to obtain a second feature corresponding to the first feature of the text feature map, and then a fused feature is obtained based on the plurality of second features obtained respectively.

In the process of carrying out feature fusion on a plurality of modal features, not only the relationship between the visual features and the audio features and the text features, but also the relationship between the audio features and the visual features and the text features and the relationship between the text features and the visual features and the audio features are considered, namely, a plurality of second features corresponding to the plurality of first features are obtained through intersection and fusion among the plurality of modal features, the fusion degree of the plurality of second features is improved, namely, the fusion effect is improved, and further, the accuracy of label identification is improved.

It should be noted that the feature after fusion may be obtained by feature splicing, feature adding, or feature multiplying based on a plurality of second features, which is not specifically limited in this application. It should be understood that the present application is not limited to the specific form of visual, audio, or textual features. For example, the visual feature, the audio feature and the text feature may be vectors with specific dimensions or may be matrixes with specific dimensions, which is not particularly limited in the present application.

In some implementations, the weight corresponding to the jth feature is determined based on other first features except the jth feature; and determining the product of the jth feature and the weight corresponding to the jth feature as the second feature corresponding to the jth feature.

As an example, first, for a first feature of a visual feature map in a plurality of first features, determining a first weight corresponding to the first feature of the visual feature map based on the first feature of an audio feature map and the first feature of a text feature map, except the first feature of the visual feature map, and then modifying the first feature of the visual feature map by using the first weight to obtain a second feature corresponding to the first feature of the visual feature map; similarly, for a first feature of an audio feature map in the plurality of first features, first, a second weight corresponding to the first feature of the audio feature map is determined based on the first feature of the visual feature map and the first feature of the text feature map except the first feature of the audio feature map, and then, the first feature of the audio feature map is corrected by using the second weight to obtain a second feature corresponding to the first feature of the audio feature map; similarly, for a first feature of a text feature map in the plurality of first features, first, a third weight corresponding to the first feature of the text feature map is determined based on the first feature of the visual feature map and the first feature of the audio feature map, except the first feature of the text feature map, and then, the first feature of the text feature map is corrected by using the third weight to obtain a second feature corresponding to the first feature of the text feature map.

Determining the weight corresponding to the jth feature based on other first features except the jth feature; and correcting the jth feature based on the weight of the jth feature, which is equivalent to that before determining the second feature corresponding to the jth feature, the first features except the jth feature are preliminarily fused to improve the fusion degree of the second feature corresponding to the jth feature and the second features corresponding to the first features except the jth feature, and accordingly, the accuracy of tag identification is improved.

For example, the second feature corresponding to the first feature of the visual feature map or the second feature corresponding to the first feature of the audio feature map or the second feature corresponding to the first feature of the text feature map may be determined based on the following formula (1):

wherein Q, V, K is a triplet vector of attention (attention) mechanism, d_kRepresenting the dimension of K in the triplet vector.

Taking the text feature and the audio feature as an example, assuming that the text feature is composed of at least one word feature vector, wherein each word feature vector has 512 dimensions, the at least one word feature vector may be represented as a matrix, i.e., a third matrix, and the third matrix may be mapped to a low-dimensional vector space, e.g., 64 dimensions, through three parameter matrices QM, KM, VM, to obtain a representation of the third matrix in the three low-dimensional vector space, i.e., Q, K, V of the third matrix. For example, the third matrix may be multiplied by QM, KM, VM, respectively, to obtain Q, K, V; assuming that the audio feature is composed of at least one audio feature vector, wherein each audio feature vector has 128 dimensions, the at least one audio feature vector may be represented as a matrix, i.e. a fourth matrix, which may be mapped to a low-dimensional vector space, e.g. 64 dimensions, by three parameter matrices QM, KM, VM, to obtain a representation of the fourth matrix in the three low-dimensional vector spaces, i.e. Q, K, V of the fourth matrix. For example, the fourth matrix may be multiplied by QM, KM, VM, respectively, to obtain Q, K, V of the fourth matrix.

Performing matrix multiplication on Q of the third matrix and K of the third matrix to obtain a matrix A, performing matrix multiplication on Q of the fourth matrix and K of the fourth matrix to obtain a matrix B, averaging the matrix A and the matrix B to obtain a matrix A1, and scaling (scale) the matrix A1, for example, dividing each element by the dimension of a K vector under a root number, so that the result of inner product can be prevented from being too large and entering a region with a gradient of 0 during training.

In short, performing matrix multiplication on Q of the third matrix and K of the third matrix, performing matrix multiplication on Q of the fourth matrix and K of the fourth matrix, normalizing multiplication results of the Q of the third matrix and the K of the fourth matrix respectively, and performing averaging processing to obtain a weight corresponding to the first feature of the visual feature mapping; and correcting the first feature of the visual feature mapping by using the weight corresponding to the first feature of the visual feature mapping to obtain a second feature corresponding to the first feature of the visual feature mapping.

It should be noted that "Multi-Head" Attention (Multi-Head Attention) may be used to obtain Q, K, V of the third matrix or the fourth matrix, and "Multi-Head" may refer to using multiple sets of initialization values when initializing parameter matrices QM, KM, VM.

In some embodiments of the present application, the M-layer classifier is an M-layer classifier based on a unit of multi-layer perceptron MLP, and S203 may include:

splicing the fusion feature and the intermediate feature output by the last hidden layer in the i-1 layer classifier to obtain the spliced feature of the i-layer classifier; and taking the spliced features of the ith layer classifier as input, and obtaining the intermediate features of the ith layer classifier by using the ith layer classifier.

The intermediate features of the i-th layer classifier are obtained by the i-th layer classifier based on the fusion features and the intermediate features of the i-1-th layer classifier in the M-layer classifier, and equivalently, the i-th layer classifier considers the intermediate features of the previous i-1-th layer classifier in a layer-by-layer multiplexing mode, so that the identification accuracy of the i-th layer classifier on the label is improved, namely, the identification accuracy of the M-th layer classifier on the label is finally improved, and equivalently, the accuracy of the label of the target image or video is improved.

In some embodiments of the present application, S205 may include:

determining a first numerical value which is larger than a preset threshold value in the probability distribution characteristics based on the probability distribution characteristics; identifying a label corresponding to the first value in at least one label; and determining the label corresponding to the first numerical value as the label of the target image or video, wherein the dimension of the at least one label is equal to that of the probability distribution feature.

It should be understood that the probability distribution characteristic may be a distribution of length or dimension N. And each bit or value in the probability distribution characteristics corresponds to a label, and the label corresponding to the first value which is greater than a preset threshold value in the probability distribution characteristics is determined as the label of the target image or video. In other words, the target image or video may be labeled with a label corresponding to a first value of the probability distribution characteristic greater than a preset threshold.

The preset threshold may be a range of values, or may be a specific value. Of course, the preset thresholds corresponding to the labels of different levels may also be partially or completely different. For example, the preset threshold corresponding to the upper level tag may be greater than or equal to the preset threshold corresponding to the lower level tag. For example, the preset threshold value corresponding to the label "dog" is 8 or 9, and the preset threshold value corresponding to the label "husky" is 5 or 6. Of course, the specific numerical values mentioned above are merely examples, and the present application is not limited thereto. In addition, the numerical value in the probability distribution feature can be used for indicating the estimation accuracy of estimating the label corresponding to the numerical value as the label of the first image; in addition, the labels corresponding to each bit or value of the probability distribution feature have semantic relationships, such as semantic relationships between upper and lower bits, similar semantic relationships, or opposite semantic relationships, and for example, the labels 'dog' and 'husky' have upper and lower bit relationships. As another example, the label "african elephant" and the label "asian elephant" have a similar semantic relationship. As another example, the label "day" and the label "night" have opposite semantic relationships.

The following will be further described in detail with reference to fig. 4 by taking an M-layer classifier as a three-layer classifier as an example, and fig. 4 is a schematic block diagram of feature multiplexing of the three-layer classifier provided in the embodiment of the present application.

As shown in fig. 4, the block diagram includes a first-layer classifier 410, a second-layer classifier 420, and a third-layer classifier 430, wherein each layer classifier is based on MLP as a unit, that is, each layer classifier includes an input layer, at least one hidden layer, and an output layer, and the hidden layer is taken as one layer for illustration.

As shown in fig. 4, firstly, the fused features are respectively sent to each of the M-layer classifiers, after the input layer of the MLP in the first-layer classifier receives the fused features, the MLP hidden layer in the first-layer classifier 410 outputs the intermediate features of the first-layer classifier 410, then the intermediate features of the first-layer classifier 410 and the fused features are spliced, the spliced features are sent to the second-layer classifier 420, after the input layer of the MLP in the second-layer classifier 420 receives the spliced features, the intermediate features of the second-layer classifier 420 are output through the hidden layer of the MLP in the second-layer classifier 420, then the intermediate features of the second-layer classifier 420 and the fused features are spliced, then the spliced features are sent to the third-layer classifier 430, after the input layer of the MLP in the third-layer classifier 430 receives the spliced features, the probability distribution features of the third-layer classifier 430 are output through the output layer of the MLP in the third-layer classifier 430, based on the probability distribution features, a label of the target image or video is determined. Of course, the output layer of each layer MLP may output a corresponding probability distribution characteristic, i.e., the probability distribution characteristic of each layer corresponds to the label of each level.

By designing the multi-layer classifier based on MLP as a unit, the feature expression corresponding to the labels at the upper level is effectively utilized, namely, the features among different levels are multiplexed, the classification performance of the multi-layer classifier is improved, and the accuracy of label identification is improved.

For ease of understanding, the MLP is described below.

A Multilayer Perceptron (MLP), also called an Artificial Neural Network (ANN), may have multiple hidden layers in the middle except for input and output layers, and the simplest MLP has only one hidden layer, i.e. a three-layer structure. It should be noted that, the number of hidden layers is not specified in the MLP, and therefore, an appropriate number of hidden layer layers can be selected according to respective needs. And there is no limit to the number of output layer neurons. The multilayer perceptron is fully connected between layers, and the meaning of full connection is that: any one neuron in the upper layer is connected with all neurons in the lower layer. For the input layer, assuming the input is an n-dimensional vector, there are n neurons. Assuming that the input layer is represented by vector X, the output of the hidden layer is f (W1X + b1), W1 is weight (also called connection coefficient), b1 is bias, and the function f can be a commonly used sigmoid function or tanh function. The hidden layer to the output layer can be regarded as a multi-class logistic regression, namely, softmax regression, so that the output of the output layer is softmax (W2X1+ b2), and X1 represents the output f (W1X + b1) of the hidden layer. Assuming that the softmax function outputs a k-dimensional column vector, the number in each dimension represents the probability of that class occurring.

The MLP may typically be trained on a small Batch basis (Mini-Batch), which refers to a subset of training data randomly selected from the training data corpus T. Assuming that the training data set T contains N samples and the Batch size (Batch size) of each Mini-Batch is b, then the entire training data can be divided into N/b Mini-batches. When the model is trained by the SGD, one instance of Mini-Batch is generally run out, which is called one step (step) of completing the training, and one round of training is completed by the whole training data after running N/b steps, which is called one period (epoch) of completing. After an Epoch training process is completed, randomly shuffling is carried out on training data, the sequence of the training data is disordered, the steps are repeated, then the training of the next Epoch is started, and the complete and sufficient training of the model is formed by a plurality of rounds of epochs.

In some embodiments of the present application, the method 200 may further comprise:

acquiring text information of the target image or video, wherein the text information comprises at least one of the following items: the text of the target image or video, the title of the target image or video, and the annotation text of the target image or video; and based on the text information, supplementing or removing the label of the target image or video to obtain a final label of the target image or video.

In this embodiment, the label of the target image or video is supplemented or deduplicated based on the text information of the target image or video to obtain the final label of the target image or video, so that the accuracy of the label of the target image or video can be further improved.

In one implementation, the text information is segmented to obtain a plurality of segmented words corresponding to the target image or video; matching the multiple word segmentations with a user-defined dictionary to obtain a first label set of the target image or video; and based on the first label set, the labels of the target image or video are supplemented or deduplicated.

For example, first, text information may be identified and word-segmented according to a knowledge graph to obtain a plurality of entities of the text information; then, carrying out label matching on each entity by using a user-defined dictionary to obtain a first label set, and finally supplementing or removing the labels of the target image or video according to the labels in the first label set; of course, the way of segmenting the text information is not particularly limited in the present application.

In one implementation, the tags of the target image or video are supplemented or de-duplicated using semantic relevance of the tags of the target image or video and the first set of tags, and/or a de-duplication number threshold of the tags of the target image or video and the first set of tags.

As an example, the tag of the target image or video may be supplemented or deduplicated with a semantic relationship existing between the tag of the target image or video and the tag in the first tag set, for example, a semantic relationship between upper and lower bits, and as an example, the tags 'dog' and 'husky' have an upper and lower bit relationship, when the tag of the target image or video is supplemented or deduplicated, a lower tag may be selected, if the tag of the target image or video does not have the lower tag, the tag of the target image or video is supplemented, and if both the upper tag and the lower tag exist, the tag of the target image or video is deduplicated, so that the tag of the target image or video is more accurate.

As another example, the tags of the target image or video may be supplemented or de-duplicated using a threshold number of tags of the target image or video and the de-duplication number of the first set of tags, for example, manually designing the threshold number of tags of the target image or video in advance, and if the number of tags of the target image or video is greater than the threshold number, then the redundant tags of the target image or video may be removed according to the above semantic relevance rule.

It should be understood that the above manner of supplementing or removing the tag of the target image or video is only an example of the present application, and of course, the tag of the target image or video may also be supplemented or removed in other manners, and the present application is not limited to this.

The method comprises the steps of obtaining a first label set based on text information of a target image or a video, and supplementing or removing the weight of the obtained label of the target image or the video by using the first label set, namely, after the classification accuracy of an M-layer classifier is improved, the accuracy of the finally generated label is further improved by using the first label set, so that the obtained label of the target image or the video has higher practical application value.

Fig. 5 is a schematic flow chart of a method 500 for training a label recognition model provided in an embodiment of the present application.

As shown in fig. 5, the method 500 may include some or all of the following:

s501, acquiring an image or video to be trained;

s502, extracting a plurality of modal characteristics of the image or video to be trained;

s503, performing feature fusion on the plurality of modal features to obtain a fused feature after the plurality of modal features are fused;

s504, acquiring the marking label of the ith level corresponding to the image or video to be trained;

s505, training the ith classifier by taking the fusion feature, the intermediate feature of the i-1 th classifier in the M-layer classifiers and the label of the ith level as input to obtain an identification label model, wherein i is more than 1 and less than or equal to M, the first-layer classifier in the M-layer classifiers is obtained by training based on the fusion feature and the label of the first level, and the intermediate feature of the first-layer classifier in the M-layer classifier is obtained based on the fusion feature.

Based on the scheme, the plurality of modal features of the image or the video to be trained are extracted and subjected to feature fusion, and equivalently, the plurality of modal features of the image or the video to be trained are fused to perform feature expression, so that the image or the video to be trained is more sufficient in the aspect of feature expression, and the identification accuracy of the model on the label can be improved.

In addition, training the ith layer classifier based on the fusion characteristics and the intermediate characteristics of the i-1 th layer classifier in the M layers of classifiers and the labeling label of the ith level as input to obtain an identification label model, which is equivalent to the multiplexing of the intermediate characteristics of the front-level classifier by the rear-level classifier, on one hand, the identification accuracy of the label by the rear-level classifier is improved, namely the accuracy of the label of the model output image or video is improved; on the other hand, the post-level classifier reuses the intermediate features of the pre-level classifier, so that the convergence speed of the model is increased, the model training time is shortened, and the model training efficiency is improved.

In addition, the accuracy of the model identification label is improved, so that on an internet platform, the image or video is searched by using the label output by the label identification model, so that the searching quality and efficiency can be further improved; meanwhile, the user experience of the product can be greatly improved by utilizing the identified label to carry out video recommendation on the user.

It should be noted that the plurality of modal features may include, but are not limited to, visual features, audio features, text features; the visual feature of the target image or video may be extracted, but is not limited to, based on a residual network RestNet framework in a slow-fast Slowfast channel video classification algorithm for visual feature extraction, the audio feature of the target image or video may be extracted, but is not limited to, based on a VGGish framework for audio feature extraction, the text feature of the target image or video may be extracted, but is not limited to, based on a BERT framework for text feature extraction, or the BERT framework may be adopted while text information may be supplemented by using a character recognition technology OCR or a speech recognition technology ASR, for example, audio in the target video is separated, and a sound text in the audio is obtained by using an automatic speech recognition ASR technology, and the like. It should be noted that feature fusion for multiple modal features may be, but is not limited to, feature fusion based on a transform framework, and this is not specifically limited in the present application. In addition, the identification tag model in the present application may be an identification tag model based on a multi-layer Perceptron (MLP) as a framework, and may of course be based on other frameworks as long as the framework can implement that a post-level classifier multiplexes intermediate features of a pre-level classifier, and implement that the intermediate features of the pre-level classifier are multiplexed, which is not particularly limited in the present application.

In some embodiments of the present application, S503 may include:

mapping the modal features into first features with the same dimensionality respectively;

and carrying out modal and position coding on the plurality of first features to obtain the fused feature.

In some embodiments of the present application, S503 may include:

for the jth feature in the multiple first features, modifying the jth feature based on other first features except the jth feature to obtain a second feature corresponding to the jth feature;

and determining the fusion feature based on a plurality of second features respectively corresponding to the plurality of first features.

In some embodiments of the present application, S503 may include:

determining the weight corresponding to the jth feature based on other first features except the jth feature;

and determining the product of the jth feature and the weight corresponding to the jth feature as the second feature corresponding to the jth feature.

In some embodiments of the present application, prior to S505, the method 500 may further comprise:

acquiring text information of the target image or video, wherein the text information comprises at least one of the following items: the text of the target image or video, the title of the target image or video, and the annotation text of the target image or video; based on the text information, supplementing or removing the weight of the marking label of the ith level to obtain a final marking label of the ith level;

wherein, S505 may include:

and training the ith layer classifier by taking the fusion characteristics, the intermediate characteristics of the ith-1 layer classifier in the M layers of classifiers and the final label of the ith level as input so as to obtain the identification label model.

In some embodiments of the present application, the text information is segmented to obtain a plurality of segments corresponding to the target image or video; matching the multiple word segmentations with a user-defined dictionary to obtain a first label set of the target image or video; and supplementing or removing the repeated label of the ith level based on the first label set to obtain the final label of the ith level.

In some embodiments of the present application, the method 500 may further comprise:

and supplementing or removing the weight of the labeling label at the ith level by using the semantic relevance between the labeling label at the ith level and the first label set and/or the weight removing number threshold value between the labeling label at the ith level and the first label set.

It should be noted that the scheme for fusing the plurality of modal features in the method for training the tag identification model may be the same as the scheme for fusing the plurality of modal features in the method for identifying the tag, and is not described herein again. For example, the scheme for obtaining the intermediate features of the i-1 th-layer classifier in the method for training the label recognition model may be the same as the scheme for obtaining the intermediate features of the i-1 th-layer classifier in the method for recognizing the label.

The preferred embodiments of the present application have been described in detail with reference to the accompanying drawings, however, the present application is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and these simple modifications are all within the protection scope of the present application. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same should be considered as the disclosure of the present application as long as the concept of the present application is not violated.

It should also be understood that, in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply an execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The method provided by the embodiment of the present application is explained above, and the device provided by the embodiment of the present application is explained below.

Fig. 6 is a schematic block diagram of an apparatus 600 for identifying a tag provided in an embodiment of the present application.

As shown in fig. 6, the apparatus 600 may include:

an extraction unit 610 for extracting a plurality of modal features of a target image or video;

a fusion unit 620, configured to perform feature fusion on the multiple modal features to obtain a fusion feature obtained after the multiple modal features are fused;

a first determining unit 630, configured to obtain an intermediate feature of an ith layer classifier by using the ith layer classifier based on the fusion feature and the intermediate feature of an i-1 th layer classifier in the M layer classifiers until obtaining an intermediate feature of the M layer classifier; 1 < i < M, wherein the feature output by the ith layer classifier in the M layers of classifiers is used for identifying the label of the ith level in the M levels, and the intermediate feature of the first layer classifier in the M layers of classifiers is obtained based on the fusion feature;

an output unit 640, configured to output a probability distribution feature by using the mth layer classifier based on the intermediate feature of the mth layer classifier;

a second determining unit 650, configured to determine a label of the target image or video based on the probability distribution characteristic.

In some embodiments of the present application, the fusion unit 620 is configured to:

In some embodiments of the present application, the first determining unit 630 is configured to:

based on the fusion characteristics and the intermediate characteristics of the i-1 th layer classifier in the M layers of classifiers, the intermediate characteristics of the i-th layer classifier are obtained by using the i-th layer classifier, and the method comprises the following steps:

splicing the fusion feature and the intermediate feature output by the last hidden layer in the i-1 layer classifier to obtain the spliced feature of the i-layer classifier;

and taking the spliced features of the ith layer classifier as input, and obtaining the intermediate features of the ith layer classifier by using the ith layer classifier.

In some embodiments of the present application, the second determining unit 650 is configured to:

determining a first numerical value which is larger than a preset threshold value in the probability distribution characteristics based on the probability distribution characteristics;

identifying a label corresponding to the first value in at least one label;

and determining the label corresponding to the first numerical value as the label of the target image or video, wherein the dimension of the at least one label is equal to that of the probability distribution feature.

In some embodiments of the present application, the extraction unit 610 is further configured to:

acquiring text information of the target image or video, wherein the text information comprises at least one of the following items: the text of the target image or video, the title of the target image or video, and the annotation text of the target image or video;

and based on the text information, supplementing or removing the label of the target image or video to obtain a final label of the target image or video.

In some embodiments of the present application, the first determining unit 630 is further configured to:

performing word segmentation on the text information to obtain a plurality of word segments corresponding to the target image or video;

matching the multiple word segmentations with a user-defined dictionary to obtain a first label set of the target image or video;

and based on the first label set, the labels of the target image or video are supplemented or deduplicated.

and supplementing or removing the labels of the target image or video by utilizing the semantic correlation between the labels of the target image or video and the first label set and/or the de-duplication number threshold of the labels of the target image or video and the first label set.

Fig. 7 is a schematic block diagram of an apparatus 700 for training a tag recognition model according to an embodiment of the present application.

A first obtaining unit 710, configured to obtain an image or a video to be trained;

an extracting unit 720, configured to extract a plurality of modal features of the image or video to be trained;

a fusion unit 730, configured to perform feature fusion on the multiple modal features to obtain a fusion feature obtained after the multiple modal features are fused;

a second obtaining unit 740, configured to obtain an i-th level annotation tag corresponding to the image or video to be trained;

a training unit 750, configured to train an i-th layer classifier with the fusion feature, the intermediate feature of the i-1 th layer classifier in the M-layer classifiers, and the label of the i-th level as inputs to obtain a label recognition model, where i is greater than 1 and less than or equal to M, a first layer classifier in the M-layer classifier is obtained by training based on the fusion feature and the label of the first level, and the intermediate feature of the first layer classifier in the M-layer classifier is obtained based on the fusion feature.

In some embodiments of the present application, the fusion unit 730 is configured to:

In some embodiments of the present application, the first obtaining unit 710 is further configured to:

wherein, the training unit 750 may be specifically configured to:

performing word segmentation on the text information to obtain a plurality of word segments corresponding to the target image or video; matching the multiple word segmentations with a user-defined dictionary to obtain a first label set of the target image or video; and supplementing or removing the repeated label of the ith level based on the first label set to obtain the final label of the ith level.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 600 may correspond to a corresponding main body in executing the method 200 of the embodiment of the present application, and each unit in the apparatus 700 is respectively for implementing a corresponding flow in the method 500, and is not described herein again for brevity.

It should also be understood that the units in the apparatus 600 or the apparatus 700 related to the embodiments of the present application may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) thereof may be further split into multiple functionally smaller units to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effect of the embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the apparatus 600 or the apparatus 700 may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units. According to another embodiment of the present application, the

apparatus

600 or 700 related to the embodiment of the present application may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method on a general-purpose computing device including a general-purpose computer such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and implementing the method of identifying a tag or the method of training a tag identification model of the embodiment of the present application. The computer program can be loaded on a computer-readable storage medium, for example, and loaded and executed in an electronic device through the computer-readable storage medium, so as to implement the corresponding method of the embodiments of the present application.

In other words, the above-mentioned units may be implemented in hardware, may be implemented by instructions in software, and may also be implemented in a combination of hardware and software. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software in the decoding processor. Alternatively, the software may reside in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 8 is a schematic structural diagram of an electronic device 800 provided in an embodiment of the present application.

As shown in fig. 8, the electronic device 800 includes at least a processor 810 and a computer-readable storage medium 820. Wherein the processor 810 and the computer-readable storage medium 820 may be connected by a bus or other means. The computer-readable storage medium 820 is used to store a computer program 821, the computer program 821 includes computer instructions, and the processor 810 is used to execute the computer instructions stored by the computer-readable storage medium 820. The processor 810 is a computing core and a control core of the electronic device 800, which is adapted to implement one or more computer instructions, in particular to load and execute the one or more computer instructions to implement a corresponding method flow or a corresponding function.

By way of example, processor 810 may also be referred to as a Central Processing Unit (CPU). Processor 810 may include, but is not limited to: general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

By way of example, computer-readable storage medium 820 may be a high-speed RAM memory or a Non-volatile memory (Non-volatile memory), such as at least one disk memory; optionally, there may be at least one computer readable storage medium located remotely from the processor 810. In particular, computer-readable storage medium 820 includes, but is not limited to: volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In one implementation, the electronic device 800 can be the apparatus 600 for recognizing tags shown in fig. 6 or the apparatus 700 for training a tag recognition model shown in fig. 7; the computer-readable storage medium 820 has stored therein computer instructions; computer instructions stored in the computer-readable storage medium 820 are loaded and executed by the processor 810 to implement the corresponding steps in the method embodiments shown in fig. 2-5; in a specific implementation, the computer instructions in the computer-readable storage medium 820 are loaded by the processor 810 and perform corresponding steps, which are not described herein again to avoid repetition.

According to another aspect of the present application, a computer-readable storage medium (Memory) is provided, which is a Memory device in the electronic device 800 and is used for storing programs and data. Such as computer-readable storage medium 820. It is understood that the computer-readable storage medium 820 herein may include both a built-in storage medium in the electronic device 800 and, of course, an extended storage medium supported by the electronic device 800. The computer readable storage medium provides a storage space that stores an operating system of the electronic device 800. Also stored in the memory space are one or more computer instructions, which may be one or more computer programs 821 (including program code), suitable for loading and execution by the processor 810.

According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. Such as a computer program 821. At this time, the electronic device 800 may be a computer, and the processor 810 reads the computer instructions from the computer-readable storage medium 820, and the processor 810 executes the computer instructions, so that the computer performs the method of predicting the geographical location of the IP address provided in the above-described various alternatives.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes of the embodiments of the present application are executed in whole or in part or to realize the functions of the embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

Those of ordinary skill in the art will appreciate that the various illustrative elements and process steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Finally, it should be noted that the above is only a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all the changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of identifying a tag, comprising:

extracting a plurality of modal characteristics of a target image or video;

based on the fusion features and the intermediate features of the i-1 th layer classifier in the M layers of classifiers, utilizing the i-th layer classifier to obtain the intermediate features of the i-th layer classifier until the intermediate features of the M-th layer classifier are obtained; i is more than 1 and less than or equal to M, the features output by the ith layer classifier in the M layers of classifiers are used for identifying the label of the ith level in the M levels, and the intermediate features of the first layer classifier in the M layers of classifiers are obtained based on the fusion features;

2. The method according to claim 1, wherein said feature fusing the plurality of modal features to obtain a fused feature of the fused plurality of modal features comprises:

and carrying out modal and position coding on the plurality of first features to obtain the fusion features.

3. The method of claim 2, wherein said modality and location coding said plurality of first features resulting in said fused feature comprises:

for a jth feature in the plurality of first features, modifying the jth feature based on other first features except the jth feature to obtain a second feature corresponding to the jth feature;

4. The method according to claim 3, wherein the modifying the jth feature based on other first features except the jth feature to obtain a second feature corresponding to the jth feature comprises:

and determining the product of the jth feature and the weight corresponding to the jth feature as a second feature corresponding to the jth feature.

5. The method of claim 1, wherein the M-layer classifier is an M-layer classifier based on a unit of multilayer perceptron (MLP);

wherein, the obtaining of the intermediate feature of the ith layer classifier by using the ith layer classifier based on the fusion feature and the intermediate feature of the ith-1 layer classifier in the M layers of classifiers comprises:

splicing the fusion features and the intermediate features output by the last hidden layer in the i-1 th layer classifier to obtain spliced features of the i-th layer classifier;

and taking the spliced features of the ith layer of classifier as input, and obtaining the intermediate features of the ith layer of classifier by using the ith layer of classifier.

6. The method of claim 1, wherein determining the label of the target image or video based on the probability distribution features comprises:

identifying a label corresponding to the first value in at least one label;

and determining a label corresponding to the first numerical value as a label of the target image or video, wherein the dimension of at least one label is equal to that of the probability distribution feature.

7. The method of claim 1, further comprising:

and supplementing or removing the label of the target image or video based on the text information to obtain the final label of the target image or video.

8. The method of claim 7, wherein the supplementing or de-duplicating the tag of the target image or video based on the text information to obtain a final tag of the target image or video comprises:

and supplementing or removing the labels of the target image or video based on the first label set.

9. The method of claim 8, wherein the supplementing or de-duplicating the tag of the target image or video based on the first set of tags comprises:

and supplementing or removing the labels of the target image or video by using the semantic correlation between the labels of the target image or video and the first label set and/or the de-duplication number threshold of the labels of the target image or video and the first label set.

10. A method of training a label recognition model, comprising:

acquiring an image or video to be trained;

and training an ith layer classifier by taking the fusion features, the intermediate features of an i-1 th layer classifier in the M layers of classifiers and the labeling labels of the ith level as input to obtain an identification label model, wherein i is more than 1 and less than or equal to M, a first layer classifier in the M layers of classifiers is obtained by training based on the fusion features and the labeling labels of the first level, and the intermediate features of the first layer classifier in the M layers of classifiers are obtained based on the fusion features.

11. An apparatus for identifying a tag, comprising:

the fusion unit is used for performing feature fusion on the plurality of modal features to obtain fused features after the plurality of modal features are fused;

the first determining unit is used for obtaining the intermediate features of the ith layer classifier by using the ith layer classifier based on the fusion features and the intermediate features of the ith-1 layer classifier in the M layers of classifiers until the intermediate features of the M layers of classifiers are obtained; i is more than 1 and less than or equal to M, the features output by the ith layer classifier in the M layers of classifiers are used for identifying the label of the ith level in the M levels, and the intermediate features of the first layer classifier in the M layers of classifiers are obtained based on the fusion features;

an output unit, configured to output a probability distribution feature by using the mth layer classifier based on the intermediate feature of the mth layer classifier;

12. An apparatus for training a label recognition model, comprising:

and the training unit is used for training the ith classifier by taking the fusion features, the intermediate features of the ith-1 layer classifier in the M layers of classifiers and the labeling labels of the ith level as input so as to obtain a label recognition model, wherein i is more than 1 and less than or equal to M, the first layer classifier in the M layers of classifiers is obtained by training based on the fusion features and the labeling labels of the first level, and the intermediate features of the first layer classifier in the M layers of classifiers are obtained based on the fusion features.

13. An electronic device, comprising:

a processor adapted to execute a computer program;

a computer-readable storage medium, in which a computer program is stored which, when executed by the processor, implements the method of any one of claims 1 to 9 or the method of claim 10.

14. A computer-readable storage medium for storing a computer program which causes a computer to perform the method of any one of claims 1 to 9 or the method of claim 10.