CN114817586A

CN114817586A - Target object classification method and device, electronic equipment and storage medium

Info

Publication number: CN114817586A
Application number: CN202210467968.XA
Authority: CN
Inventors: 赵家玉; 彭冲; 程兵
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-29

Abstract

The embodiment of the invention provides a method and a device for classifying target objects, wherein the method comprises the following steps: acquiring a text label of a text object, and extracting text characteristics of the text object according to the text object and the text label; acquiring image characteristics of the non-text object and image statistical characteristics of the non-text object, and extracting fusion characteristics of the non-text object according to the image characteristics and the image statistical characteristics; and predicting the classification result of the target object according to the text characteristic, the fusion characteristic and the content statistical characteristic of the target object. The embodiment of the invention classifies the target object comprising a plurality of non-text objects and at least one text object, does not limit the number of the non-text objects and the types of the non-text objects in the target object, and enlarges the classification range of the target object. When the target object is classified, the embodiment of the invention not only utilizes the target object, but also utilizes the content statistical characteristics of the target object, enriches the classified characteristic data and improves the classification effect of the target object.

Description

Target object classification method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method and an apparatus for classifying a target object, an electronic device, and a computer-readable storage medium.

Background

User Generated Content (UGC) is some Content that users of internet platforms publish on network platforms. For example, the UGC includes the user's review content for the platform's merchants, products, and the like. Because the UGC is generated by the user, the UGC content reflects the feelings and evaluations of different users on some merchants or products, and therefore, many users can know the relevant information of the merchants or products by looking at the UGC of the merchants or products.

In order to obtain the UGC with reference value, the internet platform generally classifies the UGC through some classification algorithms or classification models, and displays the UGC to the user according to the classification result.

In the prior art, some models or algorithms can classify UGC containing images and texts, but only GUC containing a single image and text can be classified, UGC containing a plurality of images and texts, or UGC containing video and text is classified, and the classification result obtained by classification is inaccurate.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed in order to provide a classification method, apparatus, electronic device and computer-readable storage medium for a target object that overcome or at least partially solve the above problems.

In order to solve the above problem, according to a first aspect of the embodiments of the present invention, a method for classifying a target object is disclosed, where the target object includes at least one text object and a plurality of non-text objects, and the non-text objects include an image object and/or a video object; the method comprises the following steps: acquiring a text label of the text object, and extracting text characteristics of the text object according to the text object and the text label; acquiring image characteristics of the non-text object and image statistical characteristics of the non-text object, and extracting fusion characteristics of the non-text object according to the image characteristics and the image statistical characteristics; and predicting the classification result of the target object according to the text feature, the fusion feature and the content statistical feature of the target object.

Optionally, the extracting the text feature of the text object according to the text object and the text label includes: converting the text object and the text label into a text label; and inquiring the embedded features corresponding to the text labels in a dictionary trained in advance, inputting the embedded features into a preset text model, and outputting the text features.

Optionally, the acquiring the image feature of the non-text object includes: and sequentially performing convolution processing and global pooling processing on the image object and/or the video image object of the video object to obtain the image feature with the same dimension as the text feature.

Optionally, the acquiring the image statistical characteristics of the non-text object includes: acquiring image statistical characteristics of an integer type and image statistical characteristics of a floating point type of the non-text object; wherein the integer type of image statistical features comprise embedded features corresponding to image tags of the non-textual objects; the image statistical characteristics of the floating point type include click through rate.

Optionally, the extracting a fusion feature of the non-text object according to the image feature and the image statistical feature includes: integrating the image statistical features of each of the non-textual objects into an integrated image statistical feature; converting the integral image statistical features and the image features of each of the non-textual objects into a plurality of image tags; and inputting a plurality of image marks into a preset transformer-based model, and outputting the fusion characteristics.

Optionally, the predicting the classification result of the target object according to the text feature, the fusion feature and the content statistical feature of the target object includes: integrating the text features, the fusion features and the content statistical features and then executing preset operation; inputting the characteristics after the preset operation into a full connection layer, and outputting the classification result; the text feature, the fusion feature and the content statistical feature after integration processing keep the same dimension; the preset operation comprises at least one of the following operations: splicing operation, multiplication operation and attention operation.

Optionally, the content statistical features include: the number of texts of the text object, the number of non-texts of the non-text object, whether the target object contains interest points, the content statistical characteristics of the integer type of the target object, and the content statistical characteristics of the floating point type of the target object.

According to a second aspect of the embodiments of the present invention, there is also disclosed a device for classifying a target object, where the target object includes at least one text object and a plurality of non-text objects, and the non-text objects include an image object and/or a video object; the device comprises: the text feature extraction module is used for acquiring a text label of the text object and extracting the text feature of the text object according to the text object and the text label; the fusion feature extraction module is used for acquiring the image features of the non-text object and the image statistical features of the non-text object and extracting the fusion features of the non-text object according to the image features and the image statistical features; and the classification prediction module is used for predicting the classification result of the target object according to the text characteristic, the fusion characteristic and the content statistical characteristic of the target object.

Optionally, the text feature extraction module includes: the text label conversion module is used for converting the text object and the text label into a text label; and the text feature query module is used for querying the embedded features corresponding to the text labels in a dictionary trained in advance, inputting the embedded features into a preset text model and outputting the text features.

Optionally, the fused feature extraction module includes: and the image feature extraction module is used for sequentially carrying out convolution processing and global pooling processing on the image object and/or the video image object of the video object to obtain the image feature with the same dimensionality as the text feature.

Optionally, the fused feature extraction module includes: the image statistical characteristic extraction module is used for acquiring the image statistical characteristics of the integer type and the floating point type of the non-text object; wherein the integer type of image statistical features comprise embedded features corresponding to image tags of the non-textual objects; the image statistical characteristics of the floating point type include click through rate.

Optionally, the fused feature extraction module includes: a feature integration module for integrating the image statistical features of each non-textual object into integrated image statistical features; an image tag conversion module for converting the integrated image statistical features and the image features of each of the non-textual objects into a plurality of image tags; and the feature fusion module is used for inputting the image marks into a preset transformer-based model and outputting the fusion features.

Optionally, the classification prediction module includes: the preset operation module is used for integrating the text features, the fusion features and the content statistical features and then executing preset operation; the classification module is used for inputting the features after the preset operation into the full connection layer and outputting the classification result; the text feature, the fusion feature and the content statistical feature after integration processing keep the same dimension; the preset operation comprises at least one of the following operations: splicing operation, multiplication operation and attention operation.

According to a third aspect of the embodiments of the present invention, there is also disclosed an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for classifying a target object according to the first aspect when executing the computer program.

According to a fourth aspect of the embodiments of the present invention, there is also disclosed a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements a method of classifying a target object according to the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the invention has the following advantages:

according to the classification scheme of the target object provided by the embodiment of the invention, the target object to be classified can comprise at least one text object and a plurality of non-text objects. The non-text object may include an image object and/or a video object. And acquiring a text label of the text object, and extracting text characteristics of the text object according to the text object and the text label. And simultaneously, acquiring the image characteristics of the non-text object and the image statistical characteristics of the non-text object, and extracting the fusion characteristics of the non-text object according to the image characteristics and the image statistical characteristics. And then, predicting the classification result of the target object according to the text feature, the fusion feature and the content statistical feature of the target object.

The embodiment of the invention can classify the target objects comprising a plurality of non-text objects and at least one text object, does not limit the number of the non-text objects in the target objects and the types of the non-text objects, and expands the classification range of the target objects. In addition, when the target object is classified, the embodiment of the invention not only makes full use of the target object, but also can make use of the content statistical characteristics of the target object to further enrich the classified characteristic data, thereby improving the classification effect of the target object.

Drawings

FIG. 1 is a flow chart of the steps of a method of classifying a target object in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of a fused feature extraction procedure according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a target object classification model architecture based on a teletext multimodal modality according to an embodiment of the present invention;

fig. 4 is a block diagram showing a structure of a classification apparatus for a target object according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of a method for classifying a target object according to an embodiment of the present invention is shown. The classification method of the target object can be applied to a classification terminal or a classification server. The target object classification method can classify a target object comprising at least one textual object and a plurality of non-textual objects. The non-text object may include an image object and/or a video object. The method for classifying the target object specifically comprises the following steps:

step 101, obtaining a text label of a text object, and extracting text features of the text object according to the text object and the text label.

In an embodiment of the present invention, the text object may contain text content, for example, the text content is a text title and a text body input by a user. The text label of the text object can be predicted by any text prediction model or text prediction algorithm, for example, the text label of a certain text object is a "hot pot" label.

In practical applications, the text features of the text object may be extracted based on the text model. For example, the text model is a Bidirectional encoding Representation (Bert) based on a converter.

And 102, acquiring image characteristics of the non-text object and image statistical characteristics of the non-text object, and extracting fusion characteristics of the non-text object according to the image characteristics and the image statistical characteristics.

In the embodiment of the present invention, since a plurality of non-text objects exist in the target object, the image feature and the image statistical feature are extracted for each non-text object. It should be noted that the non-text object in the embodiment of the present invention may be an image object and/or a video object. When the image features and the image statistical features of the video and audio objects are extracted, the video and audio objects can be understood as a plurality of video and audio image objects which are obtained by sequencing according to time. That is, the audiovisual object may be decomposed into a plurality of audiovisual image objects.

And after the image features and the image statistical features of each non-text object are extracted, fusing the image features and the image statistical features of all the non-text objects to obtain fused features.

It should be noted that, the step 101 and the step 102 may be executed simultaneously, or the step 101 may be executed first and then the step 102 is executed, or the step 102 is executed first and then the step 101 is executed, and the execution sequence of the step 101 and the step 102 is not specifically limited in the embodiment of the present invention.

And 103, predicting the classification result of the target object according to the text feature, the fusion feature and the content statistical feature of the target object.

In the embodiment of the invention, besides extracting the text feature and the fusion feature, the content statistical feature of the target object is also introduced. And finally, predicting the classification result of the target object according to the text feature, the fusion feature and the content statistical feature.

In a preferred embodiment of the present invention, one implementation manner of extracting text features of a text object according to the text object and a text label is to convert the text object and the text label into a text label; and inquiring the embedded features corresponding to the text labels in a dictionary trained in advance, inputting the embedded features into a preset text model, and outputting the text features. In practical application, after a text object (such as a body and a title of a piece of comment content) and a predicted text label (such as a "hot pot" label) are converted into a text label (token), an embedding (embedding) feature corresponding to the text label is searched in a trainable dictionary, the embedding feature is used as an input item of a text model (such as a Bert model), and the text feature of the text object is output from the text model.

In a preferred embodiment of the present invention, an implementation manner of obtaining the image features of the non-text object is to sequentially perform convolution processing and global pooling processing on the image object and/or the video-audio image object of the video-audio object, so as to obtain the image features having the same dimension as the text features. In practical application, the image object and/or the video image object are passed through a convolutional neural network (such as Efficient-Net), and then an image feature having the same dimension as the text feature is obtained through Global-Pooling. Finally, the image feature and the text feature need to be merged and sent to the next layer of neural network, so that the dimension needs to be kept uniform. For example, a text feature has three dimensions, respectively a batch dimension, a text number dimension, and a hidden layer number dimension. The image features have five dimensions, respectively a batch dimension, an image number dimension, a width dimension, a height dimension, and a channel dimension. According to the embodiment of the invention, the wide dimension and the high dimension in the image features are subjected to global pooling processing to obtain the image features with three dimensions including batch (batch) dimension, image number dimension and channel dimension.

In a preferred embodiment of the present invention, one implementation of obtaining the image statistical characteristics of the non-text object is to obtain the image statistical characteristics of the integer type and the image statistical characteristics of the floating point type of the non-text object. The integer type of image statistical features comprise embedded features corresponding to image tags of non-text objects. The image statistics of the floating point type include click through rate. In practical applications, the image statistics of non-textual objects may include integer (int) types and floating point (float) types. The int-type image statistical characteristics may be embedding characteristics obtained by looking up a table of predicted image labels (such as portrait labels). The statistical characteristics of the float type image may be a predicted Click Through Rate (CTR) of the image, and the like.

In a preferred embodiment of the present invention, referring to fig. 2, one implementation of extracting fusion features of non-text objects based on image features and image statistical features comprises the following steps.

Step 201, integrating the image statistical features of each non-text object into integrated image statistical features.

In embodiments of the present invention, the image statistics of each non-text object may be integrated into an integrated image statistic based on a merge function concat ().

Step 202, the integrated image statistical features and the image features of each non-text object are converted into a plurality of image tags.

In embodiments of the present invention, the integral image statistical features and the image features of each non-textual object are separately converted into a plurality of image tokens (tokens).

And 203, inputting the image marks into a preset transformer model-based output fusion characteristic.

In an embodiment of the present invention, the image marker is input into a Transformer model, and the fusion feature is extracted from the Transformer model.

In a preferred embodiment of the present invention, one implementation manner of predicting the classification result of the target object according to the text feature, the fusion feature, and the content statistical feature of the target object is to perform a preset operation after performing integration and concatenation processing on the text feature, the fusion feature, and the content statistical feature; and inputting the characteristics after the preset operation into the full connection layer, and outputting a classification result. And the text feature, the fusion feature and the content statistical feature after integration processing keep the same dimensionality. The preset operation comprises at least one of the following operations: splicing operation, multiplication operation and attention operation. In practical applications, the content statistics of the target object may include, but are not limited to: the number of texts of the text object, the number of non-texts of the non-text object, whether the target object contains interest points, the content statistical characteristics of the integer type of the target object, the content statistical characteristics of the floating point type of the target object, and the like.

Based on the above-mentioned related description about an embodiment of a target object classification method, a target object classification model architecture based on a teletext multi-modality is introduced below. The input features of the classification model architecture mainly comprise three parts, namely text features, fusion features and content statistical features of content dimensions.

Referring to fig. 3, a schematic structural diagram of a target object classification model architecture based on a teletext multi-modality is shown in the embodiment of the present invention.

Text features can be handled primarily by text models (e.g., Pre-train Bert). Specifically, after the classification (cls) of the text object in the target object, the zero-level category (cat0), the first-level category (cat1), the second-level category (cat2), the model-predicted first-level category (cat1_ pred), the video tag (video _ tag), the store name (shop _ name), the image name (img _ names) and the title and text (title content) are converted into token, the corresponding embedding feature is searched from the dictionary, and the searched embedding feature is input into the Pre-trainIn Bert, output text feature H _cls(bert) 。

The fused features may include image features and image statistics. The image statistical characteristics may include int-type image statistical characteristics and float-type image statistical characteristics. The float type of image statistics may be the CTR predicted by other models. The int-type image statistical characteristics can be portrait labels (human _ id) predicted by other models, the width (width) of images, the height (height) of images, the aesthetic quality score mean difference (aes _ mean), the aesthetic quality score difference (aes _ std), CTRs subjected to barreling (for example, barreling is performed according to 0.01, 0-1 CTRs can be divided into 100 barrels, if one CTR is 0.035, the CTR is placed in a fourth barrel, and the CTR in the corresponding int-type image statistical characteristics is 4), and the like, so that the embedded characteristics are obtained through table lookup. The image features may be image features having the same dimension as the text features obtained by the image objects passing through a convolutional neural network (e.g., Efficient-Net) and then through Global-Pooling (Global-posing).

Integrating image statistical characteristics (including but not limited to aes _ mean, aes _ std, CTR and model frame (aes _ embedding)) of each image object or video object in the target object through a concat () function, converting the integrated image statistical characteristics and the image characteristics and cls of each image object or video object into a plurality of tokens, inputting the tokens into a Transformer model, and extracting fusion characteristics H from the Transformer model _{cls(transformer)} . It should be noted that the target object classification model architecture relates to aes _ mean and aes _ std for many times, and for aes _ mean and aes _ std which do not need table lookup, the target object classification model architecture belongs to the image statistical characteristics of the float type, and for aes _ mean and aes _ std which need table lookup, the target object classification model architecture is subjected to barrel processing and belongs to the image statistical characteristics of the int type.

Content statistics include, but are not limited to: the number of images (img _ cnt), whether points of interest (has _ poi) are contained, whether titles (has _ title) are contained, whether video objects (has _ video), video object duration (video _ duration), text number (word _ cnt), paragraph number (paragraph _ cnt), and content statistics of int type and float type for other model dimensions. Wherein, the content statistical characteristics of the int type need to look up a table to obtain the corresponding embedding characteristics.

When the text feature, the fusion feature, and the content statistic feature are integrated through concat (), the features are unified to the same dimension. Then, the text features, the fusion features and the content statistical features unified to the same dimension are input into a multi-layer Perceptron (MLP for short), and a classification result of the target object is obtained by using MLP prediction.

The image-text multi-mode based target object classification model architecture provided by the embodiment of the invention belongs to an end-to-end model architecture supporting one-text multi-image, and can be conveniently inserted with the characteristics of various dimensions and types and incremental information output by other models, so that the classification effect is further improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 4, a block diagram of a classification apparatus for a target object according to an embodiment of the present invention is shown, where the classification apparatus for a target object may be applied to a classification terminal or a classification server. The target object includes at least one text object and a plurality of non-text objects, the non-text objects include image objects and/or video objects, and the classification device of the target object may specifically include the following modules:

a text feature extraction module 41, configured to obtain a text label of the text object, and extract a text feature of the text object according to the text object and the text label;

a fusion feature extraction module 42, configured to obtain an image feature of the non-text object and an image statistical feature of the non-text object, and extract a fusion feature of the non-text object according to the image feature and the image statistical feature;

and a classification predicting module 43, configured to predict a classification result of the target object according to the text feature, the fusion feature, and the content statistical feature of the target object.

In a preferred embodiment of the present invention, the text feature extraction module 41 includes:

the text label conversion module is used for converting the text object and the text label into a text label;

and the text feature query module is used for querying the embedded features corresponding to the text labels in a dictionary trained in advance, inputting the embedded features into a preset text model and outputting the text features.

In a preferred embodiment of the present invention, the fused feature extraction module 42 includes:

and the image feature extraction module is used for sequentially carrying out convolution processing and global pooling processing on the image object and/or the video image object of the video object to obtain the image feature with the same dimensionality as the text feature.

the image statistical characteristic extraction module is used for acquiring the image statistical characteristics of the integer type and the floating point type of the non-text object;

wherein the integer type of image statistical features comprise embedded features corresponding to image tags of the non-textual objects; the image statistical characteristics of the floating point type include click through rate.

a feature integration module for integrating the image statistical features of each non-textual object into integrated image statistical features;

an image tag conversion module for converting the integrated image statistical features and the image features of each of the non-textual objects into a plurality of image tags;

and the feature fusion module is used for inputting the image marks into a preset transformer-based model and outputting the fusion features.

In a preferred embodiment of the present invention, the classification prediction module 43 includes:

the preset operation module is used for integrating the text features, the fusion features and the content statistical features and then executing preset operation;

the classification module is used for inputting the features after the preset operation into the full connection layer and outputting the classification result;

the text feature, the fusion feature and the content statistical feature after integration processing keep the same dimension; the preset operation comprises at least one of the following operations: splicing operation, multiplication operation and attention operation.

In a preferred embodiment of the present invention, the content statistical characteristics include: the number of texts of the text object, the number of non-texts of the non-text object, whether the target object contains interest points, the content statistical characteristics of the integer type of the target object, and the content statistical characteristics of the floating point type of the target object.

An embodiment of the present invention further provides an electronic device, referring to fig. 5, including: a processor 501, a memory 502 and a computer program 5021 stored on the memory 502 and operable on the processor 501, the processor 501 implementing the classification method of the target object of the foregoing embodiments when executing the program 5021.

Embodiments of the present invention also provide a readable storage medium, on which a computer program is stored, which when executed by a processor implements the classification method of the target object of the foregoing embodiments.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

It should be noted that all actions of acquiring signals, information or data in the embodiments of the present invention are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method and the apparatus for classifying a target object provided by the present invention are described in detail above, and the principle and the implementation manner of the present invention are explained in the present document by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. The classification method of a target object, characterized by, the said target object includes at least one text object and multiple non-text objects, the said non-text object includes the image object and/or audio-visual object; the method comprises the following steps:

acquiring a text label of the text object, and extracting text characteristics of the text object according to the text object and the text label;

acquiring image characteristics of the non-text object and image statistical characteristics of the non-text object, and extracting fusion characteristics of the non-text object according to the image characteristics and the image statistical characteristics;

and predicting the classification result of the target object according to the text feature, the fusion feature and the content statistical feature of the target object.

2. The method of claim 1, wherein extracting the text feature of the text object according to the text object and the text label comprises:

converting the text object and the text label into a text label;

and inquiring the embedded features corresponding to the text labels in a dictionary trained in advance, inputting the embedded features into a preset text model, and outputting the text features.

3. The method of claim 1, wherein the obtaining image features of the non-textual object comprises:

and sequentially performing convolution processing and global pooling processing on the image object and/or the video image object of the video object to obtain the image feature with the same dimension as the text feature.

4. The method of claim 1, wherein obtaining image statistics of the non-textual object comprises:

acquiring image statistical characteristics of an integer type and image statistical characteristics of a floating point type of the non-text object;

5. The method of claim 1, wherein extracting fused features of the non-textual object based on the image features and the image statistics comprises:

integrating the image statistical features of each of the non-textual objects into an integrated image statistical feature;

converting the integral image statistical features and the image features of each of the non-textual objects into a plurality of image tokens;

and inputting a plurality of image marks into a preset transformer-based model, and outputting the fusion characteristics.

6. The method of claim 1, wherein predicting the classification result of the target object according to the text feature, the fusion feature and the content statistical feature of the target object comprises:

integrating the text features, the fusion features and the content statistical features and then executing preset operation;

inputting the features after the preset operation into a full connection layer, and outputting the classification result;

7. The method of any of claims 1 to 6, wherein the content statistics comprise: the number of texts of the text object, the number of non-texts of the non-text object, whether the target object contains interest points, the content statistical characteristics of the integer type of the target object, and the content statistical characteristics of the floating point type of the target object.

8. The classification device of a target object, characterized in that, the target object includes at least one text object and a plurality of non-text objects, the non-text objects include image objects and/or video objects; the device comprises:

the text feature extraction module is used for acquiring a text label of the text object and extracting the text feature of the text object according to the text object and the text label;

the fusion feature extraction module is used for acquiring the image features of the non-text object and the image statistical features of the non-text object and extracting the fusion features of the non-text object according to the image features and the image statistical features;

and the classification prediction module is used for predicting the classification result of the target object according to the text characteristic, the fusion characteristic and the content statistical characteristic of the target object.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of classifying a target object according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of classifying a target object according to any one of claims 1 to 7.