CN115482490A

CN115482490A - Article classification model training method, article classification device and medium

Info

Publication number: CN115482490A
Application number: CN202211160689.5A
Authority: CN
Inventors: 邓桂林; 徐路; 谢东霖
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-16

Abstract

The present disclosure relates to the field of computer technologies, and in particular, to an article classification model training method, an article classification model training apparatus, an article classification apparatus, a computer-readable storage medium, and an electronic device, including: acquiring video sample data; acquiring a feature vector of each text and semantic information of each text; determining a text feature vector corresponding to video sample data; fusing the image feature vector and the text feature vector to obtain a first fusion feature vector corresponding to the video sample data; and obtaining a predicted article type according to the first fusion feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained according to the article type label and the predicted article type. Through the technical scheme of the embodiment of the disclosure, the problem of inaccurate classification in the prior art can be solved.

Description

Article classification model training method, article classification device and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an article classification model training method, an article classification model training apparatus, an article classification apparatus, a computer-readable storage medium, and an electronic device.

Background

With the rapid development of the internet, short video e-commerce services are more and more widespread. The merchant can display, introduce and sell the commodities in the video through the short video. In some scenarios, merchandise in the short video may be identified and relevant links provided for viewing by the demander.

In the related technology, an image feature vector can be extracted through an image encoder, a text feature vector can be extracted through a text encoder, the image feature vector and the text feature vector are fused to obtain a commodity feature vector, and the commodity feature vector is classified to obtain the category of commodities in a short video.

However, the image feature vectors or text feature vectors in the short video are not uniformly distributed, which may cause inaccurate classification results output by the model, and further cause errors in commodity recommendation.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to an article classification model training device, an article classification model training apparatus, a computer-readable storage medium, and an electronic device, which can solve the problem of inaccurate classification in the prior art.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided an item classification model training comprising: acquiring video sample data; the video sample data comprises text data, image data and an article type label, wherein the text data comprises a description text, a voice text and an image text; inputting video sample data into a model to be trained, acquiring a feature vector for describing a text and first semantic information for describing the text, acquiring a feature vector for a voice text and second semantic information for the voice text, and acquiring a feature vector for an image text and third semantic information for the image text; determining a text feature vector corresponding to video sample data according to the feature vector of the description text, first semantic information of the description text, the feature vector of the voice text, second semantic information of the voice text, the feature vector of the image text and third semantic information of the image text; acquiring an image characteristic vector corresponding to video sample data according to image data in the video sample data, and fusing the image characteristic vector and a text characteristic vector to obtain a first fusion characteristic vector corresponding to the video sample data; and obtaining a predicted article category according to the first fusion feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained according to the article category label and the predicted article category to obtain an article classification model.

Optionally, based on the foregoing scheme, obtaining the feature vector of the voice text and the second semantic information of the voice text according to the feature vector of the description text and the first semantic information of the description text, and obtaining the feature vector of the image text and the third semantic information of the image text to determine the text feature vector corresponding to the video sample data includes: combining the first semantic information, the second semantic information and the third semantic information into integral semantic information according to a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information and a third weight corresponding to the third semantic information; and determining a text feature vector corresponding to the video sample data according to the overall semantic information, the feature vector of the description text, the feature vector of the voice text and the feature vector of the image text.

Optionally, based on the foregoing scheme, updating the neural network parameter of the model to be trained according to the item category label and the predicted item category includes: determining a first loss function of the model to be trained according to the item class label and the predicted item class; and updating the neural network parameters of the model to be trained according to the first loss function of the model to be trained.

Optionally, based on the foregoing scheme, the first loss function is an asymmetric loss function; the video sample data comprises a positive sample and a negative sample, the exponential coefficient of the negative sample in the asymmetric loss function is larger than that of the positive sample, and the negative sample is removed when the prediction probability of the predicted article category corresponding to the negative sample is smaller than a preset threshold value.

Optionally, based on the foregoing scheme, updating the neural network parameter of the model to be trained according to the first loss function of the model to be trained, including: inputting video sample data into a momentum model; the neural network parameters of the momentum model are updated in a sliding mode according to the change of the neural network parameters in the training process of the model to be trained; acquiring a momentum feature vector of a description text and fourth semantic information of the description text, acquiring a momentum feature vector of a voice text and fifth semantic information of the voice text, and acquiring a momentum feature vector of an image text and sixth semantic information of the image text; determining a momentum text feature vector corresponding to video sample data according to the momentum feature vector of the description text, fourth semantic information of the description text, the momentum feature vector of the voice text, fifth semantic information of the voice text, the momentum feature vector of the image text and sixth semantic information of the image text; acquiring a momentum image characteristic vector corresponding to video sample data, and fusing the momentum image characteristic vector and the momentum text characteristic vector to obtain a second fusion characteristic vector corresponding to the video sample data; determining a second loss function of the model to be trained according to the second fusion feature vector of the video sample data; and determining an overall loss function according to the first loss function of the model to be trained and the second loss function of the model to be trained, and updating the neural network parameters of the model to be trained through the overall loss function to obtain the article classification model.

Optionally, based on the foregoing scheme, determining a second loss function of the model to be trained according to a second fusion feature vector of the video sample data includes: obtaining an article category pseudo label according to the second fusion feature vector of the video sample data; and determining a second loss function of the model to be trained according to the item class pseudo label and the predicted item class.

Optionally, based on the foregoing scheme, the second loss function is an asymmetric loss function; the video sample data comprises a positive sample and a negative sample, the exponential coefficient of the negative sample in the asymmetric loss function is larger than that of the positive sample, the negative sample is removed when the prediction probability of the predicted item type corresponding to the negative sample is smaller than that of the item type pseudo label corresponding to the negative sample, and the positive sample is removed when the prediction probability of the predicted item type corresponding to the positive sample is larger than that of the item type pseudo label corresponding to the positive sample.

According to a second aspect of the present disclosure, there is provided a method of sorting an article, the method comprising: acquiring video data; the video data comprises text data and image data, and the text data comprises description texts, voice texts and image texts; inputting the video data into an article classification model to obtain an article category; wherein the article classification model is obtained by training the article classification model according to any of the above embodiments.

According to a third aspect of the present disclosure, there is provided an article classification model training apparatus, the apparatus comprising: a sample data acquisition unit configured to perform acquisition of video sample data; the video sample data comprises text data, image data and an article type label, wherein the text data comprises a description text, a voice text and an image text; the semantic information acquisition unit is configured to input video sample data into a model to be trained, acquire a feature vector for describing a text and first semantic information for describing the text, acquire a feature vector for a voice text and second semantic information for the voice text, and acquire a feature vector for an image text and third semantic information for the image text; the text feature acquisition unit is configured to determine a text feature vector corresponding to video sample data according to a feature vector of a description text and first semantic information of the description text, a feature vector of a voice text and second semantic information of the voice text, a feature vector of an image text and third semantic information of the image text; the feature fusion unit is configured to acquire an image feature vector corresponding to video sample data according to image data in the video sample data, and fuse the image feature vector and the text feature vector to obtain a first fusion feature vector corresponding to the video sample data; and the parameter updating unit is configured to execute obtaining of a predicted article category according to the first fusion feature vector corresponding to the video sample data, and update the neural network parameter of the model to be trained according to the article category label and the predicted article category to obtain an article classification model.

Optionally, based on the foregoing scheme, according to the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text are obtained, and the feature vector of the image text and the third semantic information of the image text are obtained to determine the text feature vector corresponding to the video sample data, where the apparatus further includes: a semantic merging unit configured to perform merging of the first semantic information, the second semantic information, and the third semantic information into overall semantic information according to a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information, and a third weight corresponding to the third semantic information; and the text feature vector determining unit is configured to determine a text feature vector corresponding to the video sample data according to the overall semantic information and the feature vector describing the text, the feature vector of the voice text and the feature vector of the image text.

Optionally, based on the foregoing scheme, the neural network parameters of the to-be-trained model are updated according to the item class label and the predicted item class, and the apparatus further includes: a first loss function determination unit configured to perform a first loss function determination of the model to be trained according to the item class label and the predicted item class; and the first loss function training unit is configured to update the neural network parameters of the model to be trained according to the first loss function of the model to be trained.

Optionally, based on the foregoing scheme, the neural network parameters of the model to be trained are updated according to the first loss function of the model to be trained, and the apparatus further includes: a momentum model input unit configured to perform inputting video sample data into a momentum model; the neural network parameters of the momentum model are updated in a sliding mode according to the change of the neural network parameters in the training process of the model to be trained; the second semantic information acquisition unit is configured to execute acquisition of a momentum feature vector of the description text and fourth semantic information of the description text, acquisition of a momentum feature vector of the voice text and fifth semantic information of the voice text, and acquisition of a momentum feature vector of the image text and sixth semantic information of the image text; the second text feature vector determining unit is configured to determine a momentum text feature vector corresponding to the video sample data according to the momentum feature vector describing the text and fourth semantic information describing the text, the momentum feature vector of the voice text and fifth semantic information describing the voice text, the momentum feature vector of the image text and sixth semantic information describing the image text; the second fusion feature vector acquisition unit is configured to acquire a momentum image feature vector corresponding to the video sample data, and fuse the momentum image feature vector and the momentum text feature vector to obtain a second fusion feature vector corresponding to the video sample data; the integral loss function acquisition unit is configured to determine a second loss function of the model to be trained according to a second fusion feature vector of the video sample data; and the training unit is configured to determine an overall loss function according to the first loss function of the model to be trained and the second loss function of the model to be trained, and update the neural network parameters of the model to be trained through the overall loss function to obtain the article classification model.

Optionally, based on the foregoing scheme, a second loss function of the model to be trained is determined according to a second fusion feature vector of the video sample data, and the apparatus further includes: the item type pseudo label acquisition unit is configured to execute obtaining of an item type pseudo label according to a second fusion feature vector of the video sample data; and the second loss function determining unit is configured to execute second loss function determination of the model to be trained according to the item class pseudo label and the predicted item class.

According to a fourth aspect of the present disclosure, there is provided an article sorting apparatus comprising: a video acquisition unit configured to perform acquisition of video data; the video data comprises text data and image data, and the text data comprises description texts, voice texts and image texts; an article category acquisition unit configured to input video data into an article classification model to obtain an article category; the article classification model is obtained by training the article classification model according to any item above.

According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the item classification model training of the first aspect and the item classification method of the second aspect as in the above embodiments.

According to a sixth aspect of the present disclosure, there is provided an electronic device comprising:

a processor; and

memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the item classification model training of the first aspect and the item classification method of the second aspect as in the embodiments described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, a computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the item classification model training and the item classification method of any one of the above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the article classification model training provided by an embodiment of the disclosure, video sample data may be acquired, the video sample data is input into a model to be trained, a feature vector describing a text and first semantic information describing the text are acquired, a feature vector of a voice text and second semantic information describing the text are acquired, a feature vector of an image text and third semantic information describing the image text are acquired, a text feature vector corresponding to the video sample data is determined according to the feature vector describing the text and the first semantic information describing the text, the feature vector of the voice text and the second semantic information describing the text, the feature vector of the image text and the third semantic information describing the image text, an image feature vector corresponding to the video sample data is acquired according to image data in the video sample data, the image feature vector and the text feature vector are fused to obtain a first fused feature vector corresponding to the video sample data, a predicted article class is obtained according to the first fused feature vector corresponding to the video sample data, and a neural network parameter of the model to be trained according to an article class label and the predicted article class is updated to obtain an article classification model to be classified.

According to the embodiment of the invention, the semantic information corresponding to different text feature vectors can be fused, the influence of the semantic information corresponding to each text on the classification result can be considered, the accuracy of the classification result output by the model is improved, and the commodity recommendation is ensured to be more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It should be apparent that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived by those of ordinary skill in the art without inventive effort. In the drawings:

FIG. 1 schematically illustrates a schematic diagram of an example system architecture for training an item classification model in an example embodiment of the present disclosure;

FIG. 2 schematically illustrates a flowchart of an item classification model training method in an exemplary embodiment of the present disclosure;

fig. 3 schematically illustrates a flowchart for determining a text feature vector corresponding to video sample data according to overall semantic information and a feature vector describing a text, a feature vector of a voice text, and a feature vector of an image text in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart for updating neural network parameters of a model to be trained according to a first loss function of the model to be trained in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flowchart for updating neural network parameters of a model to be trained by a whole loss function to obtain an item classification model in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart for determining a second loss function for a model to be trained based on item class pseudo labels and predicted item classes in an exemplary embodiment of the present disclosure;

fig. 7 schematically illustrates a schematic diagram of a first fusion feature obtaining manner corresponding to video sample data in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of an item sorting method in an exemplary embodiment of the disclosure;

FIG. 9 is a schematic diagram illustrating components of an object classification model training apparatus according to an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates a component schematic of an article sorting apparatus in an exemplary embodiment of the present disclosure;

fig. 11 schematically shows a schematic structural diagram of a computer system of an electronic device suitable for implementing an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described feature vectors, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Furthermore, the described feature vectors, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

FIG. 1 sets forth a schematic diagram of an exemplary system architecture to which an article classification model training or article classification method of embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 1000 may include one or more of

terminal devices

1001, 1002, 1003, a network 1004, and a server 1005. The network 1004 is a medium used to provide communication links between the

terminal devices

1001, 1002, 1003 and the server 1005. Network 1004 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 1005 may be a server cluster composed of a plurality of servers.

A user can interact with a server 1005 via a network 1004 using

terminal devices

1001, 1002, 1003 to receive or transmit messages or the like. The

terminal devices

1001, 1002, 1003 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like. In addition, the server 1005 may be a server that provides various services.

In one embodiment, an executive subject of the article classification model training method disclosed by the present disclosure may be a server 1005, where the server 1005 may acquire video sample data sent by the

terminal devices

1001, 1002, and 1003, input the video sample data into a model to be trained, acquire a feature vector describing a text and first semantic information describing the text, acquire a feature vector of a voice text and second semantic information describing the text, acquire a feature vector of an image text and third semantic information describing the image text, determine a text feature vector corresponding to the video sample data according to the feature vector describing the text and the first semantic information describing the text, the feature vector describing the voice text and the second semantic information describing the voice text, the feature vector describing the image text and the third semantic information describing the image text, obtain a first fusion feature vector corresponding to the video sample data according to the image feature vector in the video sample data, obtain a predicted article category according to the first fusion feature vector corresponding to the video sample data, train a neural network parameter of the model to be trained according to the article category label and the predicted article category, and obtain a classification model. In addition, the article classification model training method disclosed by the disclosure can be executed through the

terminal devices

1001, 1002, 1003 and the like, so as to realize a process of fusing the image feature vector and the text feature vector to obtain a first fusion feature vector corresponding to video sample data, obtaining a predicted article category according to the first fusion feature vector corresponding to the video sample data, and updating the neural network parameter of the model to be trained according to the article category label and the predicted article category to obtain the article classification model.

In addition, the implementation process of the article classification model training method of the present disclosure may also be implemented by the

terminal devices

1001, 1002, 1003 and the server 1005 together. For example, the

terminal devices

1001, 1002, and 1003 may obtain video sample data, and then send the obtained video sample data to the server 1005, so that the server 1005 may input the video sample data into a model to be trained, obtain a feature vector describing a text and first semantic information describing the text, obtain a feature vector of a voice text and second semantic information describing the text, obtain a feature vector of an image text and third semantic information describing the image text, determine a text feature vector corresponding to the video sample data according to the feature vector describing the text and the first semantic information describing the text, the feature vector of the voice text and the second semantic information describing the voice text, the feature vector of the image text and the third semantic information describing the image text, obtain a first fused feature vector corresponding to the video sample data according to the image data in the video sample data, fuse the image feature vector and the text feature vector to obtain a predicted item category according to the first fused feature vector corresponding to the video sample data, and perform training on the neural network parameters of the predicted item classification model to obtain a training model.

With the rapid development of the internet, short video e-commerce services are more and more widespread. The merchant can display, introduce and sell the commodities in the video through the short video. In some scenarios, the items in the short video may be identified and relevant links provided for viewing by the demand provider.

In the related technology, an image feature vector can be extracted through an image encoder, a text feature vector can be extracted through a text encoder, the image feature vector and the text feature vector are fused to obtain a commodity feature vector, and the commodity feature vector is classified to obtain the category of commodities in a short video. For example, categories of the goods in the short video can be identified, and the goods in the same category can be recommended to the demander according to the goods categories for the demander to select.

According to the training method of the article classification model provided in the exemplary embodiment, video sample data may be acquired, the video sample data is input into a model to be trained, a feature vector describing a text and first semantic information describing the text are acquired, a feature vector of a voice text and second semantic information describing the voice text are acquired, a feature vector of an image text and third semantic information describing the image text are acquired, a text feature vector corresponding to the video sample data is determined according to the feature vector describing the text and the first semantic information describing the text, the feature vector of the voice text and the second semantic information describing the text, the feature vector of the image text and the third semantic information describing the image text, an image feature vector corresponding to the video sample data is acquired according to image data in the video sample data, the image feature vector and the text feature vector are fused to obtain a first fused feature vector corresponding to the video sample data, a predicted article category is obtained according to the first fused feature vector corresponding to the video sample data, and a neural network parameter of the training model to be trained according to the article category is updated to obtain the article classification model. As shown in fig. 2, the method for training the item classification model may include the following steps S210 to S250:

step S210, video sample data is obtained; the video sample data comprises text data, image data and an article type label, wherein the text data comprises a description text, a voice text and an image text;

step S220, inputting video sample data into a model to be trained, acquiring a feature vector for describing a text and first semantic information for describing the text, acquiring a feature vector for a voice text and second semantic information for the voice text, and acquiring a feature vector for an image text and third semantic information for the image text;

step S230, determining a text feature vector corresponding to video sample data according to the feature vector of the description text, the first semantic information of the description text, the feature vector of the voice text, the second semantic information of the voice text, the feature vector of the image text and the third semantic information of the image text;

step S240, acquiring an image feature vector corresponding to video sample data according to image data in the video sample data, and fusing the image feature vector and the text feature vector to obtain a first fusion feature vector corresponding to the video sample data;

and S250, obtaining a predicted article type according to the first fusion feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained according to the article type label and the predicted article type to obtain an article classification model.

Next, the steps S210 to S250 of the training method for the article classification model in the present exemplary embodiment will be described in more detail with reference to the drawings and the embodiments.

in an example embodiment of the present disclosure, video sample data may be obtained. The video sample data comprises text data, image data and an article type label, wherein the text data comprises description text, voice text and image text. Specifically, the video sample data is video data obtained by shooting through a camera module, and also can be video data obtained by artificial synthesis. It should be noted that the source of the video sample data is not particularly limited in the present disclosure.

Specifically, description text, voice text, and image text may be included in the text data.

In an example embodiment of the present disclosure, the description text may include text that a video creator or video manager explains, or directs the video, for example, the description text may include a video title, a video summary, a video brief description, and the like. For example, when a video creator uploads, a video profile such as "this video mainly evaluates xx brand mobile phones" may be added to the video, and the video profile is a description text of the video.

Further, multiple sets of description text may be included in the video sample data. For example, a video title and a video introduction corresponding to the video data may be used as the description text of the video sample data.

It is to be understood that the present disclosure is not limited to the particular type of descriptive text.

In an example embodiment of the present disclosure, the voice text refers to text obtained by voice recognizing audio information in video data. By way of example, the speech text may include text resulting from speech recognition of audio information (e.g., human voice) in the video.

In an example embodiment of the present disclosure, the image text refers to text obtained by character information in the character recognition video data. For example, the video data may be divided into multiple frames of images, and the multiple frames of images are subjected to character recognition, so that the obtained text is a voice text.

In an example embodiment of the present disclosure, the video sample data further includes an item category tag therein. Specifically, the item category tag may be used to indicate a tag of an item in video data corresponding to the video sample data. For example, the item category labels are hats, satchels, sweaters.

Further, video data may be used to indicate a plurality of items, and thus a plurality of item category labels may be configured for the video data.

Further, the items indicated in the video data may correspond to a plurality of item category tags, and thus a plurality of item category tags may be configured for the video data. For example, the items indicated in the video data are shoulder bags, and in this case, a plurality of item type labels, such as bags, satchels, lady bags, and the like, may be configured for the video data.

It should be noted that the present disclosure is not limited to the specific number and specific form of the article category labels.

Step S220, inputting video sample data into a model to be trained, acquiring a feature vector of a description text and first semantic information of the description text, acquiring a feature vector of a voice text and second semantic information of the voice text, and acquiring a feature vector of an image text and third semantic information of the image text;

in an example embodiment of the present disclosure, after the video sample data is obtained through the above steps, the video sample data may be input into the model to be trained. Specifically, the model to be trained is a model established for completing the object classification task, and the object classification task can be completed by training the model to be trained to obtain an object classification model. It should be noted that, the present disclosure does not make any special limitation on the specific structure of the model to be trained.

Further, after the video sample data is obtained, a description text, a voice text and an image text in the video sample data can be obtained through the model to be trained.

For example, the description text, the voice text and the image text corresponding to the video sample data may be obtained through a text recognition sub-model in the model to be trained. For example, a voice text corresponding to the video sample data may be obtained through the voice recognition submodel; or, the image text corresponding to the video sample data can be acquired through the character recognition sub-model.

It should be noted that, the present disclosure does not make any special limitation on the specific manner of obtaining the description text, the voice text and the image text in the video sample data.

In an example embodiment of the disclosure, after the description text, the voice text, and the image text in the video sample data are acquired, a feature vector of the description text and first semantic information of the description text may be acquired through a model to be trained, a feature vector of the voice text and second semantic information of the voice text may be acquired, and a feature vector of the image text and third semantic information of the image text may be acquired. Specifically, the first semantic information of the description text may be used to indicate the semantics of the description text of the video data corresponding to the video sample data, where the semantics fuses the meanings of each word in the description text; the second semantic information of the voice text can be used for indicating the semantics of the voice text of the video data corresponding to the video sample data, and the semantics fuses the meanings of all the characters/words in the voice text; the third semantic information of the image text may be used to indicate the semantics of the image text of the video data corresponding to the video sample data, which fuses the meanings of the respective words in the image text.

Specifically, the first semantic information describing the text, the second semantic information describing the voice text, and the third semantic information describing the image text are in the form of feature vectors. For example, the first semantic information describing the text is [ CLS ] embedding (a vector containing semantic information).

For example, a description text in video sample data may be converted into a feature vector of the description text and first semantic information of the description text may be extracted by a first text encoder in the model to be trained, a speech text in the video sample data may be converted into a feature vector of the speech text and second semantic information of the speech text may be extracted by a second text encoder in the model to be trained, and a description text in the video sample data may be converted into a feature vector of the description text and second semantic information of the image text may be extracted by a third text encoder in the model to be trained. For example, the feature vector describing the text is in the form of token embedding.

It should be noted that, the specific manner of acquiring the feature vector of the description text and the first semantic information of the description text, acquiring the feature vector of the voice text and the second semantic information of the voice text, and acquiring the feature vector of the image text and the third semantic information of the image text is not particularly limited in the present disclosure.

in an example embodiment of the present disclosure, after the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text, and the feature vector of the image text and the third semantic information of the image text are obtained through the above steps, a text feature vector corresponding to video sample data may be determined according to the feature vector of the description text and the first semantic information of the description text, and the feature vector of the voice text and the second semantic information of the voice text and the feature vector of the image text and the third semantic information of the image text. Specifically, the text feature vector corresponding to the video sample data may be used to indicate text information indicated by the video data corresponding to the video sample data, where the text information includes information describing text, voice text, and image text of the video data.

Specifically, the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text may be directly spliced with the feature vector of the image text and the third semantic information of the image text to obtain the text feature vector corresponding to the video sample data.

Or different fusion weights can be given to the feature vectors of different texts, and the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text are directly fused with the feature vector of the image text and the third semantic information of the image text according to the fusion weight corresponding to the description text, the fusion weight corresponding to the voice text and the fusion weight corresponding to the image text, so as to obtain the text feature vector corresponding to the video sample data.

Or different fusion weights can be given to semantic information of different texts, and the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text are directly fused with the feature vector of the image text and the third semantic information of the image text according to the fusion weight corresponding to the first semantic information, the fusion weight corresponding to the second semantic information and the fusion weight corresponding to the third semantic information, so as to obtain the text feature vector corresponding to the video sample data.

It should be noted that, the specific manner in which the text feature vector corresponding to the video sample data is determined according to the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text, the feature vector of the image text and the third semantic information of the image text is not particularly limited in the present disclosure.

Step S240, acquiring an image characteristic vector corresponding to video sample data according to image data in the video sample data, and fusing the image characteristic vector and the text characteristic vector to obtain a first fusion characteristic vector corresponding to the video sample data;

in an example embodiment of the present disclosure, an image feature vector corresponding to video sample data may be obtained according to image data in the video sample data. Specifically, the image feature vector corresponding to the video sample data refers to an image feature vector of video data corresponding to the video sample data. The image feature vector corresponding to the video sample data may be used to indicate image information of the video data corresponding to the video sample data.

For example, the video data corresponding to the video sample data may be converted into the image feature vector corresponding to the video sample data by an image encoder in the model to be trained.

Further, after the video sample data is obtained, the video data corresponding to the video sample data may be first divided into multiple frames of images, and then the image feature vector corresponding to the video sample data is obtained according to the multiple frames of images.

It should be noted that, the present disclosure does not make any special limitation on the specific manner of obtaining the image feature vector corresponding to the video sample data and the specific form of the image feature vector corresponding to the video sample data.

In an example embodiment of the present disclosure, after the text feature vector corresponding to the video sample data and the image feature vector corresponding to the video sample data are obtained through the above steps, the image feature vector and the text feature vector may be fused to obtain a first fusion feature vector corresponding to the video sample data. Specifically, the text feature vector corresponding to the video sample data and the image feature vector corresponding to the video sample data may be directly fused to obtain a first fusion feature vector corresponding to the video sample data. For example, the text feature vector corresponding to the video sample data and the image feature vector corresponding to the video sample data may be fused in a dot product manner.

For example, the image feature vector and the text feature vector may be fused by a decoder structure in the transform model to obtain a first fused feature vector corresponding to the video sample data, and for example, the text feature vector may be used as query, the image feature vector may be used as key and value, and the first fused feature vector corresponding to the video sample data is obtained through fusion.

Or, the text feature vector corresponding to the video sample data and the image feature vector corresponding to the video sample data may be fused according to the fusion weight corresponding to the text feature vector and the fusion weight corresponding to the image feature vector to obtain the first fusion feature vector corresponding to the video sample data.

It should be noted that, the present disclosure does not specially limit a specific manner of obtaining the first fusion feature vector corresponding to the video sample data by fusing the image feature vector and the text feature vector.

In an example embodiment of the present disclosure, after the first fusion feature vector corresponding to the video sample data is obtained through the above steps, the predicted item category may be obtained according to the first fusion feature vector corresponding to the video sample data. Specifically, the model to be trained may be configured to obtain a predicted item category according to the first fusion feature vector corresponding to the video sample data, where the predicted item category may be used to indicate a predicted category of an item in the video data corresponding to the video sample data.

Specifically, a plurality of predicted item categories can be obtained according to the first fusion feature vector corresponding to the video sample data. For example, the video data corresponding to the video sample data may include a plurality of articles, and the articles respectively correspond to different article categories, and at this time, a plurality of predicted article categories corresponding to the articles may be obtained according to the first fusion feature vector corresponding to the video sample data; alternatively, the item indicated in the video data may correspond to a plurality of item type tags, so a plurality of item type tags may be configured for the video data, and at this time, a plurality of predicted item types corresponding to one item may be obtained according to the first fusion feature vector corresponding to the video sample data.

In an example embodiment of the present disclosure, the model to be trained may include a plurality of hidden layers, and the hidden layers may include a convolutional layer, a normalization layer, an excitation layer, and the like. The first fusion feature vectors corresponding to the video sample data can be sequentially input into a plurality of hidden layers of the model to be trained to obtain hidden layer calculation results, and the predicted article category is obtained through the hidden layer calculation results.

In addition, the present disclosure does not specifically limit the specific manner of obtaining the predicted item type according to the first fusion feature vector corresponding to the video sample data.

In an example embodiment of the present disclosure, after the predicted item category is obtained through the above steps, the neural network parameters of the model to be trained may be updated according to the item category label and the predicted item category to obtain an item classification model. In particular, the predicted item category may be used to indicate a predicted category of an item in video data corresponding to the video sample data. The predicted item type is a predicted value, and at this time, a true value in the video sample data, that is, an item type tag may be obtained, where the item type tag may be used to indicate a true type of an item in the video data corresponding to the video sample data. At this time, the predicted item type (predicted value) and the item type label (true value) may be compared to obtain a predicted difference value between the predicted item type (predicted value) and the item type label (true value), and the neural network parameter of the model to be trained is updated according to the predicted difference value to obtain the item classification model.

Specifically, the neural network parameters of the model to be trained may include the number of model layers, the number of feature vector channels, the learning rate, and the like, and when the neural network parameters of the model to be trained are updated according to the prediction difference, the number of model layers, the number of feature vector channels, and the learning rate of the model to be trained may be updated to train the article classification model.

In an example embodiment of the present disclosure, the neural network parameters of the model to be trained may be updated through a back propagation algorithm, and after training is finished, the article classification model is obtained.

It should be noted that, the specific manner of updating the neural network parameters of the model to be trained according to the item class label and the predicted item class is not particularly limited in this disclosure.

In an example embodiment of the present disclosure, the neural network parameters of the model to be trained may be updated according to the item category label and the predicted item category, and when the model to be trained satisfies the convergence condition, the model to be trained is determined as the item classification model. Specifically, the fact that the model to be trained meets the convergence condition means that the model to be trained is high in prediction accuracy and can be applied. For example, the convergence condition may include the number of times of training, for example, when the model to be trained is trained N times, the training is ended; for another example, the convergence condition may include a training duration, for example, when the model to be trained is trained for a duration T, the training is ended.

It should be noted that the specific content of the convergence condition is not specially limited in the present disclosure, and by applying the convergence condition to the model, the training process of the model to be trained can be better controlled, and the problem of excessive training of the neural network is avoided, so that the training efficiency of the model to be trained is improved.

In an example embodiment of the present disclosure, the first semantic information, the second semantic information, and the third semantic information may be merged into overall semantic information according to a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information, and a third weight corresponding to the third semantic information, and a text feature vector corresponding to video sample data may be determined according to the overall semantic information, a feature vector describing a text, a feature vector of a voice text, and a feature vector of an image text. Referring to fig. 3, determining a text feature vector corresponding to video sample data according to the overall semantic information, the feature vector describing the text, the feature vector describing the voice text, and the feature vector describing the image text may include the following steps S310 to S320:

step S310, combining the first semantic information, the second semantic information and the third semantic information into integral semantic information according to a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information and a third weight corresponding to the third semantic information;

in an example embodiment of the present disclosure, after the first semantic information describing the text, the second semantic information describing the voice text, and the third semantic information describing the image text are obtained through the above steps, a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information, and a third weight corresponding to the third semantic information may be obtained, and the first semantic information, the second semantic information, and the third semantic information are merged into the overall semantic information according to the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information. Specifically, because the emphasis points of the different types of texts are different, different weights can be given to the semantic information corresponding to the different types of texts, so that the accuracy of the text feature vector corresponding to the video sample data is improved better.

For example, for a description text, a voice text, and an image text of video sample data, the description text is generally most representative, and the text content of the description text is most related to the video data, so the description text has a high degree of importance for a text feature vector corresponding to the video sample data, that is, a higher weight may be given to the first semantic information of the description text, and for the voice text, there may be useless text (e.g., a voice text converted from background music in the video data), that is, a lower degree of relation between the voice text and the video data, and therefore a lower degree of importance for a text feature vector corresponding to the video sample data, that is, a lower weight may be given to the second voice information of the voice text.

Different weights are respectively given to the first semantic information, the second semantic information and the third semantic information, so that different degrees of emphasis can be performed on the description text, the voice text and the image text, and the contribution degree of different texts to the text feature vector corresponding to the video sample data can be controlled.

For example, the first semantic information, the second semantic information, and the third semantic information may be merged into the overall semantic information by adding an attention weight.

It should be noted that, the present disclosure does not make any special limitation on the specific numerical values of the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information, and the specific manner of combining the first semantic information, the second semantic information, and the third semantic information into the whole semantic information according to the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information.

Step S320, determining a text feature vector corresponding to the video sample data according to the overall semantic information, the feature vector of the description text, the feature vector of the voice text and the feature vector of the image text.

In an example embodiment of the present disclosure, after the overall semantic information is obtained through the above steps, a text feature vector corresponding to video sample data may be determined according to the overall semantic information, and a feature vector describing a text, a feature vector of a voice text, and a feature vector of an image text. Specifically, the whole semantic information can be used as semantic information of video texts (description texts, voice texts and image texts), and then feature vectors of the description texts, feature vectors of the voice texts, feature vectors of the image texts and the whole semantic information are spliced to obtain text feature vectors corresponding to video sample data.

Furthermore, different weights can be respectively given to the feature vector of the description text, the feature vector of the voice text and the feature vector of the image text, the feature vector of the description text, the feature vector of the voice text and the feature vector of the image text are combined to obtain an overall feature vector according to the weight corresponding to the feature vector of the description text, the weight corresponding to the feature vector of the voice text and the weight corresponding to the feature vector of the image text, and then the overall semantic information and the overall feature vector are combined to obtain a text feature vector corresponding to the video sample data.

It should be noted that, the specific manner of determining the text feature vector corresponding to the video sample data according to the overall semantic information, the feature vector of the description text, the feature vector of the voice text, and the feature vector of the image text is not particularly limited in the present disclosure.

Through the steps S310 to S320, the first semantic information, the second semantic information, and the third semantic information may be merged into the overall semantic information according to the first weight corresponding to the first semantic information, the second weight corresponding to the second semantic information, and the third weight corresponding to the third semantic information, and the text feature vector corresponding to the video sample data may be determined according to the overall semantic information, the feature vector describing the text, the feature vector of the voice text, and the feature vector of the image text. Through the embodiment of the disclosure, different weights can be given to different types of texts, and different weights are given to different types of texts, so that the trained article classification model has higher accuracy.

In an example embodiment of the present disclosure, a first loss function of a model to be trained may be determined according to an item class label and a predicted item class, and a neural network parameter of the model to be trained may be updated according to the first loss function of the model to be trained. Referring to fig. 4, updating the neural network parameters of the model to be trained according to the first loss function of the model to be trained may include the following steps S410 to S420:

step S410, determining a first loss function of the model to be trained according to the item type label and the predicted item type;

step S420, updating the neural network parameters of the model to be trained according to the first loss function of the model to be trained.

In an example embodiment of the present disclosure, after the predicted item category is obtained through the above steps, a first loss function of the model to be trained may be determined according to the item category label and the predicted item category. Specifically, the predicted article type obtained by the model to be trained may be compared with the article type label in the video sample data, the prediction difference between the predicted article type obtained by the model to be trained and the article type label may be calculated, and the first loss function of the model to be trained may be determined according to the prediction difference.

In an example embodiment of the present disclosure, after the first loss function of the model to be trained is obtained through the above steps, the model to be trained may be trained through the first loss function of the model to be trained, for example, a training gradient may be calculated through the first loss function, and a neural network parameter of the model to be trained is updated through the training gradient, so as to obtain an article classification model.

It should be noted that, the specific form of the first loss function and the specific manner of determining the first loss function of the model to be trained according to the item class label and the predicted item class are not particularly limited in this disclosure.

Through the steps S410 to S420, the first loss function of the model to be trained can be determined according to the item class label and the predicted item class, and the neural network parameters of the model to be trained are updated according to the first loss function of the model to be trained.

In an example embodiment of the present disclosure, the first loss function is an asymmetric loss function. The video sample data comprises a positive sample and a negative sample, the exponential coefficient of the negative sample in the asymmetric loss function is larger than that of the positive sample, and the negative sample is removed when the prediction probability of the predicted article category corresponding to the negative sample is smaller than a preset threshold value. Specifically, the first loss function may include a loss function based on local loss, and the positive sample and the negative sample in local loss are respectively given different numerical exponential coefficients, that is, the exponential coefficient of the negative sample is greater than that of the positive sample, and the negative sample with low resolution difficulty is eliminated, so that when the neural network parameter of the model to be trained is updated according to the first loss function, the sample with difficulty is paid more attention to, and the classification accuracy of the item classification model is improved. The first loss function in this embodiment is expressed as follows, where L _classify For the first loss function, γ + is the exponential coefficient of the positive sample, γ -is the exponential coefficient of the negative sample, y is the true value, p is the predicted value, m is the preset threshold:

L _classify ＝-y(1-p) ^γ+ log(p)-(1-y)max(p-m,0) ^γ- log(1-p)

in an example embodiment of the disclosure, video sample data may be input into a momentum model, a momentum feature vector describing a text and fourth semantic information describing the text are obtained, a momentum feature vector of a voice text and fifth semantic information of the voice text are obtained, a momentum feature vector of an image text and sixth semantic information of the image text are obtained, a momentum text feature vector corresponding to the video sample data is determined according to the momentum feature vector describing the text and the fourth semantic information describing the text, the momentum feature vector of the voice text and the fifth semantic information of the voice text, the momentum feature vector of the image text and the sixth semantic information of the image text, a momentum image feature vector corresponding to the video sample data is obtained, the momentum image feature vector and the momentum text feature vector are fused to obtain a second fusion feature vector corresponding to the video sample data, a second loss function of the model to be trained is determined according to the second fusion feature vector of the video sample data, an overall loss function is determined according to the first loss function of the model to be trained and the second loss function of the model to be trained, and an article classification is obtained by updating a neural network parameter of the model to be trained through the overall loss function. Referring to fig. 5, updating the neural network parameters of the model to be trained through the global loss function to obtain the item classification model, which may include the following steps S510 to S560:

step S510, inputting video sample data into a momentum model;

in an example embodiment of the present disclosure, video sample data may be input into a momentum model. And updating the neural network parameters of the momentum model in a sliding manner according to the change of the neural network parameters in the training process of the model to be trained. Specifically, the momentum model is a model established by referring to the model to be trained, and the neural network parameters of the momentum model are updated in a sliding manner along with the iterative training process of the model to be trained.

For example, the neural network parameters of the momentum model include a neural network parameter a, a neural network parameter B, and a neural network parameter C, the model to be trained includes the neural network parameter a, the neural network parameter B, and the neural network parameter C, after the first training, the neural network parameters of the model to be trained are updated to the neural network parameter A1, the neural network parameter B1, and the neural network parameter C1, after the second training, the neural network parameters of the model to be trained are updated to the neural network parameter A2, the neural network parameter B2, and the neural network parameter C2, and after the third training, the neural network parameters of the model to be trained are updated to the neural network parameter A3, the neural network parameter B3, and the neural network parameter C3, in this case, the neural network parameters of the momentum model may be set as the momentums of the neural network parameters of the model to be trained for a plurality of times, for example, the neural network parameters of the model may be set as the average of the neural network parameters a of the neural network parameter a to be (neural network parameter A1+ neural network parameter A2+ neural network parameter A3)/3, the neural network parameter B1+ 3, and the neural network parameter C3, and the neural network parameter of the neural network may be updated as the neural network parameter C1)/3, and so as the neural network parameter of the neural network.

It should be noted that, the present disclosure does not make any particular limitation on the specific manner of updating the neural network parameters of the momentum model according to the neural network parameters of the model to be trained in a sliding manner.

Step S520, acquiring a momentum feature vector of the description text and fourth semantic information of the description text, acquiring a momentum feature vector of the voice text and fifth semantic information of the voice text, and acquiring a momentum feature vector of the image text and sixth semantic information of the image text;

in an example embodiment of the present disclosure, the momentum feature vector of the description text and the fourth semantic information of the description text may be obtained through a momentum model, the momentum feature vector of the voice text and the fifth semantic information of the voice text may be obtained, and the momentum feature vector of the image text and the sixth semantic information of the image text may be obtained. Specifically, text data (description text, voice text, and image text) corresponding to the video sample data may be converted into a momentum text feature vector by a text encoder in the momentum model.

Specifically, the fourth semantic information of the description text may be used to indicate the semantics of the description text of the video data corresponding to the video sample data, where the semantics fuses the meanings of each word in the description text; the fifth semantic information of the voice text can be used for indicating the semantics of the voice text of the video data corresponding to the video sample data, and the semantics fuses the meanings of all the characters/words in the voice text; the sixth semantic information of the image text may be used to indicate the semantics of the image text of the video data corresponding to the video sample data, which fuses the meanings of the respective words/words in the image text.

For example, a description text in the video sample data may be converted into a momentum text feature vector of the description text and fourth semantic information of the description text may be extracted by a fourth text encoder in the momentum model, a speech text in the video sample data may be converted into a momentum text feature vector of the speech text and fifth semantic information of the speech text may be extracted by a fifth text encoder in the momentum model, and the description text in the video sample data may be converted into a momentum text feature vector of the description text and sixth semantic information of the image text may be extracted by a sixth text encoder in the momentum model. For example, the momentum feature vector describing the text is in the form of token embedding.

It should be noted that, the specific manner of acquiring the momentum feature vector of the description text and the fourth semantic information of the description text, acquiring the momentum feature vector of the voice text and the fifth semantic information of the voice text, and acquiring the momentum feature vector of the image text and the sixth semantic information of the image text is not particularly limited in the present disclosure.

Step S530, determining a momentum text feature vector corresponding to video sample data according to the momentum feature vector of the description text and the fourth semantic information of the description text, the momentum feature vector of the voice text and the fifth semantic information of the voice text, the momentum feature vector of the image text and the sixth semantic information of the image text;

in an example embodiment of the present disclosure, after the momentum feature vector of the description text and the fourth semantic information of the description text are obtained through the above steps, the momentum feature vector of the voice text and the fifth semantic information of the voice text are obtained, and the momentum feature vector of the image text and the sixth semantic information of the image text are obtained, the momentum text feature vector corresponding to the video sample data may be determined according to the momentum feature vector of the description text and the fourth semantic information of the description text, the momentum feature vector of the voice text and the fifth semantic information of the voice text, the momentum feature vector of the image text and the sixth semantic information of the image text. Specifically, the text feature vector corresponding to the video sample data may be used to indicate text information indicated by the video data corresponding to the video sample data, where the text information includes information of a description text, a voice text, and an image text of the video data.

Specifically, the momentum feature vector of the description text and the fourth semantic information of the description text can be obtained, the momentum feature vector of the voice text and the fifth semantic information of the voice text can be obtained, the momentum feature vector of the image text and the sixth semantic information of the image text can be obtained, and the momentum text feature vector corresponding to the video sample data can be obtained by directly splicing.

Or different fusion weights can be given to the feature vectors of different texts, the momentum feature vector of the description text and the fourth semantic information of the description text are obtained according to the fusion weight corresponding to the description text, the fusion weight corresponding to the voice text and the fusion weight corresponding to the image text, the momentum feature vector of the voice text and the fifth semantic information of the voice text are obtained, the momentum feature vector of the image text and the sixth semantic information of the image text are obtained, and the momentum text feature vector corresponding to the video sample data is obtained by directly fusing.

Or different fusion weights can be given to semantic information of different texts, the momentum feature vector of the description text and the fourth semantic information of the description text are obtained according to the fusion weight corresponding to the fourth semantic information, the fusion weight corresponding to the fifth semantic information and the fusion weight corresponding to the sixth semantic information, the momentum feature vector of the voice text and the fifth semantic information of the voice text are obtained, the momentum feature vector of the image text and the sixth semantic information of the image text are obtained, and the momentum text feature vector corresponding to the video sample data is obtained by directly fusing.

It should be noted that, the specific manner in which the momentum feature vector corresponding to the video sample data is determined by obtaining the momentum feature vector of the voice text and the fifth semantic information of the voice text and obtaining the momentum feature vector of the image text and the sixth semantic information of the image text is not particularly limited in the present disclosure.

Step S540, acquiring a momentum image characteristic vector corresponding to the video sample data, and fusing the momentum image characteristic vector and the momentum text characteristic vector to obtain a second fusion characteristic vector corresponding to the video sample data;

in an example embodiment of the present disclosure, a momentum image feature vector corresponding to video sample data may be obtained. Specifically, the momentum image feature vector corresponding to the video sample data refers to an image feature vector of the video data corresponding to the video sample data. The momentum image feature vector corresponding to the video sample data may be used to indicate image information of the video data corresponding to the video sample data.

For example, the video data corresponding to the video sample data may be converted into the momentum image feature vector corresponding to the video sample data by an image encoder in the model to be trained.

Further, after the video sample data is obtained, the video data corresponding to the video sample data may be first divided into multiple frames of images, and then the momentum image feature vector corresponding to the video sample data is obtained according to the multiple frames of images.

It should be noted that, in the present disclosure, a specific manner of obtaining the momentum image feature vector corresponding to the video sample data and a specific form of the momentum image feature vector corresponding to the video sample data are not particularly limited.

In an example embodiment of the present disclosure, after the momentum text feature vector corresponding to the video sample data and the momentum image feature vector corresponding to the video sample data are obtained through the above steps, the momentum image feature vector and the momentum text feature vector may be fused to obtain a second fusion feature vector corresponding to the video sample data. Specifically, the momentum text feature vector corresponding to the video sample data and the momentum image feature vector corresponding to the video sample data can be directly fused to obtain a second fusion feature vector corresponding to the video sample data. For example, the momentum text feature vector corresponding to the video sample data and the momentum image feature vector corresponding to the video sample data may be fused in a dot product manner.

For example, the momentum image feature vector and the momentum text feature vector may be fused by a decoder structure in the transform model to obtain a second fusion feature vector corresponding to the video sample data, and for example, the momentum text feature vector may be used as a query, the momentum image feature vector may be used as a key and a value, and the second fusion feature vector corresponding to the video sample data may be obtained by fusion.

Or, the momentum text feature vector corresponding to the video sample data and the momentum image feature vector corresponding to the video sample data may be fused according to the fusion weight corresponding to the momentum text feature vector and the fusion weight corresponding to the momentum image feature vector to obtain a second fusion feature vector corresponding to the video sample data.

It should be noted that, in the present disclosure, a specific manner of obtaining the second fusion feature vector corresponding to the video sample data by fusing the momentum image feature vector and the momentum text feature vector is not particularly limited.

Step S550, determining a second loss function of the model to be trained according to the second fusion feature vector of the video sample data;

in an example embodiment of the disclosure, after the predicted item category is obtained by predicting through the model to be trained and the item category pseudo label is obtained through the momentum model, a second loss function of the model to be trained may be determined according to the item category pseudo label and the predicted item category. Specifically, the second loss function is a loss function for the model to be trained, which is obtained according to the training of the momentum model.

In an example embodiment of the present disclosure, after the second loss function of the model to be trained is obtained through the above steps, the model to be trained may be trained through the second loss function of the model to be trained.

It should be noted that, the specific form of the second loss function and the specific manner of determining the second loss function of the model to be trained according to the second fusion feature vector of the video sample data are not particularly limited in this disclosure.

Step S560, determining an overall loss function according to the first loss function of the model to be trained and the second loss function of the model to be trained, and updating the neural network parameters of the model to be trained through the overall loss function to obtain the article classification model.

In an example embodiment of the present disclosure, after the first loss function and the second loss function are obtained through the above steps, an overall loss function may be determined through the first loss function of the model to be trained and the second loss function of the model to be trained, and the neural network parameters of the model to be trained are updated through the overall loss function to obtain the item classification model. Specifically, the overall loss function may be obtained by adding a first loss function of the model to be trained and a second loss function of the model to be trained.

It should be noted that, the specific form of the overall loss function and the specific manner of determining the overall loss function according to the first loss function of the model to be trained and the second loss function of the model to be trained are not particularly limited in this disclosure.

In an example embodiment of the present disclosure, after the overall loss function of the model to be trained is obtained through the above steps, the model to be trained may be trained through the overall loss function of the model to be trained, for example, a training gradient may be calculated through the overall loss function, and a neural network parameter of the model to be trained is updated through the training gradient, so as to obtain an article classification model.

It should be noted that, the present disclosure does not make any special limitation on the specific manner of updating the neural network parameters of the model to be trained through the global loss function.

Through the steps S510 to S560, the video sample data may be input into the momentum model, the momentum feature vector of the description text and the fourth semantic information of the description text are obtained, the momentum feature vector of the voice text and the fifth semantic information of the voice text are obtained, the momentum feature vector of the image text and the sixth semantic information of the image text are obtained, the momentum text feature vector corresponding to the video sample data is determined according to the momentum feature vector of the description text and the fourth semantic information of the description text, the momentum feature vector of the voice text and the fifth semantic information of the voice text, the momentum feature vector of the image text and the sixth semantic information of the image text, the momentum text feature vector corresponding to the video sample data is obtained, the momentum image feature vector corresponding to the video sample data is obtained by fusing the momentum image feature vector and the momentum text feature vector, the second loss function of the model to be trained is determined according to the second fusion feature vector of the video sample data, the overall loss function is determined according to the first loss function of the model to be trained and the second loss function of the model to be trained, the neural network parameter of the model to be trained is updated through the overall loss function, so as to obtain the article classification model.

In an example embodiment of the present disclosure, the item class pseudo label may be obtained according to the second fusion feature vector of the video sample data, and the second loss function of the model to be trained is determined according to the item class pseudo label and the predicted item class. Referring to fig. 6, determining a second loss function of the model to be trained according to the item class pseudo label and the predicted item class may include the following steps S610 to S620:

step S610, obtaining an article type pseudo label according to the second fusion feature vector of the video sample data;

in an example embodiment of the present disclosure, after the second fused feature vector of the video sample data is obtained through the above steps, the item category pseudo tag may be obtained according to the second fused feature vector of the video sample data. Specifically, the momentum model may be configured to obtain a predicted item category according to the second fusion feature vector corresponding to the video sample data, where the predicted item category may be used to indicate a category of an item in the video data corresponding to the video sample data, and the momentum model has training knowledge of the model to be trained because the neural network parameter of the momentum model is updated in a sliding manner according to a training process of the model to be trained, so that the predicted item category obtained by the momentum model may be used as an item category pseudo tag, and the model to be trained is trained through the item category pseudo tag.

Specifically, the plurality of article type pseudo labels can be obtained according to the second fusion feature vector corresponding to the video sample data. For example, the video data corresponding to the video sample data may include a plurality of articles, the articles respectively correspond to different article types, and at this time, a plurality of article type pseudo tags corresponding to the articles may be obtained according to the second fusion feature vector corresponding to the video sample data; or, the item indicated in the video data may correspond to a plurality of item categories, and therefore, the plurality of item category pseudo labels may be obtained according to the second fused feature vector corresponding to the video sample data.

In an example embodiment of the present disclosure, the momentum model may include a plurality of hidden layers, and the hidden layers may include a convolution layer, a normalization layer, an excitation layer, and the like. And sequentially inputting the second fusion feature vectors corresponding to the video sample data into a plurality of hidden layers of the momentum model to obtain hidden layer calculation results, and obtaining the article type pseudo label through the hidden layer calculation results.

It should be noted that, the specific manner of obtaining the item type pseudo tag according to the second fused feature vector of the video sample data is not particularly limited in the present disclosure.

And S620, determining a second loss function of the model to be trained according to the item type pseudo label and the predicted item type.

In an example embodiment of the present disclosure, after the item class pseudo tag is obtained through the above steps, a second loss function of the model to be trained may be determined according to the item class pseudo tag and the predicted item class. Specifically, the predicted article type and the article type pseudo label obtained by the model to be trained may be compared, the prediction difference between the predicted article type and the article type pseudo label obtained by the model to be trained is calculated, and the second loss function of the model to be trained is determined according to the prediction difference.

It should be noted that, the specific form of the second loss function and the specific manner of determining the second loss function of the model to be trained according to the item class pseudo label and the predicted item class are not particularly limited in this disclosure.

Through the steps S610 to S620, the item class pseudo label can be obtained according to the second fusion feature vector of the video sample data, and the second loss function of the model to be trained is determined according to the item class pseudo label and the predicted item class.

In an example embodiment of the present disclosure, the second loss function is an asymmetric loss function; the video sample data comprises a positive sample and a negative sample, the exponential coefficient of the negative sample in the asymmetric loss function is larger than that of the positive sample, the negative sample is removed when the prediction probability of the predicted item type corresponding to the negative sample is smaller than that of the item type pseudo label corresponding to the negative sample, and the positive sample is removed when the prediction probability of the predicted item type corresponding to the positive sample is larger than that of the item type pseudo label corresponding to the positive sample. Specifically, the second loss function may include a loss function based on local loss, and index coefficients with different values are respectively given to a positive sample and a negative sample in the local loss, that is, the index coefficient of the negative sample is greater than the index coefficient of the positive sample, and the prediction probability of the predicted item class corresponding to the negative sample is compared with the prediction probability of the item class pseudo tag corresponding to the negative sample, and if the prediction probability of the predicted item class corresponding to the negative sample is smaller than the prediction probability of the item class pseudo tag corresponding to the negative sample, it indicates that the accuracy of the predicted item class predicted by the model to be trained is higher, so that the negative sample may be eliminated from the second loss function, and if the prediction probability of the predicted item class corresponding to the positive sample is greater than the prediction probability of the item class pseudo tag corresponding to the positive sample, it indicates that the accuracy of the predicted item class obtained by the model to be trained is higher, so that the positive sample may be eliminated from the second loss function, thereby improving the classification accuracy of the item classification model. The second loss function in this embodiment is expressed as follows, where L _{classify_distill} For the second loss function, γ + is the exponential coefficient of the positive sample, γ -is the exponential coefficient of the negative sample, y is the true value, p is the predicted value of the model to be trained, p' is the predicted value of the momentum model, and clamp () can limit the randomly varying values to a given interval [ min, max ]]Internal:

L _{classify_distill} ＝-y(p′-clamp(p,max＝p′)) ^γ+ log(p)-(1-y)(clamp(p,min＝p′)-p′)-log(1-p)

by the embodiment of the disclosure, the robustness of the article classification model can be improved, and the generalization capability of the article classification model is improved.

In an exemplary embodiment of the disclosure, the model to be trained may be trained through the first loss function and the second loss function obtained through the above steps. Specifically, the overall loss function may be obtained according to the first loss function and the second loss function, and the neural network parameters of the model to be trained are updated through the overall loss function, so as to obtain the article classification model.

In an exemplary embodiment of the disclosure, as shown in fig. 7, a description text, a voice text, and an image text corresponding to video sample data are obtained, the description text, the voice text, and the image text are input into respective corresponding text encoders to obtain a feature vector token embedding1 of the description text and first semantic information [ cls1] of the description text, a feature vector token embedding2 of the voice text and second semantic information [ cls2] of the voice text, a feature vector token embedding3 of the image text and third semantic information [ cls3] of the image text, the obtained feature vector and semantic information are subjected to attention fusion, the first semantic information, the second semantic information, and the third semantic information are merged into overall semantic information [ cls4] according to a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information, and a third weight corresponding to the third semantic information, the first semantic information, the second semantic information, and the third semantic information are merged into overall semantic information [ cls4], and then the feature vector token embedding text, the feature vector token embedding1, the feature vector token embedding2, the image text, and the third semantic information are merged into a multi-frame video image data corresponding to a video image video encoder, and then the feature vector corresponding to the video sample data [ cls5] of the feature vector.

In an example embodiment of the present disclosure, video data may be acquired and input into an item classification model to obtain an item classification. Referring to fig. 8, inputting video data into the item classification model to obtain the item category may include the following steps S810 to S820:

step S810, acquiring video data;

and step S820, inputting the video data into the article classification model to obtain the article classification.

In an example embodiment of the present disclosure, video data may be acquired. Specifically, the method comprises the following steps. The video data may be used to indicate one or more items, the video data may be input into the item classification model trained through the above steps, and the item type corresponding to the one or more items may be output.

For example, the item type corresponding to each item may be output for each of a plurality of items, or a plurality of item types may be output for the same item.

It should be noted that the number of types of articles is not specifically limited in this disclosure.

In an example embodiment of the present disclosure, after the item type is obtained through the above steps, the business function may be implemented according to the item type. For example, a plurality of item links associated with the item type may be retrieved and pushed to the user client so that the user clicks on the item link to learn about the related item.

It should be noted that the present disclosure is not limited to specific types of service functions.

Through the steps S810 to S820, the video data may be acquired, and the video data may be input into the article classification model to obtain the article type.

According to the training method for the article classification model provided in this exemplary embodiment, video sample data may be acquired, the video sample data is input into a model to be trained, a feature vector describing a text and first semantic information describing the text are acquired, a feature vector of a voice text and second semantic information describing the text are acquired, a feature vector of an image text and third semantic information describing the text are acquired, a text feature vector corresponding to the video sample data is determined according to the feature vector describing the text and the first semantic information describing the text, the feature vector of the voice text and the second semantic information describing the text, the feature vector of the image text and the third semantic information describing the image text, an image feature vector corresponding to the video sample data is acquired according to image data in the video sample data, the image feature vector and the text feature vector are fused to obtain a first fused feature vector corresponding to the video sample data, a predicted article class is obtained according to the first fused feature vector corresponding to the video sample data, and a neural network parameter of the training model for the article classification is updated according to the article classification label and the predicted article classification model.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the disclosure and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

In addition, in an exemplary embodiment of the disclosure, an article classification model training device is also provided. Referring to fig. 9, an article classification model training apparatus 900 includes: a sample data acquisition unit 910, a semantic information acquisition unit 920, a text feature acquisition unit 930, a feature fusion unit 940, and a parameter update unit 950.

The video processing device comprises a sample data acquisition unit, a video processing unit and a video processing unit, wherein the sample data acquisition unit is configured to acquire video sample data; the video sample data comprises text data, image data and an article type label, wherein the text data comprises a description text, a voice text and an image text; the semantic information acquisition unit is configured to input video sample data into a model to be trained, acquire a feature vector of a description text and first semantic information of the description text, acquire a feature vector of a voice text and second semantic information of the voice text, and acquire a feature vector of an image text and third semantic information of the image text; the text feature acquisition unit is configured to determine a text feature vector corresponding to video sample data according to a feature vector of a description text and first semantic information of the description text, a feature vector of a voice text and second semantic information of the voice text, a feature vector of an image text and third semantic information of the image text; the feature fusion unit is configured to acquire an image feature vector corresponding to video sample data according to image data in the video sample data, and fuse the image feature vector and a text feature vector to obtain a first fusion feature vector corresponding to the video sample data; and the parameter updating unit is configured to execute obtaining of a predicted article category according to the first fusion feature vector corresponding to the video sample data, and update the neural network parameter of the model to be trained according to the article category label and the predicted article category to obtain an article classification model.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, according to a feature vector of a description text and first semantic information of the description text, a feature vector of a voice text and second semantic information of the voice text are obtained, and a feature vector of an image text and third semantic information of the image text are obtained to determine a text feature vector corresponding to video sample data, where the apparatus further includes: a semantic merging unit configured to perform merging of the first semantic information, the second semantic information, and the third semantic information into overall semantic information according to a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information, and a third weight corresponding to the third semantic information; and the text feature vector determining unit is configured to determine a text feature vector corresponding to the video sample data according to the overall semantic information and the feature vector describing the text, the feature vector of the voice text and the feature vector of the image text.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the neural network parameters of the to-be-trained model are updated according to the item class label and the predicted item class, and the apparatus further includes: a first loss function determination unit configured to perform a first loss function determination of the model to be trained according to the item class label and the predicted item class; a first loss function training unit configured to perform updating of neural network parameters of the model to be trained according to a first loss function of the model to be trained.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the first loss function is an asymmetric loss function; the video sample data comprise positive samples and negative samples, the exponential coefficient of the asymmetric loss function for the negative samples is larger than that of the positive samples, and the negative samples are removed when the prediction probability of the predicted article types corresponding to the negative samples is smaller than a preset threshold value.

In an exemplary embodiment of the disclosure, based on the foregoing scheme, the neural network parameters of the model to be trained are updated according to the first loss function of the model to be trained, and the apparatus further includes: a momentum model input unit configured to perform input of video sample data into a momentum model; the neural network parameters of the momentum model are updated in a sliding mode according to the change of the neural network parameters in the training process of the model to be trained; the second semantic information acquisition unit is configured to execute acquisition of a momentum feature vector of the description text and fourth semantic information of the description text, acquisition of a momentum feature vector of the voice text and fifth semantic information of the voice text, and acquisition of a momentum feature vector of the image text and sixth semantic information of the image text; the second text feature vector determining unit is configured to determine a momentum text feature vector corresponding to the video sample data according to the momentum feature vector describing the text and fourth semantic information describing the text, the momentum feature vector of the voice text and fifth semantic information of the voice text, the momentum feature vector of the image text and sixth semantic information of the image text; the second fusion feature vector acquisition unit is configured to acquire a momentum image feature vector corresponding to the video sample data, and fuse the momentum image feature vector and the momentum text feature vector to obtain a second fusion feature vector corresponding to the video sample data; the integral loss function acquisition unit is configured to determine a second loss function of the model to be trained according to a second fusion feature vector of the video sample data; and the training unit is configured to determine an overall loss function according to the first loss function of the model to be trained and the second loss function of the model to be trained, and update the neural network parameters of the model to be trained through the overall loss function to obtain the article classification model.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, a second loss function of the model to be trained is determined according to a second fused feature vector of the video sample data, and the apparatus further includes: the item type pseudo label acquisition unit is configured to execute obtaining of an item type pseudo label according to a second fusion feature vector of the video sample data; and the second loss function determining unit is configured to execute second loss function determination of the model to be trained according to the item class pseudo label and the predicted item class.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the second loss function is an asymmetric loss function; the video sample data comprises positive samples and negative samples, the exponential coefficient of the asymmetric loss function for the negative samples is larger than that of the positive samples, when the prediction probability of the predicted item type corresponding to the negative samples is smaller than that of the item type pseudo label corresponding to the negative samples, the negative samples are removed, and when the prediction probability of the predicted item type corresponding to the positive samples is larger than that of the item type pseudo label corresponding to the positive samples, the positive samples are removed.

Since each functional module of the article classification model training apparatus in the exemplary embodiment of the present disclosure corresponds to a step of the above article classification model training method in the exemplary embodiment, please refer to the above article classification model training method in the embodiment of the present disclosure for details that are not disclosed in the embodiment of the apparatus in the present disclosure.

In addition, in an exemplary embodiment of the disclosure, an article classification model training device is also provided. Referring to fig. 10, an article sorting apparatus 10000 includes: a video acquisition unit 1010 and an item category acquisition unit 1020.

Wherein the video acquisition unit is configured to perform acquiring video data; the video data comprises text data and image data, and the text data comprises description texts, voice texts and image texts; an article category acquisition unit configured to input video data into an article classification model to obtain an article category; wherein the article classification model is obtained by training the article classification model according to any of the above embodiments.

Since each functional module of the article sorting apparatus in the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the article sorting method, please refer to the embodiment of the article sorting method in the present disclosure for details that are not disclosed in the embodiment of the apparatus in the present disclosure.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the feature vectors and functions of two or more modules or units described above may be embodied in one module or unit according to embodiments of the present disclosure. Conversely, the feature vectors and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method for training an article classification model is also provided.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 1100 according to such an embodiment of the disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, electronic device 1100 is embodied in the form of a general purpose computing device. The components of the electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, a bus 1130 connecting different system components (including the memory unit 1120 and the processing unit 1110), and a display unit 1140.

Where the memory unit stores program code, the program code may be executed by the processing unit 1110 to cause the processing unit 1110 to perform the steps according to various exemplary embodiments of the present disclosure as described in the above-mentioned "exemplary methods" section of this specification. For example, the processing unit 1110 may perform step S210 as shown in fig. 2, acquiring video sample data; the video sample data comprises text data, image data and an article type label, wherein the text data comprises a description text, a voice text and an image text; step S220, inputting video sample data into a model to be trained, acquiring a feature vector of a description text and first semantic information of the description text, acquiring a feature vector of a voice text and second semantic information of the voice text, and acquiring a feature vector of an image text and third semantic information of the image text; step S230, determining a text feature vector corresponding to video sample data according to the feature vector of the description text, the first semantic information of the description text, the feature vector of the voice text, the second semantic information of the voice text, the feature vector of the image text and the third semantic information of the image text; step S240, acquiring an image characteristic vector corresponding to video sample data according to image data in the video sample data, and fusing the image characteristic vector and the text characteristic vector to obtain a first fusion characteristic vector corresponding to the video sample data; and S250, obtaining a predicted article category according to the first fusion feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained according to the article category label and the predicted article category to obtain an article classification model.

Alternatively, step S810 as shown in fig. 8 may also be executed to acquire video data; and step S820, inputting the video data into the article classification model to obtain the article classification.

As another example, the electronic device may implement the steps shown in fig. 2 and 8.

The storage unit 1120 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 1121 and/or a cache memory unit 1122, and may further include a read only memory unit (ROM) 1123.

The storage unit 1120 may also include a program/utility 1124 having a set (at least one) of program modules 1125, such program modules 1125 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1130 may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1100 may also communicate with one or more external devices 1170 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1100, and/or any devices (e.g., router, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 1150. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. As shown, the network adapter 1160 communicates with the other modules of the electronic device 1100 over a bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an apparatus to perform the above-described method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising a computer program/instructions which, when executed by a processor, implement the item classification model training or item classification method of the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training an item classification model, the method comprising:

acquiring video sample data; the video sample data comprises text data, image data and an article type label, wherein the text data comprises description texts, voice texts and image texts;

inputting the video sample data into a model to be trained, acquiring a feature vector of the description text and first semantic information of the description text, acquiring a feature vector of the voice text and second semantic information of the voice text, and acquiring a feature vector of the image text and third semantic information of the image text;

determining a text feature vector corresponding to the video sample data according to the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text, the feature vector of the image text and the third semantic information of the image text;

acquiring an image feature vector corresponding to the video sample data according to image data in the video sample data, and fusing the image feature vector and the text feature vector to obtain a first fusion feature vector corresponding to the video sample data;

and obtaining a predicted article category according to the first fusion feature vector corresponding to the video sample data, and updating the neural network parameters of the model to be trained according to the article category label and the predicted article category to obtain an article classification model.

2. The method according to claim 1, wherein obtaining the feature vector of the voice text and the second semantic information of the voice text, and obtaining the feature vector of the image text and the third semantic information of the image text to determine the text feature vector corresponding to the video sample data according to the feature vector of the description text and the first semantic information of the description text comprises:

combining the first semantic information, the second semantic information and the third semantic information into integral semantic information according to a first weight corresponding to the first semantic information, a second weight corresponding to the second semantic information and a third weight corresponding to the third semantic information;

and determining a text feature vector corresponding to the video sample data according to the overall semantic information, the feature vector of the description text, the feature vector of the voice text and the feature vector of the image text.

3. The method of claim 1, wherein updating the neural network parameters of the model to be trained according to the item class labels and the predicted item classes comprises:

determining a first loss function of the model to be trained according to the item class label and the predicted item class;

and updating the neural network parameters of the model to be trained according to the first loss function of the model to be trained.

4. The method of claim 3, wherein the first loss function is an asymmetric loss function; the video sample data comprises a positive sample and a negative sample, the exponential coefficient of the asymmetric loss function for the negative sample is larger than that of the positive sample, and the negative sample is removed when the prediction probability of the predicted article category corresponding to the negative sample is smaller than a preset threshold value.

5. The method of claim 1, wherein the updating the neural network parameters of the model to be trained according to the first loss function of the model to be trained comprises:

inputting the video sample data into a momentum model; the neural network parameters of the momentum model are updated in a sliding mode according to the change of the neural network parameters in the training process of the model to be trained;

acquiring a momentum feature vector of the description text and fourth semantic information of the description text, acquiring a momentum feature vector of the voice text and fifth semantic information of the voice text, and acquiring a momentum feature vector of the image text and sixth semantic information of the image text;

determining a momentum text feature vector corresponding to the video sample data according to the momentum feature vector of the description text and fourth semantic information of the description text, the momentum feature vector of the voice text and fifth semantic information of the voice text, the momentum feature vector of the image text and sixth semantic information of the image text;

acquiring a momentum image feature vector corresponding to the video sample data, and fusing the momentum image feature vector and the momentum text feature vector to obtain a second fusion feature vector corresponding to the video sample data;

determining a second loss function of the model to be trained according to a second fusion feature vector of the video sample data;

and determining an overall loss function according to the first loss function of the model to be trained and the second loss function of the model to be trained, and updating the neural network parameters of the model to be trained through the overall loss function to obtain the article classification model.

6. The method of claim 5, wherein determining a second loss function of the model to be trained from a second fused feature vector of the video sample data comprises:

obtaining an article category pseudo label according to the second fusion feature vector of the video sample data;

and determining a second loss function of the model to be trained according to the item class pseudo label and the predicted item class.

7. The method of claim 6, wherein the second loss function is an asymmetric loss function; the video sample data comprises a positive sample and a negative sample, the exponential coefficient of the negative sample in the asymmetric loss function is greater than that of the positive sample, the negative sample is removed when the prediction probability of the predicted item class corresponding to the negative sample is less than that of the item class pseudo label corresponding to the negative sample, and the positive sample is removed when the prediction probability of the predicted item class corresponding to the positive sample is greater than that of the item class pseudo label corresponding to the positive sample.

8. A method of sorting an item, the method comprising:

acquiring video data; the video data comprises text data and image data, and the text data comprises description texts, voice texts and image texts;

inputting the video data into an article classification model to obtain an article category; wherein the article classification model is obtained by training the article classification model according to any one of claims 1 to 7.

9. An object classification model training device, comprising:

a sample data acquisition unit configured to perform acquiring video sample data; the video sample data comprises text data, image data and an article type label, wherein the text data comprises description text, voice text and image text;

a semantic information obtaining unit configured to input the video sample data into a model to be trained, obtain a feature vector of the description text and first semantic information of the description text, obtain a feature vector of the voice text and second semantic information of the voice text, and obtain a feature vector of the image text and third semantic information of the image text;

a text feature obtaining unit configured to determine a text feature vector corresponding to the video sample data according to the feature vector of the description text and the first semantic information of the description text, the feature vector of the voice text and the second semantic information of the voice text, the feature vector of the image text and the third semantic information of the image text;

the feature fusion unit is configured to execute obtaining an image feature vector corresponding to the video sample data according to image data in the video sample data, and fuse the image feature vector and the text feature vector to obtain a first fusion feature vector corresponding to the video sample data;

and the parameter updating unit is configured to execute obtaining of a predicted article category according to the first fusion feature vector corresponding to the video sample data, and update the neural network parameter of the model to be trained according to the article category label and the predicted article category to obtain an article classification model.

10. An article sorting device, comprising:

a video acquisition unit configured to perform acquisition of video data; the video data comprises text data and image data, and the text data comprises description text, voice text and image text;

the article category acquisition unit is configured to input the video data into an article classification model to obtain an article category; wherein the article classification model is obtained by training the article classification model according to any one of claims 1 to 8.

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the item classification model training of any one of claims 1 to 7 or the item classification method of claim 8.

12. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the item classification model training of any of claims 1 to 7 or the item classification method of claim 8.