CN114357204A

CN114357204A - Media information processing method and related equipment

Info

Publication number: CN114357204A
Application number: CN202111414198.4A
Authority: CN
Inventors: 李和瀚; 孙万祺; 贺文嵩; 顾晓光
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-04-15
Anticipated expiration: 2041-11-25
Also published as: CN114357204B

Abstract

The application relates to the technical field of artificial intelligence, and discloses a media information processing method and related equipment, wherein the method comprises the following steps: acquiring a description text and a description image from the media information; performing feature extraction on the description image to obtain image features; splicing the image characteristics and the word vectors corresponding to all the words in the description text to obtain splicing characteristics corresponding to the media information; performing label prediction by the emotion label prediction model according to the splicing characteristics corresponding to the media information, and outputting an emotion label corresponding to the media information; according to the emotion labels corresponding to the media information in the media information set, the display sequence of the media information including the appointed emotion labels in the media information set is disordered to obtain a target display sequence corresponding to the media information set; the scheme enriches the mode of determining the display sequence of the media information.

Description

Media information processing method and related equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a media information processing method and related devices.

Background

In the related art, when displaying a plurality of pieces of media information (e.g., news), the display order of the plurality of pieces of media information is generally determined according to the ranking of the media information in hot, and the display order of the plurality of pieces of media information is determined only in this way, which has a problem that the display order of the plurality of pieces of media information is determined in a single way.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present application provide a method and an apparatus for processing media information, an electronic device, and a storage medium, so as to improve the foregoing problems.

According to an aspect of an embodiment of the present application, there is provided a method for processing media information, the method including: acquiring a description text and a description image from the media information; performing feature extraction on the description image to obtain image features; splicing the image characteristics and the word vectors corresponding to all the words in the description text to obtain splicing characteristics corresponding to the media information; performing label prediction by an emotion label prediction model according to the splicing characteristics corresponding to the media information, and outputting an emotion label corresponding to the media information; and according to the emotion labels corresponding to the media information in the media information set, disordering the display sequence of the media information including the appointed emotion labels in the media information set to obtain a target display sequence corresponding to the media information set.

According to an aspect of an embodiment of the present application, there is provided a media information processing apparatus, including: the acquisition module is used for acquiring the description text and the description image from the media information; the image feature extraction module is used for extracting features of the description image to obtain image features; the splicing module is used for splicing the image characteristics and the word vectors respectively corresponding to all the words in the description text to obtain splicing characteristics corresponding to the media information; the label prediction module is used for performing label prediction according to the splicing characteristics corresponding to the media information by the emotion label prediction model and outputting the emotion label corresponding to the media information; and the display sequence determining module is used for disordering the display sequence of the media information including the appointed emotion label in the media information set according to the emotion label corresponding to each media information in the media information set to obtain the target display sequence corresponding to the media information set.

In some embodiments of the present application, the display order determination module comprises: the sorting unit is used for sorting the media information in the media information set according to the initial information display sequence corresponding to the media information set to obtain an initial media information sorting; the adjusting unit is used for adjusting the initial media information sequence according to the emotion tags corresponding to the media information in the media information set, so that the number of the media information of the appointed emotion tags included in the emotion tags which are continuous in the adjusted sequence is not more than N, wherein N is a positive integer; and determining the adjusted initial media information sequence as a target display sequence corresponding to the media information set.

In some embodiments of the present application, the apparatus for processing media information further includes: the sorting score calculation module is used for calculating the sorting score corresponding to each piece of media information according to the emotion label corresponding to each piece of media information in the media information set to be sorted and the initial sorting parameter corresponding to each piece of media information; and the information display sequence determining module is used for sequencing the media information according to the sequencing scores corresponding to the media information to obtain the information display sequence corresponding to the media information set to be sequenced.

In some embodiments of the present application, the media information processing apparatus further comprises: and the control module is used for controlling the investigation feedback content not to be displayed in the detail page of the media information according to the first emotion label if the emotion label corresponding to the media information comprises the first emotion label.

In some embodiments of the present application, the emotion tag prediction model comprises a feature fusion network and a classification layer; a label prediction module comprising: the feature fusion unit is used for performing feature fusion by the feature fusion network according to the splicing features corresponding to the media information and outputting fusion features corresponding to the media information; and the emotion label classification unit is used for classifying emotion labels by the classification layer according to the fusion characteristics and outputting emotion labels corresponding to the media information.

In some embodiments of the present application, the apparatus for processing media information further includes: the pre-training module is used for pre-training the feature fusion network through first training data; the first training data comprises a plurality of first image-text pairs, wherein the first image-text pairs comprise a first sample description image and a first sample description text which are derived from the same first sample media information, and the first sample description text is obtained by performing partial mask processing on an initial sample description text; and the secondary training module is used for carrying out secondary training on the emotion label prediction model through second training data, wherein the second training data comprise a plurality of second sample media information and labeled emotion labels corresponding to the second sample media information.

In some embodiments of the present application, the pre-training module comprises: a first feature extraction unit, configured to perform feature extraction on the first sample description image to obtain an image feature of the first sample description image; the first splicing unit is used for splicing the image characteristics of the first sample description image and the word vectors corresponding to all the words in the first sample description text to obtain the splicing characteristics corresponding to the first sample media information; the first feature fusion unit is used for performing feature fusion by the feature fusion network according to the splicing feature corresponding to the first sample media information and outputting a fusion feature corresponding to the first sample media information; the first prediction unit is used for predicting the masked words by the appointed classification layer according to the fusion characteristics corresponding to the first sample media information to obtain predicted masked words; a first loss calculation unit configured to calculate a first loss based on the predicted masked word and an actual masked word in the first sample description text; and the first adjusting unit is used for reversely adjusting the parameters of the feature fusion network according to the first loss until a first training end condition is reached.

In some embodiments of the present application, the apparatus for processing media information further comprises: the second splicing unit is used for splicing the image characteristics of the first sample description image and the word vectors corresponding to all the words in the initial sample description text to obtain first splicing characteristics; the contribution weight determining unit is used for predicting the contribution weight of each word of the initial sample description text to emotion label prediction according to the first splicing characteristics by a key component labeling model; the key component labeling model is obtained through training of third training data, the third training data comprise a plurality of third sample description texts and label information corresponding to each third sample description text, and the label information is used for indicating whether each word in the corresponding third sample description text is a keyword of an emotion label prediction task or not; a target word determination unit for determining a target word to be masked from words whose corresponding contribution weights are not lower than a weight threshold; and the mask processing unit is used for performing mask processing on the target words in the initial sample description text to obtain the first sample description text.

In some embodiments of the present application, the secondary training module comprises: an acquisition unit configured to acquire a second sample description text and a second sample description image from second sample media information; the second feature extraction unit is used for performing feature extraction on the second sample description image to obtain image features corresponding to the second sample description image; a third splicing unit, configured to splice image features corresponding to the second sample description image and word vectors corresponding to all words in the second sample description text, respectively, to obtain splicing features corresponding to the second sample media information; the second prediction unit is used for performing label prediction by the emotion label prediction model according to the splicing characteristics corresponding to the second sample media information and outputting a predicted emotion label corresponding to the second sample media information; a second loss calculating unit, configured to calculate a second loss according to the predicted emotion tag corresponding to the second sample media information and the labeled emotion tag corresponding to the second sample media information; and the second adjusting unit is used for reversely adjusting the parameters of the emotion label prediction model according to the first loss until a second training end condition is reached.

In some embodiments of the present application, the apparatus for processing media information further comprises: the candidate emotion label acquisition module is used for acquiring at least two groups of candidate emotion labels corresponding to the second sample media information; the at least two groups of candidate emotion labels are obtained by carrying out emotion label labeling on the second sample media information by different labeling personnel, and each group of candidate emotion labels corresponds to one labeling personnel; and the marked emotion label determining module is used for determining the candidate emotion labels as the marked emotion labels corresponding to the second sample media information if the at least two groups of candidate emotion labels are the same.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method of processing media information as described above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor, implement a method for processing media information as described above.

According to an aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a method of processing media information as described above.

According to the scheme, after the emotion labels corresponding to the media information in the media information set are obtained through prediction, the display sequence of the media information including the appointed emotion labels in the media information set is disordered according to the emotion labels, the target display sequence corresponding to the media information set is obtained, and the mode of determining the display sequence of the media information is enriched. Because the display order of the media information including the designated emotion labels in the media information set is disturbed, the media information including the designated emotion labels is not continuously and intensively displayed according to the target display order, so that the user experience can be improved.

Moreover, in the application, the emotion label prediction is performed by combining the description text and the description image in the media information under the two modalities, compared with the information under one modality, such as only the description text or only the description image, the characteristics comprehensively expressed by the description text and the description image are richer and more comprehensive, and the characteristics under different modalities can be mutually verified and enhanced, so that the emotion label predicted by combining the description text and the description image under the two modalities is more accurate.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic diagram illustrating an application scenario of the present solution according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a method of processing media information according to an embodiment of the present application.

FIG. 3 is a model diagram illustrating an emotion tag prediction model according to an embodiment of the present application.

FIG. 4 is a flowchart illustrating steps prior to step 240 according to one embodiment of the present application.

FIG. 5 is a schematic diagram illustrating training of an emotion label prediction model according to an embodiment of the present application.

FIG. 6 is a flowchart illustrating step 410 according to an embodiment of the present application.

FIG. 7 is a flowchart illustrating steps prior to step 410 according to one embodiment of the present application.

Fig. 8 is a flowchart illustrating a masking process according to an embodiment of the present application.

FIG. 9 is a schematic diagram illustrating a key component labeling model for contribution weight prediction according to an embodiment of the present application.

FIG. 10 is a flowchart illustrating step 420 according to an embodiment of the present application.

FIG. 11 is a flowchart illustrating emotion tag labeling according to an embodiment of the present application.

FIG. 12 is a block diagram illustrating an apparatus for processing media information according to one embodiment.

FIG. 13 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It should be noted that: reference herein to "a plurality" means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Fig. 1 is a schematic diagram illustrating an application scenario of the present solution according to an embodiment of the present application. As shown in fig. 1, the application scenario may include a terminal 110 and a server 120, and the terminal 110 and the server 120 may establish a communication connection through a wired network or a wireless network. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

The terminal 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted terminal, a smart television, or other electronic devices that can interact with a user. The terminal 110 may run an application program, which may be a news client program, a client program of an interactive aggregation platform, or other client programs integrating a media information display function. In the present embodiment, the media information may be news, discussion topics, and the like, and is not particularly limited herein.

The server 120 may be configured to perform the method of the present application, perform emotion tag prediction on each media information in the media information set, and according to the predicted emotion tag, scramble a display order of the media information in the media information set, where the emotion tag includes a specified emotion tag, and determine a target display order corresponding to the media information set; thereafter, the server 120 may push all media information in the media information set to the terminal 110, and sequentially display the media information in the media information set in the terminal 110 according to the determined target display order.

In some examples, the terminal 110 may send the media information request to the server 120 after detecting the media information request operation of the user, so that the server 120 sends the media information in the media information set to the terminal 110, and sends indication information indicating a target display order to the terminal 110, so that the terminal sequentially displays the media information in the media information set according to the target display order.

In some embodiments, the media information request operation may be a page refresh operation, an operation of entering a specified page, or an operation of triggering a request control, which is not specifically limited herein.

In some embodiments, the media information in the media information set may be the media information that is screened out according to the popularity of the media information and has popularity exceeding a set popularity threshold, and the media information in the media information set is the media information to be displayed.

In some implementations, the server 120 may perform media information filtering from a media information database according to the media information request after receiving the media information request sent by the terminal 110, and construct a media information set based on the filtered media information. The server 120 may obtain the user information of the corresponding user in combination with the user identifier carried in the media information request, and screen the media information from the media information database based on the user information. The user information may include the age, sex, location, user preference tag, historical browsing media information, historical collection media information, etc. of the user, so that media information with high matching degree with the user may be filtered out based on the user information and added to the media information set.

It is understood that in the embodiments of the present application, data related to user information (such as age, sex, location, preference label, etc.) and the like are involved, when the embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

It should be noted that the method of the present application is not limited to be executed by the server 120, and if the processing capability of the terminal 110 meets the requirement, the method of the present application may be executed by the terminal 110, or the terminal 110 and the server 120 may execute the method of the present application together, for example, after the server 120 determines the emotion tag corresponding to each media information in the media information set, each media information in the media information set and the corresponding emotion tag are sent to the terminal 110, the terminal 110 breaks up the display order of the media information in the media information set, in which the emotion tag includes the designated emotion tag, according to the corresponding emotion tag, determines the target display order corresponding to the media information set, and displays the media information in the media information set in the target display order in the display interface of the terminal 110.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 is a flowchart illustrating a method for processing media information according to an embodiment of the present application, where the method may be performed by an electronic device with processing capability, and the electronic device may be the server or the terminal shown in fig. 1, that is, the method may be performed by the server or the terminal, or may be performed by both the server and the terminal, and is not limited in particular herein. Referring to fig. 2, the method includes at least steps 210 to 250, which are described in detail as follows:

in step 210, the description text and the description image are obtained from the media information.

The media information may be news, discussion topics, etc., and the form of the media information may be text-combined images, text-combined videos, text-combined images and videos, which is not particularly limited herein.

In some embodiments, the description image may be a cover image of the media information, an image embedded in the media information, or a frame of video extracted from a video included in the media information, and the description image may be one or two or more.

In some embodiments, the description text may be all text or partial text in the media information, and when the total text content of the media information is more, the description text may be at least one of title text of the media information, abstract text of the media information, and keywords of the media information, and may also be a combination of a classification tag corresponding to the media information and at least one of the title text, the abstract text, and the keywords of the media information.

In some embodiments, step 210 comprises: extracting a cover image of the media information as a description image; acquiring a title text and a category label corresponding to the media information from the media information; and combining the title text and the category label corresponding to the media information to obtain a description text.

The category tag is used to indicate a media category to which the media information belongs, and generally, for media information such as news, the media information is generally classified according to a content subject of the media information, and after the media category to which the media information belongs is determined, the media information is added with a corresponding category tag. The media information is classified according to the content subject, so that the category label of the media information can reflect the subject content of the media information to a certain extent. Media categories such as science category, education category, finance category, sports category, property category, health category, and the like. The description text may include one category label or a plurality of category labels.

In some embodiments, if the media information is classified in multiple levels, the category label of the media information corresponds to the category label included in the multiple levels of classification, and the category label included in the description text may only include one level of category label of the media information, or may include multiple category labels in the multiple levels of classification, which is not specifically limited herein.

And step 220, performing feature extraction on the description image to obtain image features.

In some embodiments, the feature extraction may be performed on the description image by an image feature extraction network, and the features extracted by the image feature extraction network for the description image are referred to as image features of the description image.

The image feature extraction network may be constructed by a convolutional neural network, a long-term memory neural network, a cyclic neural network, a fully-connected network, and the like, and is not particularly limited herein.

In some embodiments, the image feature extraction network may also be a neural network model that is disclosed in the related art and can be used for image feature extraction, for example, the image feature extraction network may be an effecentnet-4B (baseline model), the effecentnet-4B includes 7 convolution modules connected end to end, and an input module and an output module with fixed structures are further provided, the length and width of an output feature map of each convolution module is half of or the same as that of the input feature map, and an operator combination contained in each convolution module needs to be searched in a subspace; as another example, the image feature extraction Network may be FR-CNN (fast regional full Convolutional Neural Network).

And step 230, splicing the image features and the word vectors respectively corresponding to all the words in the description text to obtain splicing features corresponding to the media information.

Word vector, also known as word embedding (word embedding), refers to a vectorized representation of a character resulting from converting the character to a real vector space. In particular embodiments, Word vectors describing words in the text may be generated by a Word2Vec model (distributed Word vector model), a bag of words model, and the like. In other embodiments, the word vector may also be a one-hot encoded vector of words.

In the splicing feature, the sequence of each word vector in the splicing feature is the same as the sequence of the corresponding word in the description text.

And 240, performing label prediction by the emotion label prediction model according to the splicing characteristics corresponding to the media information, and outputting the emotion label corresponding to the media information.

The emotion tag prediction model may be constructed by a convolutional neural network, a full-connection network, a cyclic neural network, a feed-forward neural network, a long-term and short-term memory neural network, and the like, which is not specifically limited herein.

In some embodiments, the emotion tag prediction model comprises a feature fusion network and a classification layer; step 240, including: performing feature fusion by the feature fusion network according to the splicing features corresponding to the media information, and outputting fusion features corresponding to the media information; and the classification layer classifies the emotion labels according to the fusion characteristics and outputs the emotion labels corresponding to the media information.

The feature fusion network performs feature fusion on the image features of the description images and the word vectors of the words in the description texts, so that the information of the image modality and the information of the text model are fused, and the obtained fusion features can more accurately express the features of the media information.

In some embodiments, the feature fusion network may be constructed by a convolutional neural network, a fully-connected network, a cyclic neural network, a feed-forward neural network, a long-term memory neural network, or the like, and of course, the feature fusion may be performed by using a neural network model known in the related art.

In some embodiments, the feature fusion network may be a BERT model (a deep Bidirectional transducer neural network model, Bidirectional Encoder responses from transducers), a Transformer model (a transducer model), a Bidirectional long-term and short-term memory neural network, and the like, which are not particularly limited herein.

The classification layer may include a fully-connected network layer, and specifically may include one or more fully-connected network layers, where the number of neurons in each fully-connected network layer may be set according to actual needs. In a specific embodiment, the classification layer may include a fully connected network layer and an output layer, wherein a classification function, such as Sigmoid, may be disposed in the output layer.

Emotion tags are used to indicate a user's tendency or feelings to media information, such as emotion tags indicating positive energy, negative energy, emotional exaggeration, country wind, down-regulation, youth, and like tendency emotions. In the present application, the emotion tag may also be referred to as a somatosensory stroke tag for indicating a somatosensory stroke, which is a specific feeling given to a user by media information or a recognized style of the user.

In some embodiments, the emotion tag output by the emotion tag prediction model for the same media information may be one or more. When the emotion label prediction model outputs a plurality of emotion labels for the same media information, it can be understood that the emotion label prediction model predicts emotion labels under multiple categories, and for the emotion labels under each category, two categories are performed in the classification layer. In this case, each dimension in the multi-dimensional feature vector output by the last fully-connected network layer in the classification layer corresponds to an emotion tag for a category. For example, categories such as sentiment tags include a first category, a second category, a third category, where sentiment tags under the first category include sentiment tags indicative of negative energy and sentiment tags indicative of non-negative energy; the emotion labels in the second category comprise emotion labels indicating low key and emotion labels indicating non-low key; the emotion tags under the third category include emotion tags indicating rural wind and emotion tags indicating non-rural wind.

FIG. 3 is a model diagram illustrating an emotion tag prediction model according to an embodiment of the present application. In the embodiment shown in FIG. 3, the feature fusion network is a BERT model. In FIG. 3, [ CLS]To start the identifier, [ SEP]To separate identifiers, [ EOS]To end the identifier, E (E)_[CLS]、E_[IMG]、E_[SEP]、E_{Electric power}、E_Letter、E_{Fraud prevention}、E_Cheat、…E_Cheat、E_Money、E_{Society of society}、E_{Will be provided with}、E_—、E_[EOS]Etc.) represent the corresponding embedded representation, in particular, for text, E is the word vector for the corresponding word, E_[IMG]For the image features of the corresponding description image, since the Position of the word in the text has an important influence on the text semantics, the input of the BERT model includes, in addition to the splicing features obtained by splicing the word vector and the image features, Position encodings (Position encodings) of the word vector and the image features in the splicing features, and further, in order to facilitate the model to distinguish the features describing the image from the features describing the text, the input of the BERT model also includes segment encodings, for example, in the order shown in fig. 3, [ CLS "]]、[IMG]、[ESP]The corresponding segment is coded as 0, followed by a word vector describing each word in the text and an [ EOS]The corresponding segment is coded as 1.

After feature fusion is performed on the BERT model, the hidden layer features corresponding to each character in the description text (namely H in FIG. 3) are output_[EOS]、H_[CLS]、H_[IMG]、H_[SEP]、H_{Electric power}、H_Letter、H_{Fraud prevention}、H_Cheat、H_Cheat、H_Money、H_{Society of society}、H_{Will be provided with}、H_—、H_[EOS]) It will be appreciated that the hidden layer features are a characteristic representation of the word under the influence of image features and the contextual influence of the descriptive text. Then, the classification layer classifies the hidden layer characteristics and outputsSentiment tags under each category.

Please refer to fig. 2, in step 250, the display order of the media information including the designated emotion tag in the media information set is scrambled according to the emotion tag corresponding to each media information in the media information set, so as to obtain the target display order corresponding to the media information set.

The media information set comprises a plurality of pieces of media information to be displayed. In the related art, the display order of the media information is generally determined by the popularity of the information (e.g., popularity of news, popularity of discussion of topics), and in this case, there may be a case where the tendency feeling of a plurality of media information in succession to the user is the same or similar.

If the tendentiousness feelings of a plurality of continuous media information to the user are negative energy, namely the plurality of media information with negative energy are intensively displayed, if the situation occurs, the emotion of the user is easy to be excited, and then an overstimulation behavior occurs, such as the posting of overstimulated comments and the like, therefore, in the scheme of the application, the display sequence of the media information including the specified emotion label can be disordered according to the predicted emotion label corresponding to each media information.

The step of disordering the display sequence of the media information of which the emotion labels comprise the designated emotion labels means that the media information comprising the designated emotion labels is dispersed into other media information not comprising the designated emotion labels, so that the media information comprising the designated emotion labels in the display sequence is prevented from being displayed in a centralized manner. In the present embodiment, the step of scrambling the display order of the media information including the designated emotion label may be understood as the step of scattering the media information including the designated emotion label in the display order.

The designated emotion tags may be set according to actual needs, and may be one or multiple, and in some embodiments, the designated emotion tags may be emotion tags indicating negative energy, emotion tags indicating exaggerated emotion, and the like, which are not specifically limited herein.

And the target display sequence is used for determining the display sequence of each piece of media information in the media information set after the display sequence of the pieces of media information including the designated emotion labels is disturbed.

In some embodiments of the present application, step 250 comprises: sorting the media information in the media information set according to the initial information display sequence corresponding to the media information set to obtain an initial media information sorting; adjusting the initial media information sequence according to the emotion tags corresponding to the media information in the media information set, so that the number of the media information of which the emotion tags which are continuous in the adjusted sequence comprise the appointed emotion tags is not more than N, wherein N is a positive integer; and determining the adjusted initial media information sequence as a target display sequence corresponding to the media information set.

N may be set according to actual needs, for example, if N is 1, it indicates that when the media information in the media information set is displayed according to the target display order, the media information of the two emotion tags including the designated emotion tag is separated by one or more media information not including the designated emotion tag, so that the two media information including the designated emotion tag are not adjacent in the display order. Of course, N may also be set to other values, such as 2, 3 …, and the like.

In some embodiments, the initial information display order corresponding to the media information set is used to indicate the initial display order of the media information in the media information set. For example, when the media information is sorted according to the degree of hotness of the media information, the initial information display order may be an order in which all the media information in the media information set is determined according to the degree of hotness.

In some embodiments, the initial display order corresponding to the media information sets may also be a display order determined by randomly ordering the media information sets.

In some embodiments, the initial display order corresponding to the media information sets may also be an order determined by ranking the user characteristics of a user from high to low.

In some embodiments, the display order of the media information including the specified emotion tag in the media information set is disturbed, or the media information not including the specified emotion tag in the media information set is sorted according to a set rule, and then the media information interval including the specified emotion tag is inserted between the media information not including the specified emotion tag, so that the number of the continuous media information including the specified emotion tag does not exceed the number of the media information.

For example, if the media information in the media information set includes media information a1, media information a2, media information B1, media information B2, and media information B3, where the emotion tags corresponding to media information a1 and media information a2 include a specified emotion tag, and the emotion tags corresponding to media information B1, media information B2, and media information B3 do not include a specified emotion tag, the media information that does not include the specified emotion tag is sorted first to obtain the following sort: media information B2 → media information B1 → media information B3, and on the basis of this, media information a1 and media information a2 can be inserted between any two pieces of media information in the above-described ranking, respectively, and: the media information a1 → the media information B2 → the media information B1 → the media information a2 → the media information B3, however, the insertion positions of the media information a1 and the media information a2 are not limited to the above examples, which are only illustrative examples and should not be construed as limiting the scope of the application.

In some embodiments, the display order of the media information including the designated emotion tag in the media information set may be scrambled, or the media information including the designated emotion tag in the media information set may be sorted according to a set rule, a first sorting is assumed, and then each piece of media information not including the designated emotion tag is inserted into the first sorting, so that at least two adjacent pieces of media information including the designated emotion tag in the first sorting are spaced apart by the media information not including the designated emotion tag.

In some embodiments, after step 250, the media information in the set of media information and the indication information indicating the determined target display order are sent to the client to cause the media information in the set of media information to be displayed in the user interface of the client in the determined target display order.

In some embodiments, the scheme of the present application may be applied in a scenario of information stream pushing (e.g., topic pushing, news pushing), and the method of the present application is adopted to scramble a display order of media information including a specified emotion tag in a plurality of media information pushed each time.

In other embodiments of the present application, after step 240, the method further comprises: calculating a ranking score corresponding to each media information according to the emotion label corresponding to each media information in the media information set to be ranked and the initial ranking parameter corresponding to each media information; and sorting the media information according to the sorting scores corresponding to the media information to obtain the information display sequence corresponding to the media information set to be sorted.

In this embodiment, the predicted emotion tag corresponding to the media information may be applied to the media information sorting process, that is, the initial sorting parameter corresponding to the media information and the emotion tag corresponding to the media information are integrated to sort.

The initial ranking parameter corresponding to the media information may be a parameter representing the heat corresponding to the media information, such as the number of browsed times in unit time, the number of comments in unit time, and the like. The initial ranking parameter may also be a matching parameter of the media information to a user characteristic of the user.

In some embodiments, a mapping relationship between the emotion tag and the first ranking score may be preset, and a mapping relationship between the initial ranking parameter and the second ranking score may be set, so that the first ranking score corresponding to a piece of media information may be determined according to the emotion tag corresponding to the piece of media information; determining a second sorting score corresponding to the media information according to the initial sorting parameter corresponding to the media information; and then weighting the first ranking score corresponding to the media information and the second ranking score corresponding to the media information, and taking the weighting result as the ranking score corresponding to the media information. On the basis, all the media information in the media information set to be sorted can be sorted according to the sequence of the sorting scores from high to low or from low to high, and the information display sequence corresponding to the media information to be sorted is obtained.

In some embodiments, when the predicted emotion tags include multiple categories of emotion tags, the emotion tags under all categories of a media information may be applied to determine the information display sequence, and some categories (e.g., one category, two categories, etc.) may be selected from the emotion tags and applied to the information display sequence.

In some embodiments, the media information sets to be sorted may be the media information sets in the above embodiments, and the information display order corresponding to the determined media information sets to be sorted may be an initial information display order corresponding to the media information sets in the above embodiments. Of course, in other embodiments, the set of media information to be sorted may also be a different set than the above set of media information.

In this embodiment, the media information in the media information set to be sorted is sorted in combination with the emotion tag corresponding to the media information, and the media information is sorted in combination with the emotion tag capable of showing the tendency emotion of the user to the media information, so that the user experience can be improved.

In some embodiments of the present application, after step 240, the method further comprises: and if the emotion label corresponding to the media information comprises a first emotion label, controlling the investigation feedback content not to be displayed in the detail page of the media information according to the first emotion label.

The content of the research feedback in the detail page of the media information is used for the user to input feedback, such as a satisfaction research, etc., for the media information.

The first emotion tag may be set according to actual needs, and the first emotion tag may be an emotion tag corresponding to an emotion that easily causes a large fluctuation in the emotion of the user, for example, the first emotion tag may be an emotion tag that indicates negative energy. The first emotion tag may be the same as or different from the designated emotion tag, and is not particularly limited herein.

In some embodiments, if the emotion tag corresponding to the media information includes the first emotion tag, a specified mark is added to the media information according to the first emotion tag, so that after the media information is sent to the client, the investigation feedback content is not displayed in the detail page of the media information if the client detects that the media information includes the specified mark. Otherwise, if the emotion tag corresponding to the media information does not comprise the first emotion tag, controlling to display the research feedback content in the detail page of the media information.

Because the emotion fluctuation is large when a user sees some media information with large negative energy, and an overexcitation behavior may occur, for example, the research feedback content in the detail page inputs the feedback content that the utterance is overexcited, or expresses an irrational or overexcited viewpoint, and further, the reference value of the input feedback for the media information is not high, in the scheme, whether the research feedback content is displayed in the detail page of the media information can be flexibly determined based on the predicted emotion tag, the amount of collected inaccurate research feedback information is reduced, and the subsequent processing amount is reduced.

In some embodiments, if the emotion tag of the media information corresponding to the currently entered detail page includes a first emotion tag, a first quantity of the media information including the first emotion tag continuously browsed by the user history may be counted in combination with the browsing record of the user, and if the first quantity exceeds a quantity threshold, the investigation feedback content is not displayed in the detail page of the current media information; conversely, if the first number does not exceed the number threshold, then the content of the research feedback may be displayed in a detail page of the current media information.

In some embodiments of the present application, as shown in fig. 4, prior to step 240, the method further comprises:

step 410, pre-training the feature fusion network through the first training data; the first training data comprises a plurality of first image-text pairs, the first image-text pairs comprise a first sample description image and a first sample description text which are derived from the same first sample media information, and the first sample description text is obtained by performing partial mask processing on the initial sample description text.

In the present application, for the sake of distinction, the description image used for pre-training the feature fusion network is referred to as a first sample description image, the description text used for pre-training the feature fusion network is referred to as a first sample description text, and the media information from which the first sample description text (or the initial sample description text) and the first sample description image originate is referred to as first sample media information.

The partial masking processing is to add a mask to partial words in the initial sample description text or replace the partial words with other words, and predict masked words to be masked through a feature fusion network in the pre-training process. Through the pre-training process, the feature fusion network can learn general semantic knowledge information, and the semantic understanding performance of the feature fusion network is improved.

In a specific embodiment, the partial masking processing is performed on the initial sample description text, and may be to replace a word determined to be masked in the initial sample description text with a specified masking tag, where the specified masking tag is, for example, [ MASK ], or replace a word determined to be masked with an arbitrary word, or leave a word determined to be masked unchanged. Replacing the word which is determined to be subjected to mask processing with any word in order to enable the feature fusion network to learn automatic error correction according to the context information; it is determined that the words that need to be masked remain unchanged to mitigate the performance penalty of the deviation between the training text and the predicted text.

In some embodiments, the ratio of words to be masked may be preset, for example, 15%, and the ratio of samples corresponding to each partial masking may be set, for example, if the ratio is set for 80% of training samples: replacing the word determined to need masking with [ MASK ]; for 10% of training samples: keeping unchanged the words determined to need to be masked; for 10% of training samples: the words determined to need to be masked are replaced with arbitrary words. Of course, the set sample proportion may be adjusted according to actual needs, which is merely an illustrative example and is not considered to limit the application scope of the present application, and in other embodiments, the partial masking processing method may be selected to perform partial masking processing on the initial sample description text when the requirement is met.

In some embodiments, the words to be masked in the initial sample description text may be randomly determined, or the words that are critical to emotion tag prediction may be determined by a downstream task guidance, and the scheme of determining the words to be masked by a specific downstream task guidance is described below.

And 420, performing secondary training on the emotion label prediction model through second training data, wherein the second training data comprises a plurality of second sample media information and labeled emotion labels corresponding to the second sample media information.

The second sample media information is media information for performing secondary training on the emotion label prediction model. And the labeled emotion label is an emotion label labeled on the second sample media information. And performing secondary training on the emotion label prediction model to enable the emotion label prediction model to have the capability of accurately predicting the emotion labels of the media information according to the splicing characteristics.

In this embodiment, the feature fusion network in the emotion tag prediction model is pre-trained, so that the semantic feature extraction performance of the feature fusion network is improved, and then secondary training is performed, so that the prediction accuracy of the emotion tag prediction model after training is finished can be ensured.

In practice, the performance of the emotion label prediction model trained by the training method of this embodiment is tested, statistics are performed on the accuracy, recall rate and F1 score of emotion labels in four categories, i.e., "whether country wind is present", "whether the country is low", "whether the emotion is young" and "whether the emotion is exaggerated", and the statistical results are shown in table 1 below:

TABLE 1

	Rate of accuracy	Recall rate	F1 score
				All categories	85.4	85.4	85.4
Whether or not to village wind	85.8	78.4	81.9
				Whether or not to have low tone	76.2	66.7	71.1
Whether or not to youthful	94.3	97.3	95.8
				Whether the mood is exaggerated	86.2	91.6	88.8

As can be seen from table 1 above, by using the mode of combining pre-training with secondary training, the emotion label prediction accuracy, recall rate and F1 score of the emotion label prediction model under each category are all high, and it can be seen that the prediction effect of the emotion label prediction model can be effectively ensured by using the training mode.

FIG. 5 is a schematic diagram illustrating the training of an emotion label prediction model according to an embodiment of the present application, as shown in FIG. 5, including:

step 510, constructing data; in this step, the primary purpose is to construct first training data for pre-training and second training data for secondary training. And obtaining task marking data by marking the data. Specifically, the data labeling includes labeling words needing to be masked in the initial sample description text, and labeling emotion labels corresponding to the second media information. Therefore, the words which are determined by the marking initial sample description text and need to be masked are subjected to masking processing to obtain a first sample description text, and therefore first training data are constructed. And combining the second media information and the labeled emotion label corresponding to the second media information to construct second training data. It is to be understood that the task annotation data in fig. 5 includes first training data and second training data.

Step 520, pre-training. In the step, the feature fusion network is pre-trained through the first training data, and the pre-trained feature fusion network is obtained. In this embodiment, for the sake of distinction, a model composed of a classification layer and a feature fusion network at the end of pre-training is referred to as a pre-training model.

Step 530, training for the second time; in this step, the pre-training model obtained after the pre-training is finished is trained by the second training data. For the sake of easy discrimination, the pre-training model after the second training is called an on-line model. It is understood that, in this embodiment, the pre-training model and the online model both refer to the emotion label prediction model in the present application, that is, the pre-training model is the emotion label prediction model at the end of pre-training, and the online model is the emotion label prediction model at the end of secondary training. After the secondary training is completed, the emotion label prediction model can be applied on line.

In the application, the feature fusion network in the emotion tag prediction model is pre-trained to enable the feature fusion network in the emotion tag prediction model to have the capability of accurately extracting semantic features, and then the emotion tag prediction model is secondarily trained by combining second training data related to the emotion tag prediction task, so that after the secondary training is finished, the emotion tag prediction model can accurately extract semantic features in the emotion tag prediction task and conduct emotion tag prediction, and the prediction accuracy of the model is improved.

In some embodiments of the present application, as shown in fig. 6, step 410, comprises:

and step 610, performing feature extraction on the first sample description image to obtain the image features of the first sample description image. In a specific embodiment, the image feature of the first sample description image may be extracted through an image feature extraction network.

And step 620, splicing the image features of the first sample description image and the word vectors respectively corresponding to all the words in the first sample description text to obtain the splicing features corresponding to the first sample media information.

And 630, performing feature fusion by the feature fusion network according to the splicing feature corresponding to the first sample media information, and outputting a fusion feature corresponding to the first sample media information.

And step 640, performing masked word prediction by the specified classification layer according to the fusion characteristics corresponding to the first sample media information to obtain a predicted masked word.

In the application, the designated classification layer is different from a classification layer in the emotion label prediction model, and it can be understood that a classification task of the designated classification layer is different from a classification task corresponding to the classification layer in the emotion label prediction model, so in the application, in order to pre-train a feature fusion network in the emotion label prediction model, the designated classification layer for performing masked word prediction is additionally deployed.

Step 650 calculates a first penalty based on the predicted masked word and the actual masked word in the first sample description text.

In some embodiments, a loss function may be set in advance for the pre-training task, and for the purpose of distinguishing, the loss function is referred to as a first loss function, so that a function value of the first loss function is calculated based on the predicted masked word and the actual masked word in the first sample description text, and the first loss is obtained.

In a specific embodiment, the first loss function may be a cross entropy loss function, a logarithmic loss function, a square loss function, or the like, and is not particularly limited herein.

And 660, reversely adjusting parameters of the feature fusion network according to the first loss until a first training end condition is reached.

In the present application, the training end condition set by the pre-training process is referred to as a first training end condition. The first training end condition may be convergence of the first loss function, or may be that the number of iterations reaches a set first-time threshold. After the parameters of the feature fusion network are reversely adjusted, the feature fusion network after the parameters are adjusted and the specified classification layer predict the masked words again, and the process is repeated, so that the feature fusion network has the capability of semantic feature extraction, and the semantic features of the description text are extracted by better combining the image features of the description images in the media information.

In addition, in this embodiment, in the pre-training stage, the image feature of the first sample description image is also used as one of the inputs of the feature fusion network, so that the semantics of the first sample description text can be proved and jointly inferred in combination with the image feature of the first sample description image, and the training effect of the feature fusion network is improved.

In some embodiments of the present application, as shown in fig. 7, prior to step 410, the method further comprises:

and step 710, splicing the image features of the first sample description image and the word vectors corresponding to all the words in the initial sample description text to obtain first splicing features.

Step 720, predicting the contribution weight of each word of the initial sample description text to the emotion label prediction by the key component labeling model according to the first splicing characteristics; the key component labeling model is obtained through training of third training data, the third training data comprise a plurality of third sample description texts and label information corresponding to each third sample description text, and the label information is used for indicating whether each word in the corresponding third sample description text is a keyword of an emotion label prediction task or not.

In the text, the same word contributes differently to different tasks, and the contributions of different words to the same task are also different, for example, words such as "smile", "tear", "cry" are more critical for the emotion classification task, whereas in the relationship extraction task, the predicate and verb are relatively more important. Therefore, in this embodiment, the contribution weight of each word in the initial sample description text to emotion tag prediction is predicted, and the word to be masked is determined according to the contribution weight.

The key component labeling model is a neural network model used for predicting contribution weight of words to emotion label prediction tasks, and can be constructed by one or more of a convolutional neural network, a cyclic neural network, a long-time memory neural network, a full-connection network and a feedforward neural network. Of course, in other embodiments, a neural network model known in the related art may be applied to perform the contribution weight prediction in step 720 after being trained with the training data related to the key component labeling task.

In some embodiments, the key component labeling model may be categorized into two categories, such that the contribution weights that the key component labeling model may output include a first contribution weight indicating being a keyword (e.g., the first contribution weight may be 1), and a second contribution weight indicating not being a keyword (the second contribution weight may be 0).

In particular embodiments, the key component labeling model may include a depth feature extraction network, which may be a BERT model or a Transformer model, and a weight classification layer, which may be a fully-connected network layer or other classification layer with classification functions deployed.

In some embodiments, the key component annotation model may be trained by the following process: inputting text embedding features corresponding to the third sample description text (the text embedding features can be obtained by splicing word vectors corresponding to all words in the third sample description text) into a key component labeling model, outputting prediction contribution weights of all words in the third sample description text to the emotion label prediction task by the key component labeling model according to the text embedding features, calculating third loss according to the predicted prediction contribution weights and labeling information, and reversely adjusting model parameters of the key component labeling model according to the third loss until a third training end condition is reached.

In some embodiments, the third training data may further include a third sample description image derived from the same media information as the third sample description text, so that the image features of the third sample description image are also input to the key component annotation model, so that the trained key component annotation model accurately combines the image features and the text features to predict the contribution weight of each word in the text for emotion tag prediction.

In step 730, the target word to be masked is determined from the words whose corresponding contribution weights are not lower than the weight threshold.

And if the corresponding contribution weight is not lower than the weight threshold, the word is the key word for predicting the emotion label. The weight threshold may be set according to actual needs, and is not specifically limited herein. It will be appreciated that the determined target words may be words for which all or part of the initial sample description text contributes with a weight not below the weight threshold. Specifically, the total number of words to be masked may be determined according to the set ratio of the masked words in the text and the total number of words in the initial sample description text, and then, a corresponding number of words whose contribution weights are not lower than the weight threshold may be selected as the target words.

Step 740, performing mask processing on the target words in the initial sample description text to obtain a first sample description text. The specific process of masking is described above, and is not described herein again.

In this embodiment, a pre-training mode of a selective mask (selected mask) guided by a downstream task is adopted, and through predicting the contribution weight of each word in an initial sample description text to an emotion tag prediction task, a target word with a higher contribution weight is determined as a word to be masked, so that the training effect of a feature fusion network can be improved, and the emotion tag prediction accuracy of an emotion tag prediction model can be improved.

In practice, the performance of the emotion label prediction model obtained by adopting the pre-training mode of the selective mask guided by the downstream task is tested, statistics are carried out on the accuracy, recall rate and F1 score of emotion labels under four categories of 'country wind', 'down-regulation', 'youth' and 'emotion exaggeration', and the statistical results are shown in the following table 2:

TABLE 2

	Rate of accuracy	Recall rate	F1 score
				All categories	85.1	87.1	86.1
Whether or not to village wind	81.0	84.6	82.8
				Whether or not to have low tone	75.7	69.5	72.4
Whether or not to youthful	92.8	98.2	95.4
				Whether the mood is exaggerated	87.5	91.7	89.5
Whether negative energy is present	82.5	83.2	82.8

As can be seen from the above table 2, by adopting the pre-training mode of the selected mask guided by the downstream task, the accuracy of the emotion label prediction model can be improved by 2.3%, and the recall rate can be improved by 0.1%.

Fig. 8 is a flowchart illustrating a masking process according to an embodiment of the present application. In fig. 8, the emotion tag prediction task markup corpus includes a plurality of third sample description texts and markup emotion tags corresponding to media information from which the third sample descriptions are derived. The key component labeling corpus comprises a plurality of third sample description texts and label information corresponding to each third sample description text. The random corpus comprises a plurality of initial sample description texts to be subjected to mask processing. The key components are expected to include corpora that have determined the contribution weights of words in the initial sample description text to the emotion tag prediction.

Specifically, the contribution degree score of each word in the sample description text to the emotion tag prediction task can be calculated by labeling linguistic data of the emotion tag prediction task.

Assume that the third sample description text s includes n words, s ═ w₁，w₂，w₃，...，w_n) S (w) is used to predict the importance of each word to the emotion label_i) It is shown that then an auxiliary text sequence s 'is constructed (s' is initially empty), and each step sequentially throws a word in the third sample description text s into the auxiliary text sequence s ', e.g. if it is currently step i, then the current auxiliary sequence is s'_i-1w_iWherein, s'_i-1Is formed by w₁、w₂、w₃.. to w_i-1Formed sequence of words w_iScoring the contribution of emotion tag prediction by S (w)_i) Comprises the following steps:

S(w_i)＝P(y_t|s)-P(y_t|s′_i-1w_i)；

wherein, y_tLabel emotion label for third sample description text, P (y)_t| s) represents the confidence that the third sample description text s can obtain the correct emotion label; p (y)_t|s′_i-1w_i) Denotes a given inclusion w_iThe partial sequence inside can derive the confidence of the correct emotion label. Wherein, S (w)_i) The smaller, the more indicated is w_iThe more important, i.e. plus w_iThe closer the later auxiliary sequence contributes to the task to the original sequence.

After the contribution degree of each word in the third sample description text s to the emotion label prediction is calculated and scored, keyword labeling may be performed on each word in the third sample description text, that is, whether each word in the third sample description text is a keyword for emotion label prediction is labeled, so as to obtain label information corresponding to the third sample description text.

On the basis, the third sample description text and the corresponding label information are used for training the key component labeling model. After training is finished, inputting the initial sample description text in the random corpus into the key component labeling model, and outputting the contribution weight of each word of the initial sample description text to emotion label prediction.

FIG. 9 is a schematic diagram illustrating a key component labeling model for contribution weight prediction according to an embodiment of the present application. As shown in FIG. 9, for "motorcycle tow a girl tens of meters! Girl cry is an initial sample description text, and the associated first sample is used for describing the image characteristic E of the image_[IMG]Word vectors corresponding to the respective characters (i.e., E in FIG. 9)_[CLS]、E_[SEP]、E_{Massage device}、E_Support、E_{Vehicle with wheels}、E_Mop、E_{Line of}…、E_Woman、E_{Children's toy}、E_Tragedy、E_Cry、E_[EOS]) Inputting the corresponding position codes and the corresponding segmentation codes into a BERT model, and performing feature fusion by the BERT model to obtain hidden layer features (namely H in figure 9) corresponding to each character_[EOS]、H_[CLS]、H_[IMG]、H_[SEP]、H_{Massage device}、H_Mop、H_{Vehicle with wheels}、H_Mop、H_{Line of}、...、H_Woman、H_{Children's toy}、H_Tragedy、H_Cry、H_[EOS])。

And then, carrying out weight classification by the weight classification layer according to the hidden layer characteristics corresponding to the characters to obtain the contribution weight of each character of the initial sample description text to emotion label prediction. In this embodiment, the weight classification layer performs two classifications to obtain the contribution weights corresponding to the words, and as shown in fig. 9, the predicted contribution weights corresponding to the four words "drag", "go", "tragic" and "crying" are 1, and the other are all 0, which indicates that the keywords predicted for the emotion tag in the initial sample description text include the four words "drag", "go" and "tragic" and "crying".

Based on this, one or more words can be selected from the four keywords as the target word, in fig. 8, the "cry" word is selected as the target word, and after the target word is masked, as shown in fig. 8, the first sample description text is "tens of meters of motorcycle dragging girl! Girl tragedy [ MASK ] ".

In some embodiments of the present application, as shown in fig. 10, step 420, comprises:

step 1010, obtaining a second sample description text and a second sample description image from the second sample media information.

And step 1020, performing feature extraction on the second sample description image to obtain image features corresponding to the second sample description image.

And step 1030, splicing the image features corresponding to the second sample description image and the word vectors corresponding to all the words in the second sample description text to obtain the splicing features corresponding to the second sample media information.

Step 1040, label prediction is carried out by the emotion label prediction model according to the splicing characteristics corresponding to the second sample media information, and a predicted emotion label corresponding to the second sample media information is output.

And 1050, calculating a second loss according to the predicted emotion tag corresponding to the second sample media information and the labeled emotion tag corresponding to the second sample media information.

In some embodiments, a loss function may be set in advance for the secondary training task, and for convenience of distinction, the loss function is referred to as a second loss function, so that a function value of the second loss function is calculated based on the predicted emotion tag corresponding to the second sample media information and the labeled emotion tag corresponding to the second sample media information, so as to obtain a second loss.

In a specific embodiment, the second loss function may be a cross-entropy loss function, a logarithmic loss function, a square loss function, an nlllos function (Negative log-likelihood loss function), and the like, and is not limited herein.

And step 1060, reversely adjusting parameters of the emotion label prediction model according to the first loss until a second training end condition is reached.

In the present application, the training end condition set in the secondary training process is referred to as a second training end condition. The second training end condition may be that the second loss function converges, or that the number of iterations reaches a set second-iteration threshold.

Through the training process, the emotion label prediction model is trained by using the second training data related to the emotion label prediction task, so that the accuracy of the emotion label predicted by the emotion label prediction model in the application stage can be ensured.

In some embodiments, the emotion label prediction model may also be trained by using only the second training data, and of course, the emotion label prediction model obtained in this way predicts the emotion label with lower accuracy compared with the pre-training plus secondary training way.

In some embodiments of the present application, prior to step 420, the method further comprises: acquiring at least two groups of candidate emotion labels corresponding to the second sample media information; at least two groups of candidate emotion labels are obtained by carrying out emotion label labeling on second sample media information by different labeling personnel, and each group of candidate emotion labels corresponds to one labeling personnel; and if the at least two groups of candidate emotion labels are the same, determining the candidate emotion labels as the labeled emotion labels corresponding to the second sample media information.

In a specific embodiment, if the emotion tag prediction model predicts emotion tags in a category, a group of candidate emotion tags comprises a candidate emotion tag; if the emotion label prediction model predicts emotion labels under multiple categories, a group of candidate emotion labels comprises candidate emotion labels corresponding to each category.

In the related art, the label labeling is generally performed by a single person, but the inventors of the present application have found that, in particular, in an application scenario of the present application, training of an emotion label prediction model by labeling an emotion label by a single person is performed, and the accuracy of an output emotion label is low. The analysis shows that the main reasons are as follows: 1) the judgment of the emotion labels has strong subjectivity, the annotation standard of the emotion labels is possibly incomplete and cannot cover all situations, and annotating personnel can play the emotion labels freely without clear guidance, so that a large number of second sample media information which are similar are inconsistent in the corresponding annotation emotion labels; 2) and (4) under the condition of label missing, the positive examples of the labeled emotion labels corresponding to the obtained second sample media information are sparse, and the positive examples of the labeled emotion labels under each category are not more than 5%.

Based on the above reasons, in order to improve the training effect of the emotion label prediction model, in this embodiment, at least two annotators are used to annotate the same sample media information (i.e., the second sample media information) to obtain at least two sets of candidate emotion labels corresponding to the second sample media information, and when it is determined that the at least two sets of candidate emotion labels corresponding to the second sample media information are the same, the candidate emotion labels are determined as the annotated emotion labels corresponding to the second sample media information; therefore, the problem that the training effect of the emotion label prediction model is poor due to strong subjectivity and easy label leakage caused by single personnel carrying out emotion label marking is solved.

In a specific embodiment, the emotion labels may be labeled by two different labeling persons, and of course, in other embodiments, three or more labeling persons may also label the emotion labels.

FIG. 11 is a flowchart illustrating emotion tag labeling according to an embodiment of the present application. In the embodiment shown in FIG. 11, the same second sample media information is emotion labeled by annotator A and annotator B. In a specific embodiment, for the convenience of labeling, a labeling standard is provided, and the labeling standard is used for guiding a labeling person to perform a labeling rule of emotion label labeling. Therefore, the annotator A and the annotator B perform emotion label annotation on the second sample media information based on the annotation standard to obtain two groups of candidate emotion labels.

Then voting is carried out on the two candidate emotion labels, namely whether the two groups of candidate emotion labels are consistent or not is judged, and if so, one group of candidate emotion labels is used as a labeling emotion label corresponding to the second sample media information; otherwise, if the two candidate emotion labels are inconsistent, prompt information can be sent to the annotators A and B, and the prompt information is used for indicating that the two candidate emotion labels are inconsistent.

When the two sets of candidate emotion labels do not agree, possible scenarios include: firstly, a marking person understands errors of media information; secondly, the marked standard has a vulnerability, for example, which emotion labels are not marked clearly aiming at certain specific media information; and thirdly, different annotators have different understandings on the annotation standard.

For the first case, the emotion labels can be re-labeled by the labeling personnel, that is, the emotion labels labeled by the two labeling personnel on the same second sample media information can be ensured to be the same.

For the second case, the annotation standard may be updated to increase the media information covered by the annotation standard, or the annotation standard may be refined, so that after the annotation standard is updated, the annotating personnel re-annotates the emotion tags according to the updated annotation standard to ensure that the emotion tags annotated by the two annotating personnel on the same second sample media information are the same.

Aiming at the third situation, different annotators can discuss to re-understand the media information and the annotation standard and re-label the emotion labels, so that the annotation standard is aligned among the different annotators, and the situation that two groups of candidate emotion labels are inconsistent due to the difference in the understanding of the annotation standard is reduced.

Through the process, double-blind voting is carried out on a plurality of groups of candidate emotion labels marked by the same second sample media information by two marking personnel, if the double-blind voting determines that two groups of optional emotion labels are inconsistent, the two marking personnel are prompted through prompt information to carry out emotion label marking on the second sample media information again, the marking standards are refined and aligned as far as possible, the marking quality of the emotion labels can be guaranteed, and the problem that the training effect of the emotion label prediction model is poor due to the fact that the emotion label is marked wrongly is solved.

The method is used for solving the problem that the emotion label prediction model is not accurate, the accuracy is improved by 6.7%, and the recall rate is improved by 16.7%.

Embodiments of the apparatus of the present application are described below, which may be used to perform the methods of the above-described embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the above-described embodiments of the method of the present application.

Fig. 12 is a block diagram illustrating a media information processing apparatus according to an embodiment, the media information processing apparatus including, as shown in fig. 12: an obtaining module 1210, configured to obtain a description text and a description image from the media information; the image feature extraction module 1220 is configured to perform feature extraction on the description image to obtain an image feature; the splicing module 1230 is configured to splice the image features and the word vectors corresponding to all words in the description text, to obtain splicing features corresponding to the media information; the tag prediction module 1240 is used for performing tag prediction according to the splicing characteristics corresponding to the media information by the emotion tag prediction model and outputting the emotion tags corresponding to the media information; the display order determining module 1250 is configured to scramble a display order of the media information in the media information set, which includes the specified emotion tag, according to the emotion tag corresponding to each media information in the media information set, to obtain a target display order corresponding to the media information set.

In some embodiments of the present application, the display order determination module 1250 includes: the sorting unit is used for sorting the media information in the media information set according to the initial information display sequence corresponding to the media information set to obtain an initial media information sorting; the adjusting unit is used for adjusting the initial media information sequence according to the emotion tags corresponding to the media information in the media information set, so that the number of the media information of which the emotion tags which are continuous in the adjusted sequence include the appointed emotion tags is not more than N, wherein N is a positive integer; and determining the adjusted initial media information sequence as a target display sequence corresponding to the media information set.

In some embodiments of the present application, the apparatus for processing media information further comprises: the sorting score calculation module is used for calculating the sorting score corresponding to each media information according to the emotion label corresponding to each media information in the media information set to be sorted and the initial sorting parameter corresponding to each media information; and the information display sequence determining module is used for sequencing the media information according to the sequencing scores corresponding to the media information to obtain the information display sequence corresponding to the media information set to be sequenced.

In some embodiments of the present application, the apparatus for processing media information further comprises: and the control module is used for controlling the detail page of the media information not to display the investigation feedback content according to the first emotion label if the emotion label corresponding to the media information comprises the first emotion label.

In some embodiments of the present application, the emotion tag prediction model includes a feature fusion network and a classification layer; the tag prediction module 1240 includes: the characteristic fusion unit is used for performing characteristic fusion by the characteristic fusion network according to the splicing characteristics corresponding to the media information and outputting fusion characteristics corresponding to the media information; and the emotion label classification unit is used for classifying the emotion labels by the classification layer according to the fusion characteristics and outputting the emotion labels corresponding to the media information.

In some embodiments of the present application, the apparatus for processing media information further comprises: the pre-training module is used for pre-training the feature fusion network through first training data; the first training data comprises a plurality of first image-text pairs, wherein the first image-text pairs comprise a first sample description image and a first sample description text which are derived from the same first sample media information, and the first sample description text is obtained by performing partial mask processing on the initial sample description text; and the secondary training module is used for carrying out secondary training on the emotion label prediction model through second training data, and the second training data comprises a plurality of second sample media information and labeled emotion labels corresponding to the second sample media information.

In some embodiments of the present application, the pre-training module comprises: the first feature extraction unit is used for extracting features of the first sample description image to obtain image features of the first sample description image; the first splicing unit is used for splicing the image characteristics of the first sample description image and the word vectors respectively corresponding to all the words in the first sample description text to obtain splicing characteristics corresponding to the first sample media information; the first feature fusion unit is used for performing feature fusion by the feature fusion network according to the splicing feature corresponding to the first sample media information and outputting a fusion feature corresponding to the first sample media information; the first prediction unit is used for predicting the masked words by the appointed classification layer according to the fusion characteristics corresponding to the first sample media information to obtain predicted masked words; a first loss calculation unit for calculating a first loss based on the predicted masked word and an actual masked word in the first sample description text; and the first adjusting unit is used for reversely adjusting the parameters of the feature fusion network according to the first loss until a first training end condition is reached.

In some embodiments of the present application, the apparatus for processing media information further comprises: the second splicing unit is used for splicing the image characteristics of the first sample description image and the word vectors corresponding to all the words in the initial sample description text to obtain first splicing characteristics; the contribution weight determining unit is used for predicting the contribution weight of each word of the initial sample description text to the emotion label prediction according to the first splicing characteristics by the key component labeling model; the key component labeling model is obtained through training of third training data, the third training data comprise a plurality of third sample description texts and label information corresponding to each third sample description text, and the label information is used for indicating whether each word in the corresponding third sample description text is a keyword of an emotion label prediction task or not; a target word determination unit for determining a target word to be masked from words whose corresponding contribution weights are not lower than a weight threshold; and the mask processing unit is used for performing mask processing on the target words in the initial sample description text to obtain a first sample description text.

In some embodiments of the present application, the secondary training module comprises: an acquisition unit configured to acquire a second sample description text and a second sample description image from second sample media information; the second feature extraction unit is used for performing feature extraction on the second sample description image to obtain image features corresponding to the second sample description image; the third splicing unit is used for splicing the image characteristics corresponding to the second sample description image and the word vectors corresponding to all the words in the second sample description text to obtain the splicing characteristics corresponding to the second sample media information; the second prediction unit is used for performing label prediction by the emotion label prediction model according to the splicing characteristics corresponding to the second sample media information and outputting a predicted emotion label corresponding to the second sample media information; the second loss calculating unit is used for calculating second loss according to the predicted emotion tag corresponding to the second sample media information and the labeled emotion tag corresponding to the second sample media information; and the second adjusting unit is used for reversely adjusting the parameters of the emotion label prediction model according to the first loss until a second training end condition is reached.

In some embodiments of the present application, the apparatus for processing media information further comprises: the candidate emotion label acquisition module is used for acquiring at least two groups of candidate emotion labels corresponding to the second sample media information; at least two groups of candidate emotion labels are obtained by carrying out emotion label labeling on second sample media information by different labeling personnel, and each group of candidate emotion labels corresponds to one labeling personnel; and the marked emotion label determining module is used for determining the candidate emotion labels as the marked emotion labels corresponding to the second sample media information if the at least two groups of candidate emotion labels are the same.

In some embodiments of the present application, the obtaining module 1210 includes: a first extraction unit for extracting a cover image of the media information as a description image; the second extraction unit is used for acquiring the title text and the category label corresponding to the media information from the media information; and the combining unit is used for combining the title text and the category label corresponding to the media information to obtain the description text.

FIG. 13 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application. It should be noted that the computer system 1300 of the electronic device shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 13, computer system 1300 includes a processor, which may be a Central Processing Unit (CPU)1301, which may perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1302 or a program loaded from a storage portion 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for system operation are also stored. The CPU1301, the ROM1302, and the RAM 1303 are connected to each other via a bus 1304. An Input/Output (I/O) interface 1305 is also connected to bus 1304.

The following components are connected to the I/O interface 1305: an input portion 1306 including a keyboard, a mouse, and the like; an output section 1307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 1308 including a hard disk and the like; and a communication section 1309 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1309 performs communication processing via a network such as the internet. A drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as necessary, so that a computer program read out therefrom is mounted into the storage portion 1308 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1301.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable storage medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries computer readable instructions which, when executed by a processor, implement the method of any of the embodiments described above.

According to an aspect of the present application, there is also provided an electronic device, including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method of any of the above embodiments.

According to an aspect of an embodiment of the present application, there is provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of any of the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for processing media information, comprising:

acquiring a description text and a description image from the media information;

performing feature extraction on the description image to obtain image features;

splicing the image characteristics and the word vectors corresponding to all the words in the description text to obtain splicing characteristics corresponding to the media information;

performing label prediction by an emotion label prediction model according to the splicing characteristics corresponding to the media information, and outputting an emotion label corresponding to the media information;

and according to the emotion labels corresponding to the media information in the media information set, disordering the display sequence of the media information including the appointed emotion labels in the media information set to obtain a target display sequence corresponding to the media information set.

2. The method of claim 1, wherein the step of scrambling the display sequence of the media information including the designated emotion tag according to the emotion tag corresponding to each piece of media information in the media information set to obtain the target display sequence corresponding to the media information set comprises:

sorting the media information in the media information set according to the initial information display sequence corresponding to the media information set to obtain an initial media information sorting;

adjusting the initial media information sequence according to the emotion tags corresponding to the media information in the media information set, so that the number of the media information of which the emotion tags which are continuous in the adjusted sequence and comprise the appointed emotion tags is not more than N, wherein N is a positive integer; and determining the adjusted initial media information sequence as a target display sequence corresponding to the media information set.

3. The method of claim 1 or 2, wherein tag prediction is performed by an emotion tag prediction model according to the splicing features corresponding to the media information, and after the emotion tag corresponding to the media information is output, the method further comprises:

calculating a ranking score corresponding to each media information according to the emotion label corresponding to each media information in the media information set to be ranked and the initial ranking parameter corresponding to each media information;

and sorting the media information according to the sorting scores corresponding to the media information to obtain the information display sequence corresponding to the media information set to be sorted.

4. The method of claim 1, wherein after the emotion tag prediction model performs tag prediction according to the splicing features corresponding to the media information and outputs the emotion tag corresponding to the media information, the method further comprises:

and if the emotion label corresponding to the media information comprises a first emotion label, controlling not to display the research feedback content in the detail page of the media information according to the first emotion label.

5. The method of claim 1, wherein the emotion tag prediction model comprises a feature fusion network and a classification layer;

the tag prediction is carried out by the emotion tag prediction model according to the splicing characteristics corresponding to the media information, and the emotion tag corresponding to the media information is output, and the tag prediction method comprises the following steps:

performing feature fusion by the feature fusion network according to the splicing features corresponding to the media information, and outputting fusion features corresponding to the media information;

and the classification layer classifies the emotion labels according to the fusion characteristics and outputs the emotion labels corresponding to the media information.

6. The method of claim 5, wherein before the label prediction is performed by the emotion label prediction model according to the splicing features corresponding to the media information and the emotion label corresponding to the media information is output, the method further comprises:

pre-training the feature fusion network through first training data; the first training data comprises a plurality of first image-text pairs, wherein the first image-text pairs comprise a first sample description image and a first sample description text which are derived from the same first sample media information, and the first sample description text is obtained by performing partial mask processing on an initial sample description text;

and performing secondary training on the emotion label prediction model through second training data, wherein the second training data comprise a plurality of second sample media information and labeled emotion labels corresponding to the second sample media information.

7. The method of claim 6, wherein the pre-training the feature fusion network with the first training data comprises:

performing feature extraction on the first sample description image to obtain image features of the first sample description image;

splicing the image features of the first sample description image and the word vectors respectively corresponding to all the words in the first sample description text to obtain splicing features corresponding to the first sample media information;

performing feature fusion by the feature fusion network according to the splicing feature corresponding to the first sample media information, and outputting a fusion feature corresponding to the first sample media information;

performing masked word prediction by the appointed classification layer according to the fusion characteristics corresponding to the first sample media information to obtain a predicted masked word;

calculating a first loss according to the predicted masked word and an actual masked word in the first sample description text;

and reversely adjusting the parameters of the feature fusion network according to the first loss until a first training end condition is reached.

8. The method according to claim 6 or 7, wherein prior to pre-training the feature fusion network with the first training data, the method further comprises:

splicing the image features of the first sample description image and the word vectors corresponding to all the words in the initial sample description text to obtain first splicing features;

predicting the contribution weight of each word of the initial sample description text to emotion label prediction by a key component labeling model according to the first splicing characteristics; the key component labeling model is obtained through training of third training data, the third training data comprise a plurality of third sample description texts and label information corresponding to each third sample description text, and the label information is used for indicating whether each word in the corresponding third sample description text is a keyword of an emotion label prediction task or not;

determining target words needing to be masked from the words of which the corresponding contribution weights are not lower than the weight threshold;

and performing mask processing on the target words in the initial sample description text to obtain the first sample description text.

9. The method of claim 6, wherein the secondary training of the emotion label prediction model through second training data comprises:

acquiring a second sample description text and a second sample description image from second sample media information;

performing feature extraction on the second sample description image to obtain image features corresponding to the second sample description image;

splicing image features corresponding to the second sample description image and word vectors corresponding to all words in the second sample description text respectively to obtain splicing features corresponding to the second sample media information;

performing label prediction by the emotion label prediction model according to the splicing characteristics corresponding to the second sample media information, and outputting a predicted emotion label corresponding to the second sample media information;

calculating a second loss according to the predicted emotion label corresponding to the second sample media information and the labeled emotion label corresponding to the second sample media information;

and reversely adjusting the parameters of the emotion label prediction model according to the first loss until a second training end condition is reached.

10. The method of claim 6 or 9, wherein before the second training of the emotion label prediction model by the second training data, the method further comprises:

acquiring at least two groups of candidate emotion labels corresponding to the second sample media information; the at least two groups of candidate emotion labels are obtained by carrying out emotion label labeling on the second sample media information by different labeling personnel, and each group of candidate emotion labels corresponds to one labeling personnel;

and if the at least two groups of candidate emotion labels are the same, determining the candidate emotion labels as the labeled emotion labels corresponding to the second sample media information.

11. The method of claim 1, wherein the obtaining the descriptive text and the descriptive image from the media information comprises:

extracting a cover image of the media information as the description image;

acquiring a title text and a category label corresponding to the media information from the media information;

and combining the title text and the category label corresponding to the media information to obtain the description text.

12. An apparatus for processing media information, comprising:

the acquisition module is used for acquiring the description text and the description image from the media information;

the image feature extraction module is used for extracting features of the description image to obtain image features;

the splicing module is used for splicing the image characteristics and the word vectors respectively corresponding to all the words in the description text to obtain splicing characteristics corresponding to the media information;

the label prediction module is used for performing label prediction according to the splicing characteristics corresponding to the media information by the emotion label prediction model and outputting the emotion label corresponding to the media information;

and the display sequence determining module is used for disordering the display sequence of the media information including the appointed emotion label in the media information set according to the emotion label corresponding to each media information in the media information set to obtain the target display sequence corresponding to the media information set.

13. An electronic device, comprising:

a processor;

a memory having computer-readable instructions stored thereon which, when executed by the processor, implement the method of any of claims 1-11.

14. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-11.

15. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the method of any of claims 1-11.