CN113010702A

CN113010702A - Interactive processing method and device for multimedia information, electronic equipment and storage medium

Info

Publication number: CN113010702A
Application number: CN202110234973.1A
Authority: CN
Inventors: 陈小帅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-22

Abstract

The application provides an interactive processing method and device of multimedia information, electronic equipment and a computer readable storage medium; relates to the machine learning technology in the field of artificial intelligence; the method comprises the following steps: acquiring a plurality of interactive information aiming at multimedia information; constructing user comment interactive characteristics of the multimedia information based on user figures of a watching user and an interactive user of the multimedia information and each interactive information issued by the interactive user aiming at the multimedia information; performing fusion processing based on the user comment interactive features and the multi-modal features of the multimedia information to obtain fusion features; determining an interaction probability of the viewing user for each interaction information based on the fusion features; and sequencing the plurality of interactive information based on the interactive probability, and displaying the plurality of interactive information based on the sequencing result. Through the application, the interactive information personalized display of the multimedia information can be realized, so that the accurate recommendation of the interactive information is realized.

Description

Interactive processing method and device for multimedia information, electronic equipment and storage medium

Technical Field

The present application relates to artificial intelligence technologies and internet technologies, and in particular, to a method and an apparatus for interactive processing of multimedia information, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. As artificial intelligence technology has been researched and developed, artificial intelligence technology has been developed and applied in various fields.

Taking information recommendation as an example, with the development of internet technology, more and more user interaction information (e.g., comments) for multimedia information (e.g., video or audio) are provided, and users interact with the content of the multimedia information. Because the interactive information data volume of the multimedia information is large, the interactive information is generally sorted according to indexes such as word number or hot degree of the interactive information in the related technology, and a user cannot directly acquire interested interactive information, so that interaction behavior cannot be implemented on recommended interactive information, invalid recommendation of the interactive information can be caused, and unnecessary waste is caused on computing resources and communication resources for recommending the interactive information.

Therefore, the related art has no effective solution for realizing effective recommendation of interactive information aiming at multimedia information.

Disclosure of Invention

The embodiment of the application provides an interactive processing method and device for multimedia information, electronic equipment and a computer readable storage medium, which can be used for realizing personalized display of interactive information of the multimedia information, so that accurate recommendation of the interactive information is realized.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an interactive processing method of multimedia information, which comprises the following steps:

acquiring a plurality of interactive information aiming at multimedia information;

constructing user comment interactive characteristics of the multimedia information based on user figures of a watching user and an interactive user of the multimedia information and each piece of interactive information issued by the interactive user aiming at the multimedia information;

performing fusion processing based on the user comment interactive features and the multi-modal features of the multimedia information to obtain fusion features;

determining an interaction probability of the viewing user for the each interaction information based on the fusion features;

and sequencing the plurality of interactive information based on the interactive probability, and displaying the plurality of interactive information based on a sequencing result.

In the above scheme, the determining the interaction polar information of each piece of interaction information includes:

executing the following processing on each interaction information through a neural network model:

extracting the characteristics of the interactive information to obtain the text characteristics of the interactive information;

mapping the text features into probabilities respectively belonging to different candidate interaction polar information;

determining candidate interaction polar direction information corresponding to the maximum probability as interaction polar direction information of the interaction information;

the neural network model is obtained by training sample interaction information and interaction polar direction information labeled according to the sample interaction information.

In the foregoing solution, the determining the heat information of each piece of interaction information based on the interaction degree of each piece of interaction information and the interaction degree of the interaction user includes:

executing the following processing for each interactive information:

determining the interaction degree of the interaction information according to the interacted times of the interaction information;

determining the interaction degree of the interactive user according to the interaction times of the interactive user;

and carrying out weighted summation on the interaction degree of the interaction information and the interaction degree of the interaction user, and determining the result of the weighted summation as the heat information of the interaction information.

In the above scheme, the determining the interaction degree of the interaction information according to the number of times of interaction of the interaction information includes:

carrying out weighted summation on the praise times, the replied times and the forwarded times of the interactive information, and determining the ratio between the result of the weighted summation and the playing times of the multimedia information as a first ratio;

when the first ratio is larger than a first ratio threshold, determining the first ratio threshold as the interaction degree of the interaction information;

and when the first ratio is not larger than a first ratio threshold, determining the first ratio as the interaction degree of the interaction information.

In the above scheme, the determining the interaction degree of the interactive user according to the number of interactions of the interactive user includes:

carrying out weighted summation on the playing times of the multimedia information published by the interactive user, the times of publishing the interactive information and the times of the published interactive information being interacted, and determining the ratio between the result of the weighted summation and the interactive parameter as a second ratio;

the interactive parameters are the maximum values selected from a plurality of interactive results which are in one-to-one correspondence with a plurality of interactive users, and the interactive results are obtained by weighting and summing the playing times of multimedia information published by the interactive users, the times of publishing interactive information and the times of interaction of published interactive information;

when the second ratio is larger than a second ratio threshold, determining the second ratio threshold as the interaction degree of the interaction user;

when the second ratio is not larger than a second ratio threshold, determining the second ratio as the interaction degree of the interactive user.

The embodiment of the application provides an interactive processing device of multimedia information, including:

the acquisition module is used for acquiring a plurality of interactive information aiming at the multimedia information;

the construction module is used for constructing user comment interactive characteristics of the multimedia information based on user figures of a watching user and an interactive user of the multimedia information and each piece of interactive information published by the interactive user aiming at the multimedia information;

the fusion module is used for performing fusion processing on the basis of the user comment interactive features and the multi-modal features of the multimedia information to obtain fusion features;

an interaction determination module, configured to determine, based on the fusion features, an interaction probability of the viewing user for each piece of interaction information;

and the sequencing module is used for sequencing the plurality of interactive information based on the interactive probability and displaying the plurality of interactive information based on a sequencing result.

In the foregoing solution, the building module is further configured to execute the following processing for each piece of interaction information: splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the characteristics of the text obtained by splicing to obtain the text characteristics of the interactive information; extracting features of texts in the user images of the watching users to obtain user features of the watching users; and fusing the text characteristics of the interactive information and the user characteristics of the watching user to obtain the user comment interactive characteristics of the multimedia information.

In the above scheme, the text feature of the interactive information and the user feature of the viewing user are extracted through the same language processing model; the construction module is further configured to determine an attention weight of a text feature of the interactive information and an attention weight of a user feature of the viewing user through the language processing model, and perform weighted summation on the text feature of the interactive information and the user feature of the viewing user according to the attention weight of the text feature of the interactive information and the attention weight of the user feature of the viewing user, so as to obtain a user comment interactive feature of the multimedia information.

In the above scheme, the fusion module is further configured to perform feature extraction on the multimedia information to obtain a text feature, an audio feature, and a video feature of the multimedia information; performing fusion processing on the text feature, the audio feature and the video feature to obtain multi-modal features of the multimedia information; and splicing the user comment interactive feature and the multi-modal feature of the multimedia information to obtain the fusion feature.

In the above scheme, the fusion module is further configured to extract text information from the multimedia information, and perform feature extraction on the text information to obtain text features of the multimedia information, where the text information includes at least one of: title, barrage, dialog text, type, label; extracting a plurality of audio frames from the multimedia information, performing feature extraction on the plurality of audio frames to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features to obtain audio features of the multimedia information; extracting a plurality of video frames from the multimedia information, performing feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and performing fusion processing on the plurality of video frame features to obtain the video features of the multimedia information.

In the foregoing solution, the interaction determining module is further configured to execute the following processing for each piece of interaction information: mapping the fusion characteristics into interaction probabilities of the interaction information corresponding to different interaction types; wherein the interaction type comprises at least one of: like, forward, reply, back step.

In the above scheme, the ranking module is further configured to perform weighted summation on the interaction probabilities of the different interaction types corresponding to each piece of interaction information to obtain a ranking score of each piece of interaction information; and sequencing the plurality of interactive information in a descending or ascending manner according to the sequencing scores, and displaying at least part of the plurality of interactive information according to a sequencing result.

In the above solution, the interactive processing device for multimedia information further includes: the relevancy determining module is used for extracting the characteristics of the multimedia information to obtain multi-modal characteristics of the multimedia information; constructing text features of each interactive information based on the user portrait of the interactive user and each interactive information published by the interactive user aiming at the multimedia information; determining a correlation degree between each piece of interactive information and the multimedia information based on the multi-modal characteristics of the multimedia information and the text characteristics of each piece of interactive information; the ranking module is further configured to perform weighted summation on the interaction probability and the correlation corresponding to each piece of interaction information to obtain a ranking score of each piece of interaction information; and sorting the plurality of interactive information in a descending order or an ascending order according to the sorting scores.

In the above scheme, the relevancy determining module is further configured to construct a user portrait of the interactive user according to multimedia information that the interactive user is interested in; executing the following processing for each interactive information: and splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the characteristics of the text obtained by splicing so as to obtain the text characteristics of the interactive information.

In the foregoing solution, the relevancy determining module is further configured to execute the following processing for each piece of interaction information: fusing the multi-modal features and the text features of the interaction information to obtain relevancy fused features; and mapping the correlation degree fusion characteristics into probabilities respectively belonging to different candidate correlation degrees, and determining the candidate correlation degree corresponding to the maximum probability as the correlation degree between the interaction information and the multimedia information.

In the foregoing solution, the relevancy determining module is further configured to extract text information from the multimedia information, and perform feature extraction on the text information to obtain a text feature of the multimedia information, where the text information includes at least one of: title, barrage, dialog text, type, label; extracting a plurality of audio frames from the multimedia information, performing feature extraction on the plurality of audio frames to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features to obtain audio features of the multimedia information; extracting a plurality of video frames from the multimedia information, performing feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and performing fusion processing on the plurality of video frame features to obtain the video features of the multimedia information; and performing fusion processing on the text feature, the audio feature and the video feature to obtain multi-modal features of the multimedia information.

In the above solution, the interactive processing device for multimedia information further includes: the polar direction heat determining module is used for determining the interactive polar direction information of each piece of interactive information; determining heat information of each interactive information based on the interactive degree of each interactive information and the interactive degree of the interactive user; the sorting module is further configured to perform weighted summation on the interaction polar information, the relevancy and the heat information corresponding to each piece of interaction information to obtain a sorting score of each piece of interaction information.

In the above solution, the polar direction heat determination module is further configured to perform the following processing on each piece of interaction information through a neural network model: extracting the characteristics of the interactive information to obtain the text characteristics of the interactive information; mapping the text features into probabilities respectively belonging to different candidate interaction polar information; determining candidate interaction polar direction information corresponding to the maximum probability as interaction polar direction information of the interaction information; the neural network model is obtained by training sample interaction information and interaction polar direction information labeled according to the sample interaction information.

In the foregoing solution, the extreme heat determining module is further configured to execute the following processing for each piece of interaction information: determining the interaction degree of the interaction information according to the interacted times of the interaction information; determining the interaction degree of the interactive user according to the interaction times of the interactive user; and carrying out weighted summation on the interaction degree of the interaction information and the interaction degree of the interaction user, and determining the result of the weighted summation as the heat information of the interaction information.

In the above scheme, the extreme heat determination module is further configured to perform weighted summation on the liked times, the replied times and the forwarded times of the interaction information, and determine a ratio between a result of the weighted summation and the playing times of the multimedia information as a first ratio; when the first ratio is larger than a first ratio threshold, determining the first ratio threshold as the interaction degree of the interaction information; and when the first ratio is not larger than a first ratio threshold, determining the first ratio as the interaction degree of the interaction information.

In the above scheme, the extreme heat determining module is further configured to perform weighted summation on the number of times that multimedia information published by the interactive user is played, the number of times that interactive information is published, and the number of times that the published interactive information is interacted, and determine a ratio between a result of the weighted summation and the interactive parameter as a second ratio; the interactive parameters are the maximum values selected from a plurality of interactive results which are in one-to-one correspondence with a plurality of interactive users, and the interactive results are obtained by weighting and summing the playing times of multimedia information published by the interactive users, the times of publishing interactive information and the times of interaction of published interactive information; when the second ratio is larger than a second ratio threshold, determining the second ratio threshold as the interaction degree of the interaction user; when the second ratio is not larger than a second ratio threshold, determining the second ratio as the interaction degree of the interactive user.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the interactive processing method of the multimedia information provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the present application provides a computer-readable storage medium, which stores computer-executable instructions and is used for implementing the interactive processing method for multimedia information provided by the embodiment of the present application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

the interactive probability of the watching user for the interactive information is determined based on the interactive characteristics of the watching user and the interactive user and the characteristics of the multimedia information, and then the interactive information is displayed according to the interactive probability, so that the sequence of the displayed information meets the personalized information requirement of the watching user.

Drawings

Fig. 1 is a schematic diagram of an interactive processing system 100 for multimedia information according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a terminal 400 provided in an embodiment of the present application;

fig. 3 is a flowchart illustrating an interactive processing method for multimedia information according to an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating an interactive processing method for multimedia information according to an embodiment of the present application;

fig. 5 is a flowchart illustrating an interactive processing method for multimedia information according to an embodiment of the present application;

fig. 6A and fig. 6B are schematic flow charts of a method for interactive processing of multimedia information according to an embodiment of the present application;

FIG. 7 is a structural diagram of a model of relevance between comments and multi-dimensional content of a video provided by an embodiment of the application;

FIG. 8 is a structural diagram of a comment emotion poled classification model provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of a user personalized comment multi-interaction target prediction model provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first", "second", and the like are only used for distinguishing similar objects and do not denote a particular order or importance, but rather the terms "first", "second", and the like may be used interchangeably with the order of priority or the order in which they are expressed, where permissible, to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated and described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

2) Video comment multi-interaction target: the user watching the video/comments can approve, reply and forward the comments, and the user can predict the interactive target of the comments by the user, so that the user personalized sequencing is carried out on the comment list of the video, and the interactive attraction of the video comments to the user can be improved.

3) Parameters of the neural network model: parameters obtained in the training process of the neural network model are automatically updated or learned by self, and the parameters comprise characteristic weight, bias and the like.

In the related art, two sequencing modes are generally available for the interactive information, one is sequencing of the interactive information based on the webpage content, specifically, coarse sequencing is performed on the content of the interactive information, and in addition, the interactive information is reordered according to the matching degree of the user interest and the interactive information. The other method is to sort based on the attribute of the interactive information and the user interest, and particularly takes the interactive behavior of the user on the interactive information as the characteristic of the interactive information sorting, so that personalized sorting is realized.

In the embodiment of the application, it is found that the interaction condition of the user on the interactive information cannot be accurately predicted by the related technology, so that the interaction behavior after personalized sequencing cannot be matched with the target behavior of the user, thereby causing invalid recommendation of the interactive information and causing unnecessary waste of computing resources and communication resources for recommending the interactive information.

In view of the above technical problems, embodiments of the present application provide an interactive processing method for multimedia information, which can perform personalized display on the interactive information of the multimedia information, thereby implementing accurate recommendation of the interactive information. The following describes an exemplary application of the interactive processing method for multimedia information provided by the embodiment of the present application, and the interactive processing method for multimedia information provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal alone, for example, the terminal determines, by using its own computing capability, a ranking of a plurality of interactive information of multimedia information, and then displays at least a part of the plurality of interactive information according to a ranking result; the terminal and the server may cooperate to perform the method, for example, the terminal determines the ranking of the plurality of interactive messages of the multimedia message by means of the computing power of the server, and then displays at least a part of the plurality of interactive messages according to the ranking result.

Next, an embodiment of the present application is described by taking a server and a terminal as an example, and referring to fig. 1, fig. 1 is a schematic structural diagram of an interactive processing system 100 for multimedia information provided in the embodiment of the present application. The interactive processing system 100 for multimedia information includes: the server 200, the network 300, and the terminal 400 will be separately described.

The server 200, which is a background server of the client 410, may send a plurality of interactive information to the client 410.

The network 300, which mediates communication between the server 200 and the terminal 400, may be a wide area network or a local area network, or a combination of both.

The terminal 400 runs thereon a client 410, and the client 410 is a client with a multimedia information playing function, such as an instant messaging client, a video client, a microblog client, a short video client, and the like. The client 410 is configured to receive a plurality of interaction information sent by the server 200; and the system is also used for determining the interaction probability of the watching user aiming at each piece of interaction information according to the user portrait of the watching user and the interaction user, the interaction information and the multi-mode characteristics of the multimedia information, sequencing the interaction information according to the interaction probability, and displaying a plurality of pieces of interaction information on a human-computer interaction interface based on the sequencing result.

In some embodiments, the server 200 may further determine an interaction probability for each interaction information of the viewing user according to the user portrait of the viewing user and the interaction user, the interaction information, and the multi-modal characteristics of the multimedia information, sort the interaction information according to the interaction probability, and send the sorted result to the client 410; the client 410 displays a plurality of interactive information on the human-computer interaction interface based on the sorting result.

The embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying series resources such as hardware, software, and network in a wide area network or a local area network to implement data calculation, storage, processing, and sharing.

The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources.

As an example, the server 200 may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be various types of user terminals such as a smart phone, a tablet computer, a vehicle-mounted terminal, an intelligent wearable device, a notebook computer, a desktop computer, and a smart television. The terminal 400 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

The structure of the terminal 400 in fig. 1 is explained next. Referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal 400 provided in an embodiment of the present application, where the terminal 400 shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes an output device 431 including one or more speakers and/or one or more visual displays that enable the presentation of multimedia and interactive information. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

The operating system 451, which includes system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used for implementing various basic services and for processing hardware-based tasks.

A network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

A presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430.

An input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the interactive processing device for multimedia information provided by the embodiments of the present application may be implemented in software, and fig. 2 illustrates the interactive processing device 455 for multimedia information stored in the memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: an acquisition module 4551, a construction module 4552, a fusion module 4553, an interaction determination module 4554, and a ranking module 4555, which are logical and thus arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

In the following, a method for performing an interactive processing on multimedia information provided by the embodiment of the present application by the terminal 400 in fig. 1 is taken as an example for description. Referring to fig. 3, fig. 3 is a flowchart illustrating an interactive processing method for multimedia information according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

It should be noted that the method shown in fig. 3 can be executed by various forms of computer programs executed by the terminal 400, and is not limited to the client described above, such as the above operating system, software modules and scripts.

In step S101, a plurality of interactive information for multimedia information is acquired.

In some embodiments, the multimedia information may be in the form of video, audio, text, and the like.

Taking the example that the multimedia information is a video, the interactive information may be information expressing the attitude of the user, such as comments, praise or back-step, and the like, which are issued in the process that the user watches the video.

In step S102, a user comment interactive feature of the multimedia information is constructed based on user figures of a viewing user and an interactive user of the multimedia information and each interactive information published by the interactive user for the multimedia information.

In some embodiments, user representations of the interactive user and the viewing user are constructed respectively according to multimedia information of interest to the interactive user and the viewing user; the following processing is performed for each interactive information: splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the characteristics of the text obtained by splicing to obtain the text characteristics of the interactive information; extracting the characteristics of the text in the user image of the watching user to obtain the user characteristics of the watching user; and fusing the text characteristic of the interactive information and the user characteristic of the watching user to obtain the user comment interactive characteristic of the multimedia information.

By way of example, the multimedia information of interest includes: historical playing multimedia information, collected multimedia information, historical interactive (comment, like press back or like) multimedia information.

For example, the user representation of the user is a sequence of interest tags of the user, which is constructed based on the iterative learning of the multimedia information of interest of the user, for example, the following processing is performed on the multimedia information of interest of the user through a neural network model: extracting a feature vector of the interested multimedia information; and mapping the extracted feature vectors into probabilities corresponding to a plurality of candidate interest tag sequences respectively, and determining the candidate interest tag sequence corresponding to the maximum probability as the user portrait of the user.

Here, the neural network model is trained by using sample multimedia information and an interest tag sequence labeled for the sample multimedia information as samples. In this way, the neural network model has the ability to identify a user representation from multimedia information of interest to the user, thereby enabling accurate determination of the user representation of the interacting user and viewing of the user.

As examples, the Neural Network model may include various types, such as a Convolutional Neural Network (CNN) model, a Recurrent Neural Network (RNN) model, a multi-layer feedforward Neural Network model, and the like. The neural network model can be trained in a supervision mode, wherein a loss function used for training the neural network model is used for representing the difference between the predicted value and the actual labeled data, and the loss function can be a 0-1 loss function, a perception loss function, a cross entropy loss function or the like.

As an example, the text feature of the interactive information and the user feature of the viewing user are extracted through the same language processing model; fusing the text features of the interactive information and the user features of the viewing user to obtain the user comment interactive features of the multimedia information may include: and determining the attention weight of the text characteristic of the interactive information and the attention weight of the user characteristic of the watching user through a language processing model, and carrying out weighted summation on the text characteristic of the interactive information and the user characteristic of the watching user according to the attention weight of the text characteristic of the interactive information and the attention weight of the user characteristic of the watching user so as to obtain the user comment interactive characteristic of the multimedia information.

As an example, the language processing model may be any machine learning model with text feature extraction function, for example, a machine learning model of an encoder structure, which may include: bidirectional Encoder characterization from transforms (BERT) models, lightweight Bidirectional Encoder characterization from transforms (ALBERT, a Lite Bidirectional Encoder characterization from transforms) models, encoders in self-encoding networks, and downsampling sections in U-type networks, etc.

As an example of fig. 9, a text in a user portrait of a comment publisher (i.e., the above-mentioned interactive user) is spliced with a text in a comment (i.e., the above-mentioned interactive information), and then feature extraction is performed on the spliced text through an ALBERT model to obtain a comment text representation (i.e., the text feature of the above-mentioned interactive information); performing feature extraction on texts in user images of video viewers (namely, the viewing users) through an ALBERT model to obtain user representations of the viewers (namely, the user features of the viewing users); based on the attention mechanism, the user representation of the viewer and the comment text representation are weighted and summed to obtain a user-comment interaction representation (i.e., the user-comment interaction feature described above).

According to the method and the device, the user comment interaction characteristics capable of predicting the interaction behavior of the watching user on the interaction information to a greater extent are extracted from the interaction information, the watching user and the user portrait of the interaction user based on the attention mechanism, and the accuracy of the subsequent prediction of the interaction probability of the watching user on each interaction information based on the user comment interaction characteristics can be improved.

In step S103, fusion processing is performed based on the user comment interactive feature and the multi-modal feature of the multimedia information, so as to obtain a fusion feature.

In some embodiments, the multimedia information is subjected to feature extraction to obtain text features, audio features and video features (or image features) of the multimedia information; performing fusion processing on the text feature, the audio feature and the video feature to obtain a multi-modal feature of the multimedia information; and splicing the user comment interactive features and the multi-mode features of the multimedia information to obtain fusion features.

As an example of accepting fig. 9, the user-comment interaction representation and the video multimodal feature fusion representation (i.e., the multimodal features of the multimedia information described above) are subjected to a stitching process to obtain stitched video features (i.e., the fused features described above).

As an example, performing feature extraction on multimedia information to obtain text features, audio features, and video features of the multimedia information may include: extracting text information from the multimedia information, and performing feature extraction on the text information to obtain text features of the multimedia information, wherein the text information comprises at least one of the following: title, barrage, dialog text, type, label; extracting a plurality of audio frames from the multimedia information, performing feature extraction on the plurality of audio frames to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features to obtain audio features of the multimedia information; the method comprises the steps of extracting a plurality of video frames from multimedia information, carrying out feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and carrying out fusion processing on the plurality of video frame features to obtain the video features of the multimedia information.

As an example, the feature extraction may be performed on the text information through a language processing model, where the language processing model may be any machine learning model with a text feature extraction function, for example, a machine learning model of an encoder structure, which may include: BERT model, ALBERT model, encoder in self-encoding network, and downsampling section in U-type network, etc.

As an example, the audio frame may be subjected to feature extraction through an audio feature extraction model, where the audio feature extraction model may be any machine learning model with an audio feature extraction function, such as a super resolution test sequence Network (VGGish) model and a sequence generation Network (Wavenet) model.

For example, an audio frame 1, an audio frame 2, and an audio frame 3 are extracted from multimedia information, feature extraction is performed on the audio frame 1, the audio frame 2, and the audio frame 3 through an audio feature extraction model, an audio frame feature 1 corresponding to the audio frame 1, an audio frame feature 2 corresponding to the audio frame 2, and an audio frame feature 3 corresponding to the audio frame 3 are obtained, and the audio frame feature 1, the audio frame feature 2, and the audio frame feature 3 are subjected to fusion processing, so that audio features are obtained.

As an example, feature extraction may be performed on a video frame through a video feature extraction model, where the video feature extraction model may be any machine learning model with a video feature extraction function, such as a Deep neural Network model (e.g., a Deep Residual Network (ResNet) model), a multi-layer feed-forward neural Network model, and the like.

For example, video frame 1, video frame 2, and video frame 3 are extracted from multimedia information, feature extraction is performed on video frame 1, video frame 2, and video frame 3 through a video feature extraction model, so as to obtain video frame feature 1 corresponding to video frame 1, video frame feature 2 corresponding to video frame 2, and video frame feature 3 corresponding to video frame 3, and video frame feature 1, video frame feature 2, and video frame feature 3 are subjected to fusion processing, so as to obtain video features.

As an example of fig. 7, taking the multimedia information as a video as an example, extracting text information of the video, and performing feature extraction processing on the text information through an ALBERT model to obtain text features of the video (i.e., a text representation of the video in fig. 7). Extracting audio information of a video, and performing audio frame extraction processing on the audio information to obtain an audio frame sequence containing a plurality of audio frames; each audio frame in the audio frame sequence is subjected to feature extraction processing through a VGGish model to obtain a plurality of audio frame features corresponding to the plurality of audio frames one by one, and the plurality of audio frame features are subjected to fusion processing by adopting average pooling to obtain audio features of the video (namely, the audio representation in fig. 7). Performing video frame extraction processing on a video to obtain a video frame sequence comprising a plurality of video frames; each video frame in the video frame sequence is subjected to feature extraction processing through a ResNet model to obtain a plurality of video frame features corresponding to the plurality of video frames one by one, and the plurality of video frame features are subjected to fusion processing through average pooling to obtain video features of the video (namely, image representation in FIG. 7). And performing fusion processing on the text features, the audio features and the video features to obtain a video multi-modal feature fusion representation.

In step S104, the interaction probability of the viewing user for each piece of interaction information is determined based on the fusion feature.

In some embodiments, the following is performed for each interaction information: mapping the fusion characteristics into interaction probabilities of interaction information corresponding to different interaction types; wherein the interaction type comprises at least one of: like, forward, reply, back step.

As an example, the following processing is performed for each interactive information: mapping the fusion features through the activation function to obtain an interaction profile corresponding to different interaction typesAnd (4) rate. The number of the activation functions used for mapping the interaction probabilities corresponding to different interaction types may be multiple, each activation function corresponds to one interaction type, and the types of the multiple activation functions may be the same or different, for example

Function or

Functions, and the like.

As an example of fig. 9, the fusion features are respectively mapped to a comment like probability (hereinafter referred to as like probability) of the viewing user, a comment reply probability (hereinafter referred to as reply probability) of the viewing user, and a comment forward probability (hereinafter referred to as forward probability) of the viewing user.

According to the method and the device, the personalized features of the interactive user and the watching user are combined, the video multi-dimensional content and the interactive information content are simultaneously learned, multiple possible interactive behaviors of the watching user on the interactive information are used as targets of personalized sequencing for modeling, the attraction of the interactive information on the watching user can be improved, the interactive information is effectively recommended, and further the computing resources and the communication resources for recommending the interactive information can be saved.

In step S105, the plurality of pieces of interaction information are sorted based on the interaction probability, and are displayed based on the sorting result.

In some embodiments, the plurality of interactive information may be sorted in an ascending order or a descending order according to the interaction probability, and all or part of the interactive information may be displayed according to the sorting result.

For example, the interaction probability represents the probability of the viewing user for implementing the interaction behavior with respect to the interaction information, so that part of the interaction information with higher interaction probability is preferentially displayed, the attraction of the interaction information to the viewing user can be improved, the effective recommendation of the interaction information is realized, and the computing resources and the communication resources for recommending the interaction information can be saved.

In some embodiments, the interaction probabilities of different interaction types corresponding to each piece of interaction information are weighted and summed to obtain a ranking score of each piece of interaction information; and sequencing the plurality of interactive information in a descending or ascending manner according to the sequencing scores, and displaying at least part of the plurality of interactive information according to the sequencing result.

As an example, the weights corresponding to the interaction probabilities of different interaction types may be the same or different, and the weights may be parameters obtained in the training process of the neural network model, or values set by a user, a client, or a server.

For example, the interaction probability includes an approval probability, a reply probability, and a forwarding probability, wherein the approval probability is weighted to 0.5, the reply probability is weighted to 0.2, and the forwarding probability is weighted to 0.3. The praise probability of the interaction information 1 is 0.2, the reply probability is 0.8, the forward probability is 0.6, the praise probability of the interaction information 2 is 0.4, the reply probability is 0.2, and the forward probability is 0.5, so that the ranking score of the interaction information 1 is 0.2 × 0.5+0.8 × 0.2+0.6 × 0.3 ═ 0.44, and the ranking score of the interaction information 2 is 0.4 × 0.5+0.2 × 0.2+0.5 × 0.3 ═ 0.39, so that the interaction information 1 can be ranked before the interaction information 2 and presented to the user, so that the user can see the interaction information 1 preferentially.

In the embodiment of the application, the ranking score of the interactive information represents the possibility that the watching user implements the interactive behavior on the interactive information, the interactive information with the larger ranking score is preferentially displayed, the attraction of the interactive information to the watching user can be promoted, the interactive information is effectively recommended, and then the computing resources and the communication resources for recommending the interactive information can be saved.

In some embodiments, referring to fig. 4, fig. 4 is a schematic flowchart of an interactive processing method for multimedia information provided in the embodiment of the present application, based on fig. 3, steps S106 to S108 may further be included before step S105, and step S105 may be replaced with step S109.

It should be noted that steps S106 to S108 and steps S101 to S104 may be executed in parallel or sequentially, which is not limited in the embodiment of the present application, and steps S106 to S108 and steps S101 to S104 are illustrated as an example in fig. 4.

In step S106, feature extraction is performed on the multimedia information to obtain multi-modal features of the multimedia information.

In some embodiments, text information is extracted from the multimedia information, and feature extraction is performed on the text information to obtain text features of the multimedia information, wherein the text information includes at least one of: title, barrage, dialog text, type, label; extracting a plurality of audio frames from the multimedia information, performing feature extraction on the plurality of audio frames to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features to obtain audio features of the multimedia information; extracting a plurality of video frames from the multimedia information, performing feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and performing fusion processing on the plurality of video frame features to obtain video features of the multimedia information; and performing fusion processing on the text features, the audio features and the video features to obtain multi-modal features of the multimedia information. Here, examples of extracting text features, audio features, and video features are similar to those included in step S103, and will not be described again here.

As an example of fig. 7, taking the multimedia information as a video as an example, extracting text information of the video, and performing feature extraction processing on the text information through an ALBERT model to obtain text features of the video. Extracting audio information of a video, and performing audio frame extraction processing on the audio information to obtain an audio frame sequence containing a plurality of audio frames; and performing feature extraction processing on each audio frame in the audio frame sequence through a VGGish model to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features by adopting average pooling to obtain the audio features of the video. Performing video frame extraction processing on a video to obtain a video frame sequence comprising a plurality of video frames; feature extraction processing is carried out on each video frame in the video frame sequence through a ResNet model so as to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and the video frame features are subjected to fusion processing by adopting average pooling so as to obtain the video features of the video. And performing fusion processing on the text features, the audio features and the video features to obtain a video multi-modal feature fusion representation.

In step S107, a text feature of each interactive information is constructed based on the user profile of the interactive user and each interactive information published by the interactive user for the multimedia information.

In some embodiments, a user representation of an interactive user is constructed based on multimedia information of interest to the interactive user; the following processing is performed for each interactive information: and splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the features of the text obtained by splicing to obtain the text features of the interactive information.

For example, the user representation of the interactive user is an interest tag sequence of the interactive user, which is constructed based on the iterative learning of the multimedia information of interest of the interactive user, for example, the following processing is performed on the multimedia information of interest of the interactive user through a neural network model: extracting a feature vector of the interested multimedia information; and mapping the extracted feature vectors into probabilities corresponding to a plurality of candidate interest tag sequences respectively, and determining the candidate interest tag sequence corresponding to the maximum probability as a user portrait of the interactive user.

Here, the neural network model is trained by using sample multimedia information and an interest tag sequence labeled for the sample multimedia information as samples. Therefore, the neural network model has the capability of identifying the user portrait of the interactive user from the multimedia information which is interested by the interactive user, so that the user portrait of the interactive user can be accurately determined.

As an example of fig. 7, the text in the user portrait of the comment publisher and the text in the comment (i.e., the above-mentioned interaction information) are spliced, and then feature extraction is performed on the spliced text through the ALBERT model to obtain a comment text representation.

In step S108, a correlation degree between each interactive information and the multimedia information is determined based on the multi-modal features of the multimedia information and the text features of each interactive information.

In some embodiments, the following is performed for each interaction information: performing fusion processing on the multi-modal characteristics and the text characteristics of the interactive information to obtain correlation fusion characteristics; and mapping the correlation degree fusion characteristics into probabilities respectively belonging to different candidate correlation degrees, and determining the candidate correlation degree corresponding to the maximum probability as the correlation degree between the interactive information and the multimedia information.

As an example of fig. 7, a video multimodal feature fusion representation and a comment text representation are fused, the fused features are input into a full link network, and the fused features are mapped to corresponding video-comment relevance (i.e., relevance between the above-mentioned interactive information and multimedia information) by the full link network.

In step S109, the plurality of pieces of interaction information are sorted based on the interaction probability and the correlation, and are displayed based on the sorting result.

In some embodiments, the interaction probability and the correlation corresponding to each piece of interaction information are weighted and summed to obtain a ranking score of each piece of interaction information, the pieces of interaction information are ranked in a descending order or an ascending order according to the ranking score, and at least part of the pieces of interaction information are displayed according to the ranking result.

As an example, the interaction probability and the weight corresponding to the correlation degree may be the same or different, and the weight may be a parameter obtained in the training process of the neural network model, or may be a value set by a user, a client, or a server.

As an example, the interaction probability corresponding to the interaction information may be obtained by performing a weighted summation on the interaction probabilities corresponding to different interaction types.

For example, the interaction probability includes an approval probability, a reply probability, and a forwarding probability, wherein the approval probability is weighted to 0.5, the reply probability is weighted to 0.2, the forwarding probability is weighted to 0.3, and the correlation degree is weighted to 0.6. The likeness probability of the interactive information 1 is 0.2, the reply probability is 0.8, the forwarding probability is 0.6, the correlation degree is 0.7, the likeness probability of the interactive information 2 is 0.4, the reply probability is 0.2, the forwarding probability is 0.5, and the correlation degree is 0.4, so that the ranking score of the interactive information 1 is 0.2 × 0.5+0.8 × 0.2+0.6 × 0.3+0.7 × 0.6 ═ 0.86, and the ranking score of the interactive information 2 is 0.4 × 0.5+0.2 × 0.2+0.5 × 0.3+0.4 × 0.6 ═ 0.63, and thus the interactive information 1 can be ranked before the interactive information 2 to be presented to the user, so that the user can preferentially see the interactive information 1.

In the embodiment of the application, the higher the correlation between the interactive information and the multimedia information is, the higher the adhesion between the interactive information and the multimedia information is, the users watching the multimedia information are interested in the multimedia information, and the users can also be interested in the interactive information with higher correlation with the multimedia information, so that the possibility that the watching users implement interactive behaviors on the interactive information which is preferentially displayed can be further improved by sequencing the interactive information based on the correlation and the interaction probability, thereby realizing effective recommendation of the interactive information and further saving the computing resources and the communication resources for recommending the interactive information.

In some embodiments, referring to fig. 5, fig. 5 is a schematic flowchart of an interactive processing method for multimedia information provided in the embodiments of the present application, based on fig. 4, step S110 to step S111 may be further included before step S109, and step S109 may be replaced with step S112.

It should be noted that steps S110 to S111 and steps S106 to S108 may be executed in parallel or sequentially, which is not limited in the embodiment of the present application, and steps S110 to S111 and steps S106 to S108 are illustrated as an example in fig. 4.

In step S110, interaction polar information of each interaction information is determined.

In some embodiments, the following is performed on each interaction information by the neural network model: extracting the characteristics of the interactive information to obtain the text characteristics of the interactive information; mapping the text features into probabilities respectively belonging to different candidate interaction polar information; and determining the candidate interaction polar information corresponding to the maximum probability as the interaction polar information of the interaction information.

As an example, the neural network model is trained with sample interaction information and interaction polar information labeled for the sample interaction information. Therefore, the neural network model has the capability of identifying the interaction polar information from the interaction information, and the interaction polar information of the interaction information can be accurately determined.

As an example, the neural network model may include various types, such as a CNN model, an RNN model, a multi-layer feedforward neural network model, and the like. The neural network model can be trained in a supervision mode, wherein a loss function used for training the neural network model is used for representing the difference between the predicted value and the actual labeled data, and the loss function can be a 0-1 loss function, a perception loss function, a cross entropy loss function or the like.

As an example of fig. 8, feature extraction is performed on the comment text through the ALBERT model to obtain a comment text representation, and the comment text representation is mapped to a corresponding comment emotion orientation (i.e., the above-mentioned interaction orientation information).

The interaction polar direction information in the embodiment of the application represents the emotion guidance of the interaction information to the watching user, the interaction information with the higher interaction polar direction grade is preferentially displayed to the watching user, and the watching user can be guided positively, so that the possibility that the watching user implements interaction behavior on the preferentially displayed interaction information is improved, the effective recommendation of the interaction information is realized, and further the computing resource and the communication resource for recommending the interaction information can be saved.

In step S111, heat information of each interactive information is determined based on the degree of interaction of each interactive information and the degree of interaction of the interactive user.

In some embodiments, the following is performed for each interaction information: determining the interaction degree of the interaction information according to the interacted times of the interaction information; determining the interaction degree of the interactive user according to the interaction times of the interactive user; and carrying out weighted summation on the interaction degree of the interaction information and the interaction degree of the interaction user, and determining the result of the weighted summation as the heat information of the interaction information.

As an example, the interaction degree of the interaction information and the weight corresponding to the interaction degree of the interaction user may be the same or different, and the weight may be a parameter obtained in the training process of the neural network model, or may be a value set by the user, the client, or the server.

Taking the example that the multimedia information is a video and the interactive information is a comment, the comprehensive heat value (i.e., the above-mentioned heat information) Ph of each comment (x 1) is the comment publisher interaction degree (i.e., the above-mentioned interaction degree of the interactive user) + x2 is the comment interaction heat degree (i.e., the above-mentioned interaction degree of the interactive information), where x1 and x2 are weights.

As an example, determining the interaction degree of the interaction information according to the number of times of being interacted with the interaction information may include: carrying out weighted summation on the praise times, the replied times and the forwarded times of the interactive information, and determining the ratio between the result of the weighted summation and the playing times of the multimedia information as a first ratio; when the first ratio is larger than the first ratio threshold, determining the first ratio threshold as the interaction degree of the interaction information; and when the first ratio is not larger than the first ratio threshold, determining the first ratio as the interaction degree of the interaction information.

For example, the first proportional threshold may be a parameter obtained in the training process of the neural network model, or may be a value set by a user, a client, or a server, and the first proportional threshold may be any number.

Taking the first scale threshold value as 1, the multimedia information is a video, and the interactive information is a comment as an example, the comment interaction popularity is min ((w1 comment voted number + w2 comment replied number + w3 comment forwarded number)/total current video playing number, 1), where w1, w2, and w3 are weights.

As an example, determining the interaction degree of the interactive user according to the number of interactions of the interactive user may include: carrying out weighted summation on the playing times of the multimedia information published by the interactive user, the times of publishing the interactive information and the times of the published interactive information being interacted, and determining the ratio between the result of the weighted summation and the interactive parameter as a second ratio; when the second ratio is larger than the second ratio threshold, determining the second ratio threshold as the interaction degree of the interactive user; and when the second ratio is not larger than the second ratio threshold, determining the second ratio as the interaction degree of the interactive user.

For example, the interaction parameter is a maximum value selected from a plurality of interaction results corresponding to a plurality of interaction users one to one, and the interaction result is obtained by performing weighted summation on the number of times that the multimedia information published by the interaction users is played, the number of times that the interaction information is published, and the number of times that the published interaction information is interacted. The second proportional threshold may be a parameter obtained in the training process of the neural network model, or may be a value set by a user, a client, or a server, and may be any number.

Taking the first scale threshold value of 1, the multimedia information is a video, and the interaction information is a comment as an example, the comment publisher interaction degree is min ((u1 + u 2+ u 3+ T, 1) the number of times that a video work published by the comment publisher is played + u 2+ u 3).

Wherein u1, u2, u3 are weights; t is a constant, and for example, the maximum value of (u1 + the total number of times the video work posted by the comment publisher is played + u 2+ the total number of times the comment posted by the comment publisher is liked/replied/forwarded + u 3) calculated among all the comment publishers may be selected as T.

The comprehensive heat of the interactive information is not only related to the interactive condition of the interactive information, but also related to the interactive user who issues the interactive information, so that the comprehensive heat of the interactive information is determined by synthesizing the interactive condition of the interactive information and the interactive condition of the interactive user who issues the interactive information, and the accuracy of the determined comprehensive heat of the interactive information can be improved.

In step S112, the interaction information is sorted based on the interaction polar information, the heat information, the interaction probability, and the correlation.

In some embodiments, the interaction polar information, the relevancy, the interaction probability and the heat information corresponding to each interaction information are weighted and summed to obtain a ranking score of each interaction information; and sorting the plurality of interactive information in a descending order or an ascending order according to the sorting scores, and displaying the plurality of interactive information based on the sorting result.

As an example, the interaction polar direction represents the emotional orientation of the interaction information to the user, and the higher the interaction polar direction is, the more positive the corresponding emotional orientation of the interaction information to the user is, that is, the higher the user's perception of the interaction information is. The heat information represents the attention degree of the interactive information, and the higher the heat is, the higher the possibility that the corresponding interactive information is the interactive information which is interested by the user is represented.

As an example, the weights corresponding to the interaction polar information, the correlation degree, the interaction probability, and the heat information may be the same or different, and the weights may be parameters obtained in the training process of the neural network model, or values set by a user, a client, or a server.

For example, the interaction probability includes an approval probability, a reply probability, and a forwarding probability, wherein the approval probability is weighted to 0.5, the reply probability is weighted to 0.2, the forwarding probability is weighted to 0.3, the correlation degree is weighted to 0.6, the interaction polar information is weighted to 0.2, and the popularity information is weighted to 0.4. The like probability of the interactive information 1 is 0.2, the reply probability is 0.8, the forward probability is 0.6, the correlation is 0.7, the interactive extreme information is 0.4, the popularity information is 0.3, the like probability of the interactive information 2 is 0.4, the reply probability is 0.2, the forward probability is 0.5, the correlation is 0.4, the interactive extreme information is 0.6, and the popularity information is 0.2, so that the ranking score of the interactive information 1 is 0.2 × 0.5+0.8 × 0.2+0.6 × 0.3+0.7 × 0.6+0.4 × 0.2+0.3 × 0.4 ═ 1.06, the ranking score of the interactive information 2 is 0.4 × 0.5+0.2 × 0.2+0.5 × 0.3+0.4 × 0.6+ 0.2 × 0.2+0.2 × 0.83, and thus the interactive information can be presented to the user preferentially before the user.

In the embodiment of the application, the higher the interaction direction is, the more positive the emotion guidance of the interactive information corresponding to the representation is, that is, the higher the sensitivity of the user to the interactive information is, the higher the possibility that the interactive information corresponding to the representation with the higher popularity is the interactive information interested by the user is, therefore, the possibility that the interactive information implements the interactive behavior on the preferentially displayed interactive information by the watching user can be further improved by sequencing the interactive information based on the interaction direction information, the relevance, the interactive probability and the popularity information, so that the effective recommendation of the interactive information is realized, and further, the calculation resources and the communication resources for recommending the interactive information can be saved.

In some embodiments, step S112 may further include performing a weighted summation of the interaction polar information, the relevancy, and the heat information of each interaction information to obtain a pre-ranking score of each interaction information; and performing ascending sorting on the plurality of interactive information according to the pre-sorting scores, and filtering the previous part of the interactive information in the ascending sorting result.

In the embodiment of the present application, if the number of the interactive information is large, a part of the interactive information with high quality can be selected through pre-sorting, so that the computing resource for sorting the interactive information in step S112 can be saved.

The following describes an interactive processing method of multimedia information provided in an embodiment of the present application, by taking an example that the multimedia information is a video and the interactive information is a video comment.

When a video is displayed in a video client or a webpage, a video comment list for the video is usually displayed below or on the right of a video playing window, however, the data volume of the video comment in the video comment list is large, in the embodiment of the present application, based on the personalized features of a current video watching user (or called a viewer, namely, the above-mentioned viewing user), video comments are subjected to praise, reply and forwarding multi-interaction target prediction, and based on the multi-interaction target prediction condition, video comments in the video comment list are subjected to personalized ranking, so that the interaction rate of the video watching user on the video comments is improved. In addition, the characteristics of the comment publishing user (or called comment publisher, namely the interactive user) and the characteristics of the current video watching user are simultaneously fused into the personalized ranking model, so that personalized ranking can be performed for different watching users.

According to the method and the device, the video multidimensional content and the comment content are simultaneously learned through combining the personalized features of the comment publishers and the current viewer users, multiple possible interaction behaviors of the viewers on the video comments are used as targets of personalized sequencing for modeling, the attraction of a video comment list to the users is improved, and the interaction rate of the users on the comments is improved.

Referring to fig. 6A, fig. 6A is a schematic flowchart of an interactive processing method for multimedia information according to an embodiment of the present disclosure.

In step S601, the terminal presents a video page, where the video page includes a video list, and the video list includes a plurality of videos.

In step S602, the terminal transmits a corresponding video acquisition request to the server in response to the video selection operation received in the video list.

In step S603, the server acquires, in response to the video acquisition request, a video (or a video cover) corresponding to the video acquisition request, and a plurality of comments for the video.

In step S604, the server sorts the plurality of comments.

Referring to fig. 6B, fig. 6B is a schematic flowchart of an interactive processing method for multimedia information according to an embodiment of the present application, and based on fig. 6A, step S604 may include step S6041 and step S6042.

In step S6041, a comment ranking feature is calculated based on the video content and the comment publisher feature.

In some embodiments, step S6041 mainly determines the relevance between the comment and the video multi-dimensional content, the comment emotion orientation (i.e., the above-mentioned interaction orientation information), and the comment popularity (i.e., the above-mentioned popularity information), so as to assist in subsequent personalized ranking of the video comments based on the user multi-interaction targets.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a model of a correlation between comments and multi-dimensional content of a video provided in an embodiment of the present application.

In some embodiments, video multidimensional content is modeled and represented by an ALBERT model; modeling and representing the user portrait of the comment publisher and the comment text through an ALBERT model; and then calculating the correlation between the comments and the multi-dimensional content of the video. According to the embodiment of the application, the user characteristics of the comment publishers are introduced, the representation capability of the comments is improved, and therefore the calculation accuracy of the relevancy of the user comments and the video content is improved.

By way of example, a user representation of a comment publisher is a sequence of interest tags of the comment publisher, and is constructed through iterative learning of video tags played based on a history of the comment publisher. The construction process of the comment text representation includes: splicing the comment publisher user portrait word text and the comment text, and then extracting the characteristics of the spliced text through an ALBE RT model to obtain corresponding comment text representation.

As an example, the video text features have spoken text of classification, semantic label, title, bullet, Optical Character Recognition (OCR) or Automatic Speech Recognition (ASR) of the video as input at the same time. In addition, the audio features and the image features are modeled through multi-frame time sequence representation, and a video multi-dimensional representation (namely a video multi-modal feature fusion representation in fig. 7, namely the multi-modal features of the multimedia information) is constructed together with the video text features.

As an example, performing model training based on the labeled video-comment relevance data to obtain a trained model; through the trained model, after the video content, the user portrait of the comment publisher and the comment content are input, the user comment and video relevancy score Pr (i.e. the video-comment relevancy in fig. 7, i.e. the relevancy between the interactive information and the multimedia information) is output.

In some embodiments, the comment emotion direction represents the emotion guidance of the comment to the user, and the comment with higher comment emotion direction is exposed to the user, so that the user can be guided more positively. By grading the comment emotion polar directions (for example, 0-5 grades, where 5 represents the most positive direction and 0 represents the most negative direction), a comment emotion polar direction data set is constructed in advance, and the comment emotion polar directions are learned through an ALBERT model, see fig. 8, where fig. 8 is a schematic structural diagram of a comment emotion polar direction classification model provided in an embodiment of the present application. The model shown in fig. 8 may output a corresponding comment emotion polar level Pv based on the comment text.

In some embodiments, the integrated value of the popularity of the comment (i.e., the popularity information) is also a basis for ranking the video comments, and is calculated by combining the popularity of the comment publisher (i.e., the interaction degree of the interactive user) and the popularity of the comment interaction (i.e., the interaction degree of the interactive information). Specifically, the comprehensive heat value Ph of each comment is x1 comment issuer interaction degree + x2 comment interaction heat degree, where x1 and x2 are weights.

As an example, the comment interaction popularity is min ((w1 comment voted number + w2 comment replied number + w3 comment forwarded number)/current video play total number, 1), where w1, w2, and w3 are weights.

As an example, the comment publisher interaction degree is min ((u1 + u 2+ u 3+ total number of times of approval/reply/forward of comments posted by the comment publisher)/T, 1).

According to the comment comprehensive popularity value calculation method and device, the interaction degree of the comment publishers is introduced, the characteristics of different comment publishers are considered, and the comprehensive popularity value calculation of comments is more accurate. Based on the user comment and video relevance score Pr, the comment emotion poled level Pv and the comprehensive heat value Ph of the comment, a video comment list can be initially sorted, and preparation is made for subsequent personalized sorting of the user multi-interaction target video comment. The purpose of the preliminary ranking here is: if the amount of the video comments is too large, the better partial comments are selected through preliminary ranking, and the calculation amount of the ranking model in the step S6042 can be reduced.

In step S6042, video comment personalized ranking is performed based on the user multi-interaction targets and comment ranking features.

In some embodiments, when a user watches a video, comments are displayed to the watching user for the purpose of improving the interaction rate of the user, such as praise, reply, or forwarding of the comments, and by taking the interaction behavior of the watching user on the video comments as a learning target and predicting multiple possible interaction targets of the user, the comments with more interaction potential are displayed to the user, and the interaction rate of the watching user on the video comments is improved.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a user personalized comment multi-interaction target prediction model provided in an embodiment of the present application.

In some embodiments, the user-personalized comment multi-interaction target prediction model jointly models comments by taking user portraits of viewers and comment publishers as input. The ALBERT models on the two sides can share parameters (namely the two models can be the same) or do not share the parameters, and meanwhile, the constructed video multi-modal feature fusion representation is used as a model input feature to strengthen the influence of the video content background on comment interaction. Model learning is carried out on comment interactive data on a line, so that the user portrait, comment text and video multidimensional representation of an input viewer and a comment publisher can output the prediction probability of the current viewing user on various interactive behaviors of comments, and the method comprises the following steps: the like probability Pc, the forward probability Pt and the reply probability Pre.

As an example, the manner of determination of the user-comment interactive representation may include: performing feature extraction on user portrait word texts of video viewers through an ALBERT model to obtain user representation of the viewers; splicing a comment publisher user portrait word text and a comment text, and then performing feature extraction on the spliced text through an ALBERT model to obtain a comment text representation; and carrying out fusion processing (such as splicing and weighted summation) on the user representation of the viewer and the comment text representation to obtain a user-comment interactive representation.

In some embodiments, the video comment list is comprehensively and individually ordered based on the user comment and video relevance score Pr, the comment emotion poled level Pv, the comprehensive heat value Ph of the comment, and the predicted probabilities (the like probability Pc, the forward probability Pt, and the reply probability Pre) of the viewing user for various interaction behaviors of the comment. For example, the final user personalized ranking score is f1 × Pr + f2 × Pv + f3 × Ph + f4 × Pc + f5 × Pt + f6 × Pre, where f1-f6 are ranking feature weights. The list of video reviews is ranked based on the final user personalized ranking score, e.g., the higher the score, the higher the review ranking.

In step S605, the server transmits the video (or the video cover) and the sorted plurality of comments to the terminal.

In step S606, the terminal displays a video (or a video cover) on the video page, and displays the sorted plurality of comments in the video comment list.

According to the method and the device, the personalized features of the current watching user and the comment publisher are comprehensively considered, the video comments are subjected to multi-interaction target prediction on the basis of fully understanding the multi-dimensional content of the video, and the comments are subjected to personalized sequencing on the basis of various possible implemented interaction behaviors of the current watching user on the comments, so that the interaction rate of the user on the comments is improved, and the comment community atmosphere is driven.

An exemplary structure of the interactive processing device for multimedia information provided by the embodiment of the present application implemented as a software module is described below with reference to fig. 2.

In some embodiments, as shown in fig. 2, the software modules stored in the interactive processing device 455 of multimedia information in the memory 450 may include:

an obtaining module 4551, configured to obtain a plurality of interactive information for multimedia information; a building module 4552, configured to build a user comment interaction feature of the multimedia information based on user figures of a viewing user and an interactive user of the multimedia information and each piece of interaction information issued by the interactive user for the multimedia information; the fusion module 4553 is configured to perform fusion processing based on the user comment interactive feature and the multi-modal feature of the multimedia information to obtain a fusion feature; an interaction determining module 4554 configured to determine an interaction probability of the viewing user for each piece of interaction information based on the fusion feature; and the sequencing module 4555 is configured to sequence the plurality of pieces of interaction information based on the interaction probability, and display the plurality of pieces of interaction information based on a sequencing result.

In the above solution, the building module 4552 is further configured to perform the following processing for each piece of interaction information: splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the characteristics of the text obtained by splicing to obtain the text characteristics of the interactive information; extracting the characteristics of the text in the user image of the watching user to obtain the user characteristics of the watching user; and fusing the text characteristic of the interactive information and the user characteristic of the watching user to obtain the user comment interactive characteristic of the multimedia information.

In the scheme, the text characteristic of the interactive information and the user characteristic of the watching user are extracted through the same language processing model; the building module 4552 is further configured to determine, through the language processing model, an attention weight of a text feature of the interactive information and an attention weight of a user feature of the viewing user, and perform weighted summation on the text feature of the interactive information and the user feature of the viewing user according to the attention weight of the text feature of the interactive information and the attention weight of the user feature of the viewing user, so as to obtain a user comment interactive feature of the multimedia information.

In the above scheme, the fusion module 4553 is further configured to perform feature extraction on the multimedia information to obtain a text feature, an audio feature, and a video feature of the multimedia information; performing fusion processing on the text feature, the audio feature and the video feature to obtain a multi-modal feature of the multimedia information; and splicing the user comment interactive features and the multi-mode features of the multimedia information to obtain fusion features.

In the above scheme, the fusing module 4553 is further configured to extract text information from the multimedia information, and perform feature extraction on the text information to obtain text features of the multimedia information, where the text information includes at least one of the following: title, barrage, dialog text, type, label; extracting a plurality of audio frames from the multimedia information, performing feature extraction on the plurality of audio frames to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features to obtain audio features of the multimedia information; the method comprises the steps of extracting a plurality of video frames from multimedia information, carrying out feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and carrying out fusion processing on the plurality of video frame features to obtain the video features of the multimedia information.

In the above scheme, the interaction determining module 4554 is further configured to perform the following processing for each piece of interaction information: mapping the fusion characteristics into interaction probabilities of interaction information corresponding to different interaction types; wherein the interaction type comprises at least one of: like, forward, reply, back step.

In the above scheme, the sorting module 4555 is further configured to perform weighted summation on the interaction probabilities of different interaction types corresponding to each piece of interaction information to obtain a sorting score of each piece of interaction information; and sequencing the plurality of interactive information in a descending or ascending manner according to the sequencing scores, and displaying at least part of the plurality of interactive information according to the sequencing result.

In the above solution, the interactive processing device 455 for multimedia information further includes: the relevancy determining module is used for extracting the characteristics of the multimedia information to obtain multi-modal characteristics of the multimedia information; constructing text characteristics of each interactive information based on a user portrait of the interactive user and each interactive information published by the interactive user aiming at the multimedia information; determining the correlation degree between each piece of interactive information and the multimedia information based on the multi-modal characteristics of the multimedia information and the text characteristics of each piece of interactive information; the ranking module 4555 is further configured to perform weighted summation on the interaction probability and the correlation corresponding to each piece of interaction information to obtain a ranking score of each piece of interaction information; and sorting the plurality of interactive information in a descending order or an ascending order according to the sorting scores.

In the scheme, the relevancy determining module is further used for constructing a user portrait of the interactive user according to the multimedia information interested by the interactive user; the following processing is performed for each interactive information: and splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the features of the text obtained by splicing to obtain the text features of the interactive information.

In the foregoing solution, the relevancy determining module is further configured to perform the following processing for each piece of interaction information: performing fusion processing on the multi-modal characteristics and the text characteristics of the interactive information to obtain correlation fusion characteristics; and mapping the correlation degree fusion characteristics into probabilities respectively belonging to different candidate correlation degrees, and determining the candidate correlation degree corresponding to the maximum probability as the correlation degree between the interactive information and the multimedia information.

In the foregoing solution, the relevancy determining module is further configured to extract text information from the multimedia information, and perform feature extraction on the text information to obtain a text feature of the multimedia information, where the text information includes at least one of the following: title, barrage, dialog text, type, label; extracting a plurality of audio frames from the multimedia information, performing feature extraction on the plurality of audio frames to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features to obtain audio features of the multimedia information; extracting a plurality of video frames from the multimedia information, performing feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and performing fusion processing on the plurality of video frame features to obtain video features of the multimedia information; and performing fusion processing on the text features, the audio features and the video features to obtain multi-modal features of the multimedia information.

In the above solution, the interactive processing device 455 for multimedia information further includes: the polar direction heat determining module is used for determining interactive polar direction information of each piece of interactive information; determining heat information of each interactive information based on the interactive degree of each interactive information and the interactive degree of the interactive users; the ranking module 4555 is further configured to perform weighted summation on the interaction polar information, the correlation degree, and the heat degree information corresponding to each piece of interaction information to obtain a ranking score of each piece of interaction information.

In the above solution, the polar direction heat determination module is further configured to perform the following processing on each piece of interaction information through the neural network model: extracting the characteristics of the interactive information to obtain the text characteristics of the interactive information; mapping the text features into probabilities respectively belonging to different candidate interaction polar information; determining candidate interaction polar direction information corresponding to the maximum probability as interaction polar direction information of the interaction information; the neural network model is obtained by training sample interaction information and interaction polar information labeled according to the sample interaction information.

In the above solution, the extreme heat determining module is further configured to perform the following processing for each piece of interaction information: determining the interaction degree of the interaction information according to the interacted times of the interaction information; determining the interaction degree of the interactive user according to the interaction times of the interactive user; and carrying out weighted summation on the interaction degree of the interaction information and the interaction degree of the interaction user, and determining the result of the weighted summation as the heat information of the interaction information.

In the above scheme, the extreme heat determining module is further configured to perform weighted summation on the liked times, the replied times and the forwarded times of the interactive information, and determine a ratio between a result of the weighted summation and the playing times of the multimedia information as a first ratio; when the first ratio is larger than the first ratio threshold, determining the first ratio threshold as the interaction degree of the interaction information; and when the first ratio is not larger than the first ratio threshold, determining the first ratio as the interaction degree of the interaction information.

In the above scheme, the extreme heat determining module is further configured to perform weighted summation on the number of times that multimedia information published by an interactive user is played, the number of times that interactive information is published, and the number of times that published interactive information is interacted, and determine a ratio between a result of the weighted summation and an interaction parameter as a second ratio; the interactive parameters are the maximum values selected from a plurality of interactive results which are in one-to-one correspondence with a plurality of interactive users, and the interactive results are obtained by weighting and summing the playing times of multimedia information issued by the interactive users, the times of issuing interactive information and the times of interaction of issued interactive information; when the second ratio is larger than the second ratio threshold, determining the second ratio threshold as the interaction degree of the interactive user; and when the second ratio is not larger than the second ratio threshold, determining the second ratio as the interaction degree of the interactive user.

In some embodiments, the machine learning model (e.g., the neural network model or the language learning model) for implementing the interactive processing method for multimedia information provided by the embodiment of the present application may be stored in a blockchain network, so that when the server 200 or the terminal 400 in fig. 1 sequences the interactive information, the server 200 or the terminal may directly obtain the corresponding machine learning model through the blockchain network, and sequence the interactive information through the obtained machine learning model, so that the server or the terminal can implement the sequencing without training the machine learning model, thereby saving the consumption of training resources of the server or the terminal.

Taking an electronic device as an example of a computer device, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the method for interactive processing of multimedia information according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, cause the processor to perform a method for interactive processing of multimedia information provided in an embodiment of the present application, for example, the method for interactive processing of multimedia information shown in fig. 3, 4, 5, 6A and 6B, where the computer includes various computing devices including an intelligent terminal and a server.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, the computer-executable instructions may be in the form of programs, software modules, scripts or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and they may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, computer-executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, computer-executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the multi-mode characteristics of the user portrait, the interactive information and the multimedia information of the watching user and the interactive user, the interaction probability of the watching user for each piece of interactive information is determined, the interactive information is recommended according to the interaction probability, and compared with the prior art that the interactive information is sequenced according to single indexes such as the word number or the hot degree of the interactive information, the possibility that the interactive user implements the interaction behavior on the recommended interactive information can be improved, the recommendation efficiency can be improved, and further the computing resources and the communication resources for recommending the interactive information can be saved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. An interactive processing method for multimedia information, the method comprising:

2. The method of claim 1, wherein constructing the user comment interactive feature of the multimedia information based on the user portrayal of the viewing user and the interactive user of the multimedia information and each interactive information published by the interactive user for the multimedia information comprises:

executing the following processing for each interactive information:

splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the characteristics of the text obtained by splicing to obtain the text characteristics of the interactive information;

extracting features of texts in the user images of the watching users to obtain user features of the watching users;

and fusing the text characteristics of the interactive information and the user characteristics of the watching user to obtain the user comment interactive characteristics of the multimedia information.

3. The method of claim 2,

the text feature of the interactive information and the user feature of the watching user are extracted through the same language processing model;

the fusing the text features of the interactive information and the user features of the watching users to obtain the user comment interactive features of the multimedia information comprises:

determining attention weights of text features of the interactive information and user features of the viewing user through the language processing model, and

and according to the attention weight of the text characteristic of the interactive information and the attention weight of the user characteristic of the watching user, carrying out weighted summation on the text characteristic of the interactive information and the user characteristic of the watching user so as to obtain the user comment interactive characteristic of the multimedia information.

4. The method of claim 1, wherein the performing a fusion process based on the user comment interactive feature and the multi-modal feature of the multimedia information to obtain a fusion feature comprises:

extracting the characteristics of the multimedia information to obtain text characteristics, audio characteristics and video characteristics of the multimedia information;

performing fusion processing on the text feature, the audio feature and the video feature to obtain multi-modal features of the multimedia information;

and splicing the user comment interactive feature and the multi-modal feature of the multimedia information to obtain the fusion feature.

5. The method of claim 4, wherein the extracting the features of the multimedia information to obtain text features, audio features and video features of the multimedia information comprises:

extracting text information from the multimedia information, and performing feature extraction on the text information to obtain text features of the multimedia information, wherein the text information comprises at least one of the following: title, barrage, dialog text, type, label;

extracting a plurality of audio frames from the multimedia information, performing feature extraction on the plurality of audio frames to obtain a plurality of audio frame features which are in one-to-one correspondence with the plurality of audio frames, and performing fusion processing on the plurality of audio frame features to obtain audio features of the multimedia information;

extracting a plurality of video frames from the multimedia information, performing feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and performing fusion processing on the plurality of video frame features to obtain the video features of the multimedia information.

6. The method of claim 1, wherein the determining the interaction probability of the viewing user for each piece of interaction information based on the fusion feature comprises:

executing the following processing for each interactive information:

mapping the fusion characteristics into interaction probabilities of the interaction information corresponding to different interaction types;

wherein the interaction type comprises at least one of: like, forward, reply, back step.

7. The method of claim 6, wherein the sorting the plurality of interaction information based on the interaction probability and displaying the plurality of interaction information based on a sorting result comprises:

weighting and summing the interaction probabilities of different interaction types corresponding to each piece of interaction information to obtain a ranking score of each piece of interaction information;

and sequencing the plurality of interactive information in a descending or ascending manner according to the sequencing scores, and displaying at least part of the plurality of interactive information according to a sequencing result.

8. The method of claim 1, further comprising:

performing feature extraction on the multimedia information to obtain multi-modal features of the multimedia information;

constructing text features of each interactive information based on the user portrait of the interactive user and each interactive information published by the interactive user aiming at the multimedia information;

determining a correlation degree between each piece of interactive information and the multimedia information based on the multi-modal characteristics of the multimedia information and the text characteristics of each piece of interactive information;

the ranking the plurality of interaction information based on the interaction probability includes:

weighting and summing the interaction probability and the correlation degree corresponding to each piece of interaction information to obtain a ranking score of each piece of interaction information;

and sorting the plurality of interactive information in a descending order or an ascending order according to the sorting scores.

9. The method of claim 8, wherein constructing a text feature of each interactive message based on the user representation of the interactive user and the interactive message published by the interactive user for the multimedia message comprises:

constructing a user portrait of the interactive user according to the multimedia information which is interested by the interactive user;

executing the following processing for each interactive information:

and splicing the text in the user portrait of the interactive user and the text in the interactive information, and extracting the characteristics of the text obtained by splicing so as to obtain the text characteristics of the interactive information.

10. The method of claim 8, wherein determining the correlation between each interactive information and the multimedia information based on the multi-modal features of the multimedia information and the text features of each interactive information comprises:

executing the following processing for each interactive information:

fusing the multi-modal features and the text features of the interaction information to obtain relevancy fused features;

and mapping the correlation degree fusion characteristics into probabilities respectively belonging to different candidate correlation degrees, and determining the candidate correlation degree corresponding to the maximum probability as the correlation degree between the interaction information and the multimedia information.

11. The method of claim 8, wherein the feature extracting the multimedia information to obtain multi-modal features of the multimedia information comprises:

extracting a plurality of video frames from the multimedia information, performing feature extraction on the plurality of video frames to obtain a plurality of video frame features which are in one-to-one correspondence with the plurality of video frames, and performing fusion processing on the plurality of video frame features to obtain the video features of the multimedia information;

and performing fusion processing on the text feature, the audio feature and the video feature to obtain multi-modal features of the multimedia information.

12. The method of claim 8, further comprising:

determining interaction polar information of each piece of interaction information;

determining heat information of each interactive information based on the interactive degree of each interactive information and the interactive degree of the interactive user;

the performing a weighted summation on the interaction probability and the correlation corresponding to each piece of interaction information to obtain a ranking score of each piece of interaction information includes:

and carrying out weighted summation on the interaction polar direction information, the correlation, the interaction probability and the heat information corresponding to each piece of interaction information to obtain the ranking score of each piece of interaction information.

13. An interactive processing device for multimedia information, the device comprising:

14. An electronic device, comprising:

a memory for storing computer executable instructions;

a processor for implementing the method of interactive processing of multimedia information according to any one of claims 1 to 12 when executing computer-executable instructions stored in the memory.

15. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of interactive processing of multimedia information of any of claims 1 to 12 when executed.