CN113392265A

CN113392265A - Multimedia processing method, device and equipment

Info

Publication number: CN113392265A
Application number: CN202110167706.7A
Authority: CN
Inventors: 陈小帅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-09-14

Abstract

The embodiment of the application discloses a multimedia processing method, a multimedia processing device and multimedia processing equipment, and relates to the technical field of computers. The method comprises the following steps: the method comprises the steps of obtaining a multimedia file set, adopting a multimedia file label prediction model to carry out prediction processing on file contents of each multimedia file respectively to obtain an alternative word set of each multimedia file, searching search prompt words matched with target search words in the alternative word sets of the N multimedia files when an intention retrieval request carrying the target search words is received, and outputting the search prompt words. Therefore, the coverage range of the alternative words can be expanded by constructing the alternative word set of the multimedia file set in advance; the user is prompted based on the pre-constructed alternative word set, the user can directly select prompt contents to search without inputting complete video search information, the video search process can be simplified, and the video search efficiency is improved.

Description

Multimedia processing method, device and equipment

Technical Field

The invention relates to the technical field of computers, in particular to a multimedia processing method, a multimedia processing device and multimedia processing equipment.

Background

With the development of computer technology, the number of videos recorded in a video platform is increasing. When a user searches for a desired video on a video platform, the video is mainly searched in a video searching information (such as a video title) mode, and the searching hint word function can assist the user in inputting video searching information. Practice shows that the search term prompt function is mainly realized based on historical search term records of a user at present, once the historical search term records of the user are few, the user needs to input complete video search information (such as a video title), and the retrieval cost of the user is high.

Disclosure of Invention

The embodiment of the invention provides a multimedia processing method, a multimedia processing device and multimedia processing equipment, which can simplify a video search process and improve video search efficiency.

In one aspect, an embodiment of the present application provides a multimedia processing method, including:

acquiring a multimedia file set, wherein the multimedia file set comprises N multimedia files, and N is a positive integer;

respectively predicting the file content of each multimedia file by adopting a multimedia file label prediction model to obtain an alternative word set of each multimedia file;

when an intention retrieval request is received, searching a search prompt word matched with a target search word in an alternative word set of N multimedia files, wherein the intention retrieval request carries the target search word;

and outputting the search prompt words.

In one aspect, an embodiment of the present application provides a multimedia processing apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a multimedia file set, the multimedia file set comprises N multimedia files, and N is a positive integer;

the processing unit is used for respectively predicting the file content of each multimedia file by adopting a multimedia file label prediction model to obtain an alternative word set of each multimedia file; the device comprises a search engine, a target search word search engine, a search engine and a search engine, wherein the search engine is used for searching a search prompt word matched with the target search word in an alternative word set of N multimedia files when an intention search request is received, and the intention search request carries the target search word; and for outputting the search hint.

In one aspect, the present application provides a multimedia processing apparatus, comprising:

a processor for loading and executing a computer program;

a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the multimedia processing method described above.

In one aspect, the present application provides a computer-readable storage medium storing a computer program adapted to be loaded by a processor and to perform the above-mentioned multimedia processing method.

In one aspect, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the multimedia processing method.

In the embodiment of the application, a multimedia file set is obtained, a multimedia file label prediction model is adopted to perform prediction processing on the file content of each multimedia file respectively to obtain an alternative word set of each multimedia file, when an intention retrieval request carrying a target search word is received, a search prompt word matched with the target search word is searched in the alternative word sets of N multimedia files, and the search prompt word is output. Therefore, the coverage range of the alternative words can be expanded by constructing the alternative word set of the multimedia file set in advance, so that the alternative word set can cover all multimedia files, when a user intends to retrieve the multimedia files, the user is prompted on the basis of the constructed alternative word set in advance, the user can directly select prompt contents to search, complete video search information does not need to be input, the video search process can be simplified, and the video search efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1a is a scene architecture diagram of multimedia processing according to an embodiment of the present disclosure;

FIG. 1b is a main flowchart of a multimedia processing according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a multimedia processing method according to an embodiment of the present application;

fig. 3a is a schematic flowchart of generating an alternative word set according to an embodiment of the present application;

fig. 3b is a flowchart illustrating a process of searching for a search hint word matching a target search word according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another multimedia processing method according to an embodiment of the present application;

FIG. 5a is an architecture diagram of a multimedia file tag prediction model according to an embodiment of the present application;

FIG. 5b is a block diagram of another multimedia file tag prediction model according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another multimedia processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a multimedia processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a multimedia processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning (ML). The AI is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses the knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The AI technology is a comprehensive subject, and relates to the field of extensive technology, both hardware level technology and software level technology. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, processing technologies for large applications, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions; computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition. The embodiment of the application mainly relates to the extraction of the characteristics of data of an image modality of a multimedia file through an image recognition technology in a computer vision technology and the recognition of subtitle texts in an image through an OCR (optical character recognition).

ML is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. ML is the core of artificial intelligence, is the fundamental way to make computers intelligent, and its application is spread over various fields of artificial intelligence. ML and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, migratory learning, inductive learning, and formal learning. The embodiment of the application mainly relates to the steps of carrying out optimization training on an initial coding and decoding model by adopting corpus sample data to obtain a coding and decoding model, and carrying out optimization training on an initial multimedia file label prediction model by adopting a training data set to obtain a multimedia file label prediction model.

The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. The embodiment of the application mainly relates to the extraction of the characteristics of audio modal data in a multimedia file through a voice technology; and converting the audio data into corresponding text data by the ASR.

Furthermore, the present application relates to Natural Language Processing (NLP). NLP is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. The embodiment of the application mainly relates to text features extracted from data of a text modality of a multimedia file through NLP and sentence splitting processing through NIP.

Referring to fig. 1a, fig. 1a is a scene architecture diagram for multimedia processing according to an embodiment of the present disclosure. As shown in fig. 1a, the scene architecture diagram includes a terminal device 101 and a server 102. Wherein, the terminal device 101 is a device used by a user, and the terminal device 101 may include but is not limited to: smart phones (e.g., Android phones, iOS phones, etc.), tablet computers, portable personal computers, Mobile Internet Devices (MID), and the like; the terminal device is often configured with a display device, the display device may also be a display, a display screen, a touch screen, and the like, and the touch screen may also be a touch screen, a touch panel, and the like.

The server 102 refers to a background device capable of providing video (search) service for the terminal device 101; in one embodiment, the server 102 may be a background server of a video client in the terminal device 101. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal and the server may be directly or indirectly connected through wired communication or wireless communication, and the application is not limited herein.

It should be noted that, the number of the terminal devices and the servers in the multimedia processing scenario shown in fig. 1a is only an example, for example, the number of the terminal devices and the servers may be multiple, and the application does not limit the number of the terminal devices and the servers.

Fig. 1b is a schematic main flow chart of multimedia processing according to an embodiment of the present disclosure. As shown in fig. 1b, the multimedia processing flow mainly includes the following steps: (1) the server 102 pre-constructs alternative terms for the multimedia files in the multimedia file library: firstly, acquiring a multimedia file set from a video library, wherein the video library can be stored in the server 102 or independently stored in a database; the multimedia file set comprises N multimedia files, wherein N is a positive integer; each multimedia file (e.g., short video uploaded by a user, movie works distributed by a movie company, etc.) includes data in multiple modalities (media forms) such as text, sound, and images. Then, the server 102 adopts a multimedia file label prediction model to respectively perform prediction processing on the file content of each multimedia file to obtain an alternative word set of each multimedia file; the file content of the multimedia file includes one or more of: data of image modalities (e.g., video frames), data of audio modalities (e.g., audio stream data), data of text modalities (e.g., subtitles, captions, etc.); it should be noted that the data in the text mode may be carried in the file content of the multimedia file, or may be extracted from the data in the image mode or the data in the audio mode; for example, converting audio stream data into corresponding text, extracting subtitles in video frames, etc. The multimedia file label prediction model comprises a multimedia key information extraction model and a multimedia candidate word generation model; the multimedia key information extraction model (comprising at least one of a keyword extraction model and a coding and decoding model) is used for generating a first alternative word subset according to the data of the text mode of the multimedia file; the multimedia candidate word generation model is used for generating a second candidate word subset according to P modal data of the multimedia file, and obtaining a candidate word set through combination of the first candidate word subset and the second candidate word subset. (2) When the terminal device 101 detects a user intention retrieval, an intention retrieval request carrying a target search word is transmitted to the server 102. (3) The server 102 determines the search prompt to be selected based on the target search term input by the user: searching a search prompt word matched with a target search word in a candidate word set of N multimedia files (such as by prefix query, fuzzy query and the like); the server 102 obtains feature information of a current user, and determines a weight of each search cue word to be selected in the search cue word set to be selected according to the feature information, wherein the feature information includes at least one of the following items: user portrait information, user historical behavior information; and sequencing the search cue words to be selected in the search cue word set to be selected according to the sequence of the weights from high to low, and determining the search cue words according to the sequencing result. (4) The server 102 returns the search prompt corresponding to the target search term to the terminal device 101 to cause the terminal device 101 to output (e.g., display in a screen) the search prompt.

In the embodiment of the application, a server obtains a multimedia file set, a multimedia file label prediction model is adopted to perform prediction processing on the file content of each multimedia file respectively to obtain an alternative word set of each multimedia file, when an intention retrieval request carrying a target search word is received, a search prompt word matched with the target search word is searched in the alternative word sets of N multimedia files, and the search prompt word is output. Therefore, the coverage range of the alternative words can be expanded by constructing the alternative word set of the multimedia file set in advance, so that the alternative word set can cover all multimedia files, when a user intends to retrieve the multimedia files, the user is prompted on the basis of the constructed alternative word set in advance, the user can directly select prompt contents to search, complete video search information does not need to be input, the video search process can be simplified, and the video search efficiency is improved.

The multimedia processing scheme provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a multimedia processing method according to an embodiment of the present disclosure. The multimedia processing scheme may be performed by the server 102 described above; the scheme comprises steps S201-S204, wherein:

s201, the server acquires a multimedia file set.

The multimedia file set comprises N multimedia files, wherein N is a positive integer; the multimedia file set can be obtained from a platform video library by a server or uploaded by a user at a certain time period; for example, the multimedia file collection 1 is composed of videos uploaded by users received by a video platform within 1 minute.

S202, the server adopts a multimedia file label prediction model to respectively perform prediction processing on the file content of each multimedia file to obtain an alternative word set of each multimedia file.

The file content of the multimedia file comprises data of at least two modes of a text mode, an image mode and an audio mode; for example, a multimedia file may be a short video uploaded by a user, the short video including a title (i.e., data of a text modality of the multimedia file), a video frame (i.e., data of an image modality of the multimedia file), and a dubbing (i.e., data of an audio modality of the multimedia file). It should be noted that the data of the text modality of the multimedia file may be carried in the file content of the multimedia file (such as a title, a content summary, etc.), or may be extracted from the data of the image modality or the data of the audio modality by the server through a text recognition model; for example, converting audio stream data into data of a corresponding text modality by ASR, extracting subtitles in video frames, and the like.

The multimedia file label prediction model is used for extracting the characteristics of each multimedia file in the multimedia file set and predicting to obtain the alternative word set of the pair of media files according to the characteristics of each multimedia file. Specifically, the multimedia file label prediction model respectively extracts the characteristics of P modal data of a target multimedia file (any multimedia file in a multimedia file set) through P characteristic extraction modules to obtain P sub-characteristics of the target multimedia file, wherein P is an integer greater than 1; and performing fusion processing on the P sub-characteristics of the target multimedia file to obtain the fusion characteristics of the target multimedia file, and predicting based on the fusion characteristics to obtain an alternative word set of the target multimedia file. For example, assume that the file content of the multimedia file 1 includes data of a text modality, data of a sound modality, and data of an image modality; extracting the characteristics of the data of the text mode by the multimedia file label prediction model to obtain the text characteristics of the multimedia file 1; similarly, extracting the characteristics of the data of the sound modality to obtain the audio characteristics of the multimedia file 1; extracting the characteristics of the data of the image modality to obtain the image characteristics of the multimedia file 1; and performing fusion processing on the text characteristic, the audio characteristic and the image characteristic to obtain a fusion characteristic of the multimedia file 1, and predicting based on the fusion characteristic to obtain a candidate word set of the multimedia file 1.

It can be understood that the set of candidate words of the target multimedia file is obtained based on the feature prediction of the data of the target multimedia file in P modalities, and is more comprehensive and accurate than the set of candidate words based on the feature prediction of the data of a single modality.

The candidate word set of the target multimedia file comprises at least one candidate word, and the at least one candidate word is obtained by predicting the file content of the target multimedia file based on a multimedia file label prediction model; for example, the alternative words are main events, main characters, content summaries, etc. of the video.

Fig. 3a is a schematic flowchart of generating an alternative word set according to an embodiment of the present application. As shown in fig. 3a, the multimedia file label prediction model includes a multimedia key information extraction model and a multimedia candidate generation model; the multimedia key information extraction model (comprising at least one of a keyword extraction model and a coding and decoding model) is used for generating a first alternative word subset according to the data of the text mode of the multimedia file; the multimedia candidate word generation model is used for generating a second candidate word subset according to P modal data of the multimedia file, and obtaining a candidate word set through combination of the first candidate word subset and the second candidate word subset.

S203, when the intention retrieval request is received, the server searches a search prompt word matched with the target search word in the alternative word set of the N multimedia files.

The intention retrieval request is sent to the server after the terminal equipment acquires the target search word input by the user, and the intention retrieval request carries the target search word. Search cues that match the target search term include search cues that formally match the target search term (e.g., a search cue that matches "new crown" has "new coronavirus"); and search tips that are semantically related to the target search term (e.g., a search tip that matches "new crown" has "pneumonia"). In particular embodiments, the server may find the search hint word matching the target search word by one or more of an index table, a prefix query, and a fuzzy query.

Fig. 3b is a flowchart of searching for a search hint word matching a target search word according to an embodiment of the present application. As shown in fig. 3b, S301, when the terminal device detects that the user intends to search for a multimedia file, obtaining a target search word input by the user, and sending an intention retrieval request to a server, wherein the intention retrieval request carries the target search word; s302: after receiving the intention retrieval request, the server determines a search prompt to be selected based on a target search word input by a user; s303: and outputting the sorted search prompt words based on the user characteristic information and the target search word information.

In one embodiment, a server acquires a search vocabulary set (the search vocabulary set can be constructed according to vocabularies related to multimedia and can also be constructed according to historical search words), and constructs a prompt index according to the incidence relation between each search vocabulary in the search vocabulary set and all the alternative words in N alternative word sets; specifically, the server constructs a prompt index for each search word according to the association relationship between the word and all the candidate words in the N candidate word sets, and determines a to-be-selected search prompt word set according to the target search word and the prompt index. Candidate words having an association with the search term include candidate words that formally have an association with the search term (e.g., "new crown prevention" for candidate words having an association with "new crown"); and candidate words that are semantically related to the search term (e.g., a search cue matching "new crown" has "pneumonia"); and candidate words having indirect relation with the search word (for example, the candidate word having the relation with the "new crown" has "new crown prevention", "new crown prevention" and "epidemic situation" are obtained by splitting the same sentence, and then the "new crown" and the "epidemic situation" also have the relation). Table 1 is a prompt index table provided in the embodiment of the present application:

TABLE 1

As can be seen from table 1, when the target search term is "new crown", the server determines, according to "new crown" and table 1, that the set of search terms to be selected is: novel coronavirus, pneumonia, new people who are new to the new crown nowadays, new crown prevention and epidemic situation ….

Further, the server acquires feature information of the current user, and determines the weight of each search cue word to be selected in the search cue word set to be selected according to the feature information, wherein the feature information comprises at least one of the following items: user portrait information (used for marking the characteristics of the user), and user historical behavior information; for example, the user profile information indicates that the current user is characterized as "student", assuming that the student attention order is: new crown prevention > epidemic situation > pneumonia > novel coronavirus, and the doctor attention sequence is as follows: novel coronavirus > epidemic situation > pneumonia > new crown prevention; setting the weight of new crown prevention to 0.9, the weight of epidemic situation to 0.7, the weight of pneumonia to 0.6 and the weight of novel coronavirus to 0.5 according to the attention sequence of students; for another example, if the user searches for "epidemic situation" 8 times, searches for "new crown prevention" 10 times, searches for "new coronavirus" 2 times, and searches for "pneumonia" 0 time, the user sets the weight of "new crown prevention" to 0.8, the weight of "epidemic situation" to 0.7, the weight of "pneumonia" to 0.2, and the weight of "new coronavirus" to 0.3 according to the user history behavior information. The server sorts the search promoting words to be selected in the search promoting word set according to the sequence of the weights from high to low, and determines the search promoting words according to the sorting result (for example, the first Q search promoting words to be selected are determined as the search promoting words in sequence, and Q is a positive integer, or the search promoting words to be selected with the weights larger than the weight threshold are determined as the search promoting words in sequence).

Optionally, the server may further determine the weight of the search prompt to be selected according to the degree of association between the search prompt to be selected and the target search term; for example, the target search word is "new crown prevention", the candidate search cue 1 is "new coronavirus prevention", and the candidate search cue 2 is "pneumonia prevention"; since "new crown prevention" is included in "novel crown prevention", the degree of association between "novel crown prevention" and "new crown prevention" is higher than the degree of association between "pneumonia prevention" and "new crown prevention".

And S204, the server outputs the search prompt words.

In one embodiment, the server sends the search prompt words matched with the target search words to the terminal equipment.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating another multimedia processing method according to an embodiment of the present disclosure. The multimedia processing scheme may be performed by the server 102 described above; the scheme comprises steps S401-S405, wherein:

s401, the server obtains a multimedia file set.

The specific implementation of step S401 can refer to the implementation of step S201 in fig. 2, and is not described herein again.

S402, the server obtains the fusion characteristics of the target multimedia file.

The fusion characteristics are obtained by respectively extracting characteristics from data of at least two modes of a target multimedia file and then fusing, wherein the file content of the target multimedia file comprises data of at least two modes of a text mode, an image mode and an audio mode; the multimedia file label prediction model comprises at least two feature extraction modules of an image feature extraction module, an audio feature extraction module and a text feature extraction module, and the file content of the multimedia file comprises data of a text mode, data of an image mode and data of at least two modes of an audio mode. The multimedia file label prediction model also comprises a full connection layer; the full connection layer is used for fusing the characteristics of the data of the P modalities extracted by the P characteristic extraction modules to obtain fused characteristics.

The following describes determining a set of candidate words for a target-to-media file, taking a multimedia file (called a target multimedia file) as an example:

fig. 5a is an architecture diagram of a multimedia file tag prediction model according to an embodiment of the present application. As shown in fig. 5a, the multimedia file tag prediction model extracts a sequence of image frames of the target multimedia file from the data of the image modality; and extracting the image characteristics of the target multimedia file through an image characteristic extraction module, specifically, firstly constructing a single-frame representation of each frame of image frame (for example, constructing the single-frame representation of each frame of image frame based on a characteristic extraction network acceptance), and then performing multi-frame time sequence characteristic fusion based on a coding model in a Transformer to obtain an image content representation (namely, image characteristics). Similarly, the multimedia file label prediction model extracts the audio frame sequence of the target multimedia file from the data of the audio modality; and extracting the audio features of the target multimedia file through an audio feature extraction module, specifically, firstly constructing a single-frame representation of each frame of audio frame (for example, constructing a single-frame representation of each frame of image frame based on a feature extraction model VGGish), and then performing multi-frame time sequence feature fusion through a transform-based coding model to obtain an audio content representation (namely, image features). The multimedia file label prediction model obtains data (such as a title, a dialogue and the like) of a text mode of a target multimedia file, performs word segmentation and vectorization on the data of the text mode of the target multimedia file, and performs vector splicing through a transform-based coding model to obtain text content representation (namely text characteristics). And then fusing the coded features through a full connection layer, so that the dimensionality of the obtained fused features (decoding initial state) is the same as that of a transform-based decoding model.

S403, decoding the fusion features by the server to obtain one or more prediction retrieval phrases; and splicing one or more prediction retrieval phrases into a prediction retrieval text, and adding the prediction retrieval text into an alternative word set of the target multimedia file.

In one embodiment, the one or more predicted search phrases include a first search phrase and a second search phrase. The server decodes the fusion characteristics to obtain a first retrieval phrase; and decoding the first search phrase and the fusion characteristics to obtain a second search phrase. The specific implementation manner of decoding the fusion features by the server to obtain the first search phrase is as follows: setting data of a text mode of a target multimedia file to comprise P text phrases, wherein P is a positive integer; the server acquires the text characteristics of the data of the text mode of the target multimedia file to obtain the characteristic vectors of P text phrases contained in the text characteristics; determining an attention coefficient between the fusion feature and the feature vector of each text phrase based on an attention mechanism (for example, calculating an inner product of the fusion feature and the feature vector of each text phrase to obtain the attention coefficient of each feature vector, or determining the attention coefficient by K, Q, V in the attention mechanism, wherein in the process of generating P feature vectors, a K matrix and a V matrix are generated, and the K matrix and the V matrix can be the same); carrying out weighted summation on the feature vectors of the P text phrases and the attention coefficient of each feature vector to obtain features to be decoded, and decoding the features to be decoded through a transform decoding model to obtain probability values corresponding to the K label words; and selecting the label word or text phrase corresponding to the maximum value from the K probability values and the P attention coefficients as a first retrieval phrase.

In the above description, the decoding process is an iterative decoding process, and the input at each time is related to the previously decoded search phrase.

In another embodiment, the method comprises the steps of predicting fusion characteristics based on a transform decoding model to obtain one or more prediction retrieval phrases; splicing one or more prediction retrieval phrases into a prediction retrieval text, and adding the prediction retrieval text into an alternative word set of a target multimedia file; for example, if the prediction search phrase is "today", "new crown", "new population", or "number of people", the prediction search text after concatenation is "the number of people who are new crown and new population today".

Fig. 5b is an architecture diagram of another multimedia file tag prediction model according to an embodiment of the present application. As shown in fig. 5b, based on the multimedia file tag prediction model shown in fig. 5a, the text feature output by the transform-based coding model is a feature matrix, data in each line of the feature matrix is used for representing a word vector, and if the feature matrix has n lines, n feature vector representations of words 1-word n (that is, n word groups are included in data of the text modality of the target multimedia file) can be obtained based on the feature matrix.

Attention mechanisms are involved in the decoding model of the Transformer, and are used for determining the contribution of the feature vector of each phrase to the current decoding position, so that different decoding positions can focus on the feature vectors of different phrases. Specifically, in decoding the i-th position, the attention mechanism determines the attention coefficient of the fused feature (the fused feature can be regarded as the input feature of the 0-th decoding position) to the feature vector of each word group, determines the attention coefficient of the word vector of the predicted first search word group y1 to the feature vector of each word group, determines the attention coefficient of the word vector of the predicted second search word group y2 to the feature vector of each word group, and determines the attention coefficient of the word vector of the predicted i-1-th search word group y (i-1) to the feature vector of each word group. The feature to be decoded at the ith position can be determined by weighted summation of the attention coefficient and the feature vectors of the n phrases. It can be known that the feature to be decoded at the i-th position calculated in this way not only contains context information, but also includes the degree of contribution of the feature vector of the phrase to the predicted phrase at the i-th position. The transform decoding model decodes the features to be decoded at the ith position, so that probability values corresponding to the K label words can be obtained, and the probability values of the K label words can be regarded as the prediction result of the pure transform decoding model. And comparing the probability values of the K label words with all the attention coefficients, selecting a maximum value, and taking the label word corresponding to the maximum value or the phrase corresponding to the feature vector as a predicted word at the ith position (namely the predicted ith search phrase yi).

This is because, if the attention coefficient of the feature vector of a certain phrase is particularly large, it indicates that the phrase corresponding to the feature vector of the word has a high probability of being a correct decoding result, and the phrase corresponding to the feature vector can be directly used as a predicted word.

Of course, besides comparing the probability values of the tag words with the attention coefficients, the tag word corresponding to the maximum probability value in the probability values of the K tag words can be directly used as the predicted word at the ith position.

Therefore, the Attention (Attention) module can be used for screening the important words in the important word set and the label words in the label word set, so that the keywords in the alternative word set of the target multimedia file are more consistent with the file content of the multimedia file.

In another embodiment, the multimedia file label prediction model is obtained by performing optimization training on an initial multimedia file label prediction model by using a training data set, wherein the training data set is constructed by a server according to a user search record, and comprises a training video and a search term related to the training video; for example, the associated term of XX television series is YY actor. Specifically, the server extracts a first feature of data of an image modality of a training video through an image feature extraction module in an initial multimedia file label prediction model; extracting a second feature of the data of the audio modality of the training video through an audio feature extraction module in the initial multimedia file label prediction model; and extracting a third feature of the data of the text mode of the training video through a text feature extraction module in the initial multimedia file label prediction model, and obtaining a word vector set according to a feature matrix corresponding to the third feature. And fusing the first feature, the second feature and the third feature by adopting a full connection layer in the initial multimedia file label prediction model to obtain a target fusion feature. And subsequently, predicting the target fusion characteristics through a prediction layer in the initial multimedia file label prediction model to obtain a prediction retrieval word group, wherein each prediction retrieval word in the prediction retrieval word group carries a probability value, determining a label word according to the probability value of each word vector and the probability value of each prediction retrieval word, and adding the label word into the prediction word set to obtain the prediction word set. And performing loss calculation on the prediction word set and the search words associated with the training video through a loss function, and adjusting parameters of the initial multimedia file label prediction model according to the loss calculation result to obtain the multimedia file label prediction model according to the adjusted parameters.

S404, when the intention retrieval request is received, the server searches the search prompt words matched with the target search words in the alternative word set of the N multimedia files.

S405, the server outputs search prompt words.

The specific implementation of step S404 and step S405 can refer to the implementation of step S203 and step S204 in fig. 2, and will not be described herein again.

In the embodiment of the application, on the basis of the embodiment of fig. 2, the characteristics of the data of P modalities of the target multimedia file are extracted through the multimedia file tag prediction model, and the candidate word set is obtained based on the characteristic prediction of the data of P modalities, so that the candidate word set is more comprehensive and accurate compared with the candidate word set based on the characteristic prediction of the data of a single modality. In addition, an Attention (Attention) module is used for screening important words in the important word set and label words in the label word set, so that keywords in the third key word set are more consistent with file contents of the multimedia file, and the prediction accuracy of the multimedia file label prediction model is improved.

Referring to fig. 6, fig. 6 is a schematic flowchart illustrating another multimedia processing method according to an embodiment of the present disclosure. The multimedia processing scheme may be performed by the server 102 described above; the scheme comprises steps S601-S606, wherein:

s601, the server acquires a multimedia file set.

The specific implementation of step S601 may refer to the implementation of step S201 in fig. 2, and is not described herein again.

S602, the server extracts keywords from the data of the text mode of the target multimedia file to obtain a first keyword set.

In one embodiment, the file content of the target multimedia file includes a text modality. The server acquires data of a text modality of the target multimedia file, wherein the data of the text modality comprises one or more of the following: data of a text modality carried in the file content (e.g., title, summary, etc. of the target multimedia file), data of a text modality recognized from data of an image modality by an image recognition technique (e.g., OCR) (e.g., subtitles), and data of a text modality recognized from data of an audio modality by an audio recognition technique (e.g., ASR) (e.g., character dialog).

Firstly, the server performs splitting processing on the acquired data of the text modality of the target multimedia file (for example, splitting a sentence by a maximum matching (max match) algorithm) to obtain a vocabulary set (that is, vocabularies in the vocabulary set are obtained by splitting a title, a subtitle, a dialog and the like in the data of the text modality by the server). Then, the server constructs nodes in the relational graph according to the vocabulary set (each node in the relational graph corresponds to one vocabulary in the vocabulary set), and constructs connecting edges of each node in the relational graph according to the incidence relation among the vocabularies in the vocabulary set; wherein, the incidence relation among all vocabularies comprises: the vocabulary a and the vocabulary B are obtained by splitting the same sentence (if the new coronavirus causes pneumonia, the new coronavirus and the pneumonia are obtained by splitting), or the separation distance between the vocabulary a and the vocabulary B (such as the number of words in the middle interval, the time interval appearing in video/audio, and the like) is smaller than the separation threshold. Then, the server performs iterative processing on the relationship graph (for example, by constructing a similarity matrix and iterating the similarity matrix until iteration converges), so as to obtain a weight of each node (i.e., a weight of each vocabulary in the vocabulary set). And finally, adding the vocabulary corresponding to the node with the weight value larger than the threshold value into the first keyword set to obtain the first keyword set.

S603, the server carries out coding and decoding processing on the data of the text mode of the target multimedia file to obtain a second keyword set.

In one embodiment, the file content of the target multimedia file includes data of a textual modality. The server performs data conversion on the data in the text mode to obtain a word vector sequence of the data in the text mode, namely the data in the text mode is represented by the word vector sequence. After the word vector sequence is obtained, coding the word vector sequence based on the incidence relation of each word vector in the word vector sequence to obtain a hidden feature sequence; the sequence of hidden features may specifically be represented by a feature matrix. And after the hidden feature sequence is obtained, decoding the hidden feature sequence to obtain a second keyword set.

The encoding and decoding modes can be realized by an encoding and decoding model, the encoding and decoding model is a model with an encoding and decoding structure, and the encoding structure or the decoding structure can be a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Gate cycle Unit (GRU), and a Long-Short-Time Memory model (LSTM). The codec model may be embodied as a Transformer (Transformer), wherein the Transformer is composed of an Encoder model and a Decoder model.

The codec model is a model for converting text data of a multimedia file into a tag associated with the file content of the multimedia file.

In one embodiment, the encoding and decoding model is obtained by performing optimization training on an initial encoding and decoding model by using corpus sample data, wherein the corpus sample data comprises text contents of a labeled video and a label set corresponding to the video; for example, the text content of video 1: roxx self-deprecated 6 billion debts on talk show, making an effort to repay the look of the debt genuine; annotation tag set for video 1: the talk show Luo XX debt makes the fun and comprehension. Specifically, the labels in the label set are video fine-grained semantic units, such as "california epidemic", "talk show", "roxx debt", and the like, the labels in the label set can be regarded as visual reflection of video content, and a user can search for videos through the labels in the label set, so that the labels in the label set are used as a training target of the coding and decoding model. Taking an encoding and decoding model which is particularly a Transformer as an example, text content of a marked video is input into an Encoder model in the Transformer, a prediction label set is generated through a Decoder model, then loss calculation is carried out on the prediction label set and the marked label set through a loss function, and parameters in the model are adjusted according to the result of the loss calculation, so that the model has the capability of inputting a video text and outputting a video label sequence.

It should be noted that, the second keyword set and the multi-modal data determination alternative word set determined here both involve transformers, but the input and output of the two transformers are completely different, that is, the model parameters of the two transformers are not consistent, only indicating that the models of the two transformers are similar in structure.

S604, the server performs fusion processing on the first keyword set and the second keyword set to obtain an alternative word set of the target multimedia file.

In an alternative embodiment, the server combines the first keyword set and the second keyword set into an alternative word set, and removes repeated alternative words in the alternative word set. In another embodiment, the server combines the first keyword set and the second keyword set into a candidate keyword set, counts the repetition number of each keyword in the candidate keyword set, and adds the keywords with the repetition number greater than a threshold number to the candidate word set to obtain the candidate word set of the target multimedia file. It can be understood that, by adding the keywords appearing multiple times to the candidate word set, the candidate words in the candidate word set can be made to better conform to the file content of the multimedia file.

S605, when receiving the intention retrieval request, the server searches the search prompt words matched with the target search words in the alternative word set of the N multimedia files.

And S606, the server outputs the search prompt words.

For specific implementation of step S605 and step S606, reference may be made to the implementation of step S203 and step S204 in fig. 2, which is not described herein again.

In an optional implementation manner, the candidate word set of the target multimedia file obtained in step S403 is determined as a third keyword set, the first keyword set, the second keyword set, and the third keyword set are combined into a candidate word set, and repeated candidate words in the candidate word set are removed to obtain the candidate word set of the target multimedia file. In another embodiment, the server combines the first keyword set, the second keyword set and the third keyword set into a candidate keyword set, counts the repetition times of each keyword in the candidate keyword set, and adds the keywords with the repetition times larger than a threshold value of the repetition times into the candidate word set to obtain the candidate word set of the target multimedia file. It can be understood that, by adding the keywords appearing multiple times to the candidate word set, the candidate words in the candidate word set can be made to better conform to the file content of the multimedia file.

In the embodiment of the application, the target multimedia file is processed in various ways to obtain a plurality of keyword sets, the keyword sets are combined to obtain the alternative word set, and the coverage of the alternative word set on the file content of the target multimedia file is improved.

While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a multimedia processing apparatus according to an embodiment of the present disclosure, where the multimedia processing apparatus 700 includes an obtaining unit 701 and a processing unit 702, and the apparatus may be mounted on the server 102 shown in fig. 1 a. The multimedia processing apparatus shown in fig. 7 may be used to perform some or all of the functions in the method embodiments described above with reference to fig. 2, 4 and 6. Wherein, the detailed description of each unit is as follows:

an obtaining unit 701, configured to obtain a multimedia file set, where the multimedia file set includes N multimedia files, and N is a positive integer;

a processing unit 702, configured to perform prediction processing on file contents of each multimedia file by using a multimedia file tag prediction model, to obtain an alternative word set of each multimedia file; the device comprises a search engine, a target search word search engine, a search engine and a search engine, wherein the search engine is used for searching a search prompt word matched with the target search word in an alternative word set of N multimedia files when an intention search request is received, and the intention search request carries the target search word; and for outputting the search hint.

In one embodiment, the target multimedia file is one multimedia file of N multimedia files; the target multimedia file is one of N multimedia files, the file content of the target multimedia file comprises P modal data, and P is a positive integer; the multimedia file tag prediction model comprises: a keyword extraction model, an encoding and decoding model and a multi-mode model; the processing unit 702 is configured to perform prediction processing on the file content of the target multimedia file by using a multimedia file tag prediction model to obtain a process of candidate word sets of the target multimedia file, and specifically configured to:

acquiring fusion characteristics of the target multimedia file, wherein the fusion characteristics are obtained by respectively extracting characteristics from data of at least two modalities of the target multimedia file and fusing the data, and the file content of the target multimedia file comprises data of at least two modalities of a text modality, an image modality and an audio modality;

decoding the fusion characteristics to obtain one or more prediction retrieval phrases;

and splicing one or more prediction retrieval phrases into a prediction retrieval text, and adding the prediction retrieval text into an alternative word set of the target multimedia file. Processing the data of the text mode of the target multimedia file by adopting the keyword extraction model to obtain a first keyword set, wherein the text mode belongs to the P modes;

processing the data of the text mode of the target multimedia file by adopting the coding and decoding model to obtain a second keyword set;

processing P modal data of the target multimedia file by adopting the multi-modal model to obtain a third key word set;

and combining the first keyword set, the second keyword set and the third keyword set into an alternative word set of the target multimedia file.

In one embodiment, the one or more predicted search phrases comprise a first search phrase and a second search phrase; the multi-modal model comprises an image feature extraction module, an audio feature extraction module, a text feature extraction module, a full connection layer and a prediction layer; the P modalities further comprise an image modality and an audio modality; the processing unit 6702 is configured to decode the fusion features to obtain one or more prediction retrieval phrases, and process the P-mode data of the target multimedia file by using the multi-mode model to obtain a third keyword set, and specifically configured to:

decoding the fusion characteristics to obtain the first retrieval phrase;

and acquiring a word vector of the first search phrase, and decoding the word vector of the first search phrase and the fusion characteristics to obtain a second search phrase. Extracting image features of data of an image modality of the target multimedia file through the image feature extraction module;

extracting the audio features of the data of the audio modality of the target multimedia file through the audio feature extraction module;

extracting the text characteristics of the data of the text mode of the target multimedia file through the text characteristic extraction module;

fusing the image feature, the audio feature and the text feature by using the full connection layer to obtain a fusion feature;

and obtaining a third keyword set according to a prediction result, wherein the prediction result is obtained by performing prediction processing on the fusion characteristics by the prediction layer.

In one embodiment, the data of the text modality of the target multimedia file includes P text phrases, P being a positive integer; the prediction result comprises a label word set, and the label word set carries probability values corresponding to all label words; the processing unit 6702 is further configured to perform decoding processing on the fusion feature to obtain the first search phrase, and specifically configured to:

acquiring text features of data of a text mode of the target multimedia file, wherein the text features comprise feature vectors of P text phrases;

determining an attention coefficient between the fused feature and a feature vector of each text phrase based on an attention mechanism;

carrying out weighted summation on the feature vectors of the P text word groups and the attention coefficient of each feature vector to obtain features to be decoded, and carrying out decoding processing on the features to be decoded to obtain probability values corresponding to the K label words respectively;

and selecting the label word or text phrase corresponding to the maximum value from the K probability values and the P attention coefficients as the first search phrase. Generating an important word set of the data of the text mode of the target multimedia file through the text feature extraction module, wherein the important word set carries probability values corresponding to all important words;

the obtaining of the third keyword set according to the prediction result includes:

adding the important words with the probability values higher than a first probability threshold in the important word set and the label words with the probability values higher than a second probability threshold in the label word set into a third important word set.

In one embodiment, the target multimedia file is one of N multimedia files, the file content of the target multimedia file including data of a text modality; the processing unit 6702 is configured to perform prediction processing on the file content of the target multimedia file by using a multimedia file tag prediction model, obtain an alternative word set of the target multimedia file, and process data of a text modality of the target multimedia file by using the keyword extraction model, to obtain a first keyword set, and specifically is configured to:

extracting keywords from the data of the text mode of the target multimedia file to obtain a first keyword set;

coding and decoding the data of the text mode of the target multimedia file to obtain a second keyword set;

and fusing the first keyword set and the second keyword set to obtain an alternative word set of the target multimedia file. Splitting the data of the text mode of the target multimedia file to obtain a vocabulary set;

constructing a relational graph according to the vocabulary set and the incidence relation of each vocabulary in the vocabulary set, wherein nodes in the relational graph correspond to the vocabularies in the vocabulary set one by one, a connecting edge of an ith node and a jth node in the relational graph is determined according to the incidence relation of the vocabulary corresponding to the ith node and the vocabulary corresponding to the jth node, i and j are positive integers, and i is not equal to j;

carrying out iterative processing on the relational graph to obtain the weight of each node;

and adding the vocabulary corresponding to the node with the weight value larger than the threshold value into the first keyword set.

In one embodiment, the codec model is constructed based on an attention mechanism; the processing unit 6702 is configured to perform keyword extraction on the data in the text mode of the target multimedia file to obtain a first keyword set, and process the data in the text mode of the target multimedia file by using the encoding and decoding model to obtain a second keyword set, and specifically configured to:

splitting the data of the text mode of the target multimedia file to obtain a vocabulary set;

and adding the vocabulary corresponding to the node with the weight value larger than the threshold value to the first keyword set. Encoding the data of the target multimedia file in the text mode to obtain a word vector sequence of the data of the text mode, wherein the word vector sequence comprises M subsets, and M is a positive integer;

and selecting k subsets from the M subsets based on the attention mechanism for decoding to obtain a second keyword set, wherein k is a positive integer and is less than or equal to M.

In an embodiment, the processing unit 6702 is configured to perform encoding and decoding processing on data of the target multimedia file in the text mode to obtain a second keyword set, and search for a search hint word matching the target search word in the candidate word sets of the N multimedia files, and specifically configured to:

converting the data of the text mode of the target multimedia file into a word vector sequence;

coding the word vector sequence to obtain a hidden feature sequence;

and decoding the hidden feature sequence to obtain the second keyword set.

In an embodiment, the processing unit 702 is configured to search for a search hint word matching the target search word in a candidate word set of N multimedia files, and specifically is configured to:

acquiring a search vocabulary set, and constructing a prompt index according to the incidence relation between each search vocabulary in the search vocabulary set and all the alternative words in the N alternative word sets;

determining a search prompt word set to be selected according to the target search word and the prompt index;

acquiring feature information of a current user, and determining the weight of each search cue word to be selected in the search cue word set to be selected according to the feature information, wherein the feature information comprises one or more of user portrait information and user historical behavior information;

and sequencing the search cue words to be selected in the search cue word set to be selected according to the sequence of the weights from high to low, and determining the search cue words according to the sequencing result. Acquiring a search vocabulary set, and constructing a prompt index according to the incidence relation between each search vocabulary in the search vocabulary set and all the alternative words in the N alternative word sets;

acquiring feature information of a current user, and determining the weight of each search cue word to be selected in the search cue word set to be selected according to the feature information, wherein the feature information comprises at least one of the following items: user portrait information, user historical behavior information;

and sequencing the search cue words to be selected in the search cue word set to be selected according to the sequence of the weights from high to low, and determining the search cue words according to the sequencing result.

According to an embodiment of the present application, some steps involved in the multimedia processing method shown in fig. 2, fig. 4 and fig. 6 may be performed by each unit in the multimedia processing apparatus shown in fig. 7. For example, step S201 shown in fig. 2 may be performed by the acquisition unit 701 shown in fig. 7, and steps S202 to S204 may be performed by the processing unit 702 shown in fig. 7. Steps S401 and S402 shown in fig. 4 may be executed by the acquisition unit 701 shown in fig. 7, and steps S403 to S405 may be executed by the processing unit 702 shown in fig. 7. Step S601 shown in fig. 6 may be executed by the acquisition unit 701 shown in fig. 7, and steps S602 to S606 may be executed by the processing unit 702 shown in fig. 7. The units in the multimedia processing apparatus shown in fig. 7 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple functionally smaller units to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effect of the embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the multimedia processing apparatus may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present application, the multimedia processing apparatus as shown in fig. 7 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2, 4 and 6 on a general-purpose computing apparatus such as a computer including a processing element and a storage element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM) and the like, and implementing the multimedia processing method of the embodiment of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.

Based on the same inventive concept, the principle and the advantageous effect of the multimedia processing apparatus provided in the embodiment of the present application for solving the problem are similar to the principle and the advantageous effect of the multimedia processing apparatus in the embodiment of the present application for solving the problem, and for brevity, the principle and the advantageous effect of the implementation of the method can be referred to, and are not described herein again.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a multimedia processing device according to an embodiment of the present disclosure, where the multimedia processing device 800 at least includes a processor 801, a communication interface 802, and a memory 803. The processor 801, the communication interface 802, and the memory 803 may be connected by a bus or other means. The processor 801 (or Central Processing Unit, CPU) is a computing core and a control core of the terminal, and can analyze various instructions in the terminal and process various data of the terminal, for example: the CPU can be used for analyzing a power-on and power-off instruction sent to the terminal by a user and controlling the terminal to carry out power-on and power-off operation; the following steps are repeated: the CPU may transmit various types of interactive data between the internal structures of the terminal, and so on. The communication interface 802 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.), and may be controlled by the processor 801 for transceiving data; the communication interface 802 may also be used for transmission and interaction of data inside the terminal. A Memory 803(Memory) is a Memory device in the terminal for storing programs and data. It is understood that the memory 803 herein can include both the built-in memory of the terminal and, of course, the extended memory supported by the terminal. The memory 803 provides storage space that stores the operating system of the terminal, which may include, but is not limited to: android system, iOS system, Windows Phone system, etc., which are not limited in this application.

In the embodiment of the present application, the processor 801 executes the executable program code in the memory 803 to perform the following operations:

acquiring a multimedia file set through a communication interface 802, wherein the multimedia file set comprises N multimedia files, and N is a positive integer;

and outputting the search prompt words.

It should be understood that the multimedia processing apparatus described in this embodiment may perform the description of the multimedia processing method in the embodiment corresponding to fig. 2, fig. 4, and fig. 6, and may also perform the description of the multimedia processing apparatus 700 in the embodiment corresponding to fig. 7, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Based on the same inventive concept, the principle and the advantageous effect of the multimedia processing device for solving the problem provided in the embodiment of the present application are similar to the principle and the advantageous effect of the multimedia processing method for solving the problem in the embodiment of the present application, and for brevity, the principle and the advantageous effect of the implementation of the method can be referred to, and are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where one or more instructions are stored in the computer-readable storage medium, and the one or more instructions are adapted to be loaded by a processor and to execute the multimedia processing method according to the foregoing method embodiment.

The present application further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the multimedia processing method described in the above method embodiments.

Embodiments of the present application also provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method of multimedia processing described above.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device can be merged, divided and deleted according to actual needs.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of multimedia processing, the method comprising:

and outputting the search prompt words.

2. The method of claim 1, wherein the target multimedia file is one multimedia file of N multimedia files;

the method for predicting the file content of the target multimedia file by adopting the multimedia file label prediction model to obtain the alternative word set of the target multimedia file comprises the following steps:

and splicing one or more prediction retrieval phrases into a prediction retrieval text, and adding the prediction retrieval text into an alternative word set of the target multimedia file.

3. The method of claim 2, wherein the one or more predicted search phrases comprise a first search phrase and a second search phrase;

the decoding processing is performed on the fusion features to obtain one or more prediction retrieval phrases, and the method comprises the following steps:

decoding the fusion characteristics to obtain the first retrieval phrase;

and acquiring a word vector of the first search phrase, and decoding the word vector of the first search phrase and the fusion characteristics to obtain a second search phrase.

4. The method according to claim 3, wherein the data of the text modality of the target multimedia file comprises P text phrases, P being a positive integer;

the decoding the fusion feature to obtain the first search phrase includes:

and selecting the label word or text phrase corresponding to the maximum value from the K probability values and the P attention coefficients as the first search phrase.

5. The method of claim 1, wherein the target multimedia file is one of N multimedia files, the file content of the target multimedia file including data of a textual modality;

adopting a multimedia file label prediction model to carry out prediction processing on the file content of the target multimedia file to obtain an alternative word set of the target multimedia file, wherein the alternative word set comprises the following steps:

and fusing the first keyword set and the second keyword set to obtain an alternative word set of the target multimedia file.

6. The method of claim 5, wherein the extracting keywords from the data of the text modality of the target multimedia file to obtain a first keyword set comprises:

and adding the vocabulary corresponding to the node with the weight value larger than the threshold value to the first keyword set.

7. The method according to claim 5, wherein said encoding and decoding the data of the text modality of the target multimedia file to obtain a second keyword set comprises:

coding the word vector sequence to obtain a hidden feature sequence;

and decoding the hidden feature sequence to obtain the second keyword set.

8. The method of claim 1, wherein the searching for the search hint word matching the target search word in the set of N alternative words of the multimedia file comprises:

9. A multimedia processing apparatus, characterized in that the multimedia processing apparatus comprises:

10. A multimedia processing apparatus, comprising: a storage device and a processor;

the storage device stores a computer program therein;

a processor for loading and executing said computer program to implement the multimedia processing method as claimed in any one of claims 1 to 8.