CN111626049A

CN111626049A - Title correction method and device for multimedia information, electronic equipment and storage medium

Info

Publication number: CN111626049A
Application number: CN202010462562.3A
Authority: CN
Inventors: 陈小帅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-04
Anticipated expiration: 2040-05-27
Also published as: CN111626049B

Abstract

The invention provides a title correction method, a title correction device, electronic equipment and a computer-readable storage medium for multimedia information based on artificial intelligence; the method comprises the following steps: performing type identification processing on the multimedia information to obtain the type of the multimedia information; carrying out error identification processing on the title of the multimedia information to obtain an error position in the title; searching a candidate correction database corresponding to the type according to the text of the error position to obtain a plurality of candidate correction texts for correcting the text of the error position; and screening a plurality of candidate corrected texts, taking the candidate corrected texts obtained after screening as corrected texts, and replacing the texts at the wrong positions of the titles with the corrected texts to form the correct titles of the multimedia information. The invention can automatically and accurately correct the title of the multimedia information and improve the efficiency of title correction.

Description

Title correction method and device for multimedia information, electronic equipment and storage medium

Technical Field

The present invention relates to artificial intelligence technology, and in particular, to a method and an apparatus for modifying a title of multimedia information based on artificial intelligence, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning, etc., and along with the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important values.

Currently, titles are becoming more and more popular in various multimedia applications. However, an effective scheme for correcting the title of the multimedia information based on artificial intelligence is lacked in the related art, and the method mainly depends on examining and verifying the multimedia information manually to correct the wrong title of the multimedia information, so that the title correction of the multimedia information is realized. Since a large amount of multimedia information needs to be audited manually, the title correction efficiency of the related art is very low.

Disclosure of Invention

The embodiment of the invention provides a method and a device for correcting a title of multimedia information based on artificial intelligence, electronic equipment and a computer readable storage medium, which can automatically and accurately correct the title of the multimedia information and improve the efficiency of correcting the title.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a title correction method of multimedia information based on artificial intelligence, which comprises the following steps:

performing type identification processing on multimedia information to obtain the type of the multimedia information;

carrying out error identification processing on the title of the multimedia information to obtain an error position in the title;

searching a candidate correction database corresponding to the type according to the text of the error position to obtain a plurality of candidate correction texts for correcting the text of the error position;

screening the candidate corrected texts, taking the candidate corrected texts obtained after screening as corrected texts, and

and replacing the text of the wrong position of the title with the corrected text to form the correct title of the multimedia information.

The embodiment of the invention provides a title correction device of multimedia information, which comprises:

the identification module is used for carrying out type identification processing on the multimedia information to obtain the type of the multimedia information; carrying out error identification processing on the title of the multimedia information to obtain an error position in the title;

the search module is used for searching a candidate correction database corresponding to the type according to the text of the error position to obtain a plurality of candidate correction texts for correcting the text of the error position;

a screening module for screening the candidate corrected texts, taking the candidate corrected texts obtained after screening as corrected texts, and

and the replacing module is used for replacing the text at the wrong position of the title with the corrected text so as to form the correct title of the multimedia information.

In the above technical solution, the apparatus further includes:

the extraction module is used for extracting the characteristics of a plurality of modes of the multimedia information;

wherein, when the multimedia information is a video, the characteristics of the plurality of modalities include: a video fusion feature, an audio fusion feature, and a text feature of a title of the multimedia information.

In the above technical solution, the extracting module is further configured to encode each video frame in the multimedia information to obtain a vector representation of each video frame, and perform fusion processing on the vector representation of each video frame to obtain the video fusion feature;

coding each audio frame in the multimedia information to obtain vector representation of each audio frame, and performing fusion processing on the vector representation of each audio frame to obtain the audio fusion characteristics;

and coding the text at each position in the title of the multimedia information to obtain a corresponding vector, and combining the vectors at each position into a vector sequence to be used as the text characteristic of the title.

In the above technical solution, the identification module is further configured to perform fusion processing on the video fusion feature, the audio fusion feature, and the text feature to obtain a multi-modal fusion feature of the multimedia information;

mapping the multi-modal fusion features to probabilities corresponding to a plurality of candidate multimedia information types, and

and determining the candidate multimedia information type with the maximum probability as the type of the multimedia information.

In the above technical solution, the identification module is further configured to map the text feature of the title to an error probability corresponding to each position in the title, and determine a position where the error probability is greater than an error threshold as the error position.

In the above technical solution, the identification module is further configured to perform the type identification processing by calling a video type classification submodel in a multitask identification model;

the error recognition process is performed by invoking an error classification submodel in the multitask recognition model.

In the above technical solution, the apparatus further includes:

a training module for performing type recognition processing on the multimedia information sample through the multi-task recognition model to obtain the prediction type of the multimedia information sample, and

carrying out error identification processing on the title of the multimedia information sample to obtain a prediction error position in the title;

constructing a loss function of the multi-task identification model according to the prediction type of the multimedia information sample, the multimedia information type label of the multimedia information sample, the prediction error position in the multimedia information sample and the error position label in the multimedia information sample;

and updating the parameters of the multi-task recognition model until the loss function is converged, and taking the updated parameters of the multi-task recognition model when the loss function is converged as the parameters of the trained multi-task recognition model.

In the above technical solution, the apparatus further includes:

the generating module is used for extracting partial text in the title of the multimedia information positive sample from the multimedia information positive sample set;

querying a text library for an error text corresponding to the partial text;

replacing part of the text in the title with the error text to generate a multimedia information negative sample containing the error text, and

and determining the position of the error text as the error position label of the multimedia information negative sample.

In the foregoing technical solution, the search module is further configured to, for a candidate modification database corresponding to the type of the multimedia information, perform at least one of the following processes:

inquiring the candidate correction text corresponding to the pinyin of the text at the error position;

inquiring the candidate corrected texts corresponding to the fonts of the texts at the wrong positions;

and inquiring the candidate corrected texts corresponding to partial texts in the texts at the error positions.

In the above technical solution, the screening module is further configured to, for any one of the candidate corrected texts, execute the following processing:

replacing the text of the error position of the title with the candidate corrected text to generate a corrected title;

carrying out smoothness degree prediction processing on the title before correction through a language model to obtain the smoothness degree of the title before correction;

carrying out smoothness degree prediction processing on the corrected title through the language model to obtain the smoothness degree of the corrected title;

taking the difference value of the smoothness degrees before and after title correction as the language smoothness degree of the candidate corrected text;

and when the language smoothness degree of the candidate corrected text is greater than the threshold value of the language smoothness degree corresponding to the type of the multimedia information, taking the candidate corrected text as the corrected text of the title.

In the above technical solution, the language model includes a type personalized language model and a general language model; the screening module is further used for conducting smoothness degree prediction processing on the corrected title through the type personalized language model to obtain a first smoothness degree of the corrected title;

conducting passing degree prediction processing on the corrected title through the universal language model to obtain a second passing degree of the corrected title;

carrying out weighted summation on the first compliance degree and the second compliance degree to obtain the final compliance degree of the corrected title;

the type personalized language model is obtained by training according to multimedia information samples corresponding to the types of the multimedia information, and the universal language model is obtained by training according to multimedia information samples including all types of the multimedia information.

In the above technical solution, the apparatus further includes:

the processing module is used for performing word segmentation processing on the title before correction so as to obtain the number of texts included in the title before correction;

performing word segmentation processing on the corrected title to obtain the number of texts included in the corrected title;

taking the difference value of the number of texts included before and after the title is corrected as a reference threshold value of the title;

and determining the difference value between the language type threshold value corresponding to the type of the multimedia information and the reference threshold value of the title as the language compliance degree threshold value corresponding to the type of the multimedia information.

The embodiment of the invention provides electronic equipment for title correction of multimedia information, which comprises:

a memory for storing executable instructions;

and the processor is used for realizing the title correction method of the multimedia information based on artificial intelligence provided by the embodiment of the invention when the executable instructions stored in the memory are executed.

The embodiment of the invention provides a computer-readable storage medium, which stores executable instructions and is used for causing a processor to execute the method for modifying the title of the multimedia information based on artificial intelligence provided by the embodiment of the invention.

The embodiment of the invention has the following beneficial effects:

the title of the multimedia information is subjected to error identification processing to obtain an error position in the title, and a text at the error position of the title is replaced by a correction text for correcting the text at the error position, so that the title of the multimedia information can be automatically corrected, and the efficiency of title correction is improved; and then, searching a candidate correction database corresponding to the type of the multimedia information according to the text of the error position to obtain a plurality of candidate correction texts for correcting the text of the error position, and screening the correction texts from the candidate correction texts, namely, the correction texts can be accurately searched in the candidate correction database by fully utilizing the knowledge of the specific type of the multimedia information, so that the title of the multimedia information can be accurately corrected according to the correction texts, and the accuracy of title correction is improved.

Drawings

Fig. 1 is a schematic view of an application scenario of a title modification system for multimedia information according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an electronic device for title modification of multimedia information according to an embodiment of the present invention;

FIGS. 3-6 are schematic flow charts illustrating methods for title modification of artificial intelligence-based multimedia information according to embodiments of the present invention;

fig. 7 is a flowchart illustrating a method for modifying a title of a video according to an embodiment of the present invention;

FIG. 8 is a process flow diagram of a multi-tasking recognition model provided by an embodiment of the invention;

FIG. 9 is a schematic diagram of identifying a fault location provided by an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a title correction apparatus for multimedia information according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the description that follows, references to the terms "first", "second", and the like, are intended only to distinguish similar objects and not to indicate a particular ordering for the objects, it being understood that "first", "second", and the like may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than the order illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) Multimedia information (Multimedia): a medium is a carrier for carrying and transporting some information or substance. Media can be divided into five major categories: sensory media, presentation media, storage media, and transmission media. In the field of computers, media are mainly carriers for transmitting and storing information, and transmitted information includes language characters, data, videos, audios and the like; carriers for storage include hard disks, floppy disks, magnetic tape, magnetic disks, optical disks, and the like. The multimedia is to scientifically integrate the functions of various media, provide various forms of information display for users, and obtain more intuitive and vivid information. Multimedia information may be a composite of multiple media, typically including text, sound, and images. That is, the multimedia information in the embodiment of the present invention may be in the form of text, audio, video, and other media.

2) Title correction of multimedia information: the title error correction is also called as multimedia information title error correction, the error of the title of the multimedia information is found, and the error is corrected in time, so that the difficulty in identifying the title error in the manual review stage is reduced, and the accuracy of title expression is improved. For example, if the title of the video is found to have errors, the errors in the title are corrected, so that the dependence on manual review is avoided, and the efficiency of title error correction is improved.

An exemplary application of the electronic device for title modification of multimedia information provided by the embodiment of the present invention is described below.

The electronic device for title correction of multimedia information provided by the embodiment of the invention can be various types of terminal devices or servers, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud computing service; the terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present invention is not limited thereto. Taking a server as an example, for example, the server may be a server cluster deployed in a cloud, and an artificial intelligence cloud service (AI asas) is opened to operation and maintenance personnel, the AI aas platform splits several types of common AI services, and provides an independent or packaged service at the cloud, this service mode is similar to an AI theme mall, and all the operation and maintenance personnel may access one or more artificial intelligence services provided by the AIaaS platform by using an application programming interface. For example, one of the services is a title modification service, that is, a title modification program of multimedia information is encapsulated in a server in the cloud. The operation and maintenance personnel call a title correction service in the cloud service through a terminal so that a server deployed at the cloud end calls massive multimedia information (in the form of videos, audios and other media) and corresponding titles in a multimedia database, the server calls a title correction program of encapsulated multimedia information, a text of an identified error position of the title is replaced by a corrected text, the title of the multimedia information is automatically corrected, the corrected multimedia information is stored in the multimedia database, namely the corrected multimedia information can be directly put into application, accurate multimedia information (corrected multimedia information) can be directly called from the multimedia database subsequently, corresponding application of the multimedia information is carried out, for example, in video application, the corrected video information is called from the multimedia database, the video title is displayed, and a user can conveniently use the corrected video title according to the accurate video title, the video to be played is selected.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a title modification system 10 for multimedia information according to an embodiment of the present invention, in which a terminal 200 is connected to a server 100 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

The terminal 200 may be used to obtain the multimedia information and the corresponding title, for example, the operation and maintenance personnel inputs the multimedia information and the corresponding title through the terminal, and after the input is completed, the terminal obtains the multimedia information and the corresponding title from the address.

In some embodiments, the terminal 200 may execute the title correction method for multimedia information based on artificial intelligence provided by the embodiments of the present invention to automatically correct the title of the multimedia information, for example, after the terminal 200 is installed with a client, such as a title correction client dedicated for correcting the title, or other clients, such as a video client, an instant messaging client, a browser client, an education client, etc., the operation and maintenance staff inputs the multimedia information and the corresponding title at the title correction client, the terminal 200 searches a correction text for correcting the text at the error position in the candidate correction database according to the type of the identified multimedia information, and corrects the title of the multimedia information according to the correction text, so as to automatically correct the title of the multimedia information and display the corrected title of the multimedia information on the display interface 210 of the terminal 200, the operation and maintenance personnel can check the corrected multimedia information and store the corrected multimedia information into the multimedia database, and then can directly call the accurate multimedia information from the multimedia database to perform corresponding multimedia information application, such as video and audio playing.

In some embodiments, the terminal 200 may also send the multimedia information and the corresponding title input by the operation and maintenance staff on the terminal 200 to the server 100 through the network 300, and call a title correction function (a title correction program of the encapsulated multimedia information) of the multimedia information provided by the server 100, the server 100 corrects the title of the multimedia information through the artificial intelligence-based title correction method of the multimedia information provided by the embodiments of the present invention, for example, a title correction client is installed on the terminal 200, the operation and maintenance staff inputs certain multimedia information and the corresponding title in the title correction client, the terminal 200 sends the multimedia information and the corresponding title to the server 100 through the network 300, after receiving the multimedia information and the corresponding title, the server 100 calls the title correction program of the encapsulated multimedia information, according to the type of the identified multimedia information, searching out a corrected text of the text for correcting the error position from the candidate correction database, correcting the title of the multimedia information according to the corrected text, returning the corrected title of the multimedia information to the title correction client, and displaying the corrected title on the display interface 210 of the terminal 200, so that the operation and maintenance staff can check the corrected multimedia information, or storing the corrected multimedia information into the multimedia database by the server 100, and then directly calling the accurate multimedia information from the multimedia database to apply the corresponding multimedia information, such as playing video, audio and the like.

The following describes a structure of an electronic device for title correction of multimedia information according to an embodiment of the present invention, where the electronic device for title correction of multimedia information may be various terminals, such as a mobile phone, a computer, a television, a smart speaker, a smart watch, and the like, and may also be the server 100 shown in fig. 1.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 for title correction of multimedia information according to an embodiment of the present invention, and taking the electronic device 500 as a server as an example for explanation, the electronic device 500 for title correction of multimedia information shown in fig. 2 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in connection with embodiments of the invention is intended to comprise any suitable type of memory. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

As can be understood from the foregoing, the title correction method for multimedia information based on artificial intelligence provided by the embodiments of the present invention can be implemented by various types of electronic devices for title correction of multimedia information, such as a terminal or a server.

The title correction method based on artificial intelligence multimedia information provided by the embodiment of the invention is described below by combining with the exemplary application and implementation of the server provided by the embodiment of the invention. Referring to fig. 3, fig. 3 is a flowchart illustrating a title modification method based on artificial intelligence multimedia information according to an embodiment of the present invention, which is described with reference to the steps shown in fig. 3.

In step 101, a type identification process is performed on the multimedia information to obtain the type of the multimedia information.

As an example of obtaining the multimedia information, the operation and maintenance staff may input the multimedia information and a corresponding title, for example, a certain entertainment video information and a title of the video information, on an input interface of the terminal, and after the input is completed, the terminal may forward the multimedia information and the corresponding title to the server, so that the server performs type identification processing according to the multimedia information and performs error identification processing according to the title of the multimedia information.

The multimedia information may be a video or a media such as an audio. For example, when the multimedia information is a video, the genre of the video may be a tv show, a movie, a variety, music, a game, a cartoon, or the like.

As a pre-processing link of the type recognition processing and the error recognition processing, the characteristics of multiple modes of the multimedia information can be extracted, so that the multimedia information can be subjected to multi-task joint recognition, namely the type recognition processing and the error recognition processing, according to the characteristics of the multiple modes. Therefore, the type identification processing and the error identification processing can be quickly carried out through the characteristics of a plurality of modes of the multimedia information, so that the type of the multimedia information and the error position of the corresponding title can be quickly identified, the requirement of carrying out the type identification processing and the error identification processing respectively depending on other characteristics of the multimedia information is avoided, the computing resource of a server can be saved, and the title of the multimedia information can be quickly corrected.

Referring to fig. 4, fig. 4 is an alternative flowchart of a title modification method based on artificial intelligence multimedia information according to an embodiment of the present invention, and fig. 4 shows that step 106 is further included before step 101 and step 102. In step 106, extracting features of a plurality of modes of the multimedia information; when the type of the multimedia information is a video, the characteristics of the plurality of modalities include: a video fusion feature, an audio fusion feature, and a text feature of a title of the multimedia information.

For example, when the category of the multimedia information is a video, before performing the type recognition processing and the error recognition processing, features of a plurality of modalities of the video may be extracted, the features of the plurality of modalities including a video fusion feature, an audio fusion feature, and a text feature of a title of the video; when the kind of the multimedia information is audio, before the type recognition processing and the error recognition processing are performed, features of a plurality of modalities of the audio, including an audio fusion feature and a text feature of a title of the audio, may be extracted.

Referring to FIG. 4, FIG. 4 shows that step 106 can be achieved by steps 1061-1063. In step 1061, encoding each video frame in the multimedia information to obtain a vector representation of each video frame, and performing fusion processing on the vector representation of each video frame to obtain video fusion characteristics; in step 1062, encoding each audio frame in the multimedia information to obtain a vector representation of each audio frame, and performing fusion processing on the vector representation of each audio frame to obtain an audio fusion feature; in step 1063, the text at each position in the title of the multimedia information is encoded to obtain a corresponding vector, and the vectors at each position are combined into a vector sequence as the text feature of the title.

Wherein the text of each position in the title may be a single word or a single word. When the type of the multimedia information is a video, extracting a video frame sequence of the video, and performing encoding processing on each video frame in the video frame sequence to obtain a Vector representation of the video frame, for example, encoding the video frame through an initial reset-r-eset 2 (initial-r-eset 2) module, constructing a Vector representation of the video frame, and fusing the Vector representation of each video frame to obtain a video fusion feature, that is, characterizing all video frame information of the video through the video fusion feature, for example, performing processing such as weighted summation on the Vector representation of each video frame through a network local Aggregated Vector (NetVLAD) model, so as to obtain the video fusion feature; the audio frame sequence of the audio can also be extracted, and each audio frame in the audio frame sequence is subjected to encoding processing to obtain a vector representation of the audio frame, for example, a neural network model (V GGish model) is used to extract a 128-dimensional feature vector with semantics from the waveform of the audio frame to construct a vector representation of the audio frame, and the vector representation of each audio frame is fused to obtain an audio fusion feature, that is, all audio frame information of the video is represented by the audio fusion feature, for example, a NetVLAD model is used to perform weighted summation and other processing on the vector representation of each audio frame to obtain the audio fusion feature; the text at each position in the title of the multimedia information can also be encoded to construct a vector corresponding to each text, and the vectors at each position are combined into a vector sequence to construct the text features of the title.

For example, when the type of the multimedia information is audio, an audio frame sequence of the audio may be extracted, and each audio frame in the audio frame sequence may be subjected to encoding processing to obtain a vector representation of the audio frame, for example, a 128-dimensional feature vector having a semantic meaning is extracted from a waveform of the audio frame through a VGGish model to construct a vector representation of the audio frame, and the vector representation of each audio frame is fused to obtain an audio fusion feature, for example, through a NetVLAD model, and the vector representation of each audio frame is subjected to weighted summation and other processing to obtain an audio fusion feature; the text at each position in the title of the multimedia information can also be encoded to construct a vector corresponding to each text, and the vectors at each position are combined into a vector sequence to construct the text features of the title.

In some embodiments, performing a type identification process on the multimedia information to obtain a type of the multimedia information includes: fusing the video fusion feature, the audio fusion feature and the text feature to obtain a multi-modal fusion feature of the multimedia information; and mapping the multi-modal fusion features into probabilities corresponding to a plurality of candidate multimedia information types, and determining the candidate multimedia information type with the maximum probability as the type of the multimedia information.

In order to realize multi-task joint identification of multimedia information, namely type identification processing and error identification processing. Before the type of the multimedia information is identified, the video fusion feature, the audio fusion feature and the text feature of the multimedia information can be fused to obtain the multi-mode fusion feature of the multimedia information, for example, the video fusion feature, the audio fusion feature and the text feature can be weighted and summed, and the result of weighted summation is used as the multi-mode fusion feature of the multimedia information, wherein through the weighted summation mode, when the contribution of the video fusion feature for identifying the type of the multimedia is larger, the weight value of the video fusion feature can be set to be larger, so that through the weighted summation mode, the multi-mode fusion feature can be accurately determined, and the type of the multimedia information can be accurately identified according to the multi-mode fusion feature; the video fusion feature, the audio fusion feature and the text feature can also be spliced, and the splicing result is used as the multi-mode fusion feature of the multimedia information, so that the multi-mode fusion feature can be quickly generated through simple splicing, and the computing resources of the server can be saved. After the multi-mode fusion features of the multimedia information are determined, the multi-mode fusion features are mapped to the probabilities corresponding to a plurality of candidate multimedia information types through the full connection layer, and the candidate multimedia information type with the maximum probability is determined as the type of the multimedia information, so that the type of the multimedia information is identified according to type identification processing, for example, the current video is identified as an integrated art video.

In step 102, the error identification process is performed on the title of the multimedia information to obtain the error position in the title.

In order to correct the title of the multimedia information, it is necessary to perform error recognition processing on the title of the multimedia information to obtain an error position in the title, for example, to recognize that the error position of the title is a position where the 2 nd to 5 th words are located. The error positions in the title may be continuous positions or discontinuous positions, for example, the 3 rd to 5 th positions in the title are error positions, or the 3 rd to 5 th positions and the 7 th to 8 th positions in the title are error positions

In some embodiments, performing error identification processing on a title of multimedia information to obtain an error location in the title includes: and mapping the text characteristics of the title to the error probability of each position in the corresponding title, and determining the position with the error probability larger than the error threshold value as the error position.

For example, after encoding processing is performed on a text at each position in a title of multimedia information to construct a vector corresponding to each text, and the vectors at each position are combined into a vector sequence to construct a text feature of the title, the text feature of the title is mapped to an error probability (error probability corresponding to each text in the title) corresponding to each position in the title through a Bidirectional Encoder characterization quantity (BERT) model from a transformer, and a position where the error probability is greater than an error threshold is determined as an error position, for example, the error threshold is 0.85, the error probability of the 3 rd position in the title is 0.9, and the 3 rd position in the title is an error position of the title.

In some embodiments, the type recognition process may be performed by invoking a video type classification submodel in the multitask recognition model; and carrying out error identification processing by calling an error classification submodel in the multitask identification model. I.e., type recognition processing and error recognition processing, are implemented by invoking a multitask recognition model.

Referring to fig. 5, fig. 5 is an optional flowchart of a title modification method for multimedia information based on artificial intelligence according to an embodiment of the present invention, and fig. 5 shows that before step 101 and step 102, further includes step 107 and step 109: in step 107, performing type identification processing on the multimedia information sample through the multitask identification model to obtain a prediction type of the multimedia information sample, and performing error identification processing on a title of the multimedia information sample to obtain a prediction error position in the title; in step 108, constructing a loss function of the multi-task recognition model according to the prediction type of the multimedia information sample, the multimedia information type label of the multimedia information sample, the prediction error position in the multimedia information sample and the error position label in the multimedia information sample; in step 109, the parameters of the multitask recognition model are updated until the loss function converges, and the updated parameters of the multitask recognition model at the time of the loss function convergence are used as the parameters of the trained multitask recognition model.

After the server obtains the multimedia information sample, determining the value of a loss function of the multi-task identification model according to the prediction type of the multimedia information sample, the multimedia information type label of the multimedia information sample, the prediction error position in the multimedia information sample and the error position label in the multimedia information sample, judging whether the value of the loss function exceeds a preset threshold value, determining an error signal of the multi-task identification model based on the loss function when the value of the loss function exceeds the preset threshold value, reversely propagating the error information in the multi-task identification model, and updating each layer in the propagation processThe model parameters of (1). Wherein the loss function is

Where y' represents the prediction type of the multimedia information sample, y_iA multimedia information type label representing a multimedia information sample, x' representing a prediction error position in the multimedia information sample, x_iAnd representing the label of the error position in the multimedia information sample, and N represents the total number of the multimedia information samples. Therein, relate to

Are applicable to embodiments of the present invention.

In addition, the multitask recognition model may include a type recognition model (video type classification submodel) and a title error recognition model (error classification submodel), i.e., training of the multitask recognition model is achieved by respectively training the type recognition model and the title error recognition model. Training for the type recognition model is as follows: performing type identification processing on the multimedia information sample through a type identification model to obtain a prediction type of the multimedia information sample; constructing a loss function of a type identification model according to the prediction type of the multimedia information sample and the multimedia information type label of the multimedia information sample; and updating the parameters of the type recognition model until the loss function is converged, and taking the updated parameters of the type recognition model when the loss function is converged as the parameters of the trained type recognition model. Training for the title misidentification model is as follows: carrying out error identification processing on the title in the multimedia information sample through a title error identification model to obtain a prediction error position in the multimedia information sample; constructing a loss function of a title error identification model according to the predicted error position in the multimedia information sample and the error position label in the multimedia information sample; and updating the parameters of the title error identification model until the loss function is converged, and taking the updated parameters of the title error identification model when the loss function is converged as the parameters of the trained title error identification model. After the trained type recognition model and the trained title error recognition model are obtained, the type recognition processing can be carried out on the multimedia information through the trained type recognition model to obtain the type of the multimedia information, and the error recognition processing can be carried out on the title of the multimedia information through the trained title error recognition model to obtain the error position of the title.

Here, by training the type recognition model and the title error recognition model respectively, the tasks of type recognition and error recognition can be realized respectively, that is, the type recognition and the error recognition are independent and do not affect each other, and it is avoided that when one task of the type recognition and the error recognition has an error, the other task also has an error, thereby improving the robustness of the type recognition and the error recognition.

Describing backward propagation, inputting training sample data into an input layer of a neural network model, passing through a hidden layer, finally reaching an output layer and outputting a result, which is a forward propagation process of the neural network model, wherein because the output result of the neural network model has an error with an actual result, an error between the output result and the actual value is calculated and is propagated backward from the output layer to the hidden layer until the error is propagated to the input layer, and in the process of backward propagation, the value of a model parameter is adjusted according to the error; and continuously iterating the process until convergence, wherein the multitask identification model, the type identification model and the title error identification model belong to a neural network model.

The parameters of the multi-task recognition model are solved in a machine learning model mode, and compared with the method that the parameters of the multi-task recognition model in the related art are set based on experience, the method has better precision.

In order to quickly and accurately generate a training sample in the training process, part of texts in the title of the multimedia information positive sample can be extracted from the positive sample set of the multimedia information; inquiring error texts corresponding to the partial texts from a text library; and replacing part of text in the title with error text to generate a multimedia information negative sample containing the error text, and determining the position of the error text as an error position label of the multimedia information negative sample.

In the embodiment of the present invention, it is found that most titles of multimedia information are correct, i.e., it is easier to obtain a positive sample of multimedia information. In order to quickly and accurately obtain the multimedia information negative sample, the multimedia information positive sample can be extracted from the multimedia information positive sample set, part of texts, namely part words, in the multimedia information positive sample can be randomly extracted, for example, if the title has 5 words, the part of texts are the first 3 words in the title, an error text corresponding to the part of texts can be inquired from a text library, the error text can be a text similar to part of text fonts, pinyin and the like, and the part of texts in the title can be replaced by the error text, so that the accurate multimedia information negative sample containing the error text is generated, and a multi-task recognition model can be trained according to the multimedia information negative sample.

In step 103, a candidate corrected text corresponding to the type is searched according to the text of the error position, and a plurality of candidate corrected texts for correcting the text of the error position are obtained.

In order to obtain a candidate corrected text for correcting the text at the error position in a targeted manner, after the server obtains the type of the multimedia information, a candidate corrected database corresponding to the type of the multimedia information is determined, and the candidate corrected database is searched according to the text at the error position of the multimedia information, so that a targeted candidate corrected text is obtained.

In some embodiments, searching a candidate corrected text database corresponding to the type according to the text of the error position to obtain a plurality of candidate corrected texts for correcting the text of the error position, including: for a candidate correction database corresponding to the type of the multimedia information, performing at least one of the following processes: inquiring candidate correction texts corresponding to the pinyin of the text at the error position; inquiring candidate corrected texts corresponding to the fonts of the texts at the wrong positions; and inquiring candidate corrected texts corresponding to partial texts in the texts at the error positions.

For example, after the type of the multimedia information is determined, a corresponding candidate correction database may be determined, for example, if the type of the multimedia information is determined to be a drama, the candidate correction database is determined to be a drama candidate database, and the drama candidate database includes information such as a drama name, an actor, a character, a common word and the like related to the drama. The candidate correction database supports pinyin indexes, font indexes and part similarity indexes, and can be inquired through font, pinyin and part similarity of the text at the wrong position so as to inquire candidate correction texts corresponding to the pinyin of the text at the wrong position, candidate correction texts corresponding to the font of the text at the wrong position and candidate correction texts corresponding to part of the text in the text at the wrong position.

In step 104, a plurality of candidate corrected texts are screened, and the candidate corrected texts obtained after screening are used as corrected texts.

For example, after the server obtains a plurality of candidate corrected texts, a corrected text can be screened from the plurality of candidate corrected texts, so that the title of the multimedia information can be corrected according to the corrected text.

Referring to fig. 6, fig. 6 is a schematic flow chart of an optional method for modifying a title of multimedia information based on artificial intelligence according to an embodiment of the present invention, and fig. 6 shows that step 104 of fig. 3 can be implemented by step 1041 and step 1045 of fig. 6: for any one of the plurality of candidate corrected texts, performing the following processing: in step 1041, replacing the text of the error position of the title with the candidate corrected text to generate a corrected title; in step 1042, conducting a smoothness degree prediction process on the title before correction through a language model to obtain the smoothness degree of the title before correction; in step 1043, performing a compliance degree prediction process on the modified title through the language model to obtain a compliance degree of the modified title; in step 1044, the difference between the compliance degrees before and after title correction is used as the language compliance degree of the candidate corrected text; in step 1045, when the language compliance degree of the candidate corrected text is greater than the threshold value of the language compliance degree corresponding to the type of the multimedia information, the candidate corrected text is taken as the corrected text of the title.

For example, after the title of the multimedia information is corrected by the text of the error position of the title, the title before correction and the title after correction are respectively predicted by the language model, and the compliance degree before and after the title correction is obtained, and when the difference value of the compliance degree before and after the title correction is larger than the language compliance degree threshold value corresponding to the type of the multimedia information, the corrected title is higher in compliance degree, namely, the corrected title is correct, wherein the language compliance degree threshold values corresponding to the type of the multimedia information are different due to different types of the multimedia information. In addition, the text of the error position of the title may be replaced with a candidate corrected text to generate a corrected title, and then the corrected title may be subjected to compliance prediction processing by the language model to obtain compliance of the corrected title, and when the compliance of the corrected title is greater than a language compliance threshold corresponding to the type of the multimedia information, the candidate corrected text may be used as the corrected text of the title. Wherein, the language model belongs to a neural network model.

In order to accurately obtain the compliance degree before and after title correction, compliance degree prediction processing can be respectively carried out through a type personalized language model and a general language model. The language model comprises a type personalized language model and a general language model; and carrying out the smoothness degree prediction processing on the corrected title through the language model to obtain the smoothness degree of the corrected title, wherein the smoothness degree prediction processing comprises the following steps: carrying out smoothness degree prediction processing on the corrected title through the type personalized language model to obtain a first smoothness degree of the corrected title; conducting passing degree prediction processing on the corrected title through a universal language model to obtain a second passing degree of the corrected title; and carrying out weighted summation on the first compliance degree and the second compliance degree to obtain the final compliance degree of the modified title.

The type personalized language model is obtained by training a multimedia information sample corresponding to the type of the multimedia information, and the general language model is obtained by training a multimedia information sample including all types of the multimedia information, namely the type personalized language model can carry out smoothness degree prediction processing on the title of a certain type of the multimedia information in a pertinence mode, and the general language model can carry out smoothness degree prediction processing on the title of all types of the multimedia information. When the contribution of the type personalized language model to the final compliance degree of the modified title is large, the weight value of the first compliance degree can be set to be larger. Wherein, the type personalized language model and the general language model belong to a neural network model.

In some embodiments, the language compliance threshold corresponding to the type of the multimedia information may be dynamic, that is, before the candidate corrected text is the corrected text of the title, further comprising: performing word segmentation processing on the title before correction to obtain the number of texts included in the title before correction; performing word segmentation processing on the corrected title to obtain the number of texts included in the corrected title; taking the difference value of the number of texts included before and after the title is corrected as a reference threshold value of the title; and determining the difference value between the language type threshold value corresponding to the type of the multimedia information and the reference threshold value of the title as the language compliance degree threshold value corresponding to the type of the multimedia information.

For example, the number of words fragmented after word segmentation is large, that is, the number of texts included in the title is large, due to an error in the title before correction, and when the error in the title is corrected, the number of texts included in the title after correction is very likely to be small, so that when the number of texts included in the title after correction is reduced relative to the number of texts included in the title before correction, it is determined that the error in the title is corrected, and it is necessary to keep the candidate corrected text as the final corrected text of the title. The embodiment of the present invention may use a difference between the number of texts included before and after the title is modified as a reference threshold of the title, and determine a difference between a language type threshold corresponding to the type of the multimedia information and the reference threshold of the title as a language compliance degree threshold corresponding to the type of the multimedia information, where the language type thresholds corresponding to the types of the multimedia information are different due to the different types of the multimedia information, and for example, when a request for a title of a variety is high relative to a request for a title of a tv series, the language compliance degree threshold corresponding to the variety may be set to 0.9, and the language compliance degree threshold corresponding to the tv series may be set to 0.7. Other variant calculation modes of the difference value between the language type threshold value corresponding to the type of the multimedia information and the reference threshold value of the title are also suitable for the embodiment of the invention.

In step 105, the text of the wrong position of the title is replaced with the corrected text to form the correct title of the multimedia information.

For example, after the server obtains the corrected text, the text of the wrong position of the title may be replaced with the corrected text to correct the title of the multimedia information. The server can return the modified title of the multimedia information to the terminal and display the modified title on a display interface of the terminal, so that operation and maintenance personnel can check the modified multimedia information, or the server stores the modified multimedia information into a multimedia database, and can subsequently directly call the accurate multimedia information from the multimedia database to apply the corresponding multimedia information, such as playing video, audio and the like.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

The embodiment of the present invention may be applied to an application scenario of title correction of a video, as shown in fig. 1, a terminal 200 is connected to a server 100 deployed in a cloud via a network 300, a title correction client is installed on the terminal 200, an operation and maintenance worker inputs a certain video and a corresponding title in the title correction client, the terminal 200 sends the video and the corresponding title to the server 100 via the network 300, after receiving the video and the corresponding title, the server 100 searches a correction text of a text for correcting an error position in a candidate correction database according to a type of identified multimedia information, replaces the text of the error position of the identified title with the correction text, returns the title of the corrected multimedia information to the title correction client, and displays the title of the corrected multimedia information on a display interface 210 of the terminal 200, so that the operation and maintenance personnel can check the corrected multimedia information, or the server 100 can store the corrected multimedia information into the multimedia database.

In the related art, title correction is mainly realized through two ways, which are respectively: 1) judging whether the title is smooth or not based on the language model, and when judging that the title is not smooth, judging that the title has errors, and then correcting the title; 2) based on the generative model, an end-to-end generative model is constructed to correct the title.

Although, both of the above schemes may implement header correction. However, the title of the video is not corrected accurately. In order to solve the problem, the embodiment of the invention carries out multi-task joint learning on the video type identification and the error identification by thinning the type of the video, fully utilizes the domain knowledge of specific types to assist the correction of the title, improves the error correction capability of the video title, better assists the video auditing stage, further improves the efficiency of manual error correction, and reduces the influence of the error title of the video on a user.

As shown in fig. 7, the overall flow of the video title correction method according to the embodiment of the present invention includes five steps, which are respectively: 1) acquiring video information; 2) identifying the video type and title error multitask; 3) constructing a candidate correction text list based on type knowledge; 4) correcting the candidate corrected text list ordering; 5) and returning a correction result, wherein the specific processing procedure is as follows:

1) video information acquisition

The video information includes a title of a video, image frames, and audio frames, for example, a video-audio codec tool (ffmpeg) uniform frame extraction method is adopted to extract one image frame in the video every second, and a pixel value of the image is input as an original representation of each image frame. The audio frame can also adopt an ffmpeg uniform frame extraction method to extract the audio frame in the video, and the Mel frequency spectrogram characteristics are constructed for the audio frame and are used as the original representation input of the audio frame.

2) Video type, title error multi-task recognition

Because different video types have different type error characteristics, for example, the error of the video of the television series can be television series name error, actor name error and the like, the title error judgment and the video type judgment are subjected to multi-task combined learning, the representation capability of a multi-task recognition model representation layer can be improved, the accuracy of the title error judgment can be improved, and type guidance can be provided for subsequently constructing a candidate correction text list. After the video frame sequence of the video is extracted, the video frame sequence is encoded through an inclusion-rest 2 module to construct a video frame representation (vector representation of the video frame), and the video frame representation is subjected to feature fusion through a NetVLAD model to obtain video fusion features; after extracting the audio frame sequence of the video, coding the audio frame sequence through a VGGish model to construct an audio frame representation (vector representation of an audio frame), and performing feature fusion on the audio frame representation through a NetVLAD model to obtain audio fusion features; the title of the video is coded through a BERT model to construct a title representation so as to generate text features of the title, multi-mode feature fusion is carried out on the video fusion features, the audio fusion features and the text features, video classification probability is output through a full-connection layer network, and the type of the video is determined according to the video classification probability. After the text feature of the title is obtained by the BERT model, an error position of the title can be identified based on the text feature (output error determination).

The multitask identification model comprises a type identification model, and the type identification model can accurately judge the type of a video, such as a TV play, a movie, a synthesis art, music, a game, an animation and the like, by performing multimodal combined modeling on a video title text (text features), video image content (video frames) and audio content (audio frames). The multitask recognition model further comprises a title error determination model which can perform error recognition learning on the title text.

In order to train the multi-task recognition model sufficiently, two tasks of video type recognition and title error recognition can be trained respectively, for example, the type recognition model can be trained on a large number of video training samples with labeled types, so that the type recognition model achieves a good recognition level through multi-modal characterizations, wherein the format of the training samples is video-type: XX, for example, the training samples are (video 1-type: TV show), (video 2-type: movie), (video 3-type: art), (video 4-type: sports), (video 5-type: game), …, (video v-type: cartoon).

The title error recognition model can be independently trained on training samples which are manually marked and automatically generated in an error mode to strengthen text representation capacity, wherein an automatically constructed error data set can be obtained by randomly selecting a word segment (correct text) in a correct title of a video, determining words (error texts) similar to the word pattern and pinyin of the word segment, replacing the correct text with the error text, and generating a negative sample to construct the error data set, so that the constructed error data set is more consistent with real error distribution. The format of the training sample of the title error recognition model is video-title-error position: XX, for example, the training samples are (video 1-title 1-error position: 1-3), (video 2-title 2-error position: 3-4), (video 3-title 3-error position: 3-5), (video 4-title 4-error position: 7-8), (video 5-title 5-error position: 6-8), …, (video v-title v-error position: 4-6).

After the independent training of the type recognition model and the title error recognition model is finished, the joint training can be carried out again, the type recognition model and the title error recognition model are respectively initialized by adopting the model parameters of the independent training during the joint training, and the training sample format of the joint training is video-type: XX-title-error location: XX, for example, the training samples are (video 1-type: drama-title 1-error position: 1-3), (video 2-type: movie-title 2-error position: 3-4), (video 3-type: variety-title 3-error position: 3-5), (video 4-type: sports-title 4-error position: 7-8), (video 5-type: game-title 5-error position: 6-8), …, (video v-type: cartoon-title v-error position: 4-6).

After the multitask identification model is established by the method, when title error identification is carried out on a video, video information is input into the multitask identification model, the multitask identification model returns the video type, and an error position is identified, for example, the video title is 'AA scene movie theme song, the video title is fascinating the development course of the country', the multitask identification model is identified as a music type, as shown in figure 9, a BERT model (multitask identification model) identifies that a dislocation position is a 'scene', and the 'scene' is actually 'singing at present'; the title is "[ countermeasures", every moment when AA is the most beautiful, the most sweet and the most lovely, the multitask identification model is identified as a synthesis type, and the dislocation position is a "countermeasures" position.

3) Constructing a list of candidate revised texts based on type knowledge

Constructing a corresponding candidate correction database for each video type in advance, for example, counting the title, actors, roles and common words of a video of a television series to construct a candidate correction database of the television series; counting game names, role names, maps, duplicates and commonly used phrases aiming at videos of games to construct a game candidate correction database; for sports videos, athletes, sports stars, game names, fields, commentators and the like are counted in advance to build a sports candidate correction database, namely each type is built by only using video data of the type. The candidate correction database supports pinyin indexing, font indexing and partial similarity indexing (realized based on Elastic Search (ES) word granularity), and can be queried by performing font indexing, pinyin indexing and partial similarity on the text at the wrong position to obtain the candidate correction text.

Inquiring a candidate correction database of a corresponding type based on the identified video type and the error position, constructing a candidate correction text list, and searching based on similarity when inquiring the candidate correction database, such as pinyin similarity, font similarity and partial similarity, wherein for example, the text of the error position is 'anti-copy', inquiring the candidate correction text of the 'rice' comprehensive art name and the candidate correction text of the 'paradigm 37061' comprehensive figure based on the pinyin similarity by inquiring the synthesis candidate correction database, and inquiring the 'food' candidate correction text based on the partial similarity.

4) Correcting candidate revised text list ordering

The output result of the language model can reflect the currency degree of the statement, the embodiment of the invention uses a Long Short-Term Memory artificial neural network (LSTM) language model to score the original title and the corrected title, and carries out descending order sorting on the candidate corrected texts according to the promotion amplitude of the corrected language currency degree (the difference value of the currency degree before and after title correction), namely, the larger the promotion amplitude is, the more the sorting position of the candidate corrected texts is.

The language models may include a genre-personalized language model trained on specific types of video titles and a generic language model trained on all types of videos. And carrying out weighted summation on the scores (the smoothness degree) output by the type personalized language model and the general language model, and taking the result of the weighted summation as the score (the smoothness degree) of the final language model. The weighting coefficient may be adjusted according to actual requirements, for example, the weighting of the type personalized language model is 0.7, and the weighting of the general language model is 0.3.

5) Returning the corrected result

And performing threshold filtering on the candidate corrected text list ordering of the language model, wherein different thresholds can be set for each video type. In addition, the difference value of the number of the participles before and after the title correction is used as a reference index for setting different thresholds, for example, after the original title (the title before the correction) is participled due to errors, the number of the participles is large, and after the errors are corrected, the number of the participles of the corrected title is reduced, so that the filtering threshold can be reduced, the candidate corrected text which meets the threshold can be effectively reserved as the final corrected text, and the text at the error position in the title is replaced by the final corrected text, so that the title of the video can be corrected.

In summary, the embodiment of the invention provides a video type personalized title correction method for multitask joint learning, which, aiming at different video types, combines with specific types of knowledge data, improves the error correction capability of the video title, reduces the manual error correction cost of auditors, improves the expression accuracy of the video title, and improves the video quality in a platform.

The title correction method for multimedia information based on artificial intelligence provided by the embodiment of the present invention has been described in conjunction with the exemplary application and implementation of the server provided by the embodiment of the present invention, and the following continues to describe a scheme for implementing the title correction of multimedia information by matching each module in the title correction device for multimedia information provided by the embodiment of the present invention.

In some embodiments, the title correction apparatus for multimedia information provided by the embodiments of the present invention may be implemented in software, and fig. 10 illustrates a title correction apparatus 555 for multimedia information stored in a memory 550, which may be software in the form of programs and plug-ins, and includes a series of modules, including a recognition module 5551, a search module 5552, a filtering module 5553, a replacement module 5554, an extraction module 5555, a training module 5556, a generation module 5557, and a processing module 5558; the recognition module 5551, the search module 5552, the screening module 5553, the replacing module 5554, the extracting module 5555, and the processing module 5558 are configured to implement a function of title modification of multimedia information provided by an embodiment of the present invention, and the training module 5556 and the generating module 5557 are configured to implement training of a multitask recognition model.

The identification module 5551 is configured to perform type identification processing on the multimedia information to obtain a type of the multimedia information; carrying out error identification processing on the title of the multimedia information to obtain an error position in the title; a searching module 5552, configured to search a candidate correction database corresponding to the type according to the text of the error position, so as to obtain a plurality of candidate correction texts for correcting the text of the error position; a screening module 5553, configured to screen the multiple candidate corrected texts, and use the candidate corrected texts obtained after screening as corrected texts, and a replacing module 5554, configured to replace the text at the wrong position of the title with the corrected text, so as to form a correct title of the multimedia information.

In some embodiments, the apparatus further comprises: an extraction module 5555 for extracting features of a plurality of modalities of the multimedia information; wherein, when the multimedia information is a video, the characteristics of the plurality of modalities include: a video fusion feature, an audio fusion feature, and a text feature of a title of the multimedia information.

In some embodiments, the extracting module 5555 is further configured to perform encoding processing on each video frame in the multimedia information to obtain a vector representation of each video frame, and perform fusion processing on the vector representation of each video frame to obtain the video fusion feature; coding each audio frame in the multimedia information to obtain vector representation of each audio frame, and performing fusion processing on the vector representation of each audio frame to obtain the audio fusion characteristics; and coding the text at each position in the title of the multimedia information to obtain a corresponding vector, and combining the vectors at each position into a vector sequence to be used as the text characteristic of the title.

In some embodiments, the recognition module 5551 is further configured to perform fusion processing on the video fusion feature, the audio fusion feature and the text feature to obtain a multi-modal fusion feature of the multimedia information; and mapping the multi-modal fusion features into probabilities corresponding to a plurality of candidate multimedia information types, and determining the candidate multimedia information type with the maximum probability as the type of the multimedia information.

In some embodiments, the identification module 5551 is further configured to map the text features of the title to an error probability corresponding to each location in the title, and determine a location with an error probability greater than an error threshold as the error location.

In some embodiments, the recognition module 5551 is further configured to perform the type recognition process by invoking a video type classification submodel in a multitask recognition model; the error recognition process is performed by invoking an error classification submodel in the multitask recognition model.

In some embodiments, the apparatus further comprises: the training module 5556 is configured to perform type identification processing on a multimedia information sample through the multitask identification model to obtain a prediction type of the multimedia information sample, and perform error identification processing on a title of the multimedia information sample to obtain a prediction error position in the title; constructing a loss function of the multi-task identification model according to the prediction type of the multimedia information sample, the multimedia information type label of the multimedia information sample, the prediction error position in the multimedia information sample and the error position label in the multimedia information sample; and updating the parameters of the multi-task recognition model until the loss function is converged, and taking the updated parameters of the multi-task recognition model when the loss function is converged as the parameters of the trained multi-task recognition model.

In some embodiments, the apparatus further comprises: a generating module 5557, configured to extract a part of text in a header of a positive sample of multimedia information from the positive sample set of multimedia information; querying a text library for an error text corresponding to the partial text; replacing part of text in the title with the error text to generate a multimedia information negative sample containing the error text, and determining the position of the error text as an error position label of the multimedia information negative sample.

In some embodiments, the search module 5552 is further configured to perform at least one of the following for a candidate rework database corresponding to the type of multimedia information: inquiring the candidate correction text corresponding to the pinyin of the text at the error position; inquiring the candidate corrected texts corresponding to the fonts of the texts at the wrong positions; and inquiring the candidate corrected texts corresponding to partial texts in the texts at the error positions.

In some embodiments, the screening module 5553 is further configured to, for any one of the candidate corrected texts, perform the following: replacing the text of the error position of the title with the candidate corrected text to generate a corrected title; carrying out smoothness degree prediction processing on the title before correction through a language model to obtain the smoothness degree of the title before correction; carrying out smoothness degree prediction processing on the corrected title through the language model to obtain the smoothness degree of the corrected title; taking the difference value of the smoothness degrees before and after title correction as the language smoothness degree of the candidate corrected text; and when the language smoothness degree of the candidate corrected text is greater than the threshold value of the language smoothness degree corresponding to the type of the multimedia information, taking the candidate corrected text as the corrected text of the title.

In some embodiments, the language models include a type-personalized language model and a generic language model; the screening module 5553 is further configured to perform compliance degree prediction processing on the modified title through the type personalized language model to obtain a first compliance degree of the modified title; conducting passing degree prediction processing on the corrected title through the universal language model to obtain a second passing degree of the corrected title; carrying out weighted summation on the first compliance degree and the second compliance degree to obtain the final compliance degree of the corrected title; the type personalized language model is obtained by training according to multimedia information samples corresponding to the types of the multimedia information, and the universal language model is obtained by training according to multimedia information samples including all types of the multimedia information.

In some embodiments, the apparatus further comprises: a processing module 5558, configured to perform word segmentation on the title before the correction to obtain the number of texts included in the title before the correction; performing word segmentation processing on the corrected title to obtain the number of texts included in the corrected title; taking the difference value of the number of texts included before and after the title is corrected as a reference threshold value of the title; and determining the difference value between the language type threshold value corresponding to the type of the multimedia information and the reference threshold value of the title as the language compliance degree threshold value corresponding to the type of the multimedia information.

According to the embodiment of the invention, the error identification processing is carried out on the title of the multimedia information to obtain the error position in the title, and the text at the error position of the title is replaced by the correction text for correcting the text at the error position, so that the title of the multimedia information can be automatically corrected, and the title correction efficiency is improved; and then, searching a candidate correction database corresponding to the type of the multimedia information according to the text of the error position to obtain a plurality of candidate correction texts for correcting the text of the error position, and screening the correction texts from the candidate correction texts, namely, the correction texts can be accurately searched in the candidate correction database by fully utilizing the knowledge of the specific type of the multimedia information, so that the title of the multimedia information can be accurately corrected according to the correction texts, and the accuracy of title correction is improved.

Embodiments of the present invention also provide a computer-readable storage medium storing executable instructions, which, when executed by a processor, cause the processor to perform a title modification method for artificial intelligence based multimedia information provided by embodiments of the present invention, for example, the title modification method for artificial intelligence based multimedia information shown in fig. 3 to 6.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a HyperText markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device (a device that includes a smart terminal and a server), or on multiple computing devices located at one site, or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A title modification method for multimedia information based on artificial intelligence, the method comprising:

2. The method of claim 1, wherein prior to said performing type identification processing on the multimedia information, the method further comprises:

extracting features of a plurality of modalities of the multimedia information;

3. The method of claim 2, wherein said extracting features of a plurality of modalities of the multimedia information comprises:

coding each video frame in the multimedia information to obtain vector representation of each video frame, and performing fusion processing on the vector representation of each video frame to obtain the video fusion characteristics;

4. The method of claim 2, wherein performing type identification processing on the multimedia information to obtain the type of the multimedia information comprises:

fusing the video fusion feature, the audio fusion feature and the text feature to obtain a multi-modal fusion feature of the multimedia information;

5. The method of claim 2, wherein said performing error identification processing on the title of the multimedia information to obtain the error position in the title comprises:

and mapping the text features of the title to correspond to the error probability of each position in the title, and determining the position with the error probability larger than an error threshold value as the error position.

6. The method of claim 1,

the type identification processing of the multimedia information comprises the following steps:

the type recognition processing is carried out by calling a video type classification submodel in the multitask recognition model;

the error identification processing of the title of the multimedia information comprises the following steps:

7. The method of claim 6,

before the performing type identification processing on the multimedia information, the method further includes:

performing type recognition processing on a multimedia information sample through the multi-task recognition model to obtain a prediction type of the multimedia information sample, and obtaining the prediction type of the multimedia information sample

8. The method of claim 7, further comprising:

extracting partial text in the title of the multimedia information positive sample from the multimedia information positive sample set;

querying a text library for an error text corresponding to the partial text;

9. The method of claim 1, wherein searching a candidate correction database corresponding to the type according to the text of the error position to obtain a plurality of candidate correction texts for correcting the text of the error position comprises:

for a candidate correction database corresponding to the type of the multimedia information, performing at least one of the following processes:

10. The method according to claim 1, wherein the screening the candidate modified texts, and using the candidate modified texts obtained after the screening as modified texts comprises:

for any one of the candidate corrected texts, performing the following processing:

11. The method of claim 10,

the language model comprises a type personalized language model and a general language model;

the obtaining the smoothness degree of the modified title by performing smoothness degree prediction processing on the modified title through the language model comprises:

carrying out smoothness degree prediction processing on the corrected title through the type personalized language model to obtain a first smoothness degree of the corrected title;

12. The method of claim 10, wherein before the determining the candidate corrected text as the corrected text for the title, the method further comprises:

performing word segmentation processing on the title before the correction to obtain the number of texts included in the title before the correction;

13. An apparatus for title modification of multimedia information, the apparatus comprising:

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based multimedia information title modification method of any one of claims 1 to 12 when executing the executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions for causing a processor to perform the method for title modification of artificial intelligence based multimedia information according to any one of claims 1 to 12 when executed.