CN113704508A

CN113704508A - Multimedia information identification method and device, electronic equipment and storage medium

Info

Publication number: CN113704508A
Application number: CN202110385210.7A
Authority: CN
Inventors: 梁涛; 张晗; 马连洋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-11-26

Abstract

The invention provides a multimedia information identification method, which comprises the following steps: the method comprises the steps of performing text extraction processing on a text in multimedia information to be identified, and determining a text characteristic vector matched with the multimedia information; carrying out image extraction processing on the multimedia information, and determining an image feature vector matched with the multimedia information; respectively filtering the text feature vector and the image feature vector; performing feature fusion processing on the filtering processing results of the text feature vector and the image feature vector to determine corresponding fusion feature vectors; and identifying the multimedia information to be identified based on the fusion feature vector to obtain an identification result of the multimedia information, so that the identification of the multimedia information is realized by combining text information and image information, and meanwhile, redundant information and error information are less through filtering processing of the feature information, the accuracy of the identification result is improved, and the reduction of the experience sense of a user caused by error identification is reduced.

Description

Multimedia information identification method and device, electronic equipment and storage medium

Technical Field

The present invention relates to information processing technologies, and in particular, to a multimedia information recognition method and apparatus, an electronic device, and a storage medium.

Background

In the conventional technology, various information recommendation systems can identify multimedia information to be recommended in a process of recommending corresponding information to a user, taking a news short for example, for news short classification, classification can be performed only by using text mode information in the news short, for example, content feature extraction is performed on vectorized text in the news short by using a CNN system model (TextCNN, DPCNN) or an RNN system model (TextRNN, TextRCNN), and the news short is classified based on the extracted content feature information. However, the classification is performed by using text modality information in the news articles, useful image modality information in the articles is not considered, and the extracted content feature information of the news articles may contain some wrong and redundant information, which interferes with the accuracy of information classification.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a multimedia information identification method, an apparatus, an electronic device, and a storage medium, and a technical solution of the embodiment of the present invention is implemented as follows:

the invention provides a multimedia information identification method, which comprises the following steps:

acquiring multimedia information to be identified, wherein the multimedia information to be identified comprises texts and images;

performing text extraction processing on a text in the multimedia information to be identified, and determining a text characteristic vector matched with the multimedia information;

determining an image feature vector matched with the multimedia information by carrying out image extraction processing on the multimedia information;

respectively filtering the text feature vector and the image feature vector to obtain a text feature vector filtering result and an image feature vector filtering result;

determining corresponding fusion feature vectors by performing feature fusion processing on the result of the text feature vector filtering processing and the result of the image feature vector filtering processing;

and identifying the multimedia information to be identified based on the fusion feature vector to obtain an identification result of the multimedia information.

The embodiment of the invention also provides a multimedia information identification device, which comprises:

the information transmission module is used for acquiring multimedia information to be identified, wherein the multimedia information to be identified comprises texts and images;

the information processing module is used for performing text extraction processing on the text in the multimedia information to be identified and determining a text characteristic vector matched with the multimedia information;

the information processing module is used for determining an image feature vector matched with the multimedia information by carrying out image extraction processing on the multimedia information;

the information processing module is used for respectively filtering the text characteristic vector and the image characteristic vector to obtain a text characteristic vector filtering result and an image characteristic vector filtering result;

the information processing module is used for performing feature fusion processing on the text feature vector filtering processing result and the image feature vector filtering processing result to determine corresponding fusion feature vectors;

and the information processing module is used for identifying the multimedia information to be identified based on the fusion characteristic vector to obtain an identification result of the multimedia information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for extracting a characteristic vector matched with the text content of the multimedia information through a character information processing network of a multimedia information identification model;

the information processing module is used for determining a statement vector corresponding to the text content according to the feature vector through the word information processing network;

the information processing module is used for determining at least one word-level hidden variable corresponding to the text content according to the feature vector through the word information processing network;

and the information processing module is used for determining a text characteristic vector matched with the multimedia information according to the at least one word-level hidden variable and the statement vector corresponding to the text content through the word information processing network.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for triggering the corresponding word segmentation library according to the text type parameters included in the text content of the multimedia information;

the information processing module is used for carrying out word segmentation processing on the text content of the multimedia information through the triggered word segmentation library word dictionary to form different word level feature vectors;

and the information processing module is used for denoising the different word-level characteristic vectors to form a set of characteristic vectors matched with the text content of the multimedia information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model;

the information processing module is used for carrying out denoising processing on the different word-level feature vectors according to the dynamic noise threshold value and triggering a dynamic word segmentation strategy matched with the dynamic noise threshold value;

and the information processing module is used for performing word segmentation processing on the text content of the multimedia information according to a dynamic word segmentation strategy matched with the dynamic noise threshold value to form a corresponding dynamic word level feature vector set.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a fixed noise threshold value corresponding to the use environment of the multimedia information identification model;

the information processing module is used for denoising the different word-level feature vectors according to the fixed noise threshold and triggering a fixed word segmentation strategy matched with the fixed noise threshold;

and the information processing module is used for performing word segmentation processing on the target text of the multimedia information according to a fixed word segmentation strategy matched with the fixed noise threshold value to form a corresponding fixed word level feature vector set.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining the number of the word-level hidden variables according to the type of the multimedia information to be identified;

the information processing module is used for extracting high-dimensional features through the statement vectors in the word information processing network;

and the information processing module is used for performing feature fusion on the sentence vectors extracted from the high-dimensional features in the word information processing network based on the number of the word-level hidden variables to obtain the text feature vectors matched with the multimedia information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for performing single extraction on the image of the multimedia information through a preprocessing subnetwork of the image information processing network;

the information processing module is used for carrying out noise reduction processing on the images of the multimedia information subjected to the simplification processing through the image information processing network;

the information processing module is used for performing cross downsampling processing on the image of the multimedia information subjected to denoising processing through the image information processing network to obtain a downsampling result of the image of the multimedia information, performing normalization processing on the downsampling result, and determining an image feature vector matched with the image of the multimedia information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model according to the type of the multimedia information to be identified; or

The information processing module is used for determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model according to the type of the image of the multimedia information;

and carrying out noise reduction processing on the image of the multimedia information through the image information processing network according to the dynamic noise threshold value so as to form the image of the multimedia information matched with the dynamic noise threshold value.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining the word list length of the text feature vector and the number of images corresponding to the image feature vector;

the information processing module is used for responding to the length of the word list and the number of the images and acquiring a filter matrix matched with the multimedia information identification model through an activation function corresponding to the multimedia information identification model;

and the information processing module is used for respectively filtering the text characteristic vector and the image characteristic vector through the filtering matrix, deleting redundant characteristics and error characteristics, and obtaining a filtering processing result of the text characteristic vector and a filtering processing result of the image characteristic vector.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the preorder multimedia information identification method when the executable instructions stored in the memory are operated.

The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions and is characterized in that the executable instructions are executed by a processor to realize the multimedia information identification method of the preamble.

The embodiment of the invention has the following beneficial effects:

the method comprises the steps of obtaining multimedia information to be identified, carrying out text extraction processing on a text in the multimedia information to be identified, and determining a text characteristic vector matched with the multimedia information; carrying out image extraction processing on the multimedia information, and determining an image feature vector matched with the multimedia information; respectively filtering the text feature vector and the image feature vector; performing feature fusion processing on the filtering processing results of the text feature vectors and the image feature vectors through a feature fusion network in the multimedia information identification model to determine corresponding fusion feature vectors; and identifying the multimedia information to be identified based on the fusion feature vector to obtain an identification result of the multimedia information. Compared with the prior art, the scheme has the advantages that the multimedia information with more redundant information and error information is directly identified, the scheme can filter the text characteristic vectors and the image characteristic vectors included in the multimedia information, the redundant information and the error information are reduced, meanwhile, the filtered text characteristic vectors and the filtered image characteristic vectors are fused to form fusion characteristic vectors, the multimedia information is accurately identified based on the fusion characteristic vectors, the accuracy of multimedia information identification is improved, and the identification error rate caused by the redundant information and the error information is reduced.

Drawings

Fig. 1 is a schematic view of a usage scenario of a multimedia information recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a structure of a multimedia information recognition apparatus according to an embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating an alternative multimedia information recognition method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an alternative model structure of the multimedia information recognition method according to the embodiment of the present invention;

fig. 5 is a schematic flow chart illustrating an alternative multimedia information recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the filtering process in an embodiment of the present invention;

FIG. 7 is a schematic diagram of a multimodal feature fusion process in an embodiment of the invention;

FIG. 8 is a schematic diagram of a full link layer process according to an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating an operation process of a multimedia information recognition method according to an embodiment of the present invention;

FIG. 10 is a schematic view of advertisement identification in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) Multimedia information, various forms of information available in the internet, such as video files presented in clients or smart devices, multimedia information, news information, WeChat advertisement information, short videos including literature and literature information, video advertisements, and the like.

3) Convolutional Neural Networks (CNN Convolutional Neural Networks) are a class of Feed forward Neural Networks (Feed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the representative algorithms for deep learning (deep learning). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on input information according to a hierarchical structure of the convolutional neural network.

4) And (4) model training, namely performing multi-classification learning on the image data set. The model can be constructed by adopting deep learning frames such as Tensor Flow, torch and the like, and a multi-classification model is formed by combining multiple layers of neural network layers such as CNN and the like. The input of the model is a three-channel or original channel matrix formed by reading an image through openCV and other tools, the output of the model is multi-classification probability, and the webpage category is finally output through softmax and other algorithms. During training, the model approaches to a correct trend through an objective function such as cross entropy and the like.

5) Neural Networks (NN): an Artificial Neural Network (ANN), referred to as Neural Network or Neural Network for short, is a mathematical model or computational model that imitates the structure and function of biological Neural Network (central nervous system of animals, especially brain) in the field of machine learning and cognitive science, and is used for estimating or approximating functions.

6) Encoder-decoder architecture: a network architecture commonly used for machine translation technology. The decoder receives the output result of the encoder as input and outputs a corresponding text sequence of another language.

7) Word segmentation: and segmenting the Chinese text by using a Chinese word segmentation tool to obtain a set of fine-grained words. Stop words: words or words that do not contribute or contribute negligibly to the semantics of the text.

8) token: the word unit, before any actual processing of the input text, needs to be divided into language units such as words, punctuation, numbers or pure alphanumerics. These units are called word units.

9) Softmax: the normalized exponential function is a generalization of the logistic function. It can "compress" a K-dimensional vector containing arbitrary real numbers into another K-dimensional real vector, such that each element ranges between [0, 1] and the sum of all elements is 1.

Fig. 1 is a schematic view of a usage scenario of a multimedia information identification method according to an embodiment of the present invention, referring to fig. 1, a client of software capable of displaying different corresponding multimedia information, such as a client or a plug-in for video playing, is disposed on a terminal (including a terminal 10-1 and a terminal 10-2), and a user may obtain and display different multimedia information through the corresponding client, such as different videos with texts, news reports including texts and pictures at the same time, advertisement information that may be a combination of pictures and texts in a WeChat friend circle, and short video recommendation that may be composed of a text case and a video cover image in a short video product; the terminal is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to realize data transmission.

As an example, the server 200 is configured to lay a corresponding multimedia information recognition model to implement the multimedia information recognition method provided by the present invention, or lay a multimedia information recognition apparatus to implement the multimedia information recognition method, and specifically, the multimedia information recognition processing includes: acquiring multimedia information to be identified, wherein the multimedia information comprises texts and images; performing text extraction processing on a text in the multimedia information to be identified, and determining a text characteristic vector matched with the multimedia information; carrying out image extraction processing on the multimedia information through an image information processing network in the multimedia information identification model, and determining an image feature vector matched with the multimedia information; respectively filtering the text feature vector and the image feature vector through a feature filtering network in the multimedia information identification model; performing feature fusion processing on the filtering processing results of the text feature vectors and the image feature vectors through a feature fusion network in the multimedia information identification model to determine corresponding fusion feature vectors; and identifying the multimedia information to be identified based on the fusion feature vector to obtain an identification result of the multimedia information. And further, different multimedia information can be recommended to the user according to the identification result, or the multimedia information to be recommended is subjected to sequencing adjustment and displayed and output by the terminal (the terminal 10-1 and/or the terminal 10-2) to be recommended, wherein the multimedia information to be recommended is matched with the target user. By taking video multimedia information with a file as an example, the multimedia information identification model provided by the invention can be applied to video playing, different multimedia information of different data sources is usually processed in the video playing, and finally, videos to be recommended corresponding to the corresponding different multimedia information and corresponding video recommendation processes are presented on a User Interface (UI User Interface), and the accuracy and timeliness of the characteristics of the different multimedia information directly influence the User experience. A background database for video playing receives a large amount of video data from different sources every day, the obtained text information matched with the different multimedia information can be called by other application programs, and of course, a multimedia information identification model matched with corresponding user behavior characteristics can also be migrated to different video recommendation processes (for example, a web page video recommendation process, an applet video recommendation process or a video recommendation process of a short video client).

As will be described in detail below, the multimedia information recognition apparatus according to the embodiment of the present invention may be implemented in various forms, such as a dedicated terminal with a multimedia information recognition processing function, or a server with a processing function of the multimedia information recognition apparatus, such as the server 200 in the foregoing fig. 1. Fig. 2 is a schematic diagram of a composition structure of a multimedia information recognition apparatus according to an embodiment of the present invention, and it can be understood that fig. 2 only shows an exemplary structure of the multimedia information recognition apparatus, and not a whole structure, and a part of the structure or the whole structure shown in fig. 2 may be implemented as needed.

The multimedia information identification device provided by the embodiment of the invention comprises: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components of the multimedia information recognition apparatus are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.

It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

In some embodiments, the multimedia information recognition apparatus provided in the embodiments of the present invention may be implemented by a combination of hardware and software, and for example, the multimedia information recognition apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the training method of the video information processing model provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

As an example of the multimedia information recognition apparatus provided by the embodiment of the present invention implemented by combining software and hardware, the multimedia information recognition apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, where the software modules may be located in a storage medium, the storage medium is located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and completes the training method of the video information processing model provided by the embodiment of the present invention in combination with necessary hardware (for example, including the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

As an example of the multimedia information recognition apparatus provided by the embodiment of the present invention implemented by hardware, the apparatus provided by the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, the training method for implementing the video information processing model provided by the embodiment of the present invention is implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The memory 202 in the embodiment of the present invention is used to store various types of data to support the operation of the multimedia information recognition apparatus. Examples of such data include: any executable instructions for operating on the multimedia information recognition apparatus, such as executable instructions, may be included in the executable instructions, and the program implementing the training method for processing a model from video information according to an embodiment of the present invention may be included in the executable instructions.

In other embodiments, the multimedia information recognition apparatus provided by the embodiment of the present invention may be implemented in software, and fig. 2 illustrates the multimedia information recognition apparatus stored in the memory 202, which may be software in the form of programs, plug-ins, and the like, and includes a series of modules, as examples of the programs stored in the memory 202, the multimedia information recognition apparatus may include the following software modules:

an information transmission module 2081 and an information processing module 2082. When the software modules in the multimedia information recognition device are read into the RAM by the processor 201 and executed, the method for training the video information processing model provided by the embodiment of the invention is implemented, wherein the functions of each software module in the multimedia information recognition device include:

the information transmission module 2081 is configured to obtain multimedia information to be identified, where the multimedia information to be identified includes a text and an image.

The information processing module 2082 is configured to perform text extraction processing on the text in the multimedia information to be identified, and determine a text feature vector matched with the multimedia information.

The information processing module 2082 is configured to determine an image feature vector matched with the multimedia information by performing image extraction processing on the multimedia information.

The information processing module 2082 is configured to obtain a result of the text feature vector filtering processing and a result of the image feature vector filtering processing by filtering the text feature vector and the image feature vector respectively.

The information processing module 2082 is configured to perform feature fusion processing on the filtering processing results of the text feature vector and the image feature vector through a feature fusion network in the multimedia information recognition model, and determine a corresponding fusion feature vector.

The information processing module 2082 is configured to identify the multimedia information to be identified based on the fusion feature vector, and obtain an identification result of the multimedia information.

According to the electronic device shown in fig. 2, in one aspect of the present application, the present application also provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute different embodiments and combinations of embodiments provided in various alternative implementations of the multimedia information identification method.

Referring to fig. 3, fig. 3 is an optional flowchart of the multimedia information recognition method according to the embodiment of the present invention, and it can be understood that the steps shown in fig. 3 can be executed by various electronic devices operating the multimedia information recognition apparatus, such as a dedicated terminal with a multimedia information recognition apparatus, a server, or a server cluster, where the dedicated terminal with a multimedia information recognition apparatus can be an electronic device with a multimedia information recognition apparatus according to the embodiment shown in fig. 2. The following is a description of the steps shown in fig. 3.

Step 301: the multimedia information identification device acquires multimedia information to be identified.

The obtained multimedia information comprises texts and images, such as different videos with texts, news reports simultaneously comprising texts and pictures, advertisement information which can be a combination of pictures and texts in a WeChat friend circle, and short videos to be recommended which can be composed of short video products, texts and video cover images.

Step 302: and the multimedia information identification device extracts texts from the texts in the multimedia information to be identified and determines text characteristic vectors matched with the multimedia information.

Referring to fig. 4, fig. 4 is a schematic diagram of an optional model structure of the multimedia information recognition method according to the embodiment of the present invention, taking multimedia information as a news essay as an example, extracting content feature information from two modes, namely a text and a matching image of the news essay, and filtering the feature information of the two modes based on an M-Gate mechanism to obtain clean and useful multi-mode content feature information of the news essay. After the text mode is quantized, transmitting the text mode into a Long and Short time Memory network (LSTM Long Short-Term Memory) for text context feature information extraction; the image modality needs to be sent to the ResNet50 for single image feature extraction, and then to another LSTM for image context feature information extraction. Then, the two extracted feature vectors are transmitted into an M-Gate module to filter error redundancy features, and finally, the two filtered feature vectors are fused and subjected to news short text recognition, and different structures shown in fig. 4 are respectively described below.

In some embodiments of the present invention, performing text extraction processing on a text in the multimedia information to be identified, and determining a text feature vector matching the multimedia information may be implemented by:

extracting a characteristic vector matched with the text content of the multimedia information through a word information processing network; determining a statement vector corresponding to the text content according to the feature vector through the word information processing network; determining at least one word-level hidden variable corresponding to the text content according to the feature vector through the word information processing network; and determining a text characteristic vector matched with the multimedia information according to the hidden variable of the at least one word level and the statement vector corresponding to the text content through the word information processing network. Specifically, the number of word-level hidden variables may be determined according to the type of the multimedia information to be identified; extracting high-dimensional features through the statement vectors in the word information processing network; and performing feature fusion on the sentence vectors extracted from the high-dimensional features in the word information processing network based on the number of the word-level hidden variables to obtain text feature vectors matched with the multimedia information.

In some embodiments of the present invention, in the process of forming the text feature vector, a corresponding word segmentation library needs to be triggered according to a text category parameter included in the text content of the information; performing word segmentation processing on the text content of the multimedia information through the triggered word segmentation library word dictionary to form different word level feature vectors; and denoising the different word-level feature vectors to form a set of feature vectors matched with the text content of the multimedia information. Wherein, the word segmentation means that the meaning of verb also means the meaning of name word; each participle is a word or a phrase, namely the minimum semantic unit with definite meaning; for the received use environments of different users or different text processing models, the minimum semantic units contained in the received use environments need to be divided into different types, and adjustment needs to be made timely, and the process is called word segmentation, namely the word segmentation can refer to the process for dividing the minimum semantic units; on the other hand, the minimum semantic unit obtained after division is also often called word segmentation, that is, a word obtained after the word segmentation is performed; in order to distinguish the two meanings from each other, the smallest semantic unit referred to by the latter meaning is sometimes referred to as a participle object (Term); the term participled object is used in this application; the word segmentation object corresponds to a keyword which is used as an index basis in the inverted list. For Chinese, because words serving as the minimum semantic unit are often composed of different numbers of characters, natural distinguishing marks in alphabetic characters such as blank partitions and the like do not exist among the words, meaningless word vectors can be effectively removed through word segmentation processing, the text volume is reduced, and the calculation amount of a terminal is saved.

In some embodiments of the present invention, because the types of the target texts are different, the fields of text processing are different, and the content of the text information in different fields is greatly different, so that a dynamic noise threshold value matched with the use environment of the multimedia information recognition model can be determined in order to improve the processing speed; denoising the different word-level feature vectors according to the dynamic noise threshold, and triggering a dynamic word segmentation strategy matched with the dynamic noise threshold; and performing word segmentation processing on the text content of the multimedia information according to a dynamic word segmentation strategy matched with the dynamic noise threshold value to form a corresponding dynamic word level feature vector set. For example, in the usage environment of academic translation, the dynamic noise threshold value of the text information displayed by the terminal, which only includes the text information of the academic paper and matches with the usage environment of the text information processing model, needs to be smaller than the dynamic noise threshold value in the reading environment of the entertainment information text.

In some embodiments of the invention, a fixed noise threshold corresponding to the environment of use of the multimedia information recognition model may also be determined; denoising the different word-level feature vectors according to the fixed noise threshold, and triggering a fixed word segmentation strategy matched with the fixed noise threshold; and performing word segmentation processing on the target text of the multimedia information according to a fixed word segmentation strategy matched with the fixed noise threshold value to form a corresponding fixed word level feature vector set. When the use environment is professional term text information (or text information in a certain field), because the noise is single, the processing speed of the text information processing model can be effectively improved through the fixed noise threshold corresponding to the fixed text information processing model, the waiting time of a user is reduced, and the use experience of the user is improved. Further, since the text processed by the text information processing model not only includes text information in a single language, but also may include complex text information in multiple languages (for example, a chinese-english hybrid academic paper as text information), in which, unlike english which directly uses spaces as spaces between words, the chinese text needs to be segmented accordingly, because words in chinese can contain complete information. Correspondingly, a Chinese word segmentation tool Jieba can be used for segmenting Chinese texts. In addition, word processing needs to be stopped for the segmented keyword set correspondingly, and words like 'yes' and 'can' are not informative to corresponding abstract text evaluation. For example, for the text "yes, i like doing experiments", segmenting words, and stopping words to obtain a set consisting of two keywords "like/doing experiments" (using/as separators, the same below), thereby effectively improving the processing speed of the text information processing model.

Step 303: the multimedia information recognition device carries out image extraction processing on the multimedia information through an image information processing network in the multimedia information recognition model, and determines an image feature vector matched with the multimedia information.

Continuing to describe the multimedia information recognition method provided by the embodiment of the present invention with reference to the multimedia information recognition model structure shown in fig. 4, referring to fig. 5, fig. 5 is an optional flowchart of the multimedia information recognition method provided by the embodiment of the present invention, and it can be understood that the steps shown in fig. 5 may be executed by various electronic devices operating the multimedia information recognition apparatus, such as a dedicated terminal, a server or a server cluster with a multimedia information recognition function. The following is a description of the steps shown in fig. 5.

Step 501: and performing simplification extraction on the image of the multimedia information through a preprocessing sub-network of the image information processing network.

In some embodiments of the present invention, since the image resolution of the advertisement information of the image-text combination in the instant messaging client and the image resolution of the short video recommendation information of the short video product composed of the text and the video cover image are different, in order to identify the image content more accurately, the target resolution of the image of the multimedia information can be determined according to the type of the multimedia information to be identified; and based on the target resolution, performing resolution enhancement processing on the image of the multimedia information through the image processing network, and acquiring a corresponding image feature vector to realize the adaptation of the image feature vector and the resolution of the multimedia information.

Step 502: and performing noise reduction processing on the images of the multimedia information subjected to the simplification processing through the image information processing network.

Step 503: determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model according to the type of the multimedia information to be identified; or determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model according to the type of the image of the multimedia information.

Step 504: and carrying out noise reduction processing on the image of the multimedia information through the image information processing network according to the dynamic noise threshold value so as to form the image of the multimedia information matched with the dynamic noise threshold value.

Step 505: and performing cross downsampling processing on the image of the multimedia information subjected to the denoising processing through the image information processing network to obtain a downsampling result of the image of the multimedia information, performing normalization processing on the downsampling result, and determining an image feature vector matched with the image of the multimedia information.

With reference to fig. 4, in the process of identifying multimedia information including text and images, where wi is a word, wvi is a corresponding word vector, vi is a news map, and fvi is a single image feature. During text feature extraction, for a text mode, firstly, word segmentation is carried out on text content to form a word list, then vectorization processing is carried out on words by Google pre-training word vectors word2vec, and finally the vectorized text content is transmitted into an LSTM network to extract content feature information tvec of a news text. For the image modality, a series of preprocessing operations (resize unified size, denoising and the like) are firstly carried out on the image, then the image is transmitted into a ResNet50 network for feature extraction of a single image, and finally a plurality of extracted single image features are transmitted into another LSTM to extract content feature information vec of the news image.

With continued reference to FIG. 3, step 304 is performed.

Step 304: and the multimedia information identification device respectively filters the text feature vector and the image feature vector through a feature filtering network in the multimedia information identification model.

In some embodiments of the present invention, when processing a news essay, a text portion has text contents of various types, such as a text, a title, and a label, and thus, a length of a word list of the text feature vector and the number of images corresponding to the image feature vector can be determined; responding to the length of the word list and the number of the images, and acquiring a filter matrix matched with the multimedia information identification model through an activation function corresponding to the multimedia information identification model; and respectively filtering the text characteristic vector and the image characteristic vector through the filtering matrix, and deleting redundant characteristics and error characteristics to obtain a filtering result of the text characteristic vector and a filtering result of the image characteristic vector. The pre-training convolutional neural network based on the depth residual error resnet50 performs feature extraction, and extracts cover map information of the video into 128-dimensional feature vectors. Resnet is a widely extracted network in the image feature extraction at present, and is beneficial to representing cover page information.

With reference to fig. 4, when performing filtering processing, content feature information extracted based on two modalities is transmitted into an M-Gate mechanism, the mechanism generates a filter matrix G through a Sigmoid function according to information (tvec, vec) of the two modalities, and performs error redundant information filtering on two feature vectors by using the filter matrix G, so as to obtain correct and useful text feature information ctvec and image feature information civec, thereby reducing an identification error rate caused by redundant information and error information.

With reference to fig. 4, the M-Gate module filters the features to obtain filtered correct and useful text feature information ctvec and image feature information civec, the feature information of the two modes is added to obtain multi-mode content feature information mvec including the text feature information and the image feature information, and based on the fused multi-mode content feature information mvec, the accurate recognition of the multimedia information can be realized, and the accuracy of the multimedia information recognition is improved.

With reference to fig. 6, fig. 6 is a schematic diagram of a filtering process in the embodiment of the present invention, wherein feature filtering needs to be performed on the extracted text feature vector tvec and the image feature vector vec in the M-Gate-based feature filtering process. Where tvec ═ ht1, ht2, …, htn denote text feature information extracted through the LSTM network, and n is the word list length. Vec [ hv1, hv2, …, hvm ] denotes image feature information extracted by the LSTM network, and m is the number of images. The filtering matrix G is obtained by a Sigmoid function based on the text feature vector tvec and the image feature vector vec, as shown in formula 1. And multiplying the filter matrix G with a text mode tvec and an image mode vec respectively to obtain filtered features ctvec and civec, wherein the element 1 ensures that the correct and useful features are reserved, and the element 0 ensures that the wrong redundant features are discarded.

G ═ S (tvec: + vec: + Wi + b) formula 1

Wherein S (.) represents Sigmoid function operation, Wt and Wi represent training parameters, and b represents bias.

Through the filtering process shown in fig. 6, not only the text feature vectors and the image feature vectors are filtered, and redundant information and error information are reduced, but also the number of feature vectors required to be subjected to feature fusion processing is effectively reduced, and the processing rate of the multimedia information identification model is improved, so that the multimedia information identification model can identify more multimedia information.

Step 305: and the multimedia information identification device performs characteristic fusion processing on the filtering processing results of the text characteristic vector and the image characteristic vector through a characteristic fusion network in the multimedia information identification model to determine a corresponding fusion characteristic vector.

Step 306: and the multimedia information identification device identifies the multimedia information to be identified based on the fusion characteristic vector to obtain an identification result of the multimedia information.

Referring to fig. 7 and 8 in the fusion stage, fig. 7 is a schematic diagram of multi-modal feature fusion processing in an embodiment of the present invention, and fig. 8 is a schematic diagram of a processing process of a full connection layer in an embodiment of the present inventionInputting the spliced long vector into a full-connection layer, and carrying out nonlinear transformation on the input text characteristic vector and the image characteristic vector by the full-connection layer: and Y is f (WX + b), and outputs to the full connection layer (FC layer). Wherein f is a network node of the full connection layer, X is an activation function, W is a weight matrix, and b is a bias constant. And finally, through the processing of a logistic regression function Softmax, converting the output result of the full connection layer into the probability of each multimedia information category to be identified by the Softmax layer network, wherein the calculation method is shown as a formula 2. Wherein z is_j＝Wx_j+b，x_jIs the output of the full connection layer, wherein W, b is the parameter to be trained of the layer.

The multimedia information identification method provided by the embodiment of the invention is described below by taking the multimedia information to be identified as the advertisement information comprising text and image, wherein, in combination with the schematic application environment of the advertisement information identification method shown in fig. 1, the terminal (including the terminal 130-1 and the terminal 130-2) is provided with a client capable of displaying software of corresponding advertisement information, such as an advertising information playing applet or presenting different advertising news clients or plug-ins, the user may obtain different advertising information through the corresponding client, in the process of recommending advertisements to users, the advertisement information to be identified needs to be identified, a better advertisement recommendation result is obtained through the identification result, fig. 9 is a schematic diagram of a working process of the multimedia information identification method according to the embodiment of the present invention, which specifically includes the following steps:

step 901: training the multimedia information recognition model, and determining the network parameters of the multimedia information recognition model.

Step 902: and deploying the trained multimedia information recognition model in a server.

Step 903: text extraction processing is carried out on the text in the advertisement information to be identified in the data source through the multimedia information identification model, text characteristic vectors are determined, and meanwhile image extraction processing is carried out on the advertisement information, and image characteristic vectors are determined.

Step 904: and filtering and fusing the text feature vector and the image feature vector to obtain a fused feature vector.

Step 905: and identifying the advertisement information to be identified based on the fusion feature vector, and determining the identification result of the advertisement information.

Further, for a more intuitive identification result of the advertisement information, fig. 10 is an advertisement identification schematic diagram in the present application, and as shown in fig. 10, it may be determined that the advertisement information to be identified may be identified by fusing feature vectors, and the advertisement information in the data source may be accurately classified, so as to form (1) a scientific advertisement, (2) a sports advertisement, (3) a car advertisement, (4) a literature advertisement, and (5) a public service advertisement, where each advertisement includes text information and an image, and a user may obtain a richer advertisement browsing experience.

Step 906: and recommending different advertisement information to different users based on the identification result of the advertisement information.

The beneficial technical effects are as follows:

the method comprises the steps of obtaining multimedia information to be identified, wherein the multimedia information comprises texts and images; performing text extraction processing on a text in the multimedia information to be identified, and determining a text characteristic vector matched with the multimedia information; carrying out image extraction processing on the multimedia information through an image information processing network in the multimedia information identification model, and determining an image feature vector matched with the multimedia information; respectively filtering the text feature vector and the image feature vector through a feature filtering network in the multimedia information identification model; performing feature fusion processing on the filtering processing results of the text feature vectors and the image feature vectors through a feature fusion network in the multimedia information identification model to determine corresponding fusion feature vectors; the scheme can not only filter text characteristic vectors and image characteristic vectors contained in the multimedia information and reduce redundant information and error information, but also can fuse the filtered text characteristic vectors and the filtered image characteristic vectors to form fused characteristic vectors, realize accurate identification of the multimedia information based on the fused characteristic vectors, improve the accuracy of multimedia information identification and reduce the identification error rate caused by the redundant information and the error information.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for identifying multimedia information, the method comprising:

2. The method according to claim 1, wherein the text extraction processing is performed on the text in the multimedia information to be identified, and determining the text feature vector matching the multimedia information comprises:

extracting a characteristic vector matched with the text content of the multimedia information through a word information processing network of a multimedia information identification model;

determining a statement vector corresponding to the text content according to the feature vector through the word information processing network;

determining at least one word-level hidden variable corresponding to the text content according to the feature vector through the word information processing network;

and determining a text characteristic vector matched with the multimedia information according to the hidden variable of the at least one word level and the statement vector corresponding to the text content through the word information processing network.

3. The method of claim 2, wherein extracting feature vectors matching text content of the multimedia information via a word processing network comprises:

triggering a corresponding word segmentation library according to the text category parameters included in the text content of the multimedia information;

performing word segmentation processing on the text content of the multimedia information through the triggered word segmentation library word dictionary to form different word level feature vectors;

and denoising the different word-level feature vectors to form a set of feature vectors matched with the text content of the multimedia information.

4. The method of claim 3, wherein the denoising the different word-level feature vectors to form a set of feature vectors matching text content of the multimedia information comprises:

determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model;

denoising the different word-level feature vectors according to the dynamic noise threshold, and triggering a dynamic word segmentation strategy matched with the dynamic noise threshold;

and performing word segmentation processing on the text content of the multimedia information according to a dynamic word segmentation strategy matched with the dynamic noise threshold value to form a corresponding dynamic word level feature vector set.

5. The method of claim 3, wherein the denoising the different word-level feature vectors to form a set of feature vectors matching text content of the multimedia information comprises:

determining a fixed noise threshold corresponding to the use environment of the multimedia information identification model;

denoising the different word-level feature vectors according to the fixed noise threshold, and triggering a fixed word segmentation strategy matched with the fixed noise threshold;

and performing word segmentation processing on the target text of the multimedia information according to a fixed word segmentation strategy matched with the fixed noise threshold value to form a corresponding fixed word level feature vector set.

6. The method of claim 2, wherein determining, by the word information processing network, a text feature vector matching the multimedia information according to the at least one word-level hidden variable and a sentence vector corresponding to the text content comprises:

determining the number of word-level hidden variables according to the type of the multimedia information to be identified;

extracting high-dimensional features through the statement vectors in the word information processing network;

and performing feature fusion on the sentence vectors extracted from the high-dimensional features in the word information processing network based on the number of the word-level hidden variables to obtain text feature vectors matched with the multimedia information.

7. The method of claim 1, wherein determining the image feature vector matching the multimedia information by performing an image extraction process on the multimedia information comprises:

the image of the multimedia information is subjected to single extraction through a preprocessing sub-network in an image information processing network of a multimedia information identification model;

performing noise reduction processing on the images of the multimedia information subjected to the simplification processing through the image information processing network;

and performing cross downsampling processing on the image of the multimedia information subjected to the denoising processing through the image information processing network to obtain a downsampling result of the image of the multimedia information, performing normalization processing on the downsampling result, and determining an image feature vector matched with the image of the multimedia information.

8. The method of claim 7, wherein said denoising the image of the multimedia information by the image information processing network comprises:

determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model according to the type of the multimedia information to be identified; or

Determining a dynamic noise threshold value matched with the use environment of the multimedia information identification model according to the type of the image of the multimedia information;

9. The method according to claim 1, wherein the filtering the text feature vector and the image feature vector to obtain a text feature vector filtering result and an image feature vector filtering result respectively comprises:

determining the length of the text feature vector word list and the number of images corresponding to the image feature vectors;

responding to the length of the word list and the number of the images, and acquiring a filter matrix matched with the multimedia information identification model through an activation function corresponding to the multimedia information identification model;

and respectively filtering the text characteristic vector and the image characteristic vector through the filtering matrix, and deleting redundant characteristics and error characteristics to obtain a filtering result of the text characteristic vector and a filtering result of the image characteristic vector.

10. The method of claim 1, further comprising:

determining the target resolution of the image of the multimedia information according to the type of the multimedia information to be identified;

and based on the target resolution, performing resolution enhancement processing on the image of the multimedia information through an image processing network of a multimedia information identification model, and acquiring a corresponding image feature vector to realize the adaptation of the image feature vector and the resolution of the multimedia information.

11. An apparatus for identifying multimedia information, the apparatus comprising:

12. The apparatus of claim 11,

13. The apparatus of claim 11,

the information processing module is used for performing single extraction on the image of the multimedia information through a preprocessing sub-network in an image information processing network of a multimedia information identification model;

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the multimedia information recognition method of any one of claims 1 to 10 when executing the executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions, wherein the executable instructions when executed by a processor implement the multimedia information recognition method of any one of claims 1 to 10.