CN114661951A

CN114661951A - Video processing method and device, computer equipment and storage medium

Info

Publication number: CN114661951A
Application number: CN202210286475.6A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-06-24

Abstract

The embodiment of the application discloses a video processing method, a video processing device, computer equipment and a storage medium. The method comprises the following steps: acquiring a video to be identified, and determining image modal characteristics, text modal characteristics and audio modal characteristics of the video to be identified; acquiring entity information of the video to be identified, and determining map modal characteristics of the video to be identified by using the entity information and a knowledge map; determining the video modal characteristics of the video to be identified according to the image modal characteristics and the map modal characteristics; and determining a category identification result of the video to be identified based on the video modal characteristic, the text modal characteristic and the audio modal characteristic. By implementing the embodiment of the application, the accuracy of video identification can be effectively improved.

Description

Video processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video processing method and apparatus, a computer device, and a storage medium.

Background

With the rapid development of electronic technology and internet technology, multimedia data has also been rapidly developed, and users can browse various videos through various multimedia platforms, and generally classify videos in order that users can obtain videos of their own interest from massive videos. Therefore, how to guarantee the accuracy of video classification (or identification) becomes a hot research problem of the current computer vision technology.

Disclosure of Invention

The embodiment of the application provides a video processing method and device, computer equipment and a storage medium, which can effectively improve the accuracy of video identification.

In a first aspect, the present application provides a video processing method, including:

acquiring a video to be identified, and determining image modal characteristics, text modal characteristics and audio modal characteristics of the video to be identified;

acquiring entity information of the video to be identified, and determining map modal characteristics of the video to be identified by using the entity information and a knowledge map;

determining the video modal characteristics of the video to be identified according to the image modal characteristics and the map modal characteristics;

and determining a category identification result of the video to be identified based on the video modal characteristics, the text modal characteristics and the audio modal characteristics.

In a second aspect, the present application provides a video processing apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be processed and extracting a plurality of image characteristics of the image to be processed;

a first determining unit, configured to determine, according to the plurality of image features, a plurality of candidate lens regions in the image to be processed and position information of each candidate lens region;

the fusion unit is used for acquiring a plurality of target image features from the plurality of image features and fusing the plurality of target image features to obtain fused image features, wherein the target image features comprise background image features;

and the second determining unit is used for determining a target lens area from the plurality of candidate lens areas according to the position information of each candidate lens area and the fused image characteristic.

In a third aspect, the present application provides a computer device comprising: a processor, a memory, a network interface;

the processor is connected with a memory and a network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program so as to enable a computer device comprising the processor to execute the data processing method.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the above-mentioned video processing method.

In a fifth aspect, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the video processing method provided in the various alternatives in the first aspect of the present application.

In the embodiment of the application, the computer equipment can acquire the video to be identified and determine the image modal characteristic, the text modal characteristic and the audio modal characteristic of the video to be identified; the entity information of the video to be identified can be obtained, and the atlas modal characteristics of the video to be identified are determined by utilizing the entity information and the knowledge atlas; then, the video modality characteristics of the video to be identified can be determined according to the image modality characteristics and the map modality characteristics. And determining a category identification result of the video to be identified based on the video modal characteristics, the text modal characteristics and the audio modal characteristics. According to the method, various modal characteristics (image modal characteristics, text modal characteristics and audio modal characteristics) can be obtained based on the understanding of the video content, so that the video identification by utilizing the multi-modal characteristics can be realized conveniently in the follow-up process. And the external knowledge of the video can be introduced to carry out combined representation on the video so as to realize the identification of the video, wherein the external knowledge is the knowledge graph information at the position so as to fully utilize the knowledge graph information to carry out effective expansion and obtain the graph modal characteristics corresponding to the video. Furthermore, the image modal characteristics and the atlas modal characteristics can be used for fusion processing to obtain the video level characteristics which can be used for representing the incidence relation between the image and the knowledge atlas, so that the video identification has certain reasoning capability when the video level characteristics are subsequently utilized, the video identification effect is improved, and the identification accuracy is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic architecture diagram of a video processing system according to an embodiment of the present application;

FIG. 1b is a block diagram of another video processing system according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present application;

FIG. 3a is a schematic diagram of a knowledge-graph structure provided by an embodiment of the present application;

FIG. 3b is a schematic diagram of another knowledge-graph structure provided by an embodiment of the present application;

fig. 3c is a schematic structural diagram of a feature fusion module in a video recognition model according to an embodiment of the present application;

fig. 3d is a schematic structural diagram of determining a feature of a video modality according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another video processing method provided in the embodiment of the present application;

FIG. 5a is a schematic structural diagram of a hierarchical classification module in a video recognition model according to an embodiment of the present application;

fig. 5b is a schematic structural diagram of a video recognition model provided in an embodiment of the present application;

fig. 5c is a schematic flowchart of another video processing method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application relates to artificial intelligence and knowledge maps, and the following brief introduction is made to relevant terms and concepts of the artificial intelligence and knowledge maps:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine look, and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Knowledge Graph (KG) is a Knowledge database with entity relationships established for describing text semantics, and in general, a relationship Graph can be used to represent the Knowledge Graph, and the Knowledge Graph can accurately illustrate relationships among people, things and objects, and is essentially a semantic network composed of nodes and edges. Knowledge maps can be generally called knowledge domain visualization or knowledge domain mapping maps, can describe knowledge resources and carriers thereof by using visualization technology, and comprise a series of different graphs for displaying the relationship between the knowledge development process and the structure. The construction technology of the knowledge graph mainly comprises a top-down technology and a bottom-up technology. The top-down construction refers to extracting ontology and mode information from high-quality data by means of structured data sources such as encyclopedic websites and the like, and adding the ontology and mode information into a knowledge database. And the bottom-up construction is that a resource mode is extracted from publicly acquired data by a certain technical means, and information with higher confidence coefficient is selected and added into a knowledge database.

The knowledge graph representation (or knowledge representation or vector representation) is a description and convention of knowledge data, and aims to enable a machine (such as a computer) to understand knowledge like a human, so that the machine can further reason and calculate. Most knowledge maps are represented by a symbolization method, wherein a Resource Description Framework (RDF) is a more common symbolic semantic representation model, and one side of the RDF can generally express an objective fact for a triple < Subject, Predicate, Object >.

An Embedding (Embedding) algorithm based on knowledge graph representation is gradually developed, a vector capable of being characterized can be trained for each entity and relation in a knowledge graph, algorithm learning is easy to perform, invisible knowledge can be characterized, and the invisible knowledge can be further explored. Generally, a Graph Embedding (also referred to as Network Embedding) manner can be used to merge into the knowledge Graph. Graph embedding can be understood as a process of mapping graph data (generally a high-dimensional dense matrix) into a low-micro dense vector, and can well solve the problem that the graph data is difficult to input into a machine learning algorithm efficiently.

The central idea of graph embedding is to find a mapping function that can convert each node in a graph-structured network composed of entities and corresponding attributes into a low-dimensional potential representation. The method is favorable for calculation and storage, does not need to manually extract features, and improves the adaptivity. For example, several common ways of graph embedding can be introduced as follows, as viewed in view of network structure, which can include: deepwalk, GraRep, struc2vec, LINE, node2vec, GraphSAGE, etc.; considering the structural and other information aspects, it may include: CENE, CANE, Trans-Net, etc.; from the aspect of Deep Learning (DL), the method may include: GCN, SDNE, etc.; from the aspect of generating a countermeasure network (GAN), the method may include: AraphGAN, ANE, etc.

The knowledge graph represents and learns, namely, a low-dimensional vector representation is learned for the entities and the relations in the knowledge graph, and meanwhile, some semantic information is included, so that the information in the knowledge graph can be extracted and utilized more conveniently in downstream tasks (such as video identification, video recommendation and the like).

Based on the above mentioned artificial intelligence and knowledge graph technologies, the embodiment of the application provides a video processing scheme; specifically, the principle of the scheme is as follows: the video to be identified can be obtained to identify the category of the video to be identified, so as to obtain a corresponding category identification result. Specifically, after the video to be recognized is acquired, features of the video to be recognized in multiple modalities can be determined, so that category recognition can be performed according to the multi-modality features. For example, the features in the multiple modalities may include features determined based on video content understanding, such as may include image modality features, text modality features, and audio modality features. Optionally, in addition to performing category identification based on video content understanding, category identification may be performed in combination with external knowledge, for example, the external knowledge may refer to information corresponding to a knowledge graph, such as entity information of a video to be identified may be acquired, and graph modal features of the video to be identified may be determined by using the entity information and the knowledge graph, so that category identification may be performed based on image modal features, text modal features, audio modal features, and graph modal features. Optionally, during the category identification, the video modality feature of the video to be identified may be determined according to the image modality feature and the map modality feature, so as to perform the category identification according to the video modality feature, the text modality feature and the audio modality feature, thereby obtaining a category identification result of the video to be identified. By the implementation, on the basis of fully utilizing the images, texts and audios of the videos, the videos can be subjected to combined representation by introducing external high-quality knowledge (for example, the knowledge graph is utilized to obtain the entity and the attribute corresponding to the entity), so that the videos are identified by combining the characteristics of multiple modes, and the accuracy of video identification is improved; by introducing the knowledge graph information, the video is further characterized by combining the video characteristics with the knowledge graph information, so that the knowledge graph information is fully utilized to effectively expand, the video has reasoning capacity during category identification, the video identification capacity is improved, and the accuracy of video identification is improved.

In a specific implementation, the execution subject of the above-mentioned video processing scheme may be a computer device, and the computer device may be a terminal or a server. The terminal mentioned here may be a smart phone, a tablet computer, a notebook computer, a desktop computer, or other devices, and may also be an external device such as a handle, a touch screen, or other devices; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform, and the like. For example, when the computer device is a server, the embodiment of the present application provides a video processing system, as shown in fig. 1a, which may include at least one terminal and at least one server; the terminal can acquire the video to be identified and upload the acquired video to be identified to a server (namely computer equipment) so that the server can acquire the video to be identified and perform category identification processing on the video to be identified to obtain a corresponding category identification result.

Alternatively, the above-mentioned video processing scheme may be performed by both the terminal and the server. For example, after the terminal acquires the video to be identified, the terminal may also determine an image modal feature, a text modal feature, and an audio modal feature of the video to be identified; then, the terminal uploads the obtained video to be identified and the determined image modal characteristics, text modal characteristics and audio modal characteristics to a server, so that the server can determine corresponding map modal characteristics based on the video to be identified, and further performs category identification on the video to be identified based on the map modal characteristics, the image modal characteristics, the text modal characteristics and the audio modal characteristics to obtain a corresponding category identification result. For another example, after the terminal acquires the video to be identified, the terminal may determine a map modal feature, an image modal feature, a text modal feature and an audio modal feature of the video to be identified; and then, the terminal uploads the determined atlas modal characteristics, image modal characteristics, text modal characteristics and audio modal characteristics to a server, so that the server can perform category identification on the video to be identified based on the atlas modal characteristics, the image modal characteristics, the text modal characteristics and the audio modal characteristics to obtain a corresponding category identification result. It should be noted that, when the video processing scheme is executed by the terminal and the server together, the terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited herein.

In an implementation scenario, the present application further provides another video processing system applying the video processing scheme, for example, fig. 1b is an architectural schematic diagram of a video processing system provided in an embodiment of the present application, and the schematic diagram can be understood as a framework of a method and a system flow for knowledge-graph-assisted information flow video content multi-level classification, or a method and a system for information flow video content distribution based on multi-modal machine learning. The video processing method may be specifically executed by the video processing module. For example, as shown in fig. 1b, the video processing module may include a download file module, a video content extraction frame and audio separation service module, an atlas assisted content classification model module, and an atlas assisted content classification service module as shown in fig. 1 b.

In an implementation manner, a required video to be identified may be downloaded and acquired through the file downloading module, and then the video content frame extraction and audio separation service module is used to perform relevant processing on the video to be identified, for example, operations such as image frame extraction and audio separation are performed, so as to obtain modal information of the video to be identified in each modal, and the modal information is used as input of each modal of the subsequent video to be identified in the multi-level classification. Then, the atlas auxiliary content classification model module can be called to perform identification processing on the video to be identified by using each modal input, and a corresponding category identification result is obtained. And the map auxiliary content classification service module can serve the map auxiliary content classification model described above and is communicated with the scheduling center module to complete the class identification and marking of the map auxiliary content classification on the main link of the video flow.

In one implementation manner, the video processing system may further include a content production end, a content consumption end, an uplink and downlink content interface service module, a content distribution outlet module, a content database, a scheduling center module, a manual review module, a content re-ordering service module, and a content storage service module. The function of the modules mentioned is explained below, where:

the Content producer may be configured to provide multimedia data required by a multimedia platform, where the multimedia data may be, for example, graphics, video, or the like, where a Content producer of the multimedia data may include a Professional Generated Content (PGC) or a User Generated Content (UGC) or a Multi-Channel Network (MCN) or a Content producer of a Professional User Generated Content (pupc), and the Content producer may provide local or captured graphics, video, or atlas Content, which are main Content sources for distributing Content, through a mobile end or backend interface API system. The content production end can also firstly acquire an interface address of the uploading server through communication with the uplink and downlink content interface service module, and then upload a local file (such as a video) through the interface address, wherein the local video content can select matched music, a filter template, a beautifying function of pictures and texts and the like in the video shooting process.

The content consumption end can be used for being in communication connection with the uplink and downlink content interface service module so as to obtain index information for accessing the video file, such as a download address of the video file; and then downloads the corresponding video file according to the index information and plays the video file for viewing through the local player. And meanwhile, behavior data (such as card pause, loading time, playing click and the like) played by the user in the uploading and downloading processes can be reported to the server.

The uplink and downlink content interface service module can be directly communicated with the content production end, the content (such as video files) submitted by the front end can directly enter the service end through the service module, and the related files are stored in the content database. For example, the content submitted by the head end may be the title, publisher, abstract, cover art, publication time of the video, or the content of the video that is shot, etc. The uploaded video file can be submitted to the dispatching center module, so that the dispatching center module can perform subsequent content (such as video) processing and circulation. Among them, the content described below may specifically refer to video.

The content database can be used for data storage, the content database is a core database of the content, and the meta-information of all the producer released contents can be stored in the content database, for example, the meta-information can include the file size of the content itself, the cover page link, the code rate, the file format, the title, the release time, the author, whether the content is original or originated, and the like. Optionally, the classification result of classifying the content in the manual review process can be stored, for example, the classification of the level of the video file and the corresponding tag information can be stored, for example, a video explaining a watch, the first-level classification of the video can be science and technology, the second-level classification can be an intelligent watch, the third-level classification can be a domestic watch, and the tag information can be specifically the brand and the model of the watch. Optionally, in the process of manual review, data in the content database may be read, and meanwhile, the result and the state of the manual review may also be returned to the content database to update the meta information of the content in the content database. Optionally, the result obtained after the uplink and downlink content interface service module performs relevant processing on the video file may be stored, for example, after the uplink and downlink content interface service module receives the video file, the uplink and downlink content interface service module may perform standard transcoding operation on the content, and after transcoding is completed, the meta-information is asynchronously returned, for example, the information such as the file size, the code rate, the specification, the intercepted cover map and the like may be stored in the content database. Optionally, the result of the content processing performed by the scheduling center module may also be written into the content database, for example, the content processing performed by the scheduling center module may include machine processing and manual review processing, where the machine processing core may be to invoke the content deduplication service module to process completely repeated and similar content, and the result of deduplication may be written into the content database; completely repeated content does not give the human operator repeated secondary processing.

The scheduling center module can be used for being responsible for the whole scheduling process of content flow, receiving the warehoused content through the uplink and downlink content interface service module, and then obtaining the meta-information from the content database. And the system can also be used for scheduling manual review modules and machine processing systems, and can control the scheduling sequence and priority. And the system can also be used for communicating with the content re-ordering service module to filter out unnecessary repeated or similar new warehousing content. Content similarity and similarity relation chains can also be output for the content which does not reach the repeated filtering for being used by the recommendation system. The content that passes through the manual review module can also be enabled to a content export distribution service module to provide the content to the content consumers of the terminal through a recommendation engine or a search engine or an operation direct presentation page. And the system can also be responsible for communicating with the map-assisted content classification service module to complete the multi-level classification and scheduling processing of the video content.

And the content distribution outlet module can be used for communicating with the dispatching center module, acquiring the video provided by the dispatching center module, sending the video to the content consumption end, and displaying the video in the message source list of the user terminal.

And the manual checking module can be used for checking the data in the content database, and is generally a system with complex service and developed based on a Web database. The manual checking module can read original information of the image-text contents in the content database so as to standardize the contents manually and perform one round of preliminary filtering on the contents which do not conform to the standardization; and the content can be secondarily checked on the basis of the primary check, and the secondary check mainly comprises the steps of classifying the content and labeling or confirming the label. Since the video content is not completely verified through machine learning (such as deep learning), secondary manual verification processing can be performed on the basis of machine processing, so that the accuracy and efficiency of video labeling are improved through man-machine cooperation.

The content rearrangement service module can provide video rearrangement service, mainly can vectorize videos, and then establishes indexes of vectors to determine the similarity degree of the videos according to the distance between the vectors, so that content rearrangement is performed according to the similarity degree. Under the condition of more contents released at the same time, the project capable of realizing massive duplicate removal services is utilized to carry out parallelization processing so as to avoid repeated starting of the contents.

And the content storage service module can store the video and picture contents uploaded by the content producer through the uplink and downlink content interface service module. The content storage service module is usually a group of storage servers which are widely distributed and are close to the user side, so that the storage servers can be accessed nearby, and a CDN acceleration server is usually arranged at the periphery for distributed cache acceleration. Generally, after acquiring the content index information, the end consumer may directly access the content storage service module to download the corresponding content. The content storage service module can be used as a data source of the external service and also can be used as a data source of the internal service for the download file module to acquire the original video data for relevant processing. In which, the paths of the internal and external data sources are usually deployed separately to avoid mutual influence.

And the file downloading module can be used for downloading and acquiring original video content from the content storage service module and controlling the downloading speed and progress, and is usually a group of parallel servers and formed by related task scheduling and distribution clusters. The downloaded file can call a frame extraction service to acquire necessary video file content frames from a video source file to serve as a preprocessing service of subsequent video content image mode input data.

The video content frame extraction and audio separation service module can perform frame extraction processing, audio feature extraction and other operations on the downloaded video content to serve as modal input of subsequent video content multi-level classification.

The atlas auxiliary content classification model module can introduce external knowledge such as face recognition, named entity recognition, knowledge atlas and the like according to the video processing scheme to obtain attributes and extraction relations corresponding to the entities, performs content representation by using a relation network structure corresponding to the multimodal video classification and the knowledge atlas, and then combines with a hierarchical classifier to realize the multilevel classification of video content.

And the map auxiliary content classification service module can serve the map auxiliary content classification model described above and is communicated with the dispatching center module to complete the identification and marking of the indication map auxiliary content classification on the main link of the content flow. For the identification content multi-level classification service, a manual review link can be added, and the contents passed by the manual review can be directly used. When the accuracy of the model identification reaches a certain threshold value, the manual review link can be removed, so that the automatic multi-classification identification and marking can be directly carried out.

Based on the video processing scheme provided above, embodiments of the present application provide a video processing method, which can be executed by the above-mentioned computer device. Referring to fig. 2, the video processing method includes, but is not limited to, the following steps:

s201, obtaining a video to be identified, and determining an image modal characteristic, a text modal characteristic and an audio modal characteristic of the video to be identified.

In one implementation, a computer device may obtain a video to be identified to implement class identification of the video to be identified, for example, the class identification may be performed on video content understanding of the video to be identified. Optionally, a video corresponds to a title, a video dialog, audio, an image (a multi-frame image included in the video, such as a video cover image), and so on, and these pieces of information may be referred to as information in each modality, for example, for a video, the information of the video may include information in an image modality, a text modality, and an audio modality. The method and the device can identify the category of the video based on the information in the text mode, the image mode and the audio mode.

Optionally, after the video to be identified is obtained, obtaining a plurality of modality information from the video to be identified, for example, the modality information may include image modality information, text modality information, and audio modality information; after determining the plurality of modality information, corresponding modality features may be determined based on each modality information of the plurality of modality information, so as to implement category identification of the video by using the plurality of modality features. The plurality of modal characteristics may include an image modal characteristic corresponding to the image modal information, a text modal characteristic corresponding to the text modal information, and an audio modal characteristic corresponding to the audio modal information.

The following is a relevant explanation of the determination of the individual modal characteristics:

in one implementation, the specific implementation of determining the image modality features may be as follows.

For a video, dynamic visual images are generally the more important information of the video. In order to capture robust and discriminative information in the video to be identified, image-level features, i.e., the image modality features mentioned above, can be extracted using a neural network model. In a specific implementation, one or more frames of images can be extracted from a video to be identified, and the one or more frames of images can be understood as the above-mentioned image modality information; after obtaining one or more frames of images, the image modality features in each frame of image may be extracted, for example, the image modality features may be extracted by using a neural network model, such as a BigTransfer model, a ResNet model, or another neural network model that can be used for extracting image-level features, which is not limited in this application. The BigTransfer model is a pre-training image classification model with better performance, the performance of the BigTransfer model is superior to that of a ResNet model, and the BigTransfer model can be preferentially adopted for extracting image modal characteristics so as to improve the accuracy of video identification.

In one implementation, the specific implementation of determining the text modality feature may be as follows.

First, the text modality information may be obtained from the video to be recognized, for example, the text modality information may include one or more of a title, OCR data, and a video dialog of the video to be recognized. The title of the video to be identified is usually a subjective description of the publisher on the video expression content, and can usually cover the high-level semantics that the video wants to express. In practice, however, it may be found that many videos do not have titles or insufficient information conveyed by titles, and OCR data may be used to supplement the information lacking in titles to enrich the textual modality information. Alternatively, OCR data may have some problems, such as: in the process of screen switching, OCR data recognition is not accurate, fixed position OCR data needs to be deduplicated, dictation OCR data needs to be reserved, news scroll OCR data needs to be deleted and the like. Denoising processing can be performed on the OCR data to ensure the accuracy of the OCR data and further improve the accuracy of category identification. The denoising process may include filtering OCR data of single character type/pure number/pure letter, filtering OCR data of two adjacent frames with small bbox (image frame) position offset and high character repetition rate, filtering OCR data of bbox at the bottom end of the screen with small height, and the like. After the denoised OCR data is obtained, the denoised OCR data can be used as text modal information. Alternatively, if the video to be recognized does not have OCR data, but it is considered that the video to be recognized generally has video dialogue, Automatic Speech Recognition (ASR) data may be used as the text modality information. Optionally, when the text modality information includes a plurality of titles, OCR data and video subtitles, the plurality of text modality information may be spliced to use the spliced plurality of text modality information as the text modality information finally required by the video to be recognized, for example, the denoised OCR data may be spliced to the titles to use the spliced data as the text modality information.

After the text modal information is obtained, text feature extraction can be performed by using the text modal information to obtain corresponding text modal features. For example, a neural network model may be used to model text to obtain corresponding text modal characteristics. For example, the neural network model may be a TextRCNN model, or other neural network model that may perform text modal feature extraction. The textRCNN model has the characteristics of few model parameters and short training and prediction time, and the extraction of the text modal characteristics is performed by using the textRCNN model, so that the extraction time of the text modal characteristics can be shortened, the category identification time can be shortened, and the category identification speed can be increased.

In one implementation, the specific implementation of determining the audio modality characteristics may be as follows.

Firstly, an audio file corresponding to a video to be identified may be preprocessed to obtain a spectrogram, for example, a target audio may be selected from the video to be identified, for example, the target audio may be a video captured from the video to be identified, and for example, the target video may be an audio of 16 kilohertz (kHz) in the first 10 minutes of the video to be identified; after the target video is obtained, the target audio may be short-time fourier transformed to obtain a spectrogram, for example, the target audio may be short-time fourier transformed using a hamming time window and a frame shift, such as a hamming time window of 25 milliseconds (ms) and a frame shift of 10ms, to obtain a spectrogram. Then, a mel (mel) spectrum may be obtained based on the spectrogram, which may be characterized as an audio modality. For example, the spectrogram can be mapped to a 64-order mel filter bank to obtain a mel-frequency spectrum, the mel-frequency spectrum is framed in 960ms duration, no overlap exists between frames, each frame duration is 10ms, and 64 mel-frequency bands are included. Optionally, a neural network model may be used as the feature extractor, that is, the extraction of the audio modal features may be implemented by using the neural network model, which may be a Vggish model or another neural network model capable of performing audio modal feature extraction. The Vggish model has strong special expression capability on sound events of scene types. In the application, by adding the audio modality, the accuracy of classification of video contents such as emotion and fun can be obviously improved.

S202, acquiring entity information of the video to be identified, and determining the map modal characteristics of the video to be identified by using the entity information and the knowledge map.

In one implementation, in addition to performing subsequent category identification processing by using corresponding features (such as the image modality features, the text modality features, and the audio modality features described above) obtained based on understanding of video content, external knowledge may be introduced to perform category identification of a video by combining understanding of video content and external knowledge, so as to perform category identification by using features of multiple dimensions, and improve accuracy of category identification. For example, the external knowledge may refer to information of a knowledge graph, that is, the class of the video to be recognized may be recognized by using features corresponding to the external knowledge and the modal features in the respective modalities.

In one implementation, entity information of the video to be recognized may be obtained to determine, by using the entity information and the knowledge graph, graph modal characteristics of the video to be recognized, where the graph modal characteristics may be understood as characteristics corresponding to the above-mentioned external knowledge.

The entity information may refer to words describing entities in the video to be recognized, such as names of people, places, organizations, product names, and the like, and the entity information may be understood as words that users pay more attention to. The entity information may be obtained from the related content of the video to be recognized, for example, the related content may refer to a title of the video to be recognized, an image frame of the video to be recognized, text of an audio conversion of the video to be recognized, and the like. Optionally, the Entity information may be obtained from the related content of the video to be recognized through face Recognition, Named Entity Recognition (NER), text mining, and the like. Named entity recognition is a classic problem in natural language processing, and is widely applied, such as recognizing names of people and places from a sentence, recognizing names of products (such as recognizing names of medicines) from search of e-commerce, and the like. The named entity recognition can be realized by adopting a Conditional Random Field (CRF), and other processing algorithms can also be adopted, which are not limited in the application. A preferred processing algorithm in the NER domain is conditional random field, which is a discriminative probabilistic model of random field and is commonly used for labeling or analyzing sequence data, such as natural language text or biological sequences.

As can be seen from the above description, the specific implementation of acquiring the entity information of the video to be identified may be: video content of the video to be recognized is obtained, the video content may include one or more of a title, audio text (text of audio conversion of the video to be recognized), and an image, and the video content herein may also include other video-related content, such as the above-mentioned OCR data, video dialog, etc., which are not illustrated herein. After the video content is acquired, the entity information of the video to be identified can be determined from the video content by using an entity information determination mode. The entity information determination mode may refer to the above-described modes of face recognition, named entity recognition, text mining, and the like.

Optionally, when the video content is one, the entity information may be determined from the video content according to the entity information determination manner adapted to the video content; in the case that the video content is multiple, the entity information may be determined from each video content according to the entity information determination manner in which each video content is respectively adapted, so as to determine the entity information of the video to be identified according to the entity information corresponding to each video content, for example, the entity information corresponding to each video content may be merged, thereby obtaining the entity information of the video to be identified.

For example, assuming that the video content includes a title, named entity identification may be performed on the title to obtain entity information of the video to be identified. For example, as shown in fig. 3a, the video content (i.e. title) is "at third, various high-end hotels along the coastline, select # trilinear # and" trilinear # ", and after the entity identification is performed by naming the entity, the entity information in the title is determined to be" third ", i.e. the entity information of the video to be identified is" third ".

For example, assuming that the video content includes an image, where the image may include one frame or one frame, and the present application is not limited thereto, the following description mainly takes an image (such as a video cover image) as an example, and then the image may be subjected to face recognition to obtain entity information of the video to be recognized. For example, as shown in fig. 3b, the frame image may be a face image including three sheets, and after face recognition is performed, it may be determined that entity information corresponding to the frame image is "three sheets", that is, entity information of a video to be recognized is "three sheets"; for another example, if the frame image may be a cat image, after face recognition, it may be determined that entity information corresponding to the frame image is "cat", that is, entity information of the video to be recognized is "cat".

For example, assuming that the video content includes a title and an image, the title may be subjected to named entity recognition, and the frame image may be subjected to face recognition to obtain entity information obtained through the named entity recognition and entity information obtained through the face recognition, and then the entity information of the video to be recognized is determined according to the two entity information. For example, as shown in fig. 3b, the video content includes a face image including three pictures and a title "long today", and what "long tomorrow i thinks about # three #" again, the face recognition can be performed on the frame image to obtain entity information "three pictures" corresponding to the frame image, and the named entity recognition is performed on the title to obtain corresponding entity information "three pictures", that is, the entity information of the video to be recognized is "three pictures".

It is understood that a large number of entities and attribute information having an association relationship with the entities may be stored in the knowledge-graph, and after the entity information is obtained, the attribute information associated with the entity information may be obtained from the knowledge-graph. For example, the attribute information corresponding to the entity information "three" as shown in fig. 3a may include "tourist resort", "hainan", "country X"; as another example, the attribute information corresponding to the entity information "zhang san" in fig. 3b includes "singer", "woman", "25 years old", and "country X". After the entity information and the attribute information are obtained, the atlas modal characteristics of the video to be identified can be determined according to the entity information and the corresponding attribute information. The graph modal features may refer to a knowledge graph representation (or vector representation, knowledge representation, etc.) for a certain information.

Alternatively, KG Representation may be implemented using Deepwalk. Wherein, deep walk may be understood as generating Embedding (Embedding) by using random walk, and the main idea of deep walk may be: the method comprises the steps of carrying out random walk on a graph structure consisting of entities and attribute nodes to generate a large number of sequences, and then inputting the sequences as training samples into a vectorization model (such as a word2vec model) for training to obtain a trained vectorization model. The trained vectorization model can be used for Embedding information, and mathematically, Embedding can be understood as mapping points of one space to another space by using a function, and generally can refer to mapping from a high-dimensional abstract space to a low-dimensional image space. The significance of embedding is to convert high-dimensional data into low-dimensional data so as to facilitate processing of an algorithm, and simultaneously solve the problems that the length of One-Hot (One-Hot code) vectors changes along with the change of samples and correlation between two entities cannot be represented. In general, the trained vectorization model can be used for knowledge graph representation, and graph modal characteristics can be obtained.

In one implementation, the specific implementation of determining the atlas modal characteristics of the video to be identified may be as follows: random walk can be performed on a graph structure formed by the entity information and the attribute information to obtain a random walk sequence, wherein the graph structure is formed by nodes corresponding to the entity information and the attribute information respectively, and one node corresponds to one entity information or one attribute information; then, the random walk sequence is input into the trained vectorization model, so that a knowledge graph representation of each node on the graph structure, or a graph modal characteristic of each node, is obtained.

Optionally, the atlas modal characteristics of each node on the graph structure are not all required by the present application, and these atlas modal characteristics may be screened to obtain the atlas modal characteristics of the video to be identified. In a possible implementation manner, keywords corresponding to the map modal features of the video to be identified may be preset, for example, the keywords may include one or more of a name of a person (abbreviated as PER), an organization (abbreviated as ORG), a location (abbreviated as LOC), a content IP name (abbreviated as IP), an association description (abbreviated as PRD), and the like, that is, the map modal features corresponding to the keywords may be screened out from the map modal features of each node, and the screened map modal features are the map modal features of the video to be identified.

S203, determining the video mode characteristics of the video to be identified according to the image mode characteristics and the map mode characteristics.

In one implementation, the video modality feature of the video to be identified may be determined according to the image modality feature and the map modality feature, so that the video modality feature and the obtained text modality feature and audio modality feature may be utilized to perform category identification of the video in succession. Optionally, the video modal feature may be obtained by performing fusion processing based on a feature fusion module in the video recognition model, where a specific processing procedure of the feature fusion module may be as follows: feature splicing can be performed on the basis of the image modal features and the map modal features to obtain splicing features, so that the video modal features of the video to be identified can be obtained by using the splicing features. Optionally, the video modality characteristics of the video to be identified may be directly determined according to the stitching characteristics, and the video modality characteristics may be used to represent characteristic association information between image modality characteristics and map modality characteristics; optionally, after the video modality features used for characterizing the feature association information are obtained, feature enhancement processing may be further performed on the video modality features to enhance important features in the video modality features, so as to improve video identification accuracy. For example, an initial video modality feature of the video to be identified may be determined according to the stitching feature, and the initial video modality feature may be used to represent feature association information between an image modality feature and a profile modality feature; then, feature enhancement processing may be performed on the initial video modality features to obtain feature-enhanced feature relationship features, where the feature-enhanced feature relationship features are video modality features of the video to be identified.

Optionally, as can be seen from the above description, the feature fusion module may include a concatenation module and a feature relation module, or as shown in fig. 3c, the feature fusion module may include a concatenation module, a feature relation module, and a feature enhancement module. The following description mainly takes the feature fusion module, which may include a splicing module, a feature relationship module, and a feature enhancement module as an example. The image modal characteristics and the map modal characteristics can be subjected to characteristic splicing through a splicing module in the characteristic fusion module to obtain splicing characteristics; then, the mosaic features may be input into a feature relation module, so as to learn, by using the feature relation module, a feature association between image modality features and map modality features in the mosaic features, to obtain initial video modality features, for example, the feature relation module may be a Transformer model, and the Transformer model may learn a relation between an image and a knowledge map; then, the initial video modality features may be input to a feature enhancement module to perform feature enhancement processing on the initial video modality features to obtain the video modality features of the video to be identified, for example, the feature enhancement module may be an SE Context filtering model to implement the feature enhancement processing on the initial video modality features.

To better understand the method for determining characteristics of a video modality according to the embodiment of the present application, the following description is further made with reference to the schematic structural diagram shown in fig. 3d, and as shown in fig. 3d, the flow shown in the figure may be the processing procedure for determining characteristics of a video modality, which is proposed in the present application. As can be seen from fig. 3d, the processing procedure for determining the video modality features mainly includes a processing procedure for feature extraction and a processing procedure for a Relationship Network (RN), where the processing procedure for feature extraction may include feature extraction of an image modality and feature extraction of a map modality to obtain image modality features and map modality features.

The image modal characteristics can be obtained by processing images in the video to be identified by using a BigTransfer model.

The determination of the modal characteristics of the atlas may be described as follows: atlas modal information may be obtained from the video to be identified, and the atlas modal information may include the above-mentioned entity information obtained from the video to be identified (such as XX brother and lie four in fig. 3 d) and attribute information obtained according to the entity information and the knowledge atlas (such as attribute information of XX brother in fig. 3 d: video class-general skill and X country, and attribute information of lie four: music class-singer, actor and X country); after obtaining the atlas modal information, knowledge atlas representation may be performed to obtain the atlas modal characteristics of the video to be identified, for example, the atlas modal characteristics may be the atlas modal characteristics about PER, ORG, LOC, IP, and PRD shown in fig. 3 d. The mode of determining the modal characteristics of the atlas may refer to the above description, and is not described herein again.

After the image modality features and the atlas modality features are obtained, the image modality features and the atlas modality features can be processed by adopting a relationship network to obtain the video modality features. The processing of the relational network may correspond to the processing of the feature fusion module, and in brief, the relational network may be understood as the feature fusion module. As shown in fig. 3d, the relational network may mainly comprise two parts of processing: firstly, learning the relation between an image and a knowledge graph by using a characteristic relation module (such as a Transformer model); secondly, a feature enhancement module (such as an SE Context modeling module) is used for learning the weighting weight of the relationship, so that the model can learn the weighting relationship among the relationships. In specific implementation, the image modal characteristics and the map modal characteristics can be subjected to characteristic splicing to obtain splicing characteristics; then, the relationship between the image and the knowledge graph is learned through the feature relationship module, and the initial video modal features are obtained through the feature relationship module. Finally, the feature enhancement module may be used to obtain the initial video modality features after feature weighting (or feature enhancement), that is, the video modality features.

In the process of fusing the image modal characteristics and the map modal characteristics to obtain the fused characteristic expression, the SE Context mapping method is specifically adopted to realize that the modal characteristics can be fused with different weights, and the fused characteristic expression result is further reused to perform downstream processing of specific tasks such as video identification (classification). Alternatively, the processing of the SE Context Gating module may be understood by using the formula Y ═ f (WX + b) ×. Wherein, X is the initial video modal characteristics input to the characteristic enhancement module, Y is the output result (i.e. video modal characteristics) of the characteristic enhancement module, f is the activation function, and W is the model parameters required by the model training. f (WX + b) may be any value from 0 to 1, and may indicate that the initial video modality feature input to the feature enhancement module is suppressed, for example, when f (WX + b) is 0, or activated, for example, when f (WX + b) is 1. It can be seen that the initial video modality features can obtain feature association information among the initial video modality features through the feature enhancement module, strength distinction is introduced for the initial video modality features, and corresponding f (WX + b) values are different according to different importance degrees of the initial video modality features in video identification. If the importance of a certain initial video modality feature in video recognition is higher, the corresponding f (WX + b) value is larger, and if the importance of a certain initial video modality feature in video recognition is lower, the corresponding f (WX + b) value is smaller.

To better understand the processing role of the relational network in video recognition in the present application, the following is a further brief description of the relational network. The Relational Network (RN) has a structure suitable for relational reasoning and can be directly loaded into the existing neural network architecture. For example, displaying entity information and attribute information as shown in fig. 3a, a relationship network can be used to infer "tourism-X nation-hainan"; as another example, displaying entity information and attribute information as shown in fig. 3b, a relationship network can be used to infer "entertainment-X national star-rice system" based on these information. Logic behind the method is a function structure in the relational Network, so that the relational Network can grasp the key of relational reasoning, and the structure similar to a Convolutional Neural Network (CNN) can contain the properties of deducing spatial attributes and translational invariance, and the Recurrent Neural Network (RNN) can process sequence data. The processing of a simple relational network can be illustrated using the following equation 1.

Wherein the input to the model is a set of objects O ═ O₁,o₂,…o_n，o_iIs the ith object, o_jIs the jth object; f. of_φAnd g_θIs the function of the multi-Layer perceptron (MLP); g_θCan be used to quantify the relationship of two objects, or to calculate whether two objects are related, f_φTo weight relationships to highlight important relationships. In the application, the object may refer to an entity object in a video picture, such as a face of a person in an image, a hat, a picture program name, and the like; the function of the characteristic relation module (such as a Transformer model) can be equivalent to g_θThe feature enhancement module (e.g., SE Context modeling) may be equivalent to f_φThe function of (1). Briefly, RN has the following three characteristics:

1. reasoning can be learned. The RN may calculate the relationship between all two objects, or may calculate the relationship between only part of two objects.

2. The data processing efficiency of the RN is high. RN uses one g_θAll the relationships are calculated by the functions, and the generalization capability is strong. In addition, the RN can take two objects as input instead of taking all n objects as input simultaneously, so that the learning of n ^2 functions can be avoided, and the data processing is more efficient.

3. The RN may act on a set of objects, ensuring that the RN is order independent for both input and output.

In one implementation, after the image modality features are extracted, in addition to the above-mentioned video modality features obtained based on the image modality features and the atlas modality features, the video modality features may also be obtained directly based on the image modality features. For example, feature fusion may be performed on multi-frame image modality features to generate video modality features, which may employ a NextVlad model and a SE Context mapping model for multi-frame feature fusion to generate video modality features. The NeXtVLAD model is a feature dimension reduction model with excellent effect in the second Youtube-8M video understanding competition, and can aggregate multi-frame image-level features into video-level features in a feature clustering mode; the SE Context mapping model is a feature weighted selection model of the mainstream of the visual field, and is generally used for feature enhancement.

S204, determining a category identification result of the video to be identified based on the video modal characteristics, the text modal characteristics and the audio modal characteristics.

In an implementation manner, the video modal feature, the text modal feature and the audio modal feature may be fused to obtain a target fusion feature, so as to determine a category identification result of the video to be identified according to the target fusion feature.

Optionally, the category identification result of the video to be identified may be directly determined according to the target fusion feature, for example, the target fusion feature may be input to a classification module in the video identification module to obtain the category identification result of the video to be identified.

Optionally, the category identification result of the video to be identified may be determined together directly according to the video modal feature, the text modal feature, the audio modal feature, and the target fusion feature, for example, the video modal feature, the text modal feature, the audio modal feature, and the target fusion feature may all be input to a classification module in the video identification module, so as to obtain the category identification result of the video to be identified.

Optionally, the classification module in the video recognition model may be a single-stage classification module or a multi-stage classification module. The single-stage classification module can be understood as having only one classification module, and correspondingly, the multi-stage classification module can be understood as having a plurality of classification sub-modules, wherein each classification sub-module corresponds to one classification process, and then the multi-stage classification module can perform a plurality of classification processes. The processing of the multi-level classification module may refer to Embedding (Embedding) of hidden layer features of the first-level class and prediction classes of the first-level class when predicting the second-level class, Embedding of hidden layer features of the first-level class and the second-level class and prediction classes of the second-level class when predicting the third-level class, and so on to obtain the prediction class of the last-level class. Therefore, when the multi-stage classification module is used for carrying out class identification, the class hierarchy dependency relationship can be fully utilized so as to improve the prediction accuracy of the video identification model. For the multi-stage classification module, the class identification result of the model to be identified may include the prediction class of the last stage class, or may be the prediction class of each stage class.

In the embodiment of the application, the video to be identified can be obtained, and the image modal characteristics, the text modal characteristics and the audio modal characteristics of the video to be identified are determined; the entity information of the video to be identified can be obtained, and the atlas modal characteristics of the video to be identified are determined by utilizing the entity information and the knowledge atlas; then, the video modality characteristics of the video to be identified can be determined according to the image modality characteristics and the map modality characteristics. And determining a category identification result of the video to be identified based on the video modal characteristics, the text modal characteristics and the audio modal characteristics. By the implementation method, on the basis of fully utilizing the images, texts and audios of the videos, the relation network can be constructed by introducing external high-quality knowledge (such as face recognition, named entity recognition, a knowledge graph and the like to obtain entities and attributes corresponding to the entities), namely, the videos can be represented by utilizing the structure of the multi-modal video classification and knowledge graph relation network; the understanding of different vertical videos can be enhanced, so that the model has good knowledge expansibility; the relation network is introduced into the video recognition (or video classification), so that the video recognition model can learn the relation between the image and the knowledge graph, the video recognition model has reasoning capability, the model effect of the video recognition model is effectively improved, and the accuracy of the video recognition is also effectively improved.

Based on the above description, the embodiment of the present application further provides another video processing method; in the embodiment of the present application, a computer device is mainly used to execute the video processing method as an example for description. As shown in fig. 4, the video processing method includes, but is not limited to, the following steps:

s401, obtaining a video to be identified, and determining an image modal characteristic, a text modal characteristic and an audio modal characteristic of the video to be identified.

S402, acquiring entity information of the video to be identified, and determining the map modal characteristics of the video to be identified by using the entity information and the knowledge map.

And S403, determining the video modal characteristics of the video to be identified according to the image modal characteristics and the map modal characteristics.

For specific implementation of steps S401 to S403, reference may be made to the detailed description of steps S201 to S203 in the foregoing embodiment, which is not described herein again.

S404, generating a target fusion feature based on the video modal feature, the text modal feature and the audio modal feature.

S405, performing category identification according to the video modal characteristics, the text modal characteristics, the audio modal characteristics and the target fusion characteristics, and determining a category identification result of the video to be identified.

In steps S404 and S405, a target fusion feature may be generated according to the video modality feature, the text modality feature and the audio modality feature, for example, feature fusion according to a model may be utilized to obtain the target fusion feature. After the target fusion feature is obtained, the category identification result of the video to be identified can be determined by using the video modal feature, the text modal feature, the audio modal feature and the target fusion feature. For example, the video modal characteristic, the text modal characteristic, the audio modal characteristic and the target fusion characteristic can be utilized to perform category identification processing, so as to obtain a category identification result of the video to be identified.

In one implementation, when performing the category identification process based on the features, multi-level classification may be performed to identify the category corresponding to the video to be identified. For example, the multi-level classification may be referred to as a hierarchical classification. The hierarchical classification may refer to Embedding (Embedding) of hidden layer features of a first class and prediction categories of the first class when predicting a second class, Embedding of hidden layer features of the first class and the second class and prediction categories of the second class when predicting a third class, and so on to obtain a prediction category of a last class.

In specific implementation, the classification processing for the ith time can be performed according to the video modal characteristic, the text modal characteristic, the audio modal characteristic and the target fusion characteristic to obtain an ith-level class identification result corresponding to the classification processing for the ith time, wherein i is a positive integer and is less than or equal to N, and N is a positive integer greater than or equal to 2. And then, carrying out classification processing for the (i + 1) th time according to the hidden layer characteristics and the (i) th class identification result corresponding to each classification processing between the 1 st classification processing and the (i) th classification processing to obtain the (i + 1) th class identification result corresponding to the (i + 1) th classification processing. And obtaining a category identification result corresponding to the Nth classification processing until the Nth classification processing is finished, and determining the category identification result of the video to be identified according to the category identification result corresponding to each classification processing in the N classification processing. For example, the category identification result of the video to be identified may be a category identification result corresponding to the nth classification processing; for another example, the category identification result of the video to be identified may include a category identification result corresponding to each classification process.

For example, assuming that a video to be recognized is a video for explaining a watch, when performing the category recognition processing on the video, the video may be classified three times by using three levels of classification. The first classification processing corresponds to the fact that the classification recognition result is science and technology, namely the first-level classification is science and technology; the second classification processing corresponds to the fact that the category identification result is the smart watch, namely the secondary classification is the smart watch; the third classification processing corresponds to the fact that the classification recognition result is the domestic watch, namely the third classification is the domestic watch. Then, the category identification result of the video may be a three-level category (i.e., a domestic watch), and may also include a first-level category, a second-level category, and a third-level category (i.e., a science and technology, a smart watch, and a domestic watch).

Optionally, the category identification process may be implemented by using a classification module in the video identification model, where the classification module may be a hierarchical classification module (or referred to as a multi-level classification module), and the hierarchical classification module may include N levels of classification sub-modules, where a first level of classification sub-module in the N levels of classification sub-modules corresponds to a primary classification process.

For example, the hierarchical classification module shown in fig. 5a is taken as an example to describe the hierarchical classification, wherein the hierarchical classification module includes 3-level classification sub-modules. As shown in fig. 5a, the video modal feature, the text modal feature, the audio modal feature, and the target fusion feature may be input into the level 1 classification sub-module of the hierarchical classification module to perform the 1 st classification processing, so as to obtain the class identification result corresponding to the level 1 classification sub-module (or the 1 st classification processing), and for convenience of subsequent description, the class identification result is referred to as the first result for short; after the first result is obtained, inputting the hidden layer features corresponding to the 1 st classification processing and the embedded vector corresponding to the first result into a 2 nd classification submodule of the hierarchical classification module to perform 2 nd classification processing, so as to obtain a class identification result corresponding to the 2 nd classification submodule (or the 2 nd classification processing), which may be referred to as a second result; after the second result is obtained, the hidden layer features corresponding to the 1 st classification processing, the hidden layer features corresponding to the 2 nd classification processing, and the embedded vector corresponding to the second result are input to the 3 rd classification submodule of the hierarchical classification module to perform the 3 rd classification processing, so as to obtain a class identification result corresponding to the 3 rd classification submodule (or the 3 rd classification processing), and the class identification result may be referred to as a third result. The category identification result of the video to be identified may be determined based on the first result, the second result, and the third result, for example, the category identification result of the video to be identified may be the third result, or the first result, the second result, and the third result.

To better understand the video processing method provided in the embodiment of the present application, the following further description is made with reference to the schematic structural diagram of the video recognition model shown in fig. 5b and the video processing flow shown in fig. 5 c. As shown in fig. 5b, the video recognition model may include a multi-modal feature extraction module, a multi-modal feature fusion module, and a hierarchical classification module. The multi-modal feature extraction module can comprise a video modal feature extraction module, a text modal feature extraction module and an audio modal feature extraction module; the video modal feature extraction module can comprise an image modal feature extraction module, a map modal feature extraction module and a feature fusion module; the feature fusion module may include a concatenation module, a feature relationship module, and a feature enhancement module.

In one implementation, as shown in fig. 5b and 5c, the multi-modal feature extraction module may include a video modal feature extraction module for performing video modal feature extraction, a text modal feature extraction module for performing text modal feature extraction, and an audio modal feature extraction module for performing audio modal feature extraction.

The processing procedure of the video modality feature extraction module may include: the image modal feature extraction module is used for extracting image modal features, for example, image modal information can be obtained from a video to be identified, and then the BigTransfer model is used for processing the image modal information to obtain the image modal features; the atlas modal feature extraction module is used for extracting the atlas modal features, for example, atlas modal information can be obtained from a video to be identified, and then the DeepWalk is used for processing the atlas modal information to obtain the atlas modal features; and finally, fusing the image modal characteristics and the map modal characteristics by using a characteristic fusion module to obtain the video modal characteristics. The processing procedure of the feature fusion module may refer to the corresponding description of the schematic diagram shown in fig. 3c or fig. 3 d.

The processing procedure of the text modal feature extraction module may include: and acquiring text modal information from the video to be identified, and processing the text modal information by using a TextRCNN model to obtain text modal characteristics.

The processing procedure of the audio modality feature extraction module may include: and acquiring audio modal information from the video to be identified, and then processing the audio modal information by using a Vggish model to obtain audio modal characteristics.

After the video modal characteristics, the text modal characteristics and the audio modal characteristics are obtained through the multi-modal characteristic extraction module, the multi-modal characteristic fusion module can be used for carrying out fusion processing on the plurality of modal characteristics so as to obtain target fusion characteristics. For example, the fusion of multiple modal features can be performed using the Teacher-Student model. Furthermore, a hierarchical classification module can be used for carrying out multi-level classification processing on the video modal characteristics, the text modal characteristics, the audio modal characteristics and the target fusion characteristics, so that a category identification result of the video to be identified is obtained.

The specific processing procedure of each module in the video identification model may refer to the foregoing related description, and is not described herein again.

In one implementation, in order to avoid over-fitting the video recognition model to the easily learned modalities, a hierarchical classifier (i.e., the hierarchical classification module described above) may be added to all modalities, i.e., the features of each modality may be input into the hierarchical classification module for processing. Therefore, the end-to-end training video recognition model is realized by using the multi-loss function combined constraint hierarchical classifier. In the training process, each hierarchical classifier can generate cross entropy loss of multi-level classification (such as three-level classification) to constrain classification training of each mode, and the design can effectively improve the accuracy of the model and can also accelerate the convergence speed.

However, in some cases, the model is used to classify (recognize) the video by using the above-mentioned different modality information of the same video, and there may also be a phenomenon that the classification (recognition) effect is poor, for example, it is found through analysis of a difficult sample (a sample with a large error between a predicted classification result and an actual classification result) that when the classification judgment is performed by using the different modality information of the same video, an opposite result may occur, a similarity problem may also be encountered during the training process of the model, and the model may also be fitted to a modality which is easy to converge, thereby affecting the learning of the model parameters. In order to avoid this situation, when the respective modal features are input to the hierarchical classification module for processing, the target fusion features are also input to the hierarchical classification module for processing, and the synergistic effect of the multiple modalities of the video recognition model in the training process is enhanced by adding KLD (KL Divergence) loss. Optionally, KL Divergence (or called Kullback-Leibler Divergence) or Information Divergence (Information Divergence), where KL Divergence is an asymmetric measure of the difference between two Probability distributions (Probability distributions).

Optionally, in the model training of the video recognition model, the cross entropy loss and the KLD loss may be used to train the video recognition model to obtain the trained video recognition model. For example, cross-entropy loss and KLD loss may be used to calculate a model loss value, and model parameters of the video recognition model may be trained in a direction that reduces the model loss value (or minimizes the model loss value).

For example, a video modal feature is processed by the hierarchical classification module to generate a cross entropy loss L1, a text modal feature is processed by the hierarchical classification module to generate a cross entropy loss L2, an audio modal feature is processed by the hierarchical classification module to generate a cross entropy loss function L3, and a target fusion feature is processed by the hierarchical classification module to generate a KLD loss L4; and the total loss function of the video recognition model in the training process can be L1+ L2+ L3+ L4. Then, in the model training of the model to be recognized, the model loss value may be calculated based on L, and the video recognition model may be trained in a direction to reduce the model loss value.

In the embodiment of the application, on the basis that the multi-modal machine learning is adopted to extract information of different modes of video content to synthesize a stable multi-modal representation, the external high-quality knowledge is introduced to construct the relation network, so that the video identification model has reasoning capability, the video has reasoning capability during multi-level classification, the multi-level classification capability of the video is effectively enhanced, the video identification performance is improved, and meanwhile, the model has good knowledge expansibility; when the category identification processing is carried out, the category hierarchical dependency relationship is fully utilized to improve the prediction accuracy of the video identification model, and fine-grained multi-level classification can be realized. Meanwhile, the video recognition model can be further trained by combining the multi-modal characteristics of the video with knowledge map information on the basis of understanding the video content and fully utilizing the image (such as a video cover map) text (such as a title text), the audio and the external knowledge data of the video content, so that the accuracy and performance of the finally obtained video recognition model can be further improved compared with the conventional video multi-modal model, and the description is more comprehensive and accurate, so that the distribution efficiency of a subsequent recommendation system for the video can be improved.

The method of the embodiments of the present application is described above, and the apparatus of the embodiments of the present application is described below.

Referring to fig. 6, fig. 6 is a schematic diagram of a composition structure of a video processing apparatus provided in an embodiment of the present application, where the video processing apparatus may be a computer program (including program code) running in a computer device; the video processing device can be used for executing corresponding steps in the video processing method provided by the embodiment of the application. For example, the video processing apparatus 60 includes:

an obtaining unit 601, configured to obtain a video to be identified, and determine an image modal feature, a text modal feature, and an audio modal feature of the video to be identified;

a first determining unit 602, configured to obtain entity information of the video to be identified, and determine a map modal feature of the video to be identified by using the entity information and a knowledge map;

a second determining unit 603, configured to determine a video modality of the video to be recognized according to the image modality and the atlas modality;

the identifying unit 604 is configured to determine a category identification result of the video to be identified based on the modal video feature, the modal text feature, and the modal audio feature.

In an implementation manner, the second determining unit 603 is specifically configured to:

performing feature splicing based on the image modal features and the map modal features to obtain splicing features;

determining an initial video mode characteristic of the video to be identified according to the splicing characteristic, wherein the initial video mode characteristic is used for representing characteristic association information between the image mode characteristic and the map mode characteristic;

and performing feature enhancement processing on the initial video modal characteristics to obtain the video modal characteristics of the video to be identified.

In one implementation, the initial video modal characteristics are obtained by processing the stitching characteristics through a characteristic relation module in a video identification model; the video modal characteristics are obtained by performing characteristic enhancement processing on the initial video modal characteristics through a characteristic enhancement module in the video identification model.

In an implementation manner, the identifying unit 604 is specifically configured to:

generating a target fusion feature based on the video modal feature, the text modal feature, and the audio modal feature;

and performing category identification according to the video modal characteristic, the text modal characteristic, the audio modal characteristic and the target fusion characteristic, and determining a category identification result of the video to be identified.

performing ith classification processing according to the video modal characteristic, the text modal characteristic, the audio modal characteristic and the target fusion characteristic to obtain an ith class identification result corresponding to the ith classification processing, wherein i is a positive integer and is less than or equal to N, and N is a positive integer greater than or equal to 2;

performing classification processing for the (i + 1) th time according to hidden layer characteristics and an i-th class identification result corresponding to each classification processing between the 1 st classification processing and the i-th classification processing to obtain an i + 1-th class identification result corresponding to the i + 1-th classification processing;

and obtaining a class identification result corresponding to the Nth classification processing until the Nth classification processing is finished, and determining the class identification result of the video to be identified according to the class identification result corresponding to each classification processing in the N times of classification processing.

In one implementation manner, the class identification result of the video to be identified is obtained by calling a hierarchical classification module in a video identification model, wherein the hierarchical classification module comprises N-level classification sub-modules, and a primary classification sub-network in the N-level classification sub-modules corresponds to primary classification processing.

In an implementation manner, the first determining unit 602 is specifically configured to:

acquiring attribute information associated with the entity information from the knowledge graph;

and determining the map modal characteristics of the video to be identified based on the entity information and the attribute information.

It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 6, reference may be made to the description of the method embodiment, and details are not described here again.

In the embodiment of the application, a video to be identified can be obtained, and the image modal characteristic, the text modal characteristic and the audio modal characteristic of the video to be identified are determined; the entity information of the video to be identified can be obtained, and the atlas modal characteristics of the video to be identified are determined by utilizing the entity information and the knowledge atlas; then, the video modality characteristics of the video to be identified can be determined according to the image modality characteristics and the map modality characteristics. And determining a category identification result of the video to be identified based on the video modal characteristics, the text modal characteristics and the audio modal characteristics. On the basis of fully utilizing images, texts and audios of videos, the videos can be subjected to combined representation by introducing external high-quality knowledge (for example, knowledge maps are utilized to obtain entities and attributes corresponding to the entities), so that the videos are identified by combining multi-modal features, and the accuracy of video identification is improved; by introducing the knowledge graph information, the video feature joint knowledge graph information is used for further representing the video, so that the knowledge graph information is fully utilized for effective expansion, the video has reasoning capability during category identification, the video identification capability is improved, and the accuracy of video identification is improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 7, the computer device 70 may include: a processor 701, a memory 702, and a network interface 703. The processor 701 is connected to the memory 702 and the network interface 703, for example, the processor 701 may be connected to the memory 702 and the network interface 703 through a bus.

The processor 701 is configured to support the video processing apparatus to perform corresponding functions in the video processing method described above. The Processor 701 may be a Central Processing Unit (CPU), a Network Processor (NP), a hardware chip, or any combination thereof. The hardware chip may be an Application-Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a Field-Programmable Gate Array (FPGA), General Array Logic (GAL), or any combination thereof.

The memory 702 is used for storing program codes and the like. The Memory 702 may include Volatile Memory (VM), such as Random Access Memory (RAM); the Memory 702 may also include a Non-Volatile Memory (NVM), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); the memory 702 may also comprise a combination of the above types of memory. In this embodiment, the memory 702 is used to store a program for website security detection, interaction traffic data, and the like.

The network interface 703 is used to provide network communication functions.

The processor 701 may call the program code to perform the following operations:

and determining a category identification result of the video to be identified based on the video modal characteristic, the text modal characteristic and the audio modal characteristic.

It should be understood that the computer device 70 described in this embodiment may perform the description of the video processing method in the embodiment corresponding to fig. 2 and fig. 4, and may also perform the description of the video processing apparatus in the embodiment corresponding to fig. 6, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method according to the aforementioned embodiments, the computer may be a part of the aforementioned computer device. Such as the processor 701 described above. By way of example, the program instructions may be executed on one computer device, or on multiple computer devices located at one site, or distributed across multiple sites and interconnected by a communication network, which may comprise a blockchain network.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions may be readable by a processor of a computer device from a computer-readable storage medium, and executable by the processor to cause the computer device to perform the steps performed in the embodiments of the methods described above.

It will be understood by those skilled in the art that all or part of the processes in the methods for implementing the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and should not be taken as limiting the scope of the present application, so that the present application will be covered by the appended claims.

Claims

1. A video processing method, comprising:

determining the video modal characteristics of the video to be identified according to the image modal characteristics and the atlas modal characteristics;

2. The method according to claim 1, wherein the determining the video modality feature of the video to be identified according to the image modality feature and the atlas modality feature comprises:

performing feature splicing based on the image modal features and the atlas modal features to obtain splicing features;

determining an initial video modality feature of the video to be identified according to the splicing feature, wherein the initial video modality feature is used for representing feature association information between the image modality feature and the map modality feature;

3. The method according to claim 2, wherein the initial video modality features are obtained by processing the stitched features through a feature relation module in a video recognition model; the video modal characteristics are obtained by performing characteristic enhancement processing on the initial video modal characteristics through a characteristic enhancement module in the video identification model.

4. The method according to any one of claims 1-3, wherein the determining a category identification result of the video to be identified based on the video modality feature, the text modality feature, and the audio modality feature comprises:

generating a target fusion feature based on the video modality feature, the text modality feature, and the audio modality feature;

and performing category identification according to the video modal characteristics, the text modal characteristics, the audio modal characteristics and the target fusion characteristics, and determining a category identification result of the video to be identified.

5. The method according to claim 4, wherein the performing category identification according to the video modality feature, the text modality feature, the audio modality feature and the target fusion feature to determine a category identification result of the video to be identified comprises:

performing ith classification processing according to the video modal characteristics, the text modal characteristics, the audio modal characteristics and the target fusion characteristics to obtain an ith class identification result corresponding to the ith classification processing, wherein i is a positive integer and is less than or equal to N, and N is a positive integer greater than or equal to 2;

performing classification processing for the (i + 1) th time according to hidden layer characteristics and an ith-level class identification result corresponding to each classification processing between the 1 st classification processing and the ith classification processing to obtain an (i + 1) th-level class identification result corresponding to the (i + 1) th classification processing;

6. The method according to claim 5, wherein the class identification result of the video to be identified is obtained by calling a hierarchical classification module in a video identification model, the hierarchical classification module comprises N-level classification sub-modules, and a primary classification sub-network in the N-level classification sub-modules corresponds to a primary classification process.

7. The method according to claim 1, wherein the determining the atlas modal characteristics of the video to be identified by using the entity information and the knowledge atlas comprises:

acquiring attribute information associated with the entity information from a knowledge graph;

8. A video processing apparatus, comprising:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be recognized and determining an image modal characteristic, a text modal characteristic and an audio modal characteristic of the video to be recognized;

the first determining unit is used for acquiring entity information of the video to be identified and determining the map modal characteristics of the video to be identified by using the entity information and the knowledge map;

the second determining unit is used for determining the video modal characteristics of the video to be identified according to the image modal characteristics and the map modal characteristics;

and the identification unit is used for determining the category identification result of the video to be identified based on the video modal characteristic, the text modal characteristic and the audio modal characteristic.

9. A computer device, comprising: a processor, a memory, and a network interface;

the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide data communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to cause the computer device to perform the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-7.