CN113539299A

CN113539299A - Multimedia information processing method and device, electronic equipment and storage medium

Info

Publication number: CN113539299A
Application number: CN202110036472.2A
Authority: CN
Inventors: 常德丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-10-22

Abstract

The invention provides a multimedia information processing method, a multimedia information processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: analyzing the target multimedia information to realize separation of target audio contained in the multimedia information; converting the target audio to form a Mel frequency spectrogram matched with the time domain characteristics and the frequency domain characteristics of the target audio; determining a first audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with time domain features and frequency domain features of the target audio through a first sub-model network in the multimedia information processing model; determining a second audio characteristic vector corresponding to the target audio through a second sub-model network in the multimedia information processing model; the type of the target audio in the target multimedia information is determined, so that the type of the target audio in the target multimedia information can be determined, the workload of manual auditing is reduced, the speed and accuracy of auditing the multimedia information are improved, and the use experience of a user is improved.

Description

Multimedia information processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to multimedia information processing technologies, and in particular, to a multimedia information processing method and apparatus, an electronic device, and a storage medium.

Background

In the related technology, the forms of multimedia information are various, the demand of the multimedia information shows explosive growth, the number and the types of the multimedia information received by a multimedia information server are more and more, for example, long video application is taken, a video server can identify the similarity relation between videos through a corresponding matching algorithm, but along with the popularization and the development of video editing tools, more and more video audio editing methods are used, the sound change is accelerated, multiple layers of audio tracks are overlapped to avoid copyright review, the algorithm identification of the videos is increasingly difficult, and the use of users is influenced due to the fact that the video edited by the audio, the infringing content and the manual review speed are slow.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a method, an apparatus, an electronic device, and a storage medium for processing multimedia information, which are capable of determining a type of a target audio in target multimedia information by classifying the target audio in the multimedia information and processing the target audio by using a multimedia information processing model.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a multimedia information processing method, which comprises the following steps:

acquiring target multimedia information, and analyzing the target multimedia information to realize separation of target audio contained in the multimedia information;

converting the target audio to form a Mel frequency spectrogram matched with the time domain characteristics and the frequency domain characteristics of the target audio;

determining a first audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with time domain features and frequency domain features of the target audio through a first sub-model network in a multimedia information processing model;

determining a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through a second sub-model network in a multimedia information processing model;

determining a type of a target audio in the target multimedia information based on the first audio feature vector and the second audio feature vector.

An embodiment of the present invention further provides a multimedia information processing apparatus, where the apparatus includes:

the information transmission module is used for acquiring target multimedia information and analyzing the target multimedia information to realize separation of target audio contained in the multimedia information;

the information processing module is used for carrying out conversion processing on the target audio to form a Mel frequency spectrogram matched with the time domain characteristics and the frequency domain characteristics of the target audio;

the information processing module is used for determining a first audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through a first sub-model network in a multimedia information processing model;

the information processing module is used for determining a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through a second sub-model network in a multimedia information processing model;

the information processing module is configured to determine a type of a target audio in the target multimedia information based on the first audio feature vector and the second audio feature vector.

In the above scheme, the information processing module is configured to analyze the target multimedia information to obtain timing information of the target multimedia information;

the information processing module is used for analyzing the video parameters corresponding to the target multimedia information according to the time sequence information of the target multimedia information, and acquiring the playing time length parameter and the audio track information parameter corresponding to the target multimedia information;

the information processing module is used for extracting the target multimedia information based on the playing time length parameter and the audio track information parameter corresponding to the target multimedia information so as to obtain the target audio corresponding to the target multimedia information.

In the above scheme, the information processing module is configured to perform channel conversion processing on the target audio to form monaural audio data;

the information processing module is used for carrying out short-time Fourier transform on the single-channel audio data based on a windowing function corresponding to a multimedia information processing model to form a corresponding spectrogram;

the information processing module is used for determining a duration parameter corresponding to the multimedia information processing model;

and the information processing module is used for processing the spectrogram according to the duration parameter to form a Mel spectrogram matched with the time domain characteristic and the frequency domain characteristic of the target audio.

In the above scheme, the information processing module is configured to convert a mel frequency spectrum chart matched with a time domain feature and a frequency domain feature of the target audio into a corresponding grayscale image;

the information processing module is used for extracting a feature vector of the Mel frequency spectrum diagram through a convolutional neural network in a first sub-model network in a multimedia information processing model according to the gray level image;

the information processing module is used for processing the feature vector of the Mel frequency spectrogram through a gating circulation unit in a first sub-model network, and determining a first audio feature vector corresponding to the target audio.

In the foregoing solution, the information processing module is configured to determine, based on the number of the mel frequency spectrograms, the number of channels of the gating cycle unit in the first submodel network;

the information processing module is used for determining time sequence parameters according to the time domain characteristics and the frequency domain characteristics of the target audio;

the information processing module is used for determining a recurrent neural network in the first sub-model network based on the number of gated recurrent unit channels in the first sub-model network and the time sequence parameter;

the information processing module is used for determining a first audio feature vector corresponding to the target audio through a recurrent neural network in the first sub-model network.

In the foregoing solution, the information processing module is configured to determine, based on a mel frequency spectrum chart matched with a time domain feature and a frequency domain feature of the target audio, output information of an average pooling layer network through a residual error network in a second sub-model network in the multimedia information processing model;

the information processing module is used for adjusting parameters of an image classification network in the second sub-model network according to the output information of the average pooling layer network;

the information processing module is used for determining a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through an image classification network in a second sub-model network.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for establishing data storage mapping according to the information source of the target multimedia information;

and the information processing module is used for responding to the established data storage mapping and adjusting the file format of the target audio so as to realize matching with the information source.

In the above scheme, the training module is configured to obtain a first training sample set, where the first training sample set is an audio sample in video information acquired through a terminal;

the training module is used for adding noise to the first training sample set to form a corresponding second training sample set;

the training module is used for processing the second training sample set through a multimedia information processing model so as to determine initial parameters of the multimedia information processing model;

the training module is used for responding to the initial parameters of the multimedia information processing model, processing the second training sample set through the multimedia information processing model and determining the updating parameters of the multimedia information processing model;

and according to the updating parameters of the multimedia information processing model, iteratively updating the network parameters of the multimedia information processing model through the second training sample set.

In the above scheme, the training module is configured to determine a dynamic noise type matched with a use environment of the multimedia information processing model;

and the training module is used for adding noise to the first training sample set according to the dynamic noise type so as to change the background noise, volume or sampling rate of the audio samples in the first training sample set and form a corresponding second training sample set.

In the above scheme, the training module is configured to substitute different audio samples in the second training sample set into loss functions respectively corresponding to the first sub-model network and the second sub-model network of the multimedia information processing model;

the training module is used for determining parameters respectively corresponding to a first submodel network and a second submodel network in the multimedia information processing model when the loss function meets corresponding convergence conditions;

and the training module is used for taking the parameters respectively corresponding to the first sub-model network and the second sub-model network as the update parameters of the multimedia information processing model.

In the above scheme, the training module is configured to determine convergence conditions respectively matched with a first sub-model network and a second sub-model network in the multimedia information processing model;

the training module is used for carrying out iterative updating on the parameters respectively corresponding to the first sub-model network and the second sub-model network until the loss functions respectively corresponding to the first sub-model network and the second sub-model network meet the corresponding convergence conditions.

In the above scheme, the information processing module is configured to perform vector fusion processing on the first audio feature vector and the second audio feature vector;

the information processing module is configured to determine a type of a target audio in the target multimedia information based on a result of the vector fusion process, where the type of the target audio includes at least one of:

compliant audio, accelerated change audio, and multi-layer soundtrack overlay audio.

In the above solution, the information processing module is configured to determine source multimedia information corresponding to the target multimedia information;

the information processing module is used for determining a corresponding inter-frame similarity parameter set through the first audio feature vector and the second audio feature vector based on a target audio of the target multimedia information and a source audio of the source multimedia information;

the information processing module is used for acquiring the number of audio frames reaching a similarity threshold value in the interframe similarity parameter set;

the information processing module is used for determining the similarity between the target multimedia information and the source multimedia information based on the number of the audio frames reaching the similarity threshold.

In the above scheme, the information processing module is configured to, when it is determined that the target multimedia information is similar to the source multimedia information, obtain copyright information of the target multimedia information;

the information processing module is used for determining the legality of the target multimedia information according to the copyright information of the target multimedia information and the copyright information of the source multimedia information;

and the information processing module is used for sending out warning information when the copyright information of the target multimedia information is inconsistent with the copyright information of the source multimedia information.

In the above solution, the information processing module is configured to add the target multimedia information to a multimedia information source when it is determined that the target multimedia information is not similar to source multimedia information;

the information processing module is used for sequencing the recall sequence of the multimedia information to be recommended in the multimedia information source;

and the information processing module is used for recommending the multimedia information to the target user based on the sequencing result of the recall sequence of the multimedia information to be recommended.

In the foregoing solution, the information processing module is configured to send the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector, and the type of the target audio of the target multimedia information to the blockchain network, so that the information processing module enables the target multimedia information identifier, the first audio feature vector, and the type of the target audio of the target multimedia information to be sent to the blockchain network

And the node of the block chain network fills the target multimedia information identifier, the first audio characteristic vector of the target multimedia information, the first audio characteristic vector and the type of the target audio of the target multimedia information into a new block, and when the common identifications of the new block are consistent, the new block is added to the tail part of the block chain.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the preorder multimedia information processing method when the executable instructions stored in the memory are operated.

The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and the executable instructions are executed by a processor to realize the multimedia information processing method of the preamble.

The embodiment of the invention has the following beneficial effects:

the method comprises the steps of obtaining target multimedia information, analyzing the target multimedia information to separate target audio contained in the multimedia information; converting the target audio to form a Mel frequency spectrogram matched with the time domain characteristics and the frequency domain characteristics of the target audio; determining a first audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with time domain features and frequency domain features of the target audio through a first sub-model network in a multimedia information processing model; determining a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through a second sub-model network in a multimedia information processing model; and determining the type of the target audio in the target multimedia information based on the first audio feature vector and the second audio feature vector, so that the type of the target audio in the target multimedia information can be reduced, the workload of manual review is reduced, the speed and accuracy of multimedia information review are improved, and the use experience of a user is improved.

Drawings

FIG. 1 is a schematic diagram of an environment for processing multimedia information according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a short video playback in accordance with an embodiment of the present invention;

FIG. 4 is a schematic flow chart illustrating an alternative method for processing multimedia information according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the processing of audio by the multimedia information processing model according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart illustrating an alternative method for processing multimedia information according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart illustrating an alternative method for processing multimedia information according to an embodiment of the present invention;

FIG. 8 is an alternative diagram illustrating the type determination of target audio in an embodiment of the present invention;

fig. 9 is a schematic block chain network architecture provided in the embodiment of the present invention;

fig. 10 is a schematic structural diagram of a block chain in the block chain network 200 according to an embodiment of the present invention;

fig. 11 is a functional architecture diagram of a blockchain network 200 according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating a usage scenario of a multimedia information processing method according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating a process of using the method for processing multimedia information according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) In response to: for indicating the condition or state on which the performed operation depends, when the condition or state on which the performed operation depends is satisfied, the performed operation or operations may be in real time or may have a set delay; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) Target video: various forms of video information available in the internet, such as video files, multimedia information, etc. presented in a client or smart device.

3) A client: the bearer in the terminal that implements the specific function, for example, the mobile client (APP), is the bearer for the specific function in the mobile terminal, for example, the function of performing live online (video push streaming) or the playing function of online video.

4) Short-time Fourier transform: short-time Fourier transform (STFT) is a mathematical transform related to the Fourier transform that determines the frequency and phase of the local area sinusoid of a time-varying signal.

5) Mel spectrum (MBF Mel Bank Features): since the obtained spectrogram is large, in order to obtain a sound feature with a proper size, it is usually passed through a Mel-scale filter banks (Mel-scale filter banks) to become a Mel-scale spectrum.

6) And the information flow is a content organization form which is arranged up and down according to a specific specification and style. From the perspective of presentation ordering, there are common chronological, thermal, algorithmic orderings.

7) The audio feature vector, i.e., the audio 01 vector, is a binarized feature vector generated based on audio.

8) Transaction (Transaction): equivalent to the computer term "transaction," a transaction includes an operation that needs to be committed to a blockchain network for execution and does not refer solely to a transaction in the business context, which embodiments of the present invention follow in view of the convention in blockchain technology that colloquially uses the term "transaction.

For example, a deployment (deployment) transaction is used to install a specified smart contract to a node in a blockchain network and is ready to be invoked; the Invoke (Invoke) transaction is used to append records of the transaction in the blockchain by invoking the smart contract and to perform operations on the state database of the blockchain, including update operations (including adding, deleting, and modifying key-value pairs in the state database) and query operations (i.e., querying key-value pairs in the state database).

9) Block chain (Block chain): is the storage structure of an encrypted, chained transaction formed by blocks (blocks).

For example, the header of each block may include hash values of all transactions in the block, and also include hash values of all transactions in the previous block, so as to achieve tamper resistance and forgery resistance of the transactions in the block based on the hash values; newly generated transactions, after being filled into the tiles and passing through the consensus of nodes in the blockchain network, are appended to the end of the blockchain to form a chain growth.

10) Block chain Network (Block chain Network): the new block is incorporated into the set of a series of nodes of the block chain in a consensus manner.

11) Ledger (hedger): is a general term for blockchains (also known as ledger data) and state databases that are synchronized with blockchains.

Wherein, the blockchain records the transaction in the form of a file in a file system; the state database records the transactions in the blockchain in the form of different types of Key (Key) Value pairs for supporting fast query of the transactions in the blockchain.

12) Smart Contracts (Smart Contracts): also known as Chain code (Chain code) or application code, a program deployed in a node of a blockchain network, the node executing an intelligent contract invoked in a received transaction to update or query key-value data of an account database.

13) Consensus (Consensus): is a process in a blockchain network for agreeing on transactions in blocks among the nodes involved, the agreed blocks are to be appended to the end of the blockchain, and the mechanisms for achieving consensus include Proof of workload (pow, Proof of Work), Proof of rights of interest (PoS, Proof of stamp), Proof of equity authorization (D PoS, released Proof of-of-stamp), Proof of Elapsed Time (Po ET, Proof of Elapsed Time), etc.

14) Multimedia information: including but not limited to: long video (video uploaded by the user), short video (video uploaded by the user with a length of less than 1 minute), audio (e.g., mv or album with a fixed picture).

Fig. 1 is a schematic usage environment diagram of a multimedia information processing method according to an embodiment of the present invention, and referring to fig. 1, a terminal (including a terminal 10-1 and a terminal 10-2) is provided with a client capable of executing different functions, wherein the terminal (including the terminal 10-1 and the terminal 10-2) obtains different video information from a corresponding server 200 through a network 300 by using different service processes for browsing, the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented using a wireless link, wherein types of multimedia information obtained by the terminal (including the terminal 10-1 and the terminal 10-2) from the corresponding server 200 through the network 300 are different, and the multimedia information: including but not limited to: long videos (e.g., videos uploaded by users, or existing videos that users need to perform copyright verification), short videos (e.g., videos uploaded by users with a length of less than 1 minute), audio (e.g., mv or albums with fixed pictures), such as: the terminals (including the terminal 10-1 and the terminal 10-2) can obtain the long video (i.e. the video carries video information or a corresponding video link) from the corresponding server 200 through the network 300, and can also obtain the short video from the corresponding server 400 through the same video client or the wechat applet by using the network 300 for browsing. Different types of videos may be stored in server 200 and server 400. In the application, the playing environments of different types of videos are not distinguished any more. In the process, the video information pushed to the client of the user is video information with copyright compliance, so that for a large number of videos, which videos are similar need to be judged, and further compliance detection is carried out on the copyright information of the similar videos, so that repeated or infringement video information is prevented from being pushed.

Taking short videos as an example, the multimedia information processing model provided by the invention can be applied to short video playing, different short videos of different data sources are usually processed in the short video playing, and finally, multimedia information to be recommended corresponding to a corresponding user is presented on a user interface (user interface), and if the recommended video is a pirated video with non-compliant copyright, the user experience is directly influenced. A background database for video playing receives a large amount of video data from different sources every day, the obtained different videos for multimedia information recommendation to a target user can be called by other application programs (for example, a recommendation result of a short video recommendation process is migrated to a long video information recommendation process or a news recommendation process), and of course, a multimedia information processing model matched with the corresponding target user can also be migrated to different multimedia information recommendation processes (for example, a web page multimedia information recommendation process, an applet multimedia information recommendation process or a multimedia information recommendation process of a long video client).

The multimedia information processing method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, a method, a technology and an application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned voice processing technology and machine learning and other directions. For example, the present invention may relate to a Speech Recognition Technology (ASR) in Speech Technology (Speech Technology), which includes Speech signal preprocessing (Speech signal preprocessing), Speech signal frequency domain analysis (Speech signal analysis), Speech signal feature extraction (Speech signal feature extraction), Speech signal feature matching/Recognition (Speech signal feature matching/Recognition), training of Speech (Speech training), and the like.

For example, Machine Learning (ML) may be involved, which is a multi-domain cross discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and so on. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine Learning generally includes techniques such as Deep Learning (Deep Learning), which includes artificial Neural networks (artificial Neural networks), such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Deep Neural Networks (DNN), and the like.

As described in detail below with respect to the structure of the electronic device according to the embodiment of the present invention, the electronic device may be implemented in various forms, such as a terminal with a multimedia information processing function, for example, a mobile phone running a video client, where the trained multimedia information processing model may be packaged in a storage medium of the terminal, or may be a server or a group of servers with a multimedia information processing function, where the trained multimedia information processing model may be deployed in a server, for example, the server 200 in fig. 1. Fig. 2 is a schematic diagram of a composition structure of an electronic device according to an embodiment of the present invention, and it is understood that fig. 2 only shows an exemplary structure of the electronic device, and not a whole structure, and a part of the structure or the whole structure shown in fig. 2 may be implemented as needed.

The electronic device provided by the embodiment of the invention can comprise: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components in the electronic device 20 are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.

It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

In some embodiments, the multimedia information processing apparatus provided in the embodiments of the present invention may be implemented by a combination of hardware and software, and by way of example, the multimedia information processing apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the multimedia information processing method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

As an example of the multimedia information processing apparatus provided by the embodiment of the present invention implemented by combining software and hardware, the multimedia information processing apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, where the software modules may be located in a storage medium, the storage medium is located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and completes the multimedia information processing method provided by the embodiment of the present invention in combination with necessary hardware (for example, including the processor 201 and other components connected to the bus system 205).

By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

As an example of the multimedia information processing apparatus provided by the embodiment of the present invention implemented by hardware, the apparatus provided by the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, by being executed by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components, to implement the multimedia information processing method provided by the embodiment of the present invention.

The memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the electronic device 20. Examples of such data include: any executable instructions for operating on the electronic device 20, such as executable instructions, may be included in the executable instructions, as may the programs implementing the multimedia information processing method of embodiments of the present invention.

In other embodiments, the multimedia information processing apparatus provided by the embodiment of the present invention may be implemented by software, and fig. 2 shows the multimedia information processing apparatus 2020 stored in the memory 202, which may be software in the form of programs, plug-ins, and the like, and includes a series of modules, and as an example of the programs stored in the memory 202, the multimedia information processing apparatus 2020 may include the following software modules: an information transmission module 2081 and an information processing module 2082. When the software modules in the multimedia information processing apparatus 2020 are read into the RAM by the processor 201 and executed, the functions of the software modules in the multimedia information processing apparatus 2020 are described as follows:

the information transmission module 2081 is configured to obtain target multimedia information and analyze the target multimedia information to separate a target audio included in the multimedia information.

The information processing module 2082 is configured to perform conversion processing on the target audio to form a mel spectrum map matched with the time domain feature and the frequency domain feature of the target audio.

The information processing module 2082 is configured to determine, through a first sub-model network in the multimedia information processing model, a first audio feature vector corresponding to the target audio based on a mel-frequency spectrogram matched with a time-domain feature and a frequency-domain feature of the target audio.

The information processing module 2082 is configured to determine, through a second sub-model network in the multimedia information processing model, a second audio feature vector corresponding to the target audio based on the mel-frequency spectrogram matched with the time-domain feature and the frequency-domain feature of the target audio.

The information processing module 2082 is configured to determine a type of a target audio in the target multimedia information based on the first audio feature vector and the second audio feature vector.

According to the electronic device shown in fig. 2, in one aspect of the present application, the present application also provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes different embodiments and combinations of embodiments provided in various alternative implementations of the multimedia information processing method.

Before the multimedia information processing method provided by the embodiment of the present invention is described with reference to the electronic device 20 shown in fig. 2, firstly, the defects of the related art are described, in this process, although the existing video server can identify the similarity relationship between videos through the corresponding matching algorithm, as the popularization and development of video editing tools, the attack type of video pictures becomes more complicated, referring to fig. 3, fig. 3 is a schematic diagram of short video playing in the related art of the embodiment of the present invention, in the cropped video shown in fig. 3, it is difficult to solve the problem that a part of the video repetition/infringement content changes a lot on the picture due to the simple dependence on video image fingerprints, the audio in the video is changed into sound at an accelerated speed, and the multi-layer audio track superposition is difficult to identify. In the related art, audio information in videos can be compared through an audio fingerprint algorithm to determine whether the videos are similar, but for audio subjected to accelerated sound change, accurate identification cannot be achieved in a use environment of multi-layer audio track superposition, for example: for the use environment of the inflexion attack, because the landmark depends on the frequency peak point, the frequency of the audio frequency is changed in the inflexion video, the generated hash is different, and the similarity retrieval fails; similarly for the use environment of double/slow attacks, the slow dt of the double changes, resulting in different generated hashes, since the combined hash in landmark depends on dt (t2-t 1).

In order to overcome the above-mentioned drawbacks, referring to fig. 4, fig. 4 is an optional flowchart of a multimedia information processing method according to an embodiment of the present invention, and it can be understood that the steps shown in fig. 4 may be executed by various electronic devices operating a multimedia information processing apparatus, for example, a terminal, a server, or a server cluster having a multimedia information processing function, when the multimedia information processing apparatus is operated in the terminal, a WeChat applet in the terminal may be triggered to perform multimedia information similarity detection, and when the multimedia information processing apparatus is operated in a Long video copyright detection server or a music playing software server, a corresponding Long video copyright or music information copyright may be detected, and the following description is provided with reference to the steps shown in fig. 4.

Step 401: the multimedia information processing device acquires target multimedia information and analyzes the target multimedia information to realize separation of target audio included in the multimedia information.

Wherein, a data storage mapping can be established according to the information source of the target multimedia information; and adjusting the file format of the target audio in response to the established data storage mapping so as to match the information source.

In some embodiments of the present invention, obtaining target multimedia information, and analyzing the target multimedia information to separate a target audio included in the target multimedia information may be implemented by:

analyzing the target multimedia information to acquire time sequence information of the target multimedia information; analyzing the video parameter corresponding to the target multimedia information according to the time sequence information of the target multimedia information, and acquiring a playing time length parameter and an audio track information parameter corresponding to the target multimedia information; and extracting the target multimedia information based on the playing time length parameter and the audio track information parameter corresponding to the target multimedia information to obtain the target audio corresponding to the target multimedia information. The method comprises the steps that firstly, multimedia information to be processed is sent to a multimedia information processing device through a client, the multimedia information is used as long video information, and an audio synchronization packet in video data can be obtained firstly; then, the corresponding playing duration parameter and the corresponding audio track information parameter can be obtained by analyzing the audio head decoded data AACDecoderSpecificInfo and the audio data configuration information AudioSpecificConfig in the audio synchronization packet. The audio data configuration information AudioSpecificConfig is used to generate ADST (including the sampling rate, the number of channels, and the frame length data in the audio data). And acquiring other audio packets in the video data based on the audio track information, analyzing original audio data, and finally packaging the AAC ES stream into an ADTS format through an audio data header AAC decoder, wherein a header file ADTSheader of 7 bytes is added in front of the AAC ES stream to realize extraction so as to acquire target audio corresponding to target multimedia information.

Step 402: and the multimedia information processing device converts the target audio to form a Mel frequency spectrogram matched with the time domain characteristics and the frequency domain characteristics of the target audio.

In some embodiments of the present invention, the converting process is performed on the target audio to form a mel frequency spectrum corresponding to the target audio, and may be implemented by:

carrying out sound channel conversion processing on the target audio to form single-channel audio data; performing short-time Fourier transform on the single-channel audio data based on a windowing function corresponding to a multimedia information processing model to form a corresponding spectrogram; determining a duration parameter corresponding to the multimedia information processing model; and processing the spectrogram according to the duration parameter to form a Mel spectrogram corresponding to the target audio. Taking the audio information processing in the video as an example, the audio can be firstly resampled into 16KHz single-channel audio; then, using a 25ms Hann time window, and carrying out short-time Fourier transform on the audio frequency by using a 10ms frame shift and a periodic Hann window to obtain a corresponding spectrogram; calculating the mel sound spectrum by mapping the spectrogram into a mel filter bank of 64 orders, wherein the range of mel bins is 125-7500 Hz; calculating log (mel-spectrum +0.01) to obtain stable mel-frequency spectrum, wherein 0.01 offset is added to avoid taking logarithm of 0; the obtained features are framed with the features of 0.96s, and no frame overlapping exists, each frame comprises 64 mel frequency bands and is 10ms long (total 96 frames), and therefore the corresponding Mel spectrogram is extracted.

Further, when audio data is converted into data in a mel-frequency spectrogram, since the unit of frequency is hertz (Hz), the frequency range audible to the human ear is 20-20000Hz, but the scale unit of the human ear to Hz is not a linear perceptual relationship. For example, if one adapts to a 1000Hz tone, if the tone frequency is increased to 2000Hz, the ear can only perceive a small increase in frequency, and no increase in frequency is perceived as a doubling at all. If the ordinary frequency scale is converted into the Mel frequency scale, the perception of the human ear to the frequency is linear. That is, under the mel scale, if the mel frequencies of two pieces of speech are different by two times, the pitch that can be perceived by human ears is probably different by two times, so that the beneficial technical effect of visualizing the audio data can be realized.

Step 403: and the multimedia information processing device determines a first audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through a first sub-model network in a multimedia information processing model.

In some embodiments of the present invention, determining, by a first sub-model network in a multimedia information processing model, a first audio feature vector corresponding to the target audio based on a mel frequency spectrogram matching with a time domain feature and a frequency domain feature of the target audio may be implemented by:

converting the Mel frequency spectrogram matched with the time domain characteristic and the frequency domain characteristic of the target audio frequency into corresponding gray level images; extracting a feature vector of a Mel frequency spectrum diagram through a convolutional neural network in a first sub-model network in a multimedia information processing model according to the gray level image; and processing the feature vector of the Mel frequency spectrogram through a gating circulation unit in a first sub-model network, and determining a first audio feature vector corresponding to the target audio. Wherein the number of channels of the gated cyclic unit in the first submodel network is determined based on the number of the Mel frequency spectrograms; determining time sequence parameters according to the time domain characteristics and the frequency domain characteristics of the target audio; determining a recurrent neural network in the first submodel network based on the number of gated recurrent unit channels in the first submodel network and the time series parameter; and determining a first audio feature vector corresponding to the target audio through a recurrent neural network in the first sub-model network.

Referring to fig. 5, fig. 5 is a schematic diagram of a processing process of an audio by a multimedia information processing model in the embodiment of the present invention, and may perform feature extraction through a VGGish network, where the feature extraction of the multimedia information processing model may be implemented through a Visual Geometry Group network (VGGish), for example, for audio information in a video, an audio file may be extracted to obtain an audio file, a corresponding mel spectrogram is obtained for the audio file, then, for the mel spectrogram, the audio feature is extracted through the VGGish network, and the extracted Vector is subjected to clustering coding through a spatial local aggregation Vector (NetVlad Net Vector aggregated encoded descriptors) to obtain an audio feature Vector. NetVlad can save the distance between each feature point and the nearest cluster center and take the feature point as a new feature.

Continuing with the example of long video processing, the VGGish network supports the extraction of 128-dimensional embedding feature vectors with semantics from corresponding audio information. The method specifically comprises the following steps: converting the audio clip into a triple sample of a mel frequency spectrogram as an input of the VGGish model, which specifically comprises the following steps: calculating a spectrogram of the audio clip by using the signal amplitude, mapping the spectrogram of the audio clip to a 64-order Mel filter bank to calculate a Mel frequency spectrum, and obtaining N triple samples mapped to the Mel frequency spectrogram from Hz, wherein the characteristic dimension is N x 96 x 64; and then, taking a VGGish model based on Tensorflow as an audio feature extractor, taking the triple sample as input, and performing feature extraction by using the VGGish network model to obtain an audio feature vector corresponding to the N x 128 audio segment.

Step 404: and the multimedia information processing device determines a second audio feature vector corresponding to the target audio based on the Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through a second sub-model network in the multimedia information processing model.

In some embodiments of the present invention, determining, by a second sub-model network in a multimedia information processing model, a second audio feature vector corresponding to the target audio based on a mel-frequency spectrogram matching with a time-domain feature and a frequency-domain feature of the target audio may be implemented by:

determining output information of an average pooling layer network through a residual error network in a second sub-model network in the multimedia information processing model based on a Mel frequency spectrogram matched with time domain features and frequency domain features of the target audio; adjusting parameters of an image classification network in the second sub-model network according to the output information of the average pooling layer network; and determining a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through an image classification network in a second sub-model network.

Step 405: the multimedia information processing device determines the type of the target audio in the target multimedia information based on the first audio feature vector and the second audio feature vector.

Wherein the first audio feature vector and the second audio feature vector can be subjected to vector fusion processing; determining a type of target audio in the target multimedia information based on a result of the vector fusion process, wherein the type of target audio includes at least one of:

With continuing reference to fig. 6, fig. 6 is an optional flowchart of the multimedia information processing method according to the embodiment of the present invention, and it can be understood that the steps shown in fig. 6 may be executed by various electronic devices operating a multimedia information processing apparatus, for example, a terminal, a server, or a server cluster operating a multimedia information processing function, when the multimedia information processing apparatus is operated in the terminal, a wechat applet in the terminal may be triggered to perform multimedia information similarity detection, and when the multimedia information processing apparatus is operated in a short video copyright detection server or a music playing software server, a corresponding short video copyright or a music information copyright may be detected. The following is a description of the steps shown in fig. 6.

Step 601: source multimedia information corresponding to the target multimedia information is determined.

Step 602: and determining a corresponding inter-frame similarity parameter set through the first audio feature vector and the second audio feature vector based on the target audio of the target multimedia information and the source audio of the source multimedia information.

Step 603: and acquiring the number of audio frames reaching the similarity threshold in the interframe similarity parameter set.

Step 604: and determining the number of audio frames reaching the similarity threshold, comparing the number with the number, and executing the step 605 when the number exceeds the number threshold, otherwise executing the step 606.

Step 605: and determining that the target multimedia information is similar to the source multimedia information, and prompting to provide copyright information.

Step 606: and determining that the target multimedia information is different from the source multimedia information, and entering a corresponding recommendation process.

In some embodiments of the present invention, when it is determined that the target multimedia information is similar to the source multimedia information, acquiring copyright information of the target multimedia information; determining the legality of the target multimedia information according to the copyright information of the target multimedia information and the copyright information of the source multimedia information; and when the copyright information of the target multimedia information is inconsistent with the copyright information of the source multimedia information, sending out warning information.

In some embodiments of the present invention, when it is determined that the target multimedia information is not similar to the source multimedia information, adding the target multimedia information to a multimedia information source; sequencing the recall sequence of the multimedia information to be recommended in the multimedia information source; and recommending the multimedia information to the target user based on the sequencing result of the recall sequence of the multimedia information to be recommended.

Continuing to describe the multimedia information processing method provided by the embodiment of the present invention with reference to the electronic device 20 shown in fig. 2, referring to fig. 7, fig. 7 is an optional flowchart of the multimedia information processing method provided by the embodiment of the present invention, and it can be understood that the steps shown in fig. 7 can be executed by various electronic devices operating a multimedia information processing apparatus, for example, a terminal, a server or a server cluster having a multimedia information processing function, when the multimedia information processing apparatus operates in a long video copyright detection server or a music playing software server to detect corresponding long video copyright or music information copyright, a trained multimedia information processing can be deployed in the server to detect similarity of uploaded videos to determine whether to perform compliance detection on copyright information of the videos, and of course, a multimedia information processing model needs to be trained before deploying the multimedia information processing model, the method specifically comprises the following steps:

step 701: the method comprises the steps of obtaining a first training sample set, wherein the first training sample set is an audio sample in video information collected through a terminal.

Step 702: noise adding is performed on the first set of training samples to form a corresponding second set of training samples.

In some embodiments of the present invention, noise adding the first set of training samples to form the corresponding second set of training samples may be implemented by:

determining a dynamic noise type matched with the use environment of the multimedia information processing model; and according to the dynamic noise type, adding noise to the first training sample set to change the background noise, the volume or the sampling rate of the audio samples in the first training sample set to form a corresponding second training sample set. Wherein, the attack due to the audio information includes but is not limited to: therefore, in the process of training sample set construction, an audio enhancement data set can be made according to common audio attack types, and common audio enhancement forms comprise: changing sound, increasing background noise, changing volume, changing sampling rate, changing tone quality, etc. different parameters are set to obtain different enhanced audio frequencies. It should be noted that in some embodiments of the present invention, the construction of the training sample set does not use the case where the video duration changes or there is frame shift resulting in frame misalignment.

A training sample set is made according to the audio enhancement data, for example, one original audio corresponds to 20 attack audios, where each attack audio has the same duration as the original audio and no frame shift (i.e., the audio at the corresponding time point is the same), the audio duration is dur, and with 0.96s as step, each group of audios (original audio + corresponding attack audio) will generate dur/0.96 tags, and the tags at the same time point are the same.

Step 703: processing the second set of training samples by a multimedia information processing model to determine initial parameters of the multimedia information processing model.

Step 704: and responding to the initial parameters of the multimedia information processing model, processing the second training sample set through the multimedia information processing model, and determining the updating parameters of the multimedia information processing model.

In some embodiments of the present invention, the second training sample set is processed by the multimedia information processing model in response to the initial parameters of the multimedia information processing model, and the determination of the updated parameters of the multimedia information processing model may be performed by:

substituting different audio samples in the second training sample set into loss functions respectively corresponding to a first sub-model network and a second sub-model network of the multimedia information processing model; determining parameters respectively corresponding to a first sub-model network and a second sub-model network in the multimedia information processing model when the loss function meets corresponding convergence conditions; and taking the parameters respectively corresponding to the first sub-model network and the second sub-model network as the update parameters of the multimedia information processing model.

Step 705: and according to the updating parameters of the multimedia information processing model, iteratively updating the network parameters of the multimedia information processing model through the second training sample set.

Specifically, determining convergence conditions respectively matched with a first sub-model network and a second sub-model network in the multimedia information processing model; and iteratively updating the parameters respectively corresponding to the first sub-model network and the second sub-model network until the loss functions respectively corresponding to the first sub-model network and the second sub-model network meet the corresponding convergence conditions.

Fig. 8 is an alternative diagram illustrating the type determination of the target audio according to the embodiment of the present invention. The method comprises the steps that with the playing process of a video, the displayed picture area of the video changes along with the time axis in the playing process, different video targets are arranged in the displayed picture area, audio information is processed through a multimedia information processing model, whether accelerated sound change occurs or not is judged, multiple layers of sound tracks are overlapped, auxiliary auditing of large-scale short videos is achieved, whether the video to be detected is in compliance or not is determined through the detection result of the video targets, or whether the video to be detected meets the requirement of copyright information or not is determined, and the video uploaded by a user is prevented from being embezzled.

Because the number of the multimedia information server is continuously increased, the copyright information of the multimedia information can be stored in a block chain network or a cloud server, and the similarity of the multimedia information can be judged. The embodiment of the present invention may be implemented by combining a Cloud technology or a block chain network technology, where the Cloud technology (Cloud technology) refers to a hosting technology for unifying series resources such as hardware, software, and a network in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data, and may also be understood as a generic term of a network technology, an information technology, an integration technology, a management platform technology, an application technology, and the like applied based on a Cloud computing business model. Background services of the technical network system require a large amount of computing and storage resources, such as multimedia information websites, picture-type websites and more portal websites, so that cloud technology needs to be supported by cloud computing.

It should be noted that cloud computing is a computing mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can obtain computing power, storage space and information services as required. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand. As a basic capability provider of cloud computing, a cloud computing resource pool platform, which is called an Infrastructure as a Service (IaaS) for short, is established, and multiple types of virtual resources are deployed in a resource pool and are used by external clients selectively. The cloud computing resource pool mainly comprises: a computing device (which may be a virtualized machine, including an operating system), a storage device, and a network device.

In some embodiments of the present invention, the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector and the type of target audio of the target multimedia information may be further sent to a blockchain network, so that the target multimedia information identifier, the first audio feature vector and the type of target audio of the target multimedia information are transmitted to the blockchain network

In the above scheme, the method further comprises:

receiving data synchronization requests of other nodes in the blockchain network; responding to the data synchronization request, and verifying the authority of the other nodes; and when the authority of the other nodes passes the verification, controlling the current node and the other nodes to carry out data synchronization so as to realize that the other nodes acquire the target multimedia information identifier, the first audio characteristic vector of the target multimedia information, the first audio characteristic vector and the type of the target audio of the target multimedia information.

In the above scheme, the method further comprises: responding to a query request, and analyzing the query request to obtain a corresponding user identifier; acquiring authority information in a target block in a block chain network according to the user identification; checking the matching of the authority information and the user identification; when the authority information is matched with the user identification, acquiring a corresponding target multimedia information identification, a first audio characteristic vector of the target multimedia information, a first audio characteristic vector and a type of a target audio of the target multimedia information in the block chain network; and responding to the query request, and pushing the acquired corresponding target multimedia information identifier, the first audio characteristic vector of the target multimedia information, the first audio characteristic vector and the type of the target audio of the target multimedia information to a corresponding client so as to enable the client to acquire the corresponding target multimedia information identifier, the first audio characteristic vector of the target multimedia information, the first audio characteristic vector and the type of the target audio of the target multimedia information, which are stored in the block chain network.

With continued reference to fig. 9, fig. 9 is a schematic architecture diagram of a blockchain network provided in the embodiment of the present invention, which includes a blockchain network 200 (exemplarily illustrating a consensus node 210-1 to a consensus node 210-3), an authentication center 300, a service agent 400, and a service agent 500, which are respectively described below.

The type of blockchain network 200 is flexible and may be, for example, any of a public chain, a private chain, or a federation chain. Taking a public link as an example, electronic devices such as user terminals and servers of any service entity can access the blockchain network 200 without authorization; taking a federation chain as an example, an electronic device (e.g., a terminal/server) under the jurisdiction of a service entity after obtaining authorization may access the blockchain network 200, and at this time, become a client node in the blockchain network 200.

In some embodiments, the client node may act as a mere watcher of the blockchain network 200, i.e., provides functionality to support a business entity to initiate a transaction (e.g., for uplink storage of data or querying of data on a chain), and may be implemented by default or selectively (e.g., depending on the specific business requirements of the business entity) with respect to the functions of the consensus node 210 of the blockchain network 200, such as a ranking function, a consensus service, and an accounting function, etc. Therefore, the data and the service processing logic of the service subject can be migrated into the block chain network 200 to the maximum extent, and the credibility and traceability of the data and service processing process are realized through the block chain network 200.

Consensus nodes in blockchain network 200 receive transactions submitted from client nodes (e.g., client node 410 attributed to business entity 400, and client node 510 attributed to database operator systems, shown in the preamble embodiments) of different business entities (e.g., business entity 400 and business entity 500, shown in the preamble implementation), perform the transactions to update the ledger or query the ledger, and various intermediate or final results of performing the transactions may be returned for display in the business entity's client nodes.

For example, the client node 410/510 may subscribe to events of interest in the blockchain network 200, such as transactions occurring in a particular organization/channel in the blockchain network 200, and the corresponding transaction notifications are pushed by the consensus node 210 to the client node 410/510, thereby triggering the corresponding business logic in the client node 410/510.

An exemplary application of the blockchain network is described below, taking an example that a plurality of service agents access the blockchain network to implement management of instruction information and service processes matched with the instruction information.

Referring to fig. 9, a plurality of business entities involved in the management link, such as the business entity 400 may be a multimedia information processing apparatus, the business entity 500 may be a display system with a function of the multimedia information processing apparatus, and registers from the certificate authority 300 to obtain respective digital certificates, where each digital certificate includes a public key of the business entity and a digital signature signed by the certificate authority 300 for the public key and identity information of the business entity, and is used to be attached to a transaction together with the digital signature of the business entity for the transaction, and is sent to the blockchain network, so that the blockchain network takes out the digital certificate and the signature from the transaction, verifies the authenticity of the message (i.e. whether the message is not tampered) and the identity information of the business entity sending the message, and verifies the blockchain network according to the identity, for example, whether the block chain network has a right to initiate the transaction. Clients running on electronic devices (e.g., terminals or servers) hosted by the business entity may request access from the blockchain network 200 to become client nodes.

The client node 410 of the service body 400 is configured to send the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector, and the type of the target audio of the target multimedia information to the blockchain network, so that the nodes of the blockchain network fill the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector, and the type of the target audio of the target multimedia information into a new block, and when the new block is identified consistently, add the new block to the tail of the blockchain.

Wherein, the corresponding target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector and the type of the target audio of the target multimedia information are sent to the blockchain network 200, a service logic can be set in the client node 410 in advance, when the target multimedia information is determined to be dissimilar to the source multimedia information, the client node 410 automatically sends the target multimedia information identifier to be processed, the first audio feature vector of the target multimedia information, the first audio feature vector and the type of the target audio of the target multimedia information to the blockchain network 200, or a service person of the service agent 400 logs in the client node 410, manually packs the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector and the type of the target audio of the target multimedia information and corresponding conversion progress information, and sends it to the blockchain network 200. During sending, the client node 410 generates a transaction corresponding to an update operation according to the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector, and the type of the target audio of the target multimedia information, specifies an intelligent contract that needs to be invoked to implement the update operation, and parameters transferred to the intelligent contract in the transaction, and also carries a digital certificate of the client node 410 and a signed digital signature (for example, a digest of the transaction is encrypted by using a private key in the digital certificate of the client node 410), and broadcasts the transaction to the consensus node 210 in the blockchain network 200.

When the transaction is received in the consensus node 210 in the blockchain network 200, the digital certificate and the digital signature carried by the transaction are verified, after the verification is successful, whether the service agent 400 has the transaction right is determined according to the identity of the service agent 400 carried in the transaction, and the transaction fails due to any verification judgment of the digital signature and the right verification. After successful verification, the consensus node 210 signs its own digital signature (e.g., by encrypting a digest of the transaction using the private key of the consensus node 210-1) and continues to broadcast in the blockchain network 200.

After receiving the transaction successfully verified, the consensus node 210 in the blockchain network 200 fills the transaction into a new block and broadcasts the new block. When a new block is broadcasted by the consensus node 210 in the block chain network 200, performing a consensus process on the new block, if the consensus is successful, adding the new block to the tail of the block chain stored in the new block, updating the state database according to a transaction result, and executing a transaction in the new block: and adding a target multimedia information identifier, a first audio characteristic vector of the target multimedia information, a first audio characteristic vector, a type of target audio of the target multimedia information and corresponding process trigger information into a state database for the transaction of submitting and updating the target multimedia information identifier to be processed, the first audio characteristic vector of the target multimedia information, the first audio characteristic vector, the type of the target audio of the target multimedia information and corresponding key value pairs of the process trigger information.

A service person of the service agent 500 logs in the client node 510, inputs a target multimedia information identifier, a first audio feature vector of the target multimedia information, a first audio feature vector, and a type query request of a target audio of the target multimedia information, the client node 510 generates a transaction corresponding to an update operation/query operation according to the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector, and the type query request of the target audio of the target multimedia information, specifies an intelligent contract that needs to be called to realize the update operation/query operation, and parameters transferred to the intelligent contract in the transaction, and the transaction also carries a digital certificate of the client node 510, a signed digital signature (for example, obtained by encrypting a digest of the transaction using a private key in the digital certificate of the client node 510), and broadcasts the transaction to the consensus node 210 in the blockchain network 200.

After receiving the transaction in the consensus node 210 in the blockchain network 200, verifying the transaction, filling the block and making the consensus consistent, adding the filled new block to the tail of the blockchain stored in the new block, updating the state database according to the transaction result, and executing the transaction in the new block: for the submitted transaction of the manual identification result corresponding to the copyright information data information for updating a certain multimedia information, updating the key value pair corresponding to the copyright information data information of the multimedia information in the state database according to the manual identification result; and for the submitted transaction for inquiring copyright information data information of certain multimedia information, inquiring a target multimedia information identifier, a first audio characteristic vector of the target multimedia information, the first audio characteristic vector and a key value pair corresponding to the type of the target audio of the target multimedia information from a state database, and returning a transaction result.

It should be noted that fig. 9 exemplarily shows a process of directly linking the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector and the type of the target audio of the target multimedia information and the corresponding process trigger information, but in other embodiments, for a case that the data volume of the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector and the type of the target audio of the target multimedia information is large, the client node 410 may hash the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector and the type of the target audio of the target multimedia information and the corresponding target multimedia information identifier, the first audio feature vector of the target multimedia information, the second audio feature vector of the target multimedia information, and the corresponding process trigger information, And performing Hash pairing uplink on the first audio characteristic vector and the type of the target audio of the target multimedia information, and storing a target multimedia information identifier, the first audio characteristic vector of the target multimedia information, the first audio characteristic vector, the type of the target audio of the target multimedia information and corresponding process triggering information in a distributed file system or a database. After acquiring the target multimedia information identifier, the first audio feature vector of the target multimedia information, the first audio feature vector, the type of the target audio of the target multimedia information, and the corresponding process trigger information from the distributed file system or the database, the client node 510 may perform verification in combination with the corresponding hash in the block chain network 200, thereby reducing the workload of uplink operation.

As an example of a block chain, referring to fig. 10, fig. 10 is a schematic structural diagram of a block chain in a block chain network 200 according to an embodiment of the present invention, where a header of each block may include hash values of all transactions in the block and also include hash values of all transactions in a previous block, a record of a newly generated transaction is filled in the block and is added to a tail of the block chain after being identified by nodes in the block chain network, so as to form a chain growth, and a chain structure based on hash values between blocks ensures tamper resistance and forgery prevention of transactions in the block.

An exemplary functional architecture of a block chain network provided in the embodiment of the present invention is described below, referring to fig. 11, fig. 11 is a functional architecture schematic diagram of a block chain network 200 provided in the embodiment of the present invention, which includes an application layer 201, a consensus layer 202, a network layer 203, a data layer 204, and a resource layer 205, which are described below respectively.

The resource layer 205 encapsulates the computing, storage, and communication resources that implement each of the consensus nodes 210 in the blockchain network 200.

The data layer 204 encapsulates various data structures that implement the ledger, including blockchains implemented in files in a file system, state databases of the key-value type, and presence certificates (e.g., hash trees of transactions in blocks).

The network layer 203 encapsulates the functions of a Point-to-Point (P2P) network protocol, a data propagation mechanism and a data verification mechanism, an access authentication mechanism and service agent identity management.

Wherein, the P2P network protocol implements communication between the consensus nodes 210 in the blockchain network 200, the data propagation mechanism ensures propagation of transactions in the blockchain network 200, and the data verification mechanism is used for implementing reliability of data transmission between the consensus nodes 210 based on cryptography methods (e.g., digital certificates, digital signatures, public/private key pairs); the access authentication mechanism is used for authenticating the identity of the service subject added into the block chain network 200 according to an actual service scene, and endowing the service subject with the authority of accessing the block chain network 200 when the authentication is passed; the business entity identity management is used to store the identity of the business entity that is allowed to access blockchain network 200, as well as the permissions (e.g., the types of transactions that can be initiated).

The consensus layer 202 encapsulates the mechanisms by which the consensus nodes 210 in the blockchain network 200 agree on a block (i.e., a consensus mechanism), transaction management, and ledger management. The consensus mechanism comprises consensus algorithms such as POS, POW and DPOS, and the pluggable consensus algorithm is supported.

The transaction management is configured to verify a digital signature carried in the transaction received by the consensus node 210, verify identity information of the service entity, and determine whether the service entity has the right to perform the transaction according to the identity information (read related information from the identity management of the service entity); for the service agents authorized to access the blockchain network 200, the service agents all have digital certificates issued by the certificate authority, and the service agents sign the submitted transactions by using private keys in the digital certificates of the service agents, so that the legal identities of the service agents are declared.

The ledger administration is used to maintain blockchains and state databases. For the block with the consensus, adding the block to the tail of the block chain; executing the transaction in the acquired consensus block, updating the key-value pairs in the state database when the transaction comprises an update operation, querying the key-value pairs in the state database when the transaction comprises a query operation and returning a query result to the client node of the business entity. Supporting query operations for multiple dimensions of a state database, comprising: querying the block based on the block vector number (e.g., hash value of the transaction); inquiring the block according to the block hash value; inquiring a block according to the transaction vector number; inquiring the transaction according to the transaction vector number; inquiring account data of a business main body according to an account (vector number) of the business main body; and inquiring the block chain in the channel according to the channel name.

The application layer 201 encapsulates various services that the blockchain network can implement, including tracing, crediting, and verifying transactions.

Therefore, the copyright information of the target multimedia information identified by the similarity can be stored in the blockchain network, and when a new user uploads the multimedia information to the multimedia information server, the multimedia information server can call the copyright information in the blockchain network (at this moment, the target multimedia information uploaded by the user can be used as the source multimedia information) to verify the copyright compliance of the multimedia information.

Fig. 12 is a schematic view of a usage scenario of the multimedia information processing method according to an embodiment of the present invention, where the multimedia information is a short video, a client capable of displaying software of the corresponding short video, such as a client or a plug-in for playing the short video, is disposed on a terminal (including a terminal 10-1 and a terminal 10-2), and a user can obtain and display a target video through the corresponding client; the terminal is connected to the short video server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to realize data transmission. Certainly, the user can also upload videos through the WeChat applet in the terminal to be watched by other users in the network, in the process, the video server of the operator needs to detect the videos uploaded by the user, compare and analyze different video information, determine whether the copyright of the videos uploaded by the user is in compliance or not, recommend the compliant videos to different users, and avoid the short videos of the user from being illegally broadcast.

The present invention provides an information processing method, and a use process of the multimedia information processing method provided by the present invention is described below, wherein, referring to fig. 13, fig. 13 is a schematic diagram of an optional use process of the multimedia information processing method in the embodiment of the present invention, and specifically includes the following steps:

step 1301: and acquiring audio information corresponding to the target short video, and preprocessing the audio information through a preprocessing process.

Step 1302: and acquiring a training sample set of the multimedia information processing model.

Step 1303: and training the multimedia information processing model and determining corresponding model parameters.

Step 1304: and deploying the trained multimedia information processing model in a corresponding video detection server.

Step 1305: and detecting the audio in different video information through a multimedia information processing model to determine whether the audio of the target short video is in compliance.

When determining the audio compliance of a target short video, acquiring the copyright information of the target short video; for example, a video uploading user uploads corresponding copyright information through a WeChat applet operated by the terminal 10-1, or a storage location of the video copyright information in a cloud server network. Determining the legality of the target short video according to the copyright information of the target short video and the copyright information of the source video; and when the copyright information of the target short video is inconsistent with the copyright information of the source video, sending out warning information. Meanwhile, when the target short video is determined to be dissimilar to the source video, the target short video is added to a video source; sequencing recall sequences of all videos to be recommended in the video source; and recommending the videos to the target user based on the sequencing result of the recall sequence of the videos to be recommended, so that the push of the original videos is facilitated.

The beneficial technical effects are as follows:

according to the embodiments of the invention, target multimedia information is obtained and analyzed to separate target audio frequency included in the multimedia information; converting the target audio to form a Mel frequency spectrogram matched with the time domain characteristics and the frequency domain characteristics of the target audio; determining a first audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with time domain features and frequency domain features of the target audio through a first sub-model network in a multimedia information processing model; determining a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through a second sub-model network in a multimedia information processing model; and determining the type of the target audio in the target multimedia information based on the first audio feature vector and the second audio feature vector, so that the type of the target audio in the target multimedia information can be reduced, the workload of manual review is reduced, the speed and accuracy of multimedia information review are improved, and the use experience of a user is improved.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for processing multimedia information, the method comprising:

2. The method of claim 1, wherein the obtaining the target multimedia information and parsing the target multimedia information to separate the target audio included in the target multimedia information comprises:

analyzing the target multimedia information to acquire time sequence information of the target multimedia information;

analyzing the video parameter corresponding to the target multimedia information according to the time sequence information of the target multimedia information, and acquiring a playing time length parameter and an audio track information parameter corresponding to the target multimedia information;

and extracting the target multimedia information based on the playing time length parameter and the audio track information parameter corresponding to the target multimedia information to obtain the target audio corresponding to the target multimedia information.

3. The method of claim 1, wherein the transforming the target audio to form a Mel frequency spectrogram matching time-domain features and frequency-domain features of the target audio comprises:

carrying out sound channel conversion processing on the target audio to form single-channel audio data;

performing short-time Fourier transform on the single-channel audio data based on a windowing function corresponding to a multimedia information processing model to form a corresponding spectrogram;

determining a duration parameter corresponding to the multimedia information processing model;

and processing the spectrogram according to the duration parameter to form a Mel spectrogram matched with the time domain characteristic and the frequency domain characteristic of the target audio.

4. The method of claim 1, wherein the determining, by a first sub-model network in the multimedia information processing model, a first audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matching time-domain features and frequency-domain features of the target audio comprises:

converting the Mel frequency spectrogram matched with the time domain characteristic and the frequency domain characteristic of the target audio frequency into corresponding gray level images;

extracting a feature vector of a Mel frequency spectrum diagram through a convolutional neural network in a first sub-model network in a multimedia information processing model according to the gray level image;

and processing the feature vector of the Mel frequency spectrogram through a gating circulation unit in a first sub-model network, and determining a first audio feature vector corresponding to the target audio.

5. The method of claim 4, wherein the processing the feature vectors of the Mel frequency spectrogram by a gating loop unit in a first submodel network to determine a first audio feature vector corresponding to the target audio comprises:

determining a number of channels of gated cyclic units in the first sub-model network based on the number of mel-frequency spectrograms;

determining time sequence parameters according to the time domain characteristics and the frequency domain characteristics of the target audio;

determining a recurrent neural network in the first submodel network based on the number of gated recurrent unit channels in the first submodel network and the time series parameter;

and determining a first audio feature vector corresponding to the target audio through a recurrent neural network in the first sub-model network.

6. The method of claim 1, wherein the determining, by a second sub-model network in the multimedia information processing model, a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matching time-domain features and frequency-domain features of the target audio comprises:

determining output information of an average pooling layer network through a residual error network in a second sub-model network in the multimedia information processing model based on a Mel frequency spectrogram matched with time domain features and frequency domain features of the target audio;

adjusting parameters of an image classification network in the second sub-model network according to the output information of the average pooling layer network;

and determining a second audio feature vector corresponding to the target audio based on a Mel frequency spectrogram matched with the time domain feature and the frequency domain feature of the target audio through an image classification network in a second sub-model network.

7. The method of claim 1, further comprising:

establishing data storage mapping according to the information source of the target multimedia information;

and adjusting the file format of the target audio in response to the established data storage mapping so as to match the information source.

8. The method of claim 1, further comprising:

acquiring a first training sample set, wherein the first training sample set is an audio sample in video information acquired through a terminal;

noise adding the first training sample set to form a corresponding second training sample set;

processing the second training sample set through a multimedia information processing model to determine initial parameters of the multimedia information processing model;

responding to initial parameters of the multimedia information processing model, processing the second training sample set through the multimedia information processing model, and determining updating parameters of the multimedia information processing model;

9. The method of claim 8, wherein the noise adding the first set of training samples to form a corresponding second set of training samples comprises:

determining a dynamic noise type matched with the use environment of the multimedia information processing model;

and according to the dynamic noise type, adding noise to the first training sample set to change the background noise, the volume or the sampling rate of the audio samples in the first training sample set to form a corresponding second training sample set.

10. The method of claim 8, wherein the determining updated parameters of the multimedia information processing model by processing the second set of training samples with the multimedia information processing model in response to initial parameters of the multimedia information processing model comprises:

substituting different audio samples in the second training sample set into loss functions respectively corresponding to a first sub-model network and a second sub-model network of the multimedia information processing model;

determining parameters respectively corresponding to a first sub-model network and a second sub-model network in the multimedia information processing model when the loss function meets corresponding convergence conditions;

and taking the parameters respectively corresponding to the first sub-model network and the second sub-model network as the update parameters of the multimedia information processing model.

11. The method of claim 8, wherein iteratively updating the network parameters of the multimedia information processing model with the second set of training samples according to the updated parameters of the multimedia information processing model comprises:

determining convergence conditions respectively matched with a first sub-model network and a second sub-model network in the multimedia information processing model;

and iteratively updating the parameters respectively corresponding to the first sub-model network and the second sub-model network until the loss functions respectively corresponding to the first sub-model network and the second sub-model network meet the corresponding convergence conditions.

12. The method of claim 1, wherein the determining the type of the target audio in the target multimedia information based on the first audio feature vector and the second audio feature vector comprises:

performing vector fusion processing on the first audio feature vector and the second audio feature vector;

determining a type of target audio in the target multimedia information based on a result of the vector fusion process, wherein the type of target audio includes at least one of:

13. A multimedia information processing apparatus, characterized in that the apparatus comprises:

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the multimedia information processing method of any one of claims 1 to 12 when executing the executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the multimedia information processing method of any one of claims 1 to 12.