CN110598651A

CN110598651A - Information processing method, device and storage medium

Info

Publication number: CN110598651A
Application number: CN201910878129.5A
Authority: CN
Inventors: 俄万有
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2019-12-20
Anticipated expiration: 2039-09-17
Also published as: CN110991391B; CN110598651B; CN110991391A

Abstract

The invention provides an information processing method, which comprises the following steps: acquiring a target video, and analyzing the target video to acquire video parameters of the target video; extracting image characteristic vectors and voice characteristic vectors of the target video according to the video parameters of the target video; and performing content recognition on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video. The invention also provides an information processing device, an electronic device and a storage medium. The method and the device can realize timely and accurate conversion of the video information content into the text of the corresponding natural language for output, effectively improve the sharing speed of the video information content, expand the sharing scene of the video information content, and solve the problem that the video information content cannot be described by using the natural language through machine processing in the prior art.

Description

Information processing method, device and storage medium

Technical Field

The present invention relates to information processing technologies, and in particular, to an information processing method, an information processing apparatus, and a storage medium.

Background

Under the traditional technical condition, the demand of multimedia information is increased explosively, and the traditional information processing technology cannot meet the requirements of multimedia data on tasks such as marking, description and the like. Describing videos, images, etc. in natural language is very simple for humans, but a very difficult task for machines, which requires machines to correctly merge both computer vision and natural language processing techniques across the semantic gap in understanding images. In recent years, many methods and systems based on deep learning have been developed to handle the separation and recognition of mixed speech signals, such as deep attraction networks. To this end, Artificial Intelligence (AI) technology provides a solution to train an appropriate speech recognition network to support the above-described applications. The artificial intelligence is the theory, method and technology for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing environment, acquiring knowledge and obtaining the best result by using the knowledge, and the artificial intelligence of an application system, namely the artificial intelligence for researching the design principle and the implementation method of various intelligent machines, so that the machine has the functions of sensing, reasoning and decision making, and in the field of voice processing, the recognition of voice is realized by using the digital computer or the machine controlled by the digital computer. At present, research in this direction has received extensive attention, and can realize obtaining effective application in fields such as security protection, house, medical treatment, teaching, but in this process, the propagation of video file also has the shortcoming that the broadcast time is long, the unit time passes information volume fewly, content sharing is inefficient.

Disclosure of Invention

In view of this, embodiments of the present invention provide an information processing method, an information processing apparatus, and a storage medium, which can solve the problem that the description of video information content in natural language cannot be realized through machine processing in the conventional technology, and timely and accurately convert the video information content into a text in a corresponding natural language for output.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an information processing method, which comprises the following steps:

acquiring a target video, and analyzing the target video to acquire video parameters of the target video;

extracting image characteristic vectors and voice characteristic vectors of the target video according to the video parameters of the target video;

performing content identification on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video, wherein the text information is used for describing the content of the target video through a natural language;

and outputting the text information matched with the target video.

An embodiment of the present invention further provides an information processing apparatus, including:

the information transmission module is used for acquiring a target video;

the information processing module is used for analyzing the target video to acquire video parameters of the target video;

the information processing module is used for extracting image feature vectors and voice feature vectors of the target video according to the video parameters of the target video;

the information processing module is used for performing content identification on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video, wherein the text information is used for describing the content of the target video through a natural language;

and the information transmission module is used for outputting the text information matched with the target video.

In the above-mentioned scheme, the first step of the method,

the information transmission module is used for sending the target video and the corresponding text information matched with the target video to a block chain network so as to ensure that

And filling the target video and the corresponding text information matched with the target video into a new block by the node of the block chain network, and when the new block is identified in a consistent manner, adding the new block to the tail part of the block chain.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the information processing method of the preamble when the executable instruction stored in the memory is operated.

The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and the executable instructions are executed by a processor to realize the information processing method of the preamble.

The embodiment of the invention has the following beneficial effects: acquiring a target video, and analyzing the target video to acquire video parameters of the target video; extracting image characteristic vectors and voice characteristic vectors of the target video according to the video parameters of the target video; performing content identification on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video, wherein the text information is used for describing the content of the target video through a natural language; the text information matched with the target video is output, the video information content is timely and accurately converted into the text of the corresponding natural language to be output, the sharing speed of the video information content is effectively improved, the sharing scene of the video information content is enlarged, and the problem that the video information content cannot be described by the natural language through machine processing in the prior art is solved.

Drawings

Fig. 1 is a schematic view of a usage scenario of an information processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a component structure of an information processing apparatus according to an embodiment of the present invention;

fig. 3 is an optional flowchart of an information processing method according to an embodiment of the present invention;

fig. 4 is an optional flowchart of an information processing method according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an alternative processing procedure of the image semantic recognition network according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an alternative processing procedure of the speech semantic recognition network according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating an alternative process for semantic integration processing according to an embodiment of the present invention;

FIG. 8 is a block diagram of an information processing apparatus 100 according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a block chain in the block chain network 200 according to an embodiment of the present invention;

fig. 10 is a functional architecture diagram of a blockchain network 200 according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of an alternative usage environment of the information processing method according to the embodiment of the present invention;

FIG. 12 is a block diagram illustrating an information processing method according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating a process of using the information processing method according to an embodiment of the present invention;

fig. 14 is a schematic diagram of a using process of the information processing method in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) Target video, various forms of video information available in the internet, such as video files, multimedia information, etc. presented in a client or smart device.

3) Hidden Markov models (HMM Hidden Markov models) are statistical models that describe a Markov process with Hidden unknown parameters. In hidden markov models, states are not directly visible, but some variables affected by the states are visible. States in an HMM are the basic components of the HMM; the transition probability of the HMM represents the probability of a transition occurring between states of the HMM; each state has a probability distribution over the symbols that may be output, i.e. the output probability of the HMM. Among them, the markov process is a stochastic process without memory peculiarities. The stochastic process has a conditional probability distribution of its future states that depends only on the current state, given the current state and all past states.

4) A Gaussian Mixture Model (DNN Gaussian Mixture Model) is a Model that accurately quantizes objects using a Gaussian probability density function (normal distribution curve) and decomposes one object into a plurality of objects formed based on the Gaussian probability density function (normal distribution curve).

5) Convolutional Neural Networks (CNN Convolutional Neural Networks) are a class of feed forward Neural Networks (fed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the representative algorithms for deep learning (deep). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on input information according to a hierarchical structure of the convolutional neural network.

6) And (4) model training, namely performing multi-classification learning on the image data set. The model can be constructed by adopting deep learning frames such as TensorFlow, torch and the like, and a multi-classification model is formed by combining multiple layers of neural network layers such as CNN and the like. The input of the model is a three-channel or original channel matrix formed by reading an image through openCV and other tools, the output of the model is multi-classification probability, and the webpage category is finally output through softmax and other algorithms. During training, the model approaches to a correct trend through an objective function such as cross entropy and the like.

7) Transactions (transactions), equivalent to the computer term "Transaction," include operations that need to be committed to a blockchain network for execution and do not refer solely to transactions in the context of commerce, which embodiments of the present invention follow in view of the convention colloquially used in blockchain technology.

8) A Block chain (Blockchain) is a storage structure for encrypted, chained transactions formed from blocks (blocks).

9) A Blockchain Network (Blockchain Network) incorporates new blocks into a set of nodes of a Blockchain in a consensus manner.

10) Ledger (legger) is a general term for blockchains (also called Ledger data) and state databases synchronized with blockchains.

11) Intelligent Contracts (Smart Contracts), also known as chain codes (chaincodes) or application codes, are programs deployed in nodes of a blockchain network, and the nodes execute the intelligent Contracts called in received transactions to perform operations of updating or querying key-value data of a state database.

12) Consensus (Consensus), a process in a blockchain network, is used to agree on transactions in a block among a plurality of nodes involved, the agreed block is to be appended to the end of the blockchain, and the mechanisms for achieving Consensus include Proof of workload (PoW, Proof of Work), Proof of rights and interests (PoS, Proof of equity (DPoS), Proof of granted of shares (DPoS), Proof of Elapsed Time (PoET, Proof of Elapsed Time), and so on.

Fig. 1 is a schematic view of a usage scenario of an information processing method according to an embodiment of the present invention, and referring to fig. 1, a client capable of displaying software of a corresponding target video, such as a client or a plug-in for video playing, is disposed on a terminal (including a terminal 10-1 and a terminal 10-2), and a user may obtain and display the target video through the corresponding client; the terminal is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless link to realize data transmission.

As an example, the server 200 is configured to deploy the information processing apparatus to implement the information processing method provided by the present invention, so as to obtain the video parameters of the target video by obtaining the target video and parsing the target video; extracting image characteristic vectors and voice characteristic vectors of the target video according to the video parameters of the target video; and performing content identification on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video, wherein the text information is used for describing the content of the target video through a natural language, and the text information matched with the target video is displayed and output through a terminal (a terminal 10-1 and/or a terminal 10-2). Of course, the information processing device provided by the invention can be applied to video playing, target videos with different data sources are usually processed in the video playing, and finally text information matched with the corresponding target videos is presented on a User Interface (UI), so that the accuracy and timeliness of the text information directly influence the User experience. The background database for video playing receives a large amount of video data from different sources every day, and the obtained text information matched with the target video can be called by other application programs.

Certainly, the processing of the target video by the information processing device to obtain the text information matched with the target video specifically includes: acquiring a target video, and analyzing the target video to acquire video parameters of the target video; extracting image characteristic vectors and voice characteristic vectors of the target video according to the video parameters of the target video; performing content identification on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video, wherein the text information is used for describing the content of the target video through a natural language; outputting text information matched with the target video

As will be described in detail below with respect to the structure of the information processing apparatus according to the embodiment of the present invention, the information processing apparatus may be implemented in various forms, such as a dedicated terminal with a processing function of the information processing apparatus, or a server provided with a processing function of the information processing apparatus, for example, the server 200 in the foregoing fig. 1. Fig. 2 is a schematic diagram of a composition structure of an information processing apparatus according to an embodiment of the present invention, and it is understood that fig. 2 only shows an exemplary structure of the information processing apparatus, and not a whole structure, and a part of the structure or the whole structure shown in fig. 2 may be implemented as needed.

The information processing apparatus provided by the embodiment of the invention includes: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components in the information processing apparatus are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.

It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

In some embodiments, the information processing apparatus provided in the embodiments of the present invention may be implemented by a combination of hardware and software, and by way of example, the information processing apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the information processing method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

As an example of the information processing apparatus provided by the embodiment of the present invention implemented by combining software and hardware, the information processing apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, where the software modules may be located in a storage medium located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and completes the information processing method provided by the embodiment of the present invention in combination with necessary hardware (for example, including the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

As an example of the information processing apparatus provided by the embodiment of the present invention implemented by hardware, the apparatus provided by the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, by being executed by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components, to implement the information processing method provided by the embodiment of the present invention.

The memory 202 in the embodiment of the present invention is used to store various types of data to support the operation of the information processing apparatus. Examples of such data include: any executable instructions for operating on an information processing apparatus, such as executable instructions, a program that implements the slave information processing method of the embodiment of the present invention may be contained in the executable instructions.

In other embodiments, the information processing apparatus provided in the embodiments of the present invention may be implemented by software, and fig. 2 shows the information processing apparatus stored in the memory 202, which may be software in the form of programs, plug-ins, and the like, and includes a series of modules, and as an example of the programs stored in the memory 202, the information processing apparatus may include the following software modules, an information transmission module 2081, and an information processing module 2082. When the software modules in the information processing apparatus are read into the RAM by the processor 201 and executed, the information processing method provided by the embodiment of the present invention will be implemented, where the functions of each software module in the information processing apparatus include:

the information transmission module 2081, which is used for obtaining a target video;

the information processing module 2082 is used for analyzing the target video to obtain video parameters of the target video;

the information processing module 2082 is configured to extract an image feature vector and a voice feature vector of the target video according to the video parameter of the target video;

the information processing module 2082 is configured to perform content identification on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video, where the text information is used to describe the content of the target video through a natural language;

the information transmission module 2081 is configured to output text information matched with the target video.

Referring to fig. 3, fig. 3 is an optional flowchart of the information processing method provided by the embodiment of the present invention, and it can be understood that the steps shown in fig. 3 may be executed by various electronic devices operating the information processing apparatus, such as a server with the information processing apparatus or a server cluster, where a dedicated terminal with the information processing apparatus may be packaged in the server shown in fig. 1 to execute corresponding software modules in the information processing apparatus shown in the foregoing fig. 2. The following is a description of the steps shown in fig. 3.

Step 301: and acquiring a target video, and analyzing the target video to acquire video parameters of the target video.

In some embodiments of the present invention, parsing the target video to obtain the video parameters of the target video may be implemented by:

analyzing the target video to acquire time sequence information of the target video; and analyzing the video information corresponding to the target video according to the time sequence information of the target video, and acquiring a playing time parameter and a storage position parameter corresponding to the target video. The acquired time sequence information can be used for decomposing video image frames of the target video and corresponding audio files, and due to the fact that the source of the target video has uncertainty (can be a video resource in the internet or a local video file stored by the electronic device), the playing time length parameter and the storage position parameter corresponding to the target video can be stored in the blockchain network when the original target video is stored in the corresponding blockchain network by acquiring the playing time length parameter and the storage position parameter corresponding to the target video, and therefore the source tracing of the target video is achieved.

Step 302: and extracting the image characteristic vector and the voice characteristic vector of the target video according to the video parameter of the target video.

In some embodiments of the present invention, extracting the image feature vector and the voice feature vector of the target video according to the video parameter of the target video may be implemented by:

according to the time sequence information of the target video, responding to the time sequence information of the target video, and according to a playing time parameter and a storage position parameter corresponding to the target video, processing image frames included in the target video through an image classification network to form an image feature vector matched with the time sequence information of the target video; and responding to the time sequence information of the target video, and processing the audio information included in the target video through a convolutional neural network of voice classification according to the playing time length parameter and the storage position parameter corresponding to the target video to form a voice feature vector matched with the time sequence information of the target video. The target video is composed of at least one frame of image frame, the image frame included in the target video is processed through the image classification network, the image feature vector to be recognized can be obtained, and the voice feature vector to be recognized can be obtained through the voice classification convolutional neural network. Further, the target video may be a video stored in a storage medium of a local terminal, or a video stored in different servers in different networks in a distributed storage manner (or in a streaming media form, a terminal is required to splice segmented streaming media videos to form a complete target video), so that the image classification network responds to the timing information of the target video, processes the target video according to the playing time parameter and the storage location parameter corresponding to the target video, and may form an image feature vector and a voice feature vector which are matched with the timing information of the target video, thereby avoiding the loss of the image feature vector and the voice feature vector caused by the loss of a video frame of the target video due to the lack of the timing information of the target video.

The voice feature vector extraction may include sound intensity and sound intensity, loudness, pitch period, pitch frequency, signal-to-noise ratio, and harmonic-to-noise ratio. The sound intensity refers to the average sound energy per unit time passing through a unit area perpendicular to the propagation direction of the sound wave. The sound intensity is denoted by I and has the unit of Watt/square meter. The sound intensity is represented by a sound intensity level. A common unit for sound intensity level is decibels (dB). Loudness is a measure of how strong or weak a sound is. Loudness is expressed in loudness level. Pitch is the perception of how high and low the sound frequencies are by the human auditory system. The unit of pitch is meier. The pitch period reflects the time interval between two adjacent glottis or the frequency of the opening and closing. The signal-to-noise ratio is calculated as the ratio between the power of the signal and the noise. Wherein the harmonic-to-noise ratio is used to characterize the ratio of harmonic components and noise components in the speech.

Step 303: and performing content recognition on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video.

Wherein the text information is used for describing the content of the target video through a natural language.

Referring to fig. 4, fig. 4 is an optional flowchart of the information processing method provided by the embodiment of the present invention, and it is understood that the steps shown in fig. 4 may be executed by an electronic device running the information processing apparatus, such as a server with an information processing function or a server cluster, where a dedicated terminal with an information processing apparatus may be packaged in the server shown in fig. 1 to execute a corresponding software module in the information processing apparatus shown in the foregoing fig. 2; the method is used for carrying out content recognition on a target video based on an image feature vector and the voice feature vector to obtain text information matched with the target video, and specifically comprises the following steps:

step 3031: and carrying out semantic recognition processing on the image feature vector through an image semantic recognition network to form an image semantic recognition result matched with the image feature vector.

The server may receive corresponding requests from terminal a and terminal B, respectively, to implement the request for corresponding text information.

In some embodiments of the present invention, the semantic recognition processing is performed on the image feature vector through an image semantic recognition network to form an image semantic recognition result matched with the image feature vector, which may be implemented by:

performing semantic recognition processing on the image feature vector through a cyclic convolution neural network based on an attention mechanism to form a corresponding first image semantic recognition result; performing semantic recognition processing on the image feature vector through a convolutional neural network based on a memory mechanism to form a corresponding second image semantic recognition result; and performing semantic integration processing on the first image semantic recognition result and the second image semantic recognition result to form an image semantic recognition result matched with the image feature vector. The processing accuracy of the image semantic recognition result can be effectively improved through the fusion of the first image semantic recognition result formed by the cyclic convolution neural network based on the attention mechanism and the second image semantic recognition result formed by the convolution neural network based on the memory mechanism.

Referring to fig. 5, fig. 5 is a schematic diagram of an optional processing procedure of the image semantic recognition network in the embodiment of the present invention, where an encoder may include a convolutional neural network, and after inputting an image feature vector into the encoder, a frame-level video feature vector corresponding to the image feature vector is output. Specifically, the image feature vector is input into an encoder, that is, a convolutional neural network in the encoder, a frame-level video feature vector corresponding to the image feature vector is extracted through the convolutional neural network, the convolutional neural network outputs the extracted frame-level video feature vector and serves as the output of the encoder, and then corresponding image semantic recognition is performed by using the video feature vector output by the encoder, or the encoder may include the convolutional neural network and the cyclic neural network, and after the image feature vector is input into the encoder, a frame-level video feature vector carrying timing information corresponding to the image feature vector is output, as shown in the encoder in fig. 5. Specifically, the image feature vector is input into the encoder, i.e., into a convolutional neural network (e.g., CNN neural network in fig. 5) in the encoder, a frame-level video feature vector corresponding to the image feature vector is extracted by the convolutional neural network, the convolutional neural network outputs the extracted frame-level video feature vector, and a cyclic neural network (corresponding to h in fig. 5) in the encoder_i-1、h_iEtc.), extracting and fusing time sequence information of the extracted convolutional neural network characteristic vector through a recurrent neural network, and circularly performingAnd the neural network outputs the video characteristic vector carrying the time sequence information as the output of the encoder, and then executes corresponding processing steps by utilizing the video characteristic vector output by the encoder.

Step 3032: and performing semantic recognition processing on the voice feature vector through a voice semantic recognition network to form a voice semantic recognition result matched with the voice feature vector.

Referring to fig. 6, fig. 6 is a schematic diagram of an optional processing procedure of the speech semantic recognition network according to the embodiment of the present invention, where a first layer is a speech frame data, a second layer is a DNN model, and a third layer is an HMM model. Wherein S represents an HMM state in the HMM model; a denotes transition probability in HMM model, a_k-1skDenotes from s_k-1State change to s_k-2Transition probabilities of states; h represents a neuron in the DNN model; w represents the weight in the DNN model, and M represents the number of layers of the DNN model. h represents a function, if the function is the first layer, the input of h is the respective weight corresponding to one frame of data or a plurality of frames of data; if the layer is from the second layer to the last layer, the input of h is the output of the previous layer and the weight corresponding to each output. The output of each DNN corresponds to the output probability of one HMM model state. The output of each DNN corresponds to a speech frame, and the processing shown in fig. 6 can realize the formation of speech semantic recognition result matched with the speech feature vector.

In some embodiments of the present invention, the information processing apparatus may calculate a matrix weight value and a matrix bias value between output level nodes in the DNN model according to a training Filter bank feature extracted from the training audio corpus and a posterior probability feature of each phoneme state, and preferably, the information processing apparatus may extract a training Filter bank feature corresponding to each frame of audio data in the training audio corpus based on the foregoing method, and use the training Filter bank feature and the corresponding posterior probability feature as a training sample pair, so that the training audio corpus may have a plurality of training sample pairs, and calculate a matrix weight value and a matrix bias value between output level nodes in the DNN model based on the plurality of training sample pairs and using a backward transfer algorithm of a maximum likelihood criterion. The information processing device adds the matrix weight values and the matrix bias values to the DNN model to generate a trained DNN model to be packaged in a corresponding software module or solidified in corresponding hardware equipment, so that a user can use the DNN model directly without model training.

Step 3033: and performing semantic integration processing on the image semantic recognition result and the voice semantic recognition result to form text information matched with the target video.

Referring to fig. 7, fig. 7 is a schematic diagram of an optional processing procedure of semantic integration processing according to an embodiment of the present invention, where the dual-flow long-term memory network may include a bidirectional vector model, an attention model, a full connection layer, and a sigmoid classifier, the bidirectional vector model performs recursive processing on RGB image feature vectors and voice feature vectors of an input video to be recognized, the attention model is used to combine the image feature vectors after the recursive processing together to form a longer vector, the voice feature vectors are combined together to form a longer vector, the two combined vectors are combined together again to form a longer vector, finally, the two full connection layers are used to map learned distributed feature representations to corresponding sample label spaces to improve accuracy of a final classification result, and finally, the sigmoid classifier is used to determine probability values of the video to be recognized corresponding to respective labels, and performing semantic integration processing on the image semantic recognition result and the voice semantic recognition result to form text information matched with the target video.

In some embodiments of the invention, the method further comprises:

extracting respective video samples in the target video, wherein the video samples comprise at least two video image frames; extracting at least two local features of each frame of video image frame in the video sample; according to the weight parameters corresponding to the local features in each frame of video image frame, performing feature integration processing on the local features in each frame of video image frame to obtain local feature vectors of each frame of video image frame; according to the local feature vector of each frame of video image frame and the time sequence information of the images in the video sample, performing content identification on the video sample to obtain video content information of the target video, wherein the video content information comprises video content description and part-of-speech information of the video content description; and updating the parameters of the image semantic recognition network according to the video content information and the real video content information of the video sample.

In some embodiments of the present invention, data preprocessing may be further performed on the training audio corpus, where the data preprocessing may include: performing data framing, data pre-emphasis, data windowing and the like on the training audio corpus to obtain at least one frame of audio data in a time domain; performing fast Fourier transform, and converting the at least one frame of audio data into a frequency domain to obtain at least one power spectrum data corresponding to the at least one frame of audio data on the frequency domain; at least one power spectrum data on the frequency domain passes through a Mel frequency filter with triangular filtering characteristics to obtain at least one Mel power spectrum data; taking log energy from at least one Mel power spectrum data to obtain at least one Mel log energy spectrum data, removing data correlation of at least one Mel log energy spectrum data by DCT to obtain MFCC feature, using the MFCC feature as DNN input vector to train DNN and HMM, and obtaining likelihood probability feature of each phoneme state in at least one phoneme state output by the trained DNN and phoneme conversion probability of the trained HMM. It can be understood that there is a one-to-one correspondence between Filter bank features and MFCC features for the same frame of audio data in the training audio corpus.

Step 304: and outputting the text information matched with the target video.

The terminal A can receive the text information output by the server and display the corresponding text information A; the terminal B can also receive the text information output by the server and display the corresponding text information B; alternatively, the first and second electrodes may be,

the terminals a and B may receive the text information output by the server and display the same text information.

Wherein the text information may have a global part of speech.

In some embodiments of the present invention, the content output by the electronic device may be adjusted according to the received user instruction, and when the user can only view the corresponding text information, the text information matched with the target video is output to adapt to different requirements of the user.

It should be noted that the global part of speech of the text information matched with the target video in the present invention refers to each part of speech corresponding to the natural language description information of the video to be described, the global part of speech vector is a vector of each part of speech combination, and the global part of speech vector feature is a feature of a vector of each part of speech combination. Wherein, the part of speech is an attribute of a word, a phrase or a word, and various languages can be defined with various parts of speech. By way of example, Chinese includes, but is not limited to, parts of speech such as nouns, verbs, adjectives, quantifiers, adverbs, prepositions, and the like; english includes, but is not limited to, noun, verb, gerund, adjective, adverb, article, preposition, etc.; other types of parts of speech may also be included in other languages, and are not described in detail herein. The part-of-speech vector is relative to a sentence described in natural language, usually the sentence is composed of two or more words, and the part-of-speech vector feature is a combination of part-of-speech features of each word in the sentence.

In some embodiments of the present invention, in order to facilitate storage and query of text information after forming text information matching a target video, the target video and corresponding text information matching the target video may be sent to a blockchain network, so that a node of the blockchain network fills the target video and corresponding text information matching the target video into a new block, and when the new block is identified consistently, the new block is appended to the end of the blockchain.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an information processing apparatus 100 according to an embodiment of the present invention, which includes a blockchain network 200 (exemplarily illustrating a consensus node 210-1 to a consensus node 210-3), an authentication center 300, a service agent 400, and a service agent 500, which are respectively described below.

The type of blockchain network 200 is flexible and may be, for example, any of a public chain, a private chain, or a federation chain. Taking a public link as an example, electronic devices such as user terminals and servers of any service entity can access the blockchain network 200 without authorization; taking a federation chain as an example, an electronic device (e.g., a terminal/server) under the jurisdiction of a service entity after obtaining authorization may access the blockchain network 200, and at this time, become a client node in the blockchain network 200.

In some embodiments, the client node may act as a mere watcher of the blockchain network 200, i.e., provides functionality to support a business entity to initiate a transaction (e.g., for uplink storage of data or querying of data on a chain), and may be implemented by default or selectively (e.g., depending on the specific business requirements of the business entity) with respect to the functions of the consensus node 210 of the blockchain network 200, such as a ranking function, a consensus service, and an accounting function, etc. Therefore, the data and the service processing logic of the service subject can be migrated into the block chain network 200 to the maximum extent, and the credibility and traceability of the data and service processing process are realized through the block chain network 200.

Consensus nodes in blockchain network 200 receive transactions submitted from client nodes (e.g., client node 410 shown in fig. 1 as belonging to business entity 400 and client node 510 shown in fig. 1 as belonging to system 500 of electronic devices) of different business entities (e.g., business entity 400 and business entity 500 shown in fig. 1), perform the transactions to update the ledger or query the ledger, and various intermediate or final results of performing the transactions may be returned for display in the business entity's client nodes.

For example, the client node 410/510 may subscribe to events of interest in the blockchain network 200, such as transactions occurring in a particular organization/channel in the blockchain network 200, and the corresponding transaction notifications are pushed by the consensus node 210 to the client node 410/510, thereby triggering the corresponding business logic in the client node 410/510.

An exemplary application of the blockchain network is described below, taking an example in which a plurality of service entities access the blockchain network to implement management of target videos and corresponding text information.

Referring to fig. 8, a plurality of business entities involved in the management link, for example, the business entity 400 may be an artificial intelligence-based information processing apparatus, the business entity 500 may be a display system with a video playing function, and registers from the certificate authority 300 to obtain respective digital certificates, where the digital certificates include the public key of the business entity and the digital signatures signed by the certificate authority 300 for the public key and the identity information of the business entity, and are used to be attached to the transaction together with the digital signature of the business entity for the transaction, and are sent to the blockchain network, so that the blockchain network takes out the digital certificate and the signature from the transaction, verifies the reliability of the message (i.e. whether the message is not tampered) and the identity information of the business entity sending the message, and verifies the blockchain network according to the identity, for example, whether the block chain network has the right to initiate the transaction. Clients running on electronic devices (e.g., terminals or servers) hosted by the business entity may request access from the blockchain network 200 to become client nodes.

The client node 410 of the service body 400 is configured to obtain a target video, and parse the target video to obtain a video parameter of the target video; extracting image characteristic vectors and voice characteristic vectors of the target video according to the video parameters of the target video; performing content identification on the target video based on the image feature vector and the voice feature vector to obtain text information matched with the target video, wherein the text information is used for describing the content of the target video through a natural language; and outputting the text information matched with the target video and sending the target video and the corresponding text information to the blockchain network 200.

The target video and the corresponding text information are sent to the blockchain network 200, service logic may be set in the client node 410 in advance, when the corresponding text information is formed, the client node 410 automatically sends the target video and the corresponding text information to the blockchain network 200, or a service person of the service agent 400 logs in the client node 410, manually packages the target video and the corresponding text information, and sends the packaged target video and the corresponding text information to the blockchain network 200. During sending, the client node 410 generates a transaction corresponding to the update operation according to the target video and the corresponding text information, specifies an intelligent contract that needs to be invoked to implement the update operation and parameters passed to the intelligent contract, and also carries a digital certificate of the client node 410 and a signed digital signature (for example, a digest of the transaction is encrypted using a private key in the digital certificate of the client node 410), and broadcasts the transaction to the consensus node 210 in the blockchain network 200.

When the transaction is received in the consensus node 210 in the blockchain network 200, the digital certificate and the digital signature carried by the transaction are verified, after the verification is successful, whether the service agent 400 has the transaction right is determined according to the identity of the service agent 400 carried in the transaction, and the transaction fails due to any verification judgment of the digital signature and the right verification. After successful verification, node 210 signs its own digital signature (e.g., by encrypting the digest of the transaction using the private key of node 210-1) and continues to broadcast in blockchain network 200.

After receiving the transaction successfully verified, the consensus node 210 in the blockchain network 200 fills the transaction into a new block and broadcasts the new block. When a new block is broadcasted by the consensus node 210 in the block chain network 200, performing a consensus process on the new block, if the consensus is successful, adding the new block to the tail of the block chain stored in the new block, updating the state database according to a transaction result, and executing a transaction in the new block: for a transaction that submits an updated target video with corresponding textual information, a key-value pair comprising the target video and the corresponding textual information is added to the status database.

A service person of the service agent 500 logs in the client node 510, inputs a target video or text information query request, the client node 510 generates a transaction corresponding to an update operation/query operation according to the target video or text information query request, specifies an intelligent contract that needs to be called to implement the update operation/query operation and parameters transferred to the intelligent contract in the transaction, and broadcasts the transaction to the consensus node 210 in the blockchain network 200, where the transaction also carries a digital certificate of the client node 510 and a signed digital signature (for example, a digest of the transaction is encrypted by using a private key in the digital certificate of the client node 510).

After receiving the transaction in the consensus node 210 in the blockchain network 200, verifying the transaction, filling the block and making the consensus consistent, adding the filled new block to the tail of the blockchain stored in the new block, updating the state database according to the transaction result, and executing the transaction in the new block: for the submitted transaction for updating the manual identification result corresponding to a certain target video, updating the key value pair corresponding to the target video in the state database according to the manual identification result; and for the submitted transaction for inquiring a certain target video, inquiring the key value pair corresponding to the target video from the state database, and returning a transaction result.

It is noted that fig. 8 illustrates an example of a process for linking a target video and corresponding text information directly, but in other embodiments, for a case where the data size of the target video is large, the client node 410 may link the hash of the target video and the hash of the corresponding text information in pairs, and store the original target video and the corresponding text information in a distributed file system or a database. After the client node 510 obtains the target video and the corresponding text information from the distributed file system or the database, it may perform verification by combining with the corresponding hash in the blockchain network 200, thereby reducing the workload of the uplink operation.

As an example of a block chain, referring to fig. 9, fig. 9 is a schematic structural diagram of a block chain in a block chain network 200 according to an embodiment of the present invention, where a header of each block may include hash values of all transactions in the block and also include hash values of all transactions in a previous block, a record of a newly generated transaction is filled in the block and is added to a tail of the block chain after being identified by nodes in the block chain network, so as to form a chain growth, and a chain structure based on hash values between blocks ensures tamper resistance and forgery prevention of transactions in the block.

An exemplary functional architecture of a block chain network provided in the embodiment of the present invention is described below, referring to fig. 10, fig. 10 is a schematic functional architecture diagram of a block chain network 200 provided in the embodiment of the present invention, which includes an application layer 201, a consensus layer 202, a network layer 203, a data layer 204, and a resource layer 205, which are described below respectively.

The resource layer 205 encapsulates the computing, storage, and communication resources that implement each node 210 in the blockchain network 200.

The data layer 204 encapsulates various data structures that implement the ledger, including blockchains implemented in files in a file system, state databases of the key-value type, and presence certificates (e.g., hash trees of transactions in blocks).

The network layer 203 encapsulates the functions of a Point-to-Point (P2P) network protocol, a data propagation mechanism and a data verification mechanism, an access authentication mechanism and service agent identity management.

Wherein the P2P network protocol implements communication between nodes 210 in the blockchain network 200, the data propagation mechanism ensures propagation of transactions in the blockchain network 200, and the data verification mechanism implements reliability of data transmission between nodes 210 based on cryptography methods (e.g., digital certificates, digital signatures, public/private key pairs); the access authentication mechanism is used for authenticating the identity of the service subject added into the block chain network 200 according to an actual service scene, and endowing the service subject with the authority of accessing the block chain network 200 when the authentication is passed; the business entity identity management is used to store the identity of the business entity that is allowed to access blockchain network 200, as well as the permissions (e.g., the types of transactions that can be initiated).

The consensus layer 202 encapsulates the functions of the mechanism for the nodes 210 in the blockchain network 200 to agree on a block (i.e., a consensus mechanism), transaction management, and ledger management. The consensus mechanism comprises consensus algorithms such as POS, POW and DPOS, and the pluggable consensus algorithm is supported.

The transaction management is configured to verify a digital signature carried in the transaction received by the node 210, verify identity information of the service entity, and determine whether the node has an authority to perform the transaction (read related information from the identity management of the service entity) according to the identity information; for the service agents authorized to access the blockchain network 200, the service agents all have digital certificates issued by the certificate authority, and the service agents sign the submitted transactions by using private keys in the digital certificates of the service agents, so that the legal identities of the service agents are declared.

The ledger administration is used to maintain blockchains and state databases. For the block with the consensus, adding the block to the tail of the block chain; executing the transaction in the acquired consensus block, updating the key-value pairs in the state database when the transaction comprises an update operation, querying the key-value pairs in the state database when the transaction comprises a query operation and returning a query result to the client node of the business entity. Supporting query operations for multiple dimensions of a state database, comprising: querying the block based on the block vector number (e.g., hash value of the transaction); inquiring the block according to the block hash value; inquiring a block according to the transaction vector number; inquiring the transaction according to the transaction vector number; inquiring account data of a business main body according to an account (vector number) of the business main body; and inquiring the block chain in the channel according to the channel name.

The application layer 201 encapsulates various services that the blockchain network can implement, including tracing, crediting, and verifying transactions.

The corresponding architecture of the information processing method provided by the present invention is described below, wherein,

referring to fig. 11, fig. 11 is a schematic view of an optional use environment of the information processing method according to the embodiment of the present invention, in which a video playing function module and a corresponding information processing device are installed in a kitchen range hood, and in a working process of the kitchen range hood, sharing content through a video is not a proper information display manner, and in comparison, sharing content through a text information output manner results in higher reading efficiency (for example, a cooking teaching video, in an actual cooking practice process, efficiency of viewing a text description is significantly higher than that of viewing a video), but a target video acquired in the internet often does not carry corresponding text information.

Referring to fig. 12, fig. 12 is a schematic diagram of an architecture of an information processing method according to an embodiment of the present invention, wherein,

the client is used for supporting APP (application) for switching video file playing and text display or an application program integrated on the intelligent equipment.

And the interface layer is mainly used for processing a video playing request and a text displaying request of the client side by mainly comprising a video playing interface and a text displaying interface.

And the logic layer is used for corresponding to the interface layer and comprises a video playing service and a text display service. The video playing service receives a video playing request of a user, acquires a corresponding video playing address and playing control information according to the requested video id, and downloads a video medium for playing. The text display service receives a text display request of a user, acquires a corresponding text information storage address and text display control information according to a video id in the request, and reads text content to display to the user.

And the storage layer is used for storing the main video medium storage service, the video medium asset storage service and the description text storage service of the storage layer. Video media store encoded video media for playback (commonly MP4 files or HLS video files); the video media asset storage service user stores the basic data information (such as video id, video title, video playing address and text reading address) of the video; the descriptive text storage service is used for storing video text information (such as text information corresponding to frame intervals) organized according to a convention format.

And the offline training layer is used for performing offline text extraction. The feature extraction module is used for respectively extracting video features and audio features of the video file. The video content understanding network is a video description Model (MARN) based on a memory mechanism, and comprises a video coding module, a cyclic neural network decoding module based on an attention mechanism and a decoding module based on a memory mechanism. The mapping relation from each word to a plurality of related videos is established through the memory module, so that multiple meanings of each word can be understood and stored semantically respectively, and video descriptions with richer contents and more accurate semantemes are generated. The speech recognition network is composed of an ASR network based on an HMM-DNN model and an NLP network based on a BERT model. The video and voice are converted into texts through an ASR network, and the chatting content in the audio is eliminated and the core information is extracted through an NLP network. The text fusion network is responsible for fusing the text description based on video understanding and the text description based on audio analysis, organizing formatted text information, and uploading the text information to a storage service for storage.

The following describes a using process of the information processing method provided by the present invention, where reference is made to fig. 13, where fig. 13 is a schematic diagram of a using process of an information processing method in an embodiment of the present invention, where implementing seamless switching and playing of a video and a text includes the following steps:

step 1301: a user requests video playing to be converted into text display by clicking a switching key of a terminal, and a text display request is triggered;

step 1302: after receiving the client request, the text display interface verifies the request validity and initiates a request to a text display service;

step 1303: after receiving the request, the text display service analyzes the request parameters (video id, video playing frame vector number), and then initiates a request to the video media asset storage service according to the video id to request the relevant video media asset information (video duration, video text description storage address, etc.);

step 1304: after receiving the request, the video media asset storage returns corresponding video media asset information according to the video id and the request media asset field;

step 1305: after the text display service acquires the video media information, judging whether the request is valid, and if the request is valid, returning a text description address and a text offset corresponding to the video;

step 1306: the text display interface returns text display response information to the terminal after receiving the response message of the text display service;

step 1307: reading corresponding text information by a text display SDK integrated with a terminal APP according to a text storage address and a text offset;

step 1308: the description text storage service returns the video description text content;

step 1309: the terminal APP receives the description text content and displays the text content to a user;

step 1310: the user requests the text display to be converted into video playing by clicking the switching key of the terminal again, and a video playing request is triggered;

step 1311: after receiving the client request, the video playing interface checks the validity of the request and initiates a request to a video playing service;

step 1312: after receiving the request, the video playing service analyzes the request parameters (video id, text display content offset), and then initiates a request to the video media asset storage service according to the video id to request the relevant video media asset information (video duration, video downloading address, etc.);

step 1313: after receiving the request, the video media asset storage returns corresponding video media asset information according to the video id and the request media asset field;

step 1314: after the video playing service acquires the video media asset information, judging whether the request is valid, and if so, returning a video media download address and a playing start frame number corresponding to the video;

step 1315: after receiving the response message of the video playing service, the video playing interface returns the video playing response message to the terminal;

step 1316: downloading the content of the video medium by a video playing SDK integrated with a terminal APP according to the video downloading address;

step 1317: the video medium storage service returns a video medium file;

step 1318: after the terminal APP downloads the video medium file, the video content is played to the user from the video playing frame.

The following describes a using process of the information processing method provided by the present invention, where reference is made to fig. 14, fig. 14 is a schematic using process diagram of the information processing method in an embodiment of the present invention, where extracting corresponding text information based on video file content includes the following steps:

step 1401: and the characteristic extraction module requests the video medium storage service to download the video medium according to the video id to be processed.

Step 1402: the video media storage service issues the video media to the feature extraction module.

Step 1403: and after the characteristic extraction module finishes downloading the video medium, audio and video characteristic extraction is carried out on the video medium. For video, extracting 2D features of each frame of image in the video and 3D features of each small section of video. For audio, MFCC features are extracted.

Step 1404: and inputting the extracted video feature information to a recurrent neural network decoding module based on an attention mechanism for feature decoding.

Step 1405: and inputting the extracted video features into a decoding module based on a memory mechanism for decoding.

Step 1406: and inputting the decoding result of the recurrent neural network based on the attention mechanism into a probability fusion module.

Step 1407: and inputting the decoding result of the decoding module based on the memory mechanism to the probability fusion module.

Step 1408: and performing probability fusion on the decoding results of the two modules, and outputting a text description result based on video content identification to a text fusion network.

Step 1409: meanwhile, the characteristic hard extraction module inputs the extracted audio characteristics into an ASR network for speech recognition.

Step 1410: and the ASR network analyzes the speech recognition result into text information and inputs the text information into the NLP network for processing.

Step 1411: the NLP network understands the text information, removes chatting content in the text, extracts key information of the text and inputs the key information into the text fusion network.

Step 1412: and the text fusion network fuses the description text generated based on the understanding of the video content and the text information generated based on the processing of the audio and video to generate the final description text.

Step 1413: and the text fusion network initiates a text content uploading request to the text uploading interface.

Step 1414: and after receiving the uploading request, the text uploading interface initiates a request to the description text storage service to request for storing the description text.

Step 1415: the description text storage service stores the corresponding description text information, generates the corresponding file address and returns a storage result to the file uploading interface.

Step 1416: and the file uploading interface initiates a request to the video media asset storage service to update the media asset attributes corresponding to the description text corresponding to the video.

Step 1417: and after the video media asset storage service updates the media asset information corresponding to the video, returning an update result to the text uploading interface.

Therefore, by processing the video content and the audio content, the key information of the video is extracted, the chatting content in the video is eliminated, the text information convenient to read is organized, and the seamless switching between video playing and text displaying is realized, so that the user can be supported to randomly select the video or text displaying content in a specific scene, the user experience is optimized, and the convenient content sharing is realized.

The invention has the following beneficial technical effects:

extracting image feature vectors and voice feature vectors of the target video according to the video parameters of the target video; based on the image feature vectors and the voice feature vectors, content recognition is carried out on the target video to obtain text information matched with the target video, the video information content is timely and accurately converted into a text of a corresponding natural language to be output, the sharing speed of the video information content is effectively improved, the sharing scene of the video information content is enlarged, and the problem that the video information content cannot be described by utilizing the natural language through machine processing in the traditional technology is solved.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An information processing method, characterized in that the method comprises:

and outputting the text information matched with the target video.

2. The method of claim 1, wherein the parsing the target video to obtain the video parameters of the target video comprises:

analyzing the target video to acquire time sequence information of the target video;

and analyzing the video information corresponding to the target video according to the time sequence information of the target video, and acquiring a playing time parameter and a storage position parameter corresponding to the target video.

3. The method according to claim 2, wherein the extracting image feature vectors and voice feature vectors of the target video according to the video parameters of the target video comprises:

responding to the time sequence information of the target video, processing image frames included in the target video through an image classification network according to a playing time length parameter and a storage position parameter corresponding to the target video, and forming an image feature vector matched with the time sequence information of the target video;

and responding to the time sequence information of the target video, and processing the audio information included in the target video through a convolutional neural network of voice classification according to the playing time length parameter and the storage position parameter corresponding to the target video to form a voice feature vector matched with the time sequence information of the target video.

4. The method of claim 1, wherein the performing content recognition on the target video based on the image feature vector and the voice feature vector to obtain text information matching the target video comprises:

performing semantic recognition processing on the image feature vector through an image semantic recognition network to form an image semantic recognition result matched with the image feature vector;

performing semantic recognition processing on the voice feature vector through a voice semantic recognition network to form a voice semantic recognition result matched with the voice feature vector;

and performing semantic integration processing on the image semantic recognition result and the voice semantic recognition result to form text information matched with the target video.

5. The method according to claim 4, wherein the semantic recognition processing is performed on the image feature vector through an image semantic recognition network to form an image semantic recognition result matched with the image feature vector, and the method comprises:

performing semantic recognition processing on the image feature vector through a cyclic convolution neural network based on an attention mechanism to form a corresponding first image semantic recognition result;

performing semantic recognition processing on the image feature vector through a convolutional neural network based on a memory mechanism to form a corresponding second image semantic recognition result;

and performing semantic integration processing on the first image semantic recognition result and the second image semantic recognition result to form an image semantic recognition result matched with the image feature vector.

6. The method of claim 4, further comprising:

extracting respective video samples in the target video, wherein the video samples comprise at least two video image frames;

extracting at least two local features of each frame of video image frame in the video sample;

according to the weight parameters corresponding to the local features in each frame of video image frame, performing feature integration processing on the local features in each frame of video image frame to obtain local feature vectors of each frame of video image frame;

according to the local feature vector of each frame of video image frame and the time sequence information of the images in the video sample, performing content identification on the video sample to obtain video content information of the target video, wherein the video content information comprises video content description and part-of-speech information of the video content description;

and updating the parameters of the image semantic recognition network according to the video content information and the real video content information of the video sample.

7. The method according to any one of claims 1 to 6, further comprising:

sending the target video and the corresponding text information matched with the target video to a block chain network so as to enable the target video and the corresponding text information to be matched with the target video

And filling the target video and the corresponding text information matched with the target video into the new block by the node of the block chain network, and when the new block is identified in a consistent manner, adding the new block to the tail part of the block chain.

8. An information processing apparatus characterized in that the apparatus comprises:

the information transmission module is used for acquiring a target video;

9. The apparatus of claim 8,

the information processing module is used for analyzing the target video and acquiring the time sequence information of the target video;

and the information processing module is used for analyzing the video information corresponding to the target video according to the time sequence information of the target video, and acquiring the playing time length parameter and the storage position parameter corresponding to the target video.

10. The apparatus of claim 9,

the information processing module is used for responding to the time sequence information of the target video, processing image frames included in the target video through an image classification network according to a playing time parameter and a storage position parameter corresponding to the target video, and forming an image feature vector matched with the time sequence information of the target video;

and the information processing module is used for responding to the time sequence information of the target video, processing the audio information included in the target video through a convolutional neural network of voice classification according to the playing time length parameter and the storage position parameter corresponding to the target video, and forming a voice feature vector matched with the time sequence information of the target video.

11. The apparatus of claim 8,

the information processing module is used for carrying out semantic recognition processing on the image feature vector through an image semantic recognition network to form an image semantic recognition result matched with the image feature vector;

the information processing module is used for carrying out semantic recognition processing on the voice feature vector through a voice semantic recognition network to form a voice semantic recognition result matched with the voice feature vector;

and the information processing module is used for performing semantic integration processing on the image semantic recognition result and the voice semantic recognition result to form text information matched with the target video.

12. The apparatus of claim 11,

the information processing module is used for carrying out semantic recognition processing on the image feature vector through a cyclic convolution neural network based on an attention mechanism to form a corresponding first image semantic recognition result;

the information processing module is used for carrying out semantic recognition processing on the image feature vector through a convolutional neural network based on a memory mechanism to form a corresponding second image semantic recognition result;

and the information processing module is used for performing semantic integration processing on the first image semantic recognition result and the second image semantic recognition result to form an image semantic recognition result matched with the image feature vector.

13. The apparatus of claim 11,

the information processing module is used for extracting corresponding video samples in the target video, wherein the video samples comprise at least two video image frames;

the information processing module is used for extracting at least two local features of each frame of video image frame in the video sample;

the information processing module is used for performing feature integration processing on the local features in each frame of video image frame according to the weight parameters corresponding to the local features in each frame of video image frame to obtain local feature vectors of each frame of video image frame;

the information processing module is used for carrying out content identification on the video samples according to the local feature vector of each frame of video image frame and the time sequence information of the images in the video samples to obtain the video content information of the target video, wherein the video content information comprises video content description and part-of-speech information of the video content description;

and the information processing module is used for updating the parameters of the image semantic recognition network according to the video content information and the real video content information of the video sample.

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the information processing method of any one of claims 1 to 7 when executing the executable instructions stored by the memory.

15. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the information processing method of any one of claims 1 to 7.