CN113792166B

CN113792166B - Information acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN113792166B
Application number: CN202110951049.5A
Authority: CN
Inventors: 高泽洲; 周湘阳; 伍星; 黄伟航; 肖秋实; 梅丰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2023-04-07
Anticipated expiration: 2041-08-18
Also published as: CN113792166A

Abstract

The disclosure relates to an information acquisition method, an information acquisition device, an electronic device and a storage medium, wherein the method comprises the following steps: obtaining multi-modal information of a video, the multi-modal information of the video comprising: the main text information of the video, the auxiliary text information of the video and the multimedia information, wherein the multimedia information comprises: visual information and/or voice information, wherein the visual information comprises: a plurality of key frame images of the video, the speech information comprising: a speech signal in video; and generating summary information of the video based on the multi-modal information of the video. Meanwhile, the relevance of the information of multiple types such as the main text information of the video, the auxiliary text information of the video, the multimedia information of the video and the like and the abstract information of the video to be generated is considered, and the abstract information of the video is generated based on the information of multiple types such as the main text information of the video, the auxiliary text information of the video, the multimedia information of the video and the like, so that the information of the video is fully utilized to obtain the abstract information of the video.

Description

Information acquisition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of videos, and in particular, to an information obtaining method, an information obtaining apparatus, an electronic device, and a storage medium.

Background

The summary information of the video reflects the main content of the video and is used for matching with a search expression input by a user in the video searching process so as to determine a searching result returned to a terminal of the user. In the related art, description information of a video input by a video author when the video is released is processed to obtain summary information of the video. Because the association degree between some words in the description information of the video and the main content of the video is often low, the description information of the video input by a video author when the video is published is processed to obtain summary information of the video, which can cause that the obtained summary information of the video often includes some words with low association degree with the main content of the video, and further cause that the accuracy of the obtained summary information of the video is low.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides an information acquisition method, an information acquisition apparatus, an electronic device, and a storage medium, so as to at least solve the problem of low accuracy of summary information of an obtained video in the related art.

The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an information acquisition method, including:

obtaining multi-modal information for a video, the multi-modal information comprising: main literal information, supplementary literal information and multimedia information, multimedia information includes: visual information and/or speech information, wherein the visual information comprises: a plurality of key frame images of the video, the speech information comprising: a voice signal in the video;

generating summary information of the video based on the multimodal information.

According to a second aspect of the embodiments of the present disclosure, there is provided an information acquisition apparatus including:

an acquisition module configured to acquire multi-modal information of a video, the multi-modal information including: main literal information, supplementary literal information and multimedia information, multimedia information includes: visual information and/or speech information, wherein the visual information comprises: a plurality of key frame images of the video, the speech information comprising: a voice signal in the video;

a generating module configured to generate summary information of the video based on the multimodal information.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

meanwhile, the relevance of the information of multiple types such as the main text information of the video, the auxiliary text information of the video, the multimedia information of the video and the like and the abstract information of the video to be generated is considered, and the abstract information of the video is generated based on the information of multiple types such as the main text information of the video, the auxiliary text information of the video, the multimedia information of the video and the like, so that the information of the video is fully utilized to obtain the abstract information of the video.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating one embodiment of an information acquisition method in accordance with an exemplary embodiment;

fig. 2 is a block diagram showing a structure of an information acquisition apparatus according to an exemplary embodiment;

fig. 3 is a block diagram illustrating a structure of an electronic device according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

FIG. 1 is a flow diagram illustrating one embodiment of an information acquisition method in accordance with an example embodiment. The method comprises the following steps:

step 101, obtaining multi-modal information of a video.

In the present disclosure, the type of video may be a short video. The multimodal information of the video includes: primary textual information, secondary textual information, and multimedia information. The multimedia information includes visual information and/or voice information.

The visual information of the video may include: the plurality of key frame images of the video, the voice information of the video may include: the speech signal in the video.

In the present disclosure, the main text information of the video may include: the video publisher inputs the description characters of the video for describing the content of the video when publishing the video, and the character recognition result obtained by performing character recognition on the cover image of the video.

The auxiliary text information of the video may include: the method includes the steps of respectively performing character recognition, such as OCR recognition, on each key frame image in a part of key frame images extracted from a video to obtain a character recognition result corresponding to each key frame image. The portion of the key frame image may be selected from a plurality of key frame images in a randomly selected manner, the portion of the key frame image not including a cover image of the video. A plurality of key frame images extracted from the video may be used as visual information of the video.

In this disclosure, the speech signal in the video may be: and taking a first speech frame in all speech frames belonging to the video as a first speech frame, and taking a speech frame which is separated from the first speech frame by a preset number of speech frames as a signal of a last speech frame.

In some embodiments, the primary textual information for the video includes: the method comprises the steps of identifying a description character of a video and a target character corresponding to each first key frame image in a plurality of key frame images; acquiring multi-modal information of the video comprises: for each first key frame image, performing character recognition on the first key frame image to obtain a preliminary character recognition result corresponding to the first key frame image; and carrying out preset semantic reduction processing on the preliminary character recognition result corresponding to the first key frame image to obtain a target character recognition result corresponding to the first key frame image.

In the present disclosure, when the multimodal information of the video includes visual information, the plurality of key frame images in the visual information of the video may include at least one first key frame image. The cover image of the video and the first frame image in the video can be used as the first key frame image.

For each first key frame image, performing character recognition, such as OCR recognition, on the first key frame image to obtain a preliminary character recognition result corresponding to the first key frame image, where the preliminary character recognition result corresponding to the first key frame image may include a plurality of character recognition sub-results, each character recognition sub-result includes one or more characters, and the positions of the plurality of character recognition sub-results in the preliminary character recognition result corresponding to the first key frame image are different. For each first key frame image, performing preset semantic reduction processing on the preliminary character recognition result corresponding to the first key frame image may be: the plurality of character recognition sub-results in the preliminary character recognition result corresponding to the first keyframe image can be spliced according to a preset sequence, for example, a sequence from top to bottom, so as to obtain a target character recognition result corresponding to the first keyframe image.

In some embodiments, the ancillary textual information includes: a target character recognition result corresponding to each second key frame image in the plurality of key frame images and a processed target voice recognition result corresponding to the video; acquiring multi-modal information of a video includes: for each second key frame image, performing character recognition on the second key frame image to obtain a preliminary character recognition result corresponding to the second key frame image; performing preset semantic reduction processing on the preliminary character recognition result corresponding to the second key frame image to obtain a target character recognition result corresponding to the second key frame image; performing voice recognition on the voice signal of the video to obtain a preliminary voice recognition result corresponding to the video; and performing preset filtering processing on the preliminary voice recognition result corresponding to the video to obtain a target voice recognition result corresponding to the video, wherein the preset filtering processing is used for removing characters related to background music and characters related to noise in the preliminary voice recognition result corresponding to the video.

In the present disclosure, when the multi-modal information of the video includes visual information and speech information, the plurality of key frame images in the visual information of the video may include at least one second key frame image. The key frame images other than the first key frame image among the plurality of key frame images may be the second key frame image.

For each second key frame image, performing character recognition, such as OCR recognition, on the second key frame image to obtain a preliminary character recognition result corresponding to the second key frame image, where the preliminary character recognition result corresponding to the second key frame image may include a plurality of character recognition sub-results, and each character recognition sub-result includes one or more characters. And the positions of a plurality of character recognition sub-results in the preliminary character recognition result corresponding to the second key frame image are different. For each second key frame image, performing preset semantic reduction processing on the preliminary character recognition result corresponding to the second key frame image may be: the plurality of character recognition sub-results in the preliminary character recognition result corresponding to the second key frame image may be spliced according to a preset sequence, for example, a sequence from top to bottom, to obtain a target character recognition result corresponding to the second key frame image.

In the present disclosure, speech recognition (ASR) may be performed on a speech signal of the video, that is, a speech signal in the speech information of the video, to obtain a preliminary speech recognition result corresponding to the video. The preliminary voice recognition result corresponding to the video can be subjected to preset filtering processing to obtain a target voice recognition result corresponding to the video. The preset filtering process is used for removing the characters related to the background music and the characters related to the noise in the preliminary voice recognition result corresponding to the video.

And 102, generating summary information of the video based on the multi-modal information of the video.

In the present disclosure, for one video, when summary information of the video is generated based on multi-modal information of the video, association information of each of a plurality of preset videos may be acquired. The type of each preset video is the same as that of the video, and the release time of each preset video is earlier than that of the video.

For each preset video, the association information of the preset video may include: the main text information of the preset video comprises sentences with low relevance degree with the main content of the preset video and the preset multi-mode information. Multimodal information of each preset video can be acquired in advance. For each preset video, the multi-modal information of the preset video comprises: the main text information of the preset video, the auxiliary text information of the preset video and the multimedia information of the preset video. For each preset video, it may be determined in advance manually according to the auxiliary text information of the preset video and the multimedia information of the preset video, that there are a plurality of sentences having a low association degree with the main content of the preset video in the main text information of the preset video.

For a video, when summary information of the video is generated based on multi-modal information of the video, the similarity of the video and each preset video can be calculated. The similarity between the main text information of the video and the main text information of each preset video can be calculated, the similarity between the auxiliary text information of the video and the auxiliary text information of each preset video can be calculated, and the similarity between the multimedia information of the video and the multimedia information of each preset video can be calculated.

For the video and a preset video, the similarity between the main text information of the video and the main text information of the preset video, the similarity between the auxiliary text information of the video and the auxiliary text information of the preset video, and the maximum value or the median of the similarity between the multimedia information of the video and the multimedia information of the preset video can be used as the similarity between the video and the preset video.

The preset video having the highest similarity to the video may be set as the target preset video. The association information of the target preset video comprises sentences with low association degree with the main content of the target preset video.

For each sentence in the main text information of the video, if the similarity between any one of the related information of the sentence and the target preset video and the sentence with low main content relevance of the target preset video is greater than the similarity threshold or the number of keywords which appear in the sentence with low main content relevance of the target preset video and exceed the number threshold also appear in the sentence, the sentence can be regarded as the sentence with low main content relevance of the video.

In the present disclosure, a term other than a term that is low in association with the main content of the video in the main text information of the video may be determined as a term that is high in association with the main content of the video. If the number of the sentences with higher association degree with the main content of the video is one, determining the sentences with higher association degree with the main content of the video as the summary information of the video. If the number of the sentences with higher association degree with the main content of the video is multiple, all the sentences with higher association degree with the main content of the video can be spliced to obtain the summary information of the video.

In some embodiments, based on the multimodal information for the video, generating summary information for the video comprises: processing the multi-mode information of the video by using a preset neural network to obtain abstract information of the video, wherein the preset neural network is trained by using training data in advance, and the training data comprises: the method comprises the steps that multi-mode information of a video used for training and annotation abstract information of the video used for training are obtained, parameters of the preset neural network are updated based on loss corresponding to the video used for training when the preset neural network is trained in advance, the loss corresponding to the video used for training indicates the difference degree between the prediction abstract information of the video used for training and the annotation abstract information of the video used for training, and the prediction abstract information is obtained based on the multi-mode information of the video used for training input into the preset neural network.

In the present disclosure, when generating the summary information of a video based on multi-modal information of the video, the multi-modal information of the video may be input into a preset neural network, and the summary information of the video may be output by the preset neural network.

The method comprises the steps that a preset neural network is trained by utilizing a data set comprising a plurality of training data in advance, the preset neural network is trained by adopting one training data each time, and the training data adopted by each training is different. For each training data, the training data comprises: the system comprises multi-mode information of a video for training and annotation abstract information of the video for training, wherein the annotation abstract information of the video for training is abstract information of the video which is preset by an annotation person according to main content of the video for training.

The multi-mode information of the video used for training can be input into the preset neural network every time the preset neural network is trained, and the preset neural network can predict the prediction summary information of the video used for training. Calculating the loss corresponding to the video for training by using a preset loss function, wherein the loss corresponding to the video for training indicates the difference degree between the prediction abstract information of the video for training and the labeling abstract information of the video for training, and updating the parameter values of the parameters of the preset neural network according to the loss corresponding to the video for training.

After training, the preset neural network learns the incidence relation between the multi-mode information and the summary information, when the video summary information is generated based on the multi-mode information of the video, the preset neural network can process the multi-mode information of the video directly according to the incidence relation between the multi-mode information learned in advance and the summary information, the video summary information is obtained quickly, and the speed of obtaining the video summary information is further improved.

In some embodiments, the predetermined neural network comprises: an encoder and a decoder; processing the multi-mode information by using a preset neural network to obtain summary information of the video, wherein the summary information comprises the following steps: encoding the main character information of the video by using an encoder based on the auxiliary character information of the video and the multimedia information of the video to obtain a target encoding result corresponding to the main character information of the video; and decoding the target coding result corresponding to the main character information by using a decoder to obtain a decoding result, and obtaining the abstract information of the video based on the decoding result.

The predetermined neural network in the present disclosure may be a neural network of an Encoder-Decoder (Encoder-Decoder) type, such as a U-Net network, and the structure of the Encoder and the structure of the Decoder in the present disclosure may be the structure of the Encoder and the structure of the Decoder in the neural network of the Encoder-Decoder type.

In the case that the preset neural network comprises an encoder and a decoder, in the process of pre-training the preset neural network, the encoder encodes the main text information of the video for training based on the auxiliary text information of the video for training and the multimedia information of the video for training to obtain a predicted encoding result, when the encoder encodes the main text information of the video for training based on the auxiliary text information of the video for training and the multimedia information of the video for training, the encoder predicts a key sentence which is higher in association degree with the auxiliary text information of the video for training and is lower in association degree with the multimedia information of the video for training in the main text information of the video for training, the predicted key sentence can be regarded as a sentence which is higher in association degree with the main content of the video for training, then, each predicted key sentence can be generated, and the predicted codes of each key sentence are spliced into the predicted encoding result. In the training of the preset neural network, the predictive coding result may be input to a decoder, the decoder may generate a decoding result corresponding to the video for training, the decoding result corresponding to the video for training may include a plurality of summary information corresponding to the video for training, and the predictive summary information of the video for training is obtained based on the decoding result corresponding to the video for training.

By training the preset neural network, the encoder learns how to determine the sentences with higher association degree with the main content of the given video in the main text information of the given video, the encoding result corresponding to the given video is generated by using the sentences with higher association degree with the main content of the given video, and the decoder learns how to obtain the abstract information of the video based on the encoding result corresponding to the main text information of the given video.

For a video, when the multi-mode information is processed by using a preset neural network comprising an encoder and a decoder to obtain abstract information of the video, main character information of the video, auxiliary character information of the video and multimedia information of the video can be simultaneously input into the encoder, and the encoder encodes the main character information of the video based on the auxiliary character information of the video and the multimedia information of the video to obtain a target encoding result corresponding to the main character information of the video. The encoder may predict sentences in the main text information of the video that are highly associated with the main content of the video, and may splice the codes of each sentence in the main text information of the video that is highly associated with the main content of the video into a target coding result corresponding to the main text information of the video. After the target coding result corresponding to the main character information of the video is obtained, the target coding result of the video is input into a decoder, and the decoding result output by the decoder can be obtained. The decoding result may include: a plurality of candidate summary information and a confidence level of each candidate summary information. When summary information of a video is obtained based on a decoding result, a candidate summary information having the highest confidence coefficient among a plurality of candidate summary information may be determined as the summary information of the video.

In the disclosure, when the encoder encodes the main text information of the video based on the auxiliary text information of the video and the multimedia information of the video to obtain the target encoding result corresponding to the main text information, the encoder encodes the sentences in the main text information of the video, which have a higher association degree with the main content of the video, to obtain the target encoding result corresponding to the main text information of the video, and the decoder obtains the summary information of the video based on the target encoding result corresponding to the main text information of the video, which is equivalent to obtaining the summary information of the video by the sentences in the main text information of the video, which have a higher association degree with the main content of the video, so that the accuracy of the obtained summary information of the video is higher. Meanwhile, the abstract information of the video is obtained only by using the main character information, and the auxiliary character information of the video is only used for determining a part of the main character information of the video, which is higher in association degree with the main content of the video, so that the condition that the accuracy of the abstract information of the video is influenced because the association degree of the part of the information of the auxiliary character information of the video and the main content of the video is lower is avoided.

For example, for a short video about a game released by a user, the main text information in the video is text in a cover image, the cover image includes text such as the name of a scene in the game, the name of a character in the game, and the cover image also includes content unrelated to the main content of the video, such as "pay my account. If the summary information of the video is generated by directly utilizing the main text information, the summary information of the video comprises the name of a certain scene in the game and the name of a character in the game, and also comprises contents which are not related to the main contents of the video, such as' pay attention to my account ].

By contrast, the summary information of the video obtained by the information acquisition method provided by the present disclosure, the sentence "pay attention to my account". The association degree of more wonderful video "in the main text information with the main content of the video can be determined to be low by the auxiliary text information, for example, the sentence extracted from the key frame image except the cover image, for example, the sentence including the name of the character in the game and the name of the scene, the visual information of the video, for example, the object representing the character in the game in the key frame image, and the voice information of the video, for example, the audio corresponding to the utterance of the character in the game.

Meanwhile, the sentence comprising the name of a scene and the sentence comprising the name of a character in the game in the main text information are determined to be higher in association degree with the main content of the video, and the abstract information of the video is obtained by utilizing the sentence with the higher association degree with the main content of the video in the main text information of the video, so that the abstract information of the video only comprises the sentence with the higher association degree with the main content of the video, and does not comprise the sentence with the low association degree with the main content of the video, namely the video with more wonderful video, and the accuracy of the obtained abstract information of the video is higher.

In some embodiments, encoding, by an encoder, main text information of a video based on auxiliary text information of the video and multimedia information of the video, and obtaining a target encoding result corresponding to the main text information of the video includes: generating a primary information vector and a secondary information vector, the primary information vector comprising: the vector representing the main character information, the position vector, and the vector representing the identification of the main character information, wherein the position vector is the vector representing the position of the word in the main character information, and the auxiliary information vector comprises: a vector representing the auxiliary text information and a vector representing the multimedia information; and inputting the main information vector and the auxiliary information vector into an encoder to obtain a target encoding result corresponding to the main character information output by the encoder.

In this disclosure, the vector representing the primary textual information may be referred to as an embedded representation (embedding) of the primary textual information. The position vector comprises a plurality of components, and each component corresponds to a word in the main character information. For each component, the component represents the position of a word corresponding to the component in the main text information. The vector representing the auxiliary textual information may be referred to as an embedded representation of the auxiliary textual information, and the vector representing the multimedia information may be referred to as an embedded representation of the multimedia information. The primary information vector and the secondary information vector may be simultaneously input into the encoder.

The encoder may distinguish between the primary information vector and the secondary information vector based on a vector representing an identity of the primary textual information such that the encoder does not need to perform a correlation operation to determine the primary information vector and the secondary information vector. Meanwhile, the encoder can determine the position of each word in the main text information according to the position vector, so that the encoder does not need to execute related operations to determine the position of each word in the main text information, and the speed of obtaining a target coding result corresponding to the main text information is improved.

In some embodiments, encoding, by using an encoder in a preset neural network, main text information of a video based on auxiliary text information of the video and multimedia information of the video, and obtaining a target encoding result corresponding to the main text information includes: utilizing the encoder to perform word segmentation on the main character information of the video to obtain a plurality of words in the main character information; determining, with an encoder, a weight for each of the plurality of terms based on auxiliary textual information of the video and multimedia information of the video; and coding each target word by using the coder to obtain a coding result of each target word, and combining the coding results of each target word into a target coding result corresponding to the main character information, wherein the target word is a word with the weight larger than a weight threshold value.

In the present disclosure, in a case where the preset neural network includes an encoder and a decoder, the training data used when the preset neural network is trained in advance may include: the higher the labeled weight of the word in the main text information of the video for training is, the higher the degree of association of the word in the main text information of the video for training with the main content of the video for training is.

In the training process, the encoder may predict a prediction weight of each word in the main text information of the video for training based on the auxiliary text information of the video for training and the multimedia information of the video for training, i.e., the visual information of the video for training and/or the voice information of the video for training, calculate a loss between the prediction weight of each word and the labeling weight of each word, and update a parameter value of a parameter of the encoder for predicting the weight of the word according to the loss. Thus, the encoder is caused to learn to determine a weight for each word in the primary textual information for a given video based on the secondary textual information for the given video and the multimedia information for the given video.

In this disclosure, when an encoder in a preset neural network is used to encode main text information of a video based on auxiliary text information of the video and multimedia information of the video to obtain a target encoding result corresponding to the main text information, the encoder may be used to perform word segmentation on the main text information of the video to obtain a plurality of words in the main text information.

Since the encoder has previously learned to determine the weight of each of the plurality of words in the primary textual information of a given video based on the secondary textual information of the given video and the multimedia information of the given video, the encoder may be utilized to determine the weight of each of the plurality of words in the primary textual information based on the secondary textual information of the video and the multimedia information of the video. A target word in the plurality of words in the main text information may be determined, where the target word is a word whose weight is greater than a weight threshold, that is, a word with a greater association with the main content of the video. And coding each target word by using a coder to obtain a coding result of each target word, and combining the coding results of each target word into a target coding result corresponding to the main character information of the video.

In the disclosure, when an encoder in a preset neural network is used to encode main text information of a video based on auxiliary text information of the video and multimedia information of the video to obtain a target encoding result corresponding to the main text information of the video, a target word having a relatively high association degree with the main content of the video may be determined by the encoder, only each target word having a relatively high association degree with the main content of the video is encoded, and the encoding result of each target word is combined into a target encoding result corresponding to the main text information of the video, so that the association degree between the target encoding result corresponding to the main text information of the video and the main content of the video is relatively high, and the accuracy of the target encoding result corresponding to the main text information of the video is relatively high.

In some embodiments, the decoding results include: a plurality of candidate summary information and an initial confidence of each candidate summary information; obtaining summary information of the video based on the decoding result comprises: determining a final confidence degree of each candidate summary information based on the reference information of each candidate summary information and the initial confidence degree of each candidate summary information, wherein the reference information of the candidate summary information comprises: the length of the candidate abstract information, the proportion of the main character information and the repeatability of the candidate abstract information; and selecting the summary information of the video from a plurality of candidate summary information based on the final confidence degree of each candidate summary information.

In this disclosure, the confidence of the candidate summary information output by the decoder may be referred to as an initial confidence.

For each candidate summary information, the length of the candidate summary information, that is, the number of characters in the candidate summary information, may be divided by the length of the main character information, that is, the number of characters in the main character information, to obtain the ratio of the length of the candidate summary information to the main character information.

For each candidate summary information, the repetition degree of the candidate summary information is the number of words which repeatedly appear in the candidate summary information, that is, words whose appearance times are more than one time.

For each candidate summary information, calculating the product of the ratio of the length of the candidate summary information to the main character information and a preset coefficient corresponding to the repetition degree of the candidate summary information, and multiplying the product by the initial confidence degree of the candidate summary information to obtain the final confidence degree of the candidate summary information, wherein the preset coefficient corresponding to the repetition degree of the candidate summary information is a numerical value which is greater than 0 and less than 1, the size of the preset coefficient corresponding to the repetition degree of the candidate summary information is inversely related to the size of the repetition degree of the candidate summary information, and the larger the repetition degree of the candidate summary information is, the smaller the preset coefficient corresponding to the repetition degree of the candidate summary information is.

For example, 3 candidate summary information are obtained. The 1 st candidate summary information includes 1 repeatedly occurring word, the 2 nd candidate summary information includes 2 repeatedly occurring words, and the 3 rd candidate summary includes 2 repeatedly occurring words. The repetition degree of the 1 st candidate summary information is 1, the repetition degree of the 2 nd candidate summary information is 2, and the repetition degree of the 3 rd candidate summary information is 3. The repetition degree 1 corresponds to a preset coefficient 1, the repetition degree 2 corresponds to a preset coefficient 2, the repetition degree 3 corresponds to a preset coefficient 3, and the preset coefficient 1, the preset coefficient 2 and the preset coefficient 3 are all values which are greater than 0 and less than 1. The size of the preset coefficient corresponding to the repetition degree of the candidate summary information is inversely related to the size of the repetition degree of the candidate summary information, so that the preset coefficient 1 is greater than the preset coefficient 2, and the preset coefficient 2 is greater than the preset coefficient 3.

And multiplying the product of the length of the 1 st candidate summary information and the main character information and a preset coefficient 1 by the initial confidence coefficient of the 1 st candidate summary information to obtain the final confidence coefficient of the 1 st candidate summary information. And multiplying the product of the length of the 2 nd candidate summary information and the ratio of the main character information and a preset coefficient 2 by the initial confidence coefficient of the 2 nd candidate summary information to obtain the final confidence coefficient of the 2 nd candidate summary information. And multiplying the product of the length of the 3 rd candidate summary information and the ratio of the main character information and a preset coefficient 3 by the initial confidence coefficient of the 3 rd candidate summary information to obtain the final confidence coefficient of the 3 rd candidate summary information.

In the present disclosure, after obtaining the final confidence of each candidate summary information, the summary information of the video may be selected from the plurality of candidate summary information based on the final confidence of each candidate summary information. All candidate summary information can be ranked according to the final confidence degree from high to low, and the candidate summary information with the longest length in the first N candidate summary information is selected as the summary information of the video.

In the disclosure, the initial confidence of the candidate summary information may represent the accuracy of the candidate summary information, the repetition degree of the candidate summary information is negatively correlated with the brevity degree of the candidate summary information, the lower the repetition degree of the candidate summary information is, the higher the brevity degree is, the more concise the candidate summary information is, and the ratio of the length of the candidate summary information to the main text information may reflect the fullness degree of the main text information utilized when the candidate summary information is generated, so that the final confidence of the candidate summary information is simultaneously correlated with the accuracy of the candidate summary information, the brevity degree of the candidate summary information, and the fullness degree of the main text information utilized when the candidate summary information is generated.

In the present disclosure, the summary information of the video is selected from the plurality of candidate summary information based on the final confidence level of each candidate summary information, which is equivalent to considering the accuracy of the candidate summary information, the degree of brevity of the candidate summary information, and the degree of fullness of the main text information when generating the candidate summary information when selecting the summary information of the video from the plurality of candidate summary information, thereby more comprehensively determining which candidate summary information is suitable as the summary information of the video.

In some embodiments, selecting summary information of the video from the plurality of candidate summary information based on the final confidence level of each candidate summary information comprises: and selecting the candidate summary information with the highest final confidence as the summary information of the video.

The final confidence of the candidate summary information is simultaneously related to the accuracy of the candidate summary information, the brief degree of the candidate summary information and the fullness degree of the main character information utilized when the candidate summary information is generated. Only when the accuracy and the simplicity of the candidate summary information are high and the candidate summary information is generated, the candidate summary information with the highest confidence degree by using the candidate summary information with the higher abundance degree of the main text information is possible to become the candidate summary information with the highest final confidence degree. And selecting the candidate summary information with the highest final confidence coefficient as the summary information of the video, so that the accuracy of the finally obtained summary information of the video is higher and the summary information is more concise, and the enrichment degree of the main character information is higher when the candidate summary information is generated.

Fig. 2 is a block diagram illustrating a structure of an information acquisition apparatus according to an exemplary embodiment. Referring to fig. 2, the information acquisition apparatus includes: an obtaining module 201 and a generating module 202.

The obtaining module 201 is configured to obtain multi-modal information of the video, the multi-modal information including: the text information of main, supplementary text information and multimedia information, multimedia information includes: visual information and/or speech information, wherein the visual information comprises: a plurality of key frame images of the video, the speech information comprising: a speech signal in the video;

the generation module 202 is configured to generate summary information for the video based on the multimodal information.

In some embodiments, the generation module 202 includes:

a processing sub-module configured to process the multi-modal information by using a preset neural network to obtain summary information of the video, wherein the preset neural network is trained by using training data in advance, and the training data includes: the method comprises the steps that multi-mode information of a video used for training and annotation abstract information of the video used for training are obtained, when the preset neural network is trained in advance, parameters of the preset neural network are updated based on loss corresponding to the video used for training, the loss corresponding to the video used for training indicates the difference degree between the prediction abstract information of the video used for training and the annotation abstract information, and the prediction abstract information is obtained based on the multi-mode information of the video used for training input into the preset neural network.

In some embodiments, the preset neural network comprises: an encoder and a decoder; the processing submodule is further configured to encode the main text information by using the encoder based on the auxiliary text information and the multimedia information to obtain a target encoding result corresponding to the main text information; and decoding the target coding result corresponding to the main character information by using the decoder to obtain a decoding result, and obtaining the abstract information of the video based on the decoding result.

In some embodiments, the decoding result comprises: a plurality of candidate summary information and an initial confidence level of each candidate summary information; the processing sub-module is further configured to determine a final confidence level of each candidate summary information based on the reference information of each candidate summary information and the initial confidence level of each candidate summary information, the reference information of the candidate summary information comprising: the length of the candidate summary information, the proportion of the main character information and the repetition degree of the candidate summary information, wherein the repetition degree of the candidate summary information is the number of repeatedly-appearing words included in the candidate summary information; and selecting the summary information of the video from a plurality of candidate summary information based on the final confidence degree of each candidate summary information.

In some embodiments, the processing sub-module is further configured to select the candidate summary information with the highest final confidence as the summary information of the video.

In some embodiments, the processing sub-module is further configured to perform word segmentation on the main text information by using the encoder to obtain a plurality of words in the main text information; determining, with the encoder, a weight for each of the plurality of words based on the auxiliary textual information and the multimedia information; and coding each target word by using the coder to obtain a coding result of each target word, and combining the coding result of each target word into a target coding result corresponding to the main character information, wherein the target word is a word with the weight larger than a weight threshold value.

In some embodiments, the processing sub-module is further configured to generate a primary information vector and a secondary information vector, the primary information vector comprising: the vector representing the main character information, a position vector, and a vector representing the identification of the main character information, wherein the position vector is a vector representing the position of a word in the main character information, and the auxiliary information vector comprises: a vector representing the auxiliary text information and a vector representing the multimedia information; and inputting the main information vector and the auxiliary information vector into an encoder to obtain a target encoding result corresponding to the main character information output by the encoder.

In some embodiments, the primary textual information includes: the description characters of the video and the target character recognition result corresponding to each first key frame image in the plurality of key frame images; the obtaining module 201 is further configured to perform character recognition on each first key frame image to obtain a preliminary character recognition result corresponding to the first key frame image; and carrying out preset semantic reduction processing on the preliminary character recognition result corresponding to the first key frame image to obtain a target character recognition result corresponding to the first key frame image.

In some embodiments, the auxiliary textual information includes: a target character recognition result corresponding to each second key frame image in the plurality of key frame images and a target voice recognition result corresponding to the video; the obtaining module 201 is further configured to perform character recognition on each second key frame image to obtain a preliminary character recognition result corresponding to the second key frame image; performing preset semantic reduction processing on the preliminary character recognition result corresponding to the second key frame image to obtain a target character recognition result corresponding to the second key frame image; performing voice recognition on the voice signal of the video to obtain a preliminary voice recognition result corresponding to the video; and carrying out preset filtering processing on the preliminary voice recognition result corresponding to the video to obtain a target voice recognition result corresponding to the video, wherein the preset filtering processing is used for removing characters related to background music and characters related to noise in the preliminary voice recognition result corresponding to the video.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 3 is a block diagram illustrating a structure of an electronic device according to an example embodiment. Referring to fig. 3, the electronic device includes a processing component 322 that further includes one or more processors and memory resources, represented by memory 332, for storing instructions, such as application programs, that are executable by the processing component 322. The application programs stored in memory 332 may include one or more modules that each correspond to a set of instructions. Further, the processing component 322 is configured to execute instructions to perform the information acquisition methods described above.

The electronic device may also include a power component 326 configured to perform power management of the electronic device, a wired or wireless network interface 350 configured to connect the electronic device to a network, and an input/output (I/O) interface 358. The electronic device may operate based on an operating system stored in memory 332, such as Windows Server, macOS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In an exemplary embodiment, a storage medium comprising instructions, such as a memory comprising instructions, executable by an electronic device to perform the above method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, the present application further provides a computer program product comprising computer readable code which, when run on an electronic device, causes the electronic device to perform an information acquisition method.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An information acquisition method, characterized in that the method comprises:

obtaining multi-modal information for a video, the multi-modal information comprising: the text information of main, supplementary text information and multimedia information, multimedia information includes: visual information and/or speech information, wherein the visual information comprises: a plurality of key frame images of the video, the speech information comprising: a voice signal in the video;

generating summary information of the video based on the multi-modal information;

generating summary information of the video based on the multimodal information, including:

processing the multi-mode information by using a preset neural network to obtain abstract information of the video;

the main text information includes: the description characters of the video and the target character recognition result corresponding to each first key frame image in the plurality of key frame images; the auxiliary text information includes: a target character recognition result corresponding to each second key frame image in the plurality of key frame images and a target voice recognition result corresponding to the video;

the preset neural network includes: an encoder and a decoder; processing the multi-modal information by using a preset neural network to obtain summary information of the video, wherein the summary information comprises:

encoding the main character information by using the encoder based on the auxiliary character information and the multimedia information to obtain a target encoding result corresponding to the main character information;

and decoding the target coding result corresponding to the main character information by using the decoder to obtain a decoding result, and obtaining the abstract information of the video based on the decoding result.

2. The method of claim 1, wherein the pre-set neural network is pre-trained with training data comprising: the method comprises the steps that multi-mode information of a video used for training and annotation abstract information of the video used for training are obtained, when the preset neural network is trained in advance, parameters of the preset neural network are updated based on loss corresponding to the video used for training, the loss corresponding to the video used for training indicates the difference degree between the prediction abstract information of the video used for training and the annotation abstract information, and the prediction abstract information is obtained based on the multi-mode information of the video used for training input into the preset neural network.

3. The method of claim 1, wherein the decoding result comprises: a plurality of candidate summary information and an initial confidence level of each candidate summary information; obtaining summary information of the video based on the decoding result comprises:

determining a final confidence degree of each candidate summary information based on the reference information of each candidate summary information and the initial confidence degree of each candidate summary information, wherein the reference information of the candidate summary information comprises: the length of the candidate summary information, the proportion of the main character information and the repetition degree of the candidate summary information, wherein the repetition degree of the candidate summary information is the number of repeatedly-appearing words included in the candidate summary information;

and selecting the summary information of the video from a plurality of candidate summary information based on the final confidence degree of each candidate summary information.

4. The method of claim 3, wherein selecting summary information of the video from a plurality of candidate summary information based on the final confidence of each candidate summary information comprises:

and selecting the candidate summary information with the highest final confidence coefficient as the summary information of the video.

5. The method of claim 1, wherein encoding the primary textual information with the encoder based on the secondary textual information and the multimedia information to obtain a target encoding result corresponding to the primary textual information comprises:

utilizing the encoder to perform word segmentation on the main character information to obtain a plurality of words in the main character information;

determining, with the encoder, a weight for each of the plurality of words based on the auxiliary textual information and the multimedia information;

and coding each target word by using the coder to obtain a coding result of each target word, and combining the coding result of each target word into a target coding result corresponding to the main character information, wherein the target word is a word with the weight larger than a weight threshold value.

6. The method of claim 1, wherein encoding the primary textual information with the encoder based on the secondary textual information and the multimedia information to obtain a target encoding result corresponding to the primary textual information comprises:

generating a primary information vector and a secondary information vector, the primary information vector comprising: a vector representing the primary text information, a position vector, a vector representing an identity of the primary text information, the position vector being a vector representing a position of a word in the primary text information, the auxiliary information vector comprising: a vector representing the auxiliary text information and a vector representing the multimedia information;

and inputting the main information vector and the auxiliary information vector into an encoder to obtain a target encoding result corresponding to the main character information output by the encoder.

7. The method of claim 1, wherein obtaining multi-modal information for a video comprises:

for each first key frame image, performing character recognition on the first key frame image to obtain a preliminary character recognition result corresponding to the first key frame image; and carrying out preset semantic reduction processing on the preliminary character recognition result corresponding to the first key frame image to obtain a target character recognition result corresponding to the first key frame image.

8. The method of claim 1, wherein obtaining multimodal information for a video comprises: for each second key frame image, performing character recognition on the second key frame image to obtain a preliminary character recognition result corresponding to the second key frame image; performing preset semantic reduction processing on the preliminary character recognition result corresponding to the second key frame image to obtain a target character recognition result corresponding to the second key frame image;

performing voice recognition on the voice signal of the video to obtain a preliminary voice recognition result corresponding to the video;

and performing preset filtering processing on the preliminary voice recognition result corresponding to the video to obtain a target voice recognition result corresponding to the video, wherein the preset filtering processing is used for removing characters related to background music and characters related to noise in the preliminary voice recognition result corresponding to the video.

9. An information acquisition apparatus, characterized in that the apparatus comprises:

an acquisition module configured to acquire multi-modal information of a video, the multi-modal information comprising: the text information of main, supplementary text information and multimedia information, multimedia information includes: visual information and/or speech information, wherein the visual information comprises: a plurality of key frame images of the video, the speech information comprising: a voice signal in the video;

a generation module configured to generate summary information of the video based on the multimodal information;

the generation module comprises:

the processing submodule is configured to process the multi-mode information by using a preset neural network to obtain abstract information of the video;

the preset neural network comprises: an encoder and a decoder; the processing submodule is further configured to encode the main text information by using the encoder based on the auxiliary text information and the multimedia information to obtain a target encoding result corresponding to the main text information; and decoding the target coding result corresponding to the main character information by using the decoder to obtain a decoding result, and obtaining the abstract information of the video based on the decoding result.

10. The apparatus of claim 9, wherein the pre-set neural network is pre-trained with training data comprising: the method comprises the steps that multi-mode information of a video used for training and annotation abstract information of the video used for training are obtained, when the preset neural network is trained in advance, parameters of the preset neural network are updated based on loss corresponding to the video used for training, the loss corresponding to the video used for training indicates the difference degree between the prediction abstract information of the video used for training and the annotation abstract information, and the prediction abstract information is obtained based on the multi-mode information of the video used for training input into the preset neural network.

11. The apparatus of claim 9, wherein the decoding result comprises: a plurality of candidate summary information and an initial confidence level of each candidate summary information; the processing sub-module is further configured to determine a final confidence level of each candidate summary information based on the reference information of each candidate summary information and the initial confidence level of each candidate summary information, the reference information of the candidate summary information comprising: the length of the candidate summary information, the proportion of the main text information and the repetition degree of the candidate summary information, wherein the repetition degree of the candidate summary information is the number of repeatedly appearing words included in the candidate summary information; and selecting the summary information of the video from a plurality of candidate summary information based on the final confidence degree of each candidate summary information.

12. The apparatus of claim 11, wherein the processing sub-module is further configured to select the candidate summary information with the highest final confidence as the summary information of the video.

13. The apparatus of claim 9, wherein the processing sub-module is further configured to perform word segmentation on the main text information using the encoder to obtain a plurality of words in the main text information; determining, with the encoder, a weight for each of the plurality of words based on the auxiliary textual information and the multimedia information; and coding each target word by utilizing the coder to obtain a coding result of each target word, and combining the coding result of each target word into a target coding result corresponding to the main character information, wherein the target word is a word with the weight larger than a weight threshold value.

14. The apparatus of claim 9, wherein the processing sub-module is further configured to generate a primary information vector and a secondary information vector, the primary information vector comprising: a vector representing the primary text information, a position vector, a vector representing an identity of the primary text information, the position vector being a vector representing a position of a word in the primary text information, the auxiliary information vector comprising: a vector representing the auxiliary text information and a vector representing the multimedia information; and inputting the main information vector and the auxiliary information vector into an encoder to obtain a target encoding result corresponding to the main character information output by the encoder.

15. The apparatus according to claim 9, wherein the obtaining module is further configured to perform text recognition on each first key frame image to obtain a preliminary text recognition result corresponding to the first key frame image; and carrying out preset semantic reduction processing on the preliminary character recognition result corresponding to the first key frame image to obtain a target character recognition result corresponding to the first key frame image.

16. The apparatus according to claim 9, wherein the obtaining module is further configured to perform, for each second key frame image, character recognition on the second key frame image, so as to obtain a preliminary character recognition result corresponding to the second key frame image; performing preset semantic reduction processing on the preliminary character recognition result corresponding to the second key frame image to obtain a target character recognition result corresponding to the second key frame image; performing voice recognition on the voice signal of the video to obtain a preliminary voice recognition result corresponding to the video; and carrying out preset filtering processing on the preliminary voice recognition result corresponding to the video to obtain a target voice recognition result corresponding to the video, wherein the preset filtering processing is used for removing characters related to background music and characters related to noise in the preliminary voice recognition result corresponding to the video.

17. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 8.

18. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-8.