CN114880496A

CN114880496A - Multimedia information topic analysis method, device, equipment and storage medium

Info

Publication number: CN114880496A
Application number: CN202210471183.XA
Authority: CN
Inventors: 陈志鹏; 张旭; 朱晓航; 刘宏宇; 马先钦; 姜文华; 曹家; 罗引; 王磊
Original assignee: Beijing Zhongke Wenge Technology Co ltd; National Computer Network and Information Security Management Center
Current assignee: Beijing Zhongke Wenge Technology Co ltd; National Computer Network and Information Security Management Center
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-08-09

Abstract

The disclosure relates to a multimedia information topic analysis method, a device, equipment and a computer readable storage medium, the disclosure obtains a voice recognition text through audio data, and obtains a caption text through a key frame of video data; topic information is extracted from the entity, the keyword and the semantic tag aiming at the voice recognition text and the subtitle text, so that the comprehensive and multi-granularity text topic extraction based on text data is realized; aiming at a key frame of video data, topic information is extracted from two aspects of a face label and a picture label, so that visual topic extraction based on image data is realized; the multi-modal characteristics of the video data are fully considered, the video topics are comprehensively analyzed, and the accuracy of topic analysis is improved; furthermore, through accurate topic analysis, audiences can quickly and effectively acquire main information of videos, the working efficiency is improved, and the method and the device can be widely applied to scenes such as video personalized recommendation and video content retrieval.

Description

Multimedia information topic analysis method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of deep learning, computer vision, and natural language processing, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for analyzing a multimedia information topic.

Background

In the internet environment, multimedia information is more and more important in daily life, and the accurate analysis of topics of the multimedia information is of great importance to the screening, reading, multimedia content retrieval and multimedia personalized recommendation of the multimedia information. The video is an important embodiment of multimedia information, and it is particularly important to accurately analyze the topic of the video.

Video is a typical type of multimodal composition that includes different modal characteristics of text, images, audio, and so on. However, in the prior art, the topic analysis technology for video generally directly utilizes a video title and a video similar key frame to analyze a video topic; due to the multi-modal characteristics of the video, topic analysis is performed only through a video title and a video similar key frame, so that the topic analysis is incomplete, the accuracy of topic analysis results is low, and further the accuracy of video screening, reading, video content retrieval, video personalized recommendation and the like is low.

Disclosure of Invention

In order to solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a multimedia information topic analysis method, apparatus, device, and computer-readable storage medium, which fully consider the multi-modal characteristics of video data, comprehensively analyze video topics, and improve the accuracy of topic analysis; furthermore, through accurate topic analysis, audiences can quickly and effectively acquire main information of videos, the working efficiency is improved, and the method and the device can be widely applied to scenes such as video personalized recommendation and video content retrieval.

In a first aspect, an embodiment of the present disclosure provides a multimedia information topic analysis method, including:

extracting audio data and video data in the multimedia information;

converting the audio data into first text information, and obtaining a voice recognition text according to the first text information;

extracting voice information from the audio data to obtain a voice recognition text;

extracting subtitle information from the key frame of the video data to obtain a subtitle text;

extracting entities and key words from the voice recognition text and the subtitle text, and determining semantic labels of the voice recognition text and the subtitle text;

extracting a face label in the key frame and determining a picture label of the key frame;

and determining the target topic of the multimedia information according to the entity, the keyword, the semantic tag, the face tag and the picture tag.

In a second aspect, an embodiment of the present disclosure provides a multimedia information topic analysis device, including:

the first extraction module is used for extracting audio data and video data in the multimedia information;

the second extraction module is used for extracting voice information from the audio data to obtain a voice recognition text;

the third extraction module is used for extracting subtitle information from the key frames of the video data to obtain subtitle texts;

a fourth extraction module, configured to extract entities and keywords from the speech recognition text and the subtitle text;

the first determining module is used for determining semantic tags of the voice recognition text and the subtitle text;

the fifth extraction module is used for extracting the face labels in the key frames;

a second determining module, configured to determine a picture tag of the key frame;

and the third determining module is used for determining the target topic of the multimedia information according to the entity, the keyword, the semantic tag, the face tag and the picture tag.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of the first aspect.

In a fourth aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the method of the first aspect.

In a fifth aspect, the disclosed embodiments also provide a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the method for analyzing the multimedia information topic is implemented as described above.

According to the multimedia information topic analysis method, the device and the equipment and the computer readable storage medium provided by the embodiment of the disclosure, the voice recognition text is obtained through the audio data, and the caption text is obtained through the key frame of the video data; topic information is extracted from the entity, the keyword and the semantic tag aiming at the voice recognition text and the subtitle text, so that the comprehensive and multi-granularity text topic extraction based on text data is realized; aiming at a key frame of video data, topic information is extracted from two aspects of a face label and a picture label, so that visual topic extraction based on image data is realized; the multi-mode information of the video data is fully considered, the video topics are comprehensively analyzed, and the accuracy of topic analysis is improved; furthermore, through accurate topic analysis, audiences can quickly and effectively acquire main information of videos, the working efficiency is improved, and the method and the device can be widely applied to scenes such as video personalized recommendation and video content retrieval.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a flowchart of a multimedia information topic analysis method provided in an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating an implementation of a multimedia information topic analysis method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a topic information extraction implementation based on text data according to an embodiment of the present disclosure;

fig. 4 is a flowchart of implementing topic information extraction based on image data according to an embodiment of the present disclosure;

fig. 5 is a flowchart of another multimedia information topic analysis method provided by the embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a multimedia information topic analysis device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

In general, most of video-oriented topic analysis technologies directly utilize video titles or video similar key frames to analyze video topics; however, due to the multi-modal characteristics of the video, topic analysis is performed only through the video title and the video similar key frame, so that topic analysis is incomplete, the accuracy of topic analysis results is not high, and further the accuracy of video screening, reading, video content retrieval, video personalized recommendation and the like is not high.

To solve this problem, the embodiments of the present disclosure provide a multimedia information topic analysis method, which is introduced below with reference to specific embodiments.

Fig. 1 is a flowchart of a multimedia information topic analysis method provided in an embodiment of the present disclosure; fig. 2 is a flowchart illustrating an implementation of a multimedia information topic analysis method according to an embodiment of the present disclosure; the multimedia information topic analysis method can be executed by a multimedia information topic analysis device, the device can be realized by software and/or hardware, and can be integrated on any electronic equipment with computing capability, such as a terminal, specifically comprising a smart phone, a palm computer, a tablet computer, wearable equipment with a display screen, a notebook computer and the like.

At present, video becomes an important embodiment form of multimedia information, and the multimedia information topic analysis method provided in this embodiment is suitable for multimedia information displayed in a video form, so that the following introduces the multimedia information topic analysis method from a video perspective.

As shown in fig. 1, the multimedia information topic analysis method provided by the embodiment of the present disclosure includes the following steps:

and S101, extracting audio data and video data in the multimedia information.

Specifically, in this embodiment, extracting the audio data and the video data in the multimedia information means performing separation processing on the multimedia information, and splitting the multimedia information into two parts, i.e., audio data and video data. The audio data comprises voice content in the multimedia information, and the video data comprises all video data except the audio data.

It should be noted that any available method, tool or device may be used for multimedia data separation, and the embodiment of the present disclosure is not limited thereto.

S102, extracting voice information from the audio data to obtain a voice recognition text.

Specifically, the main purpose of this step is to obtain a speech recognition text. The voice Recognition information is obtained by extracting voice information from the audio data through an Automatic Speech Recognition (ASR) technique.

For example, in some embodiments, extracting the speech information from the separated audio data to obtain the speech recognition text mainly includes the following processes: feature extraction, acoustic models, language models, and dictionaries and decoding. Specifically, the audio data is substantially a group of sound signals, the sound signals are subjected to feature extraction, the sound signals are converted from a time domain to a frequency domain, and a proper feature vector is provided for the acoustic model; calculating the score of each feature vector on the acoustic features according to the acoustic features in the acoustic model; then, the language model calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory related to linguistics; and finally, decoding the phrase sequence according to the existing dictionary to obtain a voice recognition text.

It should be noted that the implementation manner of extracting the speech information from the audio data to obtain the speech recognition text is only a feasible technical solution, and the implementation manner of extracting the speech information from the audio data to obtain the speech recognition text is not limited in the present application.

And S103, extracting subtitle information from the key frame of the video data to obtain a subtitle text.

Specifically, in this embodiment, a key frame in the video data is extracted to obtain a minimum unit of video processing. The text information in the key frame is recognized using Optical Character Recognition (OCR).

Alternatively, a text recognition model may be used to recognize the text information in the key frames. Illustratively, a character recognition model, Paddle OCR model, may be used.

Because a video title in a video can best represent a topic of the video, for example, in a news simulcast video, the video title is displayed on a video page to summarize central content of news, and therefore, subtitle text in the video title plays an important role in video topic analysis.

Therefore, in this embodiment, using the character recognition model, it is necessary to obtain the caption text of the video title in the video data; therefore, in this embodiment, the input when using the paddleocr character recognition model is a key frame, and the output is a subtitle text.

To avoid interference with regions unrelated to the subtitle text of a video title, a mask (mask) needs to be added to the key frames extracted from the video data before text recognition using the paddleocr character recognition model. In this embodiment, the mask plays a role of masking an area on the key frame that is irrelevant to the subtitle text of the video title, so that it does not participate in text recognition; and then carrying out text recognition on the unmasked region by using a Paddle OCR character recognition model so as to obtain a subtitle text.

And S104, extracting entities and key words from the voice recognition text and the subtitle text, and determining semantic labels of the voice recognition text and the subtitle text.

Specifically, in this embodiment, the topic information is obtained by three aspects of extracting entities in the speech recognition text and the subtitle text, extracting keywords in the speech recognition text and the subtitle text, and determining semantic tags of the speech recognition text and the subtitle text, and the implementation flowchart of this step is shown in fig. 3, which is specifically described below with reference to fig. 3:

in a first aspect, in the field of natural language, entities are structured attributes in unstructured text data. Entities including a person name, a place name, an organization name, a time, etc. are obtained through Named Entity Recognition (NER). From the semantic perspective, entities such as specific time, specific location, designated people or mechanisms and the like have a limiting effect on text contents, and the topic analysis accuracy can be improved by introducing the characteristics into topic detection.

Specifically, in this embodiment, the obtaining entity may be implemented by:

first, sentence segmentation, word segmentation, stop word removal and part-of-speech tagging are performed on the speech recognition text and the subtitle text, that is, the text preprocessing in fig. 3 is performed.

The sentence segmentation and the word segmentation refer to the division of the voice recognition text and the subtitle text into separate sentences, separate words and the like;

the removal of stop words refers to the removal of words or phrases contained in the speech recognition text and the subtitle text and needing to be automatically filtered. The removal of stop words can improve the text data processing efficiency, save the storage space and improve the accuracy of topic analysis.

Part-of-speech tagging refers to tagging words or words in speech recognition text and subtitle text with parts-of-speech, wherein the parts-of-speech include, but are not limited to, nouns, verbs, adjectives, and the like.

Through the steps, the part-word set with stop words removed and parts of speech labeled is obtained.

For example, sentence segmentation, word segmentation and part-of-speech tagging can be implemented by using a Chinese word segmentation open source library jieba. Removing stop words from the speech recognition text and the subtitle text may be performed using an existing stop word list to match the speech recognition text and the subtitle text, and if words or words, etc. in the speech recognition text and the subtitle text are in the stop word list, the words or words are removed. The stop word list is, for example, a hundred degree stop word list or a custom stop word list, which is not limited in the embodiment of the present disclosure.

Secondly, based on the processed speech recognition text and the subtitle text, words with parts of speech being nouns are screened out, and a noun word segmentation set is obtained.

Since most of the entities are nouns, in this embodiment, words or phrases with parts of speech being nouns are filtered to obtain a set of parts of speech with parts of speech being nouns. In some other embodiments, words of other parts of speech may also be filtered, which is not limited in the embodiments of the present disclosure.

Then, entities such as a person name, a place name, an organization, a time, and the like are identified, i.e., the entity identification in fig. 3.

Optionally, entities in the speech recognition text and the subtitle text may be extracted using a deep learning model.

For example, entities such as names of people, place, organization, time, etc. in a noun participle set can be identified by Long Short-Term-Memory artificial neural network (LSTM) and Conditional Random Field (CRF). In this embodiment, the above two models are used, the input is a word segmentation sequence, and the output is an entity recognition sequence.

And finally, performing entity filtering and entity duplication removal on the output entity identification sequence, and filtering out entities which are not concerned and repeated entities, namely the entity filtering shown in fig. 3, so as to obtain the required entities.

In the second aspect, the keywords are used as important words representing a certain text, and the theme idea of the whole text is expressed to a certain extent, so that extracting the keywords in the text is also important for topic analysis.

Specifically, in this embodiment, extracting the keywords of the speech recognition text and the subtitle text may be implemented in the following manner:

first, sentence segmentation, word segmentation and stop word removal processing are performed on the speech recognition text and the subtitle text, that is, the text preprocessing shown in fig. 3. In this implementation manner, the implementation manner is the same as the implementation manner of sentence segmentation, word segmentation and stop word removal in the above entity, and is not described herein again.

Through this step, a set of segmented words from which stop words are removed is obtained.

Next, words that are more frequently appeared in the segmented word set from which stop words are removed and are unusual in the text set are extracted as keywords, that is, keyword extraction shown in fig. 3.

Optionally, the keyword extraction model may be used to extract keywords in the speech recognition text and the subtitle text.

Illustratively, a keyword extraction model (Term Frequency-Inverse Document Frequency, TF-IDF) may be used. The main logics of the TF-IDF model are: if a word appears frequently in the text and rarely appears in other articles, the word is considered to have good distinguishing capability for representing the document; a word that has a high word frequency in one text and a low document frequency in the entire document collection may be given a high weight. The above model is applied to the present embodiment, and the input of the model is a segmented word set from which stop words are removed, and the output is keywords and weights of the keywords.

Optionally, the output keywords may be sorted according to weight, that is, the keyword sorting shown in fig. 3, the more top the sorting is, the more important the sorting is, and the top K keywords may be selected as final keywords as needed.

In the third aspect, because the entity and the keyword extract partial content from the text, and lack global information, the voice recognition text and the caption text are fused, and the semantic tags of the fused voice recognition text and caption text are determined, the video topics are analyzed from the global perspective.

In this embodiment, the determining of the semantic tags of the speech recognition text and the subtitle text may specifically be determining a primary tag and a secondary tag corresponding to the speech recognition text and the subtitle text. Specifically, determining semantic tags of the fused speech recognition text and the subtitle text may be implemented in the following manner:

first, primary labels of the speech recognition text and the subtitle text are determined, and the primary labels exemplarily include: finance, military, science and technology, sports, entertainment and the like.

Alternatively, the multi-classification model may be used to determine the primary labels of the speech recognition text and the subtitle text.

For example, the multi-classification model may use a pre-trained model (BERT). Before the BERT model is used, the model needs to be trained, and the classification data in the training stage can adopt a Chinese text classification data set which is disclosed at present or a custom Chinese text classification data set.

After the model is trained, the model is applied to the embodiment, the input of the model is a voice recognition text and a subtitle text, and the output of the model is a primary label of the voice recognition text and the subtitle text.

Secondly, secondary labels of the voice recognition text and the subtitle text are determined, and the secondary label under each primary label needs to be preset before the secondary labels are determined. For example, for a sport in the above-mentioned primary labels, setting its corresponding secondary label includes: basketball, football, sports star, etc.

Optionally, for the secondary label, a multi-label classification model may be used to determine the secondary label of the speech recognition text and the subtitle text.

For example, the multi-label classification model may use a BERT model. Before the model is used, the model also needs to be trained, classification data used in a training stage is classified by using a published or customized Chinese multi-label classification data set.

After the model is trained, the model is applied to the embodiment, the input of the model is a voice recognition text and a subtitle text, and the output of the model is a plurality of secondary labels of the voice recognition text and the subtitle text.

Based on the above implementation scheme, a primary label and a plurality of secondary labels of the speech recognition text and the subtitle text, that is, semantic labels shown in fig. 3, can be determined.

It should be noted that when the video only contains the speech recognition text and does not contain the subtitle text, only the primary tag and the secondary tag corresponding to the speech recognition text need to be determined. When the video only contains the subtitle text and does not contain the voice recognition text, only the primary label and the secondary label corresponding to the subtitle text need to be determined.

It can be understood that, because the speech recognition text and the subtitle text both belong to text data, this step implements text data-based topic information extraction, i.e., the text data-based topic information shown in fig. 3.

And S105, extracting the face label in the key frame and determining the picture label of the key frame.

Specifically, for video, it includes a key frame image in addition to a speech recognition text and a subtitle text. The key frame image has a larger information capacity to some extent than the text, and is more representative of the real topic.

For a key frame image included in a video, in this embodiment, topic information is obtained through two aspects of a face tag and a picture tag, and an implementation flowchart of this step is shown in fig. 4, which is specifically described below with reference to fig. 4.

In this embodiment, the face label may specifically be a face name, and exemplarily, the face label is zhang san, lie si, or the like; the picture labels may be classification labels obtained by classifying pictures, and for example, the picture labels are characters, animals, plants, and the like.

In the first aspect, the face label can be implemented in the following manner.

Specifically, the face label extraction includes several steps of face detection, face feature extraction and feature comparison.

Firstly, detecting key points of a human face from key frame data, namely detecting the human face shown in figure 4;

for example, a face detection Model (MTCNN) may be used for face keypoint detection, and the model has key frame data as input and position information of face keypoints as output.

Secondly, based on the detection result of the key points of the face, the face features contained in the corresponding positions are extracted, namely the feature extraction shown in fig. 4 is carried out.

For example, a face feature extraction model FaceNet may be used to extract the face features included in the corresponding positions, where the model is input as the position information of the face key points and output as the face feature vectors corresponding to the face key points.

Finally, the face feature vectors extracted from the key frame are compared with the face feature vectors in the face feature library one by one to obtain a face label extraction result, namely the feature comparison shown in fig. 4.

It can be understood that before comparing the face feature vectors extracted from the key frames with the face feature vectors in the face feature library one by one, the face feature library needs to be constructed in advance. The face feature library is derived from face images with different gestures, illumination and expressions in the face library, and the face library can be automatically established according to a service scene.

Specifically, after a face library is established, face feature vectors in the face feature library are obtained from face images in the face library by using a face feature extraction model FaceNet, the face feature vectors extracted from key frames are compared with the face feature vectors in the face feature library one by one, face labels are extracted, and exemplarily, the names of people in the key frames obtained according to the model are Zhang III.

In a second aspect, the picture tagging for key frames can be implemented as follows.

Optionally, determining the picture labels for the key frames may classify the key frames by using a picture classification model. Illustratively, the picture classification model is a ResNet model, i.e., the picture classification shown in fig. 4.

Optionally, before using the image classification model ResNet, the image classification model is trained on the large-scale image dataset ImageNet in advance, that is, the model shown in fig. 4 is trained. After the training is completed, the input of the model is the key frame and the output is the picture label. Illustratively, the picture tags from which the key frames are derived from the model are animals and the like.

It can be understood that, since the key frame belongs to the image data, this step implements the visual topic extraction based on the image data, i.e., the visual topic information based on the image data shown in fig. 4.

And S106, determining the target topic of the multimedia information according to the entity, the keyword, the semantic tag, the face tag and the picture tag.

Specifically, topic information is extracted from three aspects of an entity, a keyword and a semantic tag aiming at a voice recognition text and a subtitle text in a video to obtain a plurality of topics; extracting topic information from two aspects of a face label and a picture label aiming at a key frame contained in a video to obtain a plurality of topics; and using the obtained topics as target topics of the multimedia information.

The embodiment of the disclosure obtains a voice recognition text through audio data and obtains a subtitle text through a key frame of video data; topic information is extracted from the entity, the keyword and the semantic tag aiming at the voice recognition text and the subtitle text, so that the comprehensive and multi-granularity text topic extraction based on text data is realized; aiming at key frames in video data, topic information is extracted from two aspects of face labels and picture labels, and visual topic extraction based on image data is realized; the multi-modal characteristics of the video data are fully considered, the video topics are comprehensively analyzed, and the accuracy of topic analysis is improved; furthermore, through accurate topic analysis, audiences can quickly and effectively acquire main information of videos, the working efficiency is improved, and the method and the device can be widely applied to scenes such as video personalized recommendation and video content retrieval.

Fig. 5 is another multimedia new topic analysis method provided in the embodiment of the present disclosure, which includes the following steps:

s501, extracting audio data and video data in the multimedia information.

In this embodiment, the step is the same as S101, and is not described herein again.

S502, extracting first text information in the voice information from the audio data, and performing error correction processing on the first text information; and obtaining a voice recognition text based on the first text information after the error correction processing.

Specifically, in this embodiment, first, the voice information is extracted from the separated audio data, and the first text information is obtained. The extracting of the voice information from the separated audio data to obtain the first text information may be implemented in the implementation manner described in S101.

And secondly, performing error correction processing on the first text information, and taking the first text information after the error correction processing as a voice recognition text.

Specifically, the error correction processing refers to correcting errors such as shape word, pronunciation word, idiom usage, quantifier collocation, grammar and the like appearing in the first text information. For example, the first text information may be matched with a word bank collected in advance, and words, sentences, or the like that do not match the word bank may be corrected, so that the corrected first text information may be used as a speech recognition text.

S503, extracting second text information from the key frame of the video data, and carrying out error correction processing on the second text information; and obtaining the subtitle text based on the second text information after the error correction processing.

Specifically, in this embodiment, first, the second text information is obtained from the key frame of the separated video data. The obtaining of the second text information from the separated key frame of the video data may be implemented by the implementation manner described in S103.

And secondly, performing error correction processing on the second text information, and taking the second text information after the error correction processing as a subtitle text. The error correction processing performed on the second text information may be implemented in the same manner as the error correction processing described in S202, and is not described herein again.

S504, extracting entities and key words from the voice recognition texts and the subtitle texts, and determining semantic labels of the voice recognition texts and the subtitle texts.

In this embodiment, the step is the same as S104, and is not described herein again.

And S505, extracting the face label in the key frame and determining the picture label of the key frame.

In this embodiment, the step is the same as S105, and is not described herein again.

S506, combining the entities, the keywords and the semantic labels to obtain combined text topic information; combining the face label and the picture label to obtain combined visual topic information; and determining a plurality of topics appearing in the multimedia information according to the combined text topic information and the combined visual topic information. Optionally, determining a picture label of the key frame by using a picture classification model; and respectively de-duplicating the face label and the picture label of the key frame, and then combining to obtain combined visual topic information.

Specifically, in this embodiment, since the video key frame includes many similar key frames, the face tag extraction result obtained in S205 includes repeated content, and similarly, the image tag extraction result of the key frame also includes repeated content, so that the face tag extraction result and the image tag extraction result of the key frame need to be respectively deduplicated, and then the deduplicated face tag and image tag are merged. Aiming at a voice recognition text and a subtitle text, text topic information is extracted from three aspects of an entity, a keyword and a semantic tag, and the entity, the keyword and the semantic tag are combined to obtain a plurality of text topic information; and extracting visual topic information from two aspects of a face label and a picture label aiming at a key frame contained in the video, and combining the face label and the picture label to obtain a plurality of pieces of visual topic information. Combining the text topic information and the visual topic information can determine a plurality of topics.

For example, from a video segment, the extracted entities include roses, festivals, and the like; the keywords comprise plants, roses, red and the like; the semantic tags comprise primary tag plants, secondary tag roses and the like, and the entities, the keywords and the semantic tags are combined to obtain a plurality of text topic information; the face label comprises Zhang III; the picture labels comprise roses, lilies and the like, and the face labels and the picture labels are combined to obtain a plurality of pieces of visual topic information. According to the text topic information and the visual topic information, the topics of the video can be determined to comprise plants, roses and the like.

S507, determining the weight value of each topic according to the origin of each topic in the multiple topics; and determining a target topic of the multimedia information according to the weight value of each topic, wherein the target topic is a topic with the weight value meeting a preset condition.

Specifically, in this embodiment, the origin of each topic refers to whether the source of each topic is text data or image data, or both text data and image data.

In the present embodiment, the weight value of each topic is determined according to the source of each topic.

Optionally, if the topic appears in the text data, determining that the weight value of the topic is a first weight value, where the text data includes a voice recognition text and/or a subtitle text; illustratively, the first weight value is 0.4.

If the topic appears in the image data, determining that the weight value of the topic is a second weight value, wherein the image data comprises a key frame; illustratively, the second weight value is 0.6

And if the topic appears in the text data and the image data at the same time, determining that the weight value of the topic is a third weight value, wherein the third weight value is the sum of the first weight value and the second weight value. Illustratively, the third weight value is 1.

Then, based on the weight of each topic, when the weight of a topic meets a preset condition, the topic corresponding to the weight is the target topic.

For example, for the weight value of each topic, the preset condition may be that the topic with the topic weight value greater than or equal to 0.6 is taken as the target topic.

For example, the topics may be arranged in a descending order according to the weight values, and the preset condition may be that after the topic weights are arranged in a descending order, the top K topics are taken as target topics. The preset condition is not limited in this embodiment.

According to the method and the device, the first text information is obtained by extracting the voice information from the audio data, the second text information is obtained from the key frame of the video data, and the first text information and the second text information are subjected to error correction processing, so that the interference of wrong grammar, wrong characters and the like on topic analysis is avoided, and the accuracy of topic analysis is further improved; meanwhile, different weighted values are determined according to different topics, so that the dialect topics with multiple granularities can be analyzed comprehensively, the topics with different origins can be treated differently, and the accuracy of topic analysis is improved.

Fig. 6 is a schematic structural diagram of a multimedia information topic analysis apparatus according to an embodiment of the present disclosure. The multimedia information topic analysis means may be the terminal device as described in the above embodiments, or the multimedia information topic analysis means may be the terminal device component or assembly as described above. The multimedia information topic analysis device provided in the embodiment of the present disclosure may execute the processing flow provided in the embodiment of the multimedia information topic analysis method, and as shown in fig. 6, the multimedia information topic analysis device 60 includes:

the first extraction module 61 is configured to extract audio data and video data in the multimedia information.

And a second extracting module 62, configured to extract voice information from the audio data to obtain a voice recognition text.

And a third extraction module 63, configured to extract subtitle information from the key frame of the video data to obtain a subtitle text.

A fourth extraction module 64, configured to extract entities and keywords from the speech recognition text and the subtitle text.

A first determining module 65, configured to determine semantic tags of the speech recognition text and the subtitle text.

And a fifth extraction module 66, configured to extract the face label in the key frame.

A second determining module 67, configured to determine the picture label of the key frame.

A third determining module 68, configured to determine a target topic of the multimedia information according to the entity, the keyword, the text semantic tag, the face tag, and the picture tag.

Optionally, the third determining module includes a merging unit 681, a first determining unit 682, a second determining unit 683, and a third determining unit 684. The merging unit 681 is configured to merge the face label and the picture label to obtain merged visual topic information; a first determining unit 682 is configured to determine a plurality of topics appearing in the multimedia information according to the entities, the keywords, the semantic tags, and the merged visual topic information; the second determining unit 683 is used for determining the weight value of each topic according to the origin of each topic in the plurality of topics; the third determining unit 684 is configured to determine a target topic of the multimedia information according to a weight value of each topic, where the target topic is a topic whose weight value satisfies a preset condition.

Optionally, the second determining unit 683 is configured to, when determining the weight value of each topic according to the origin of each topic in the plurality of topics, specifically: if the topic appears in text data, determining that the weight value of the topic is a first weight value, wherein the text data comprises the voice recognition text and/or the caption text; if the topic appears in image data, determining that the weight value of the topic is a second weight value, wherein the image data comprises the key frame; if the topic appears in the text data and the image data at the same time, determining that the weight value of the topic is a third weight value, wherein the third weight value is the sum of the first weight value and the second weight value.

Optionally, the merging unit 681 is configured to merge the face label and the picture label to obtain merged visual topic information, and specifically configured to: and respectively de-duplicating the face label and the picture label of the key frame, and then combining the face label and the picture label of the key frame to obtain combined visual topic information.

Optionally, the second extracting module 62 is configured to extract voice information from the audio data, and when obtaining a voice recognition text, specifically configured to: extracting first text information in voice information from the audio data, and carrying out error correction processing on the first text information; and obtaining the voice recognition text based on the first text information after error correction processing.

Optionally, the third extracting module 63 is configured to extract subtitle information from the key frame of the video data, and when obtaining a subtitle text, specifically configured to: extracting second text information from the key frame of the video data, and carrying out error correction processing on the second text information; and obtaining the subtitle text based on the second text information after error correction processing.

Optionally, the fourth extraction module 64 is configured to, when extracting entities and keywords from the speech recognition text and the subtitle text, specifically: extracting entities in the voice recognition text and the subtitle text by utilizing a deep learning model; and extracting the keywords in the voice recognition text and the subtitle text by using a keyword extraction model.

Optionally, the first determining module 65 is configured to, when determining the semantic tags of the speech recognition text and the subtitle text, specifically: and determining semantic labels in the voice recognition text and the subtitle text by using a multi-classification model and a multi-label classification model.

The multimedia information topic analysis device in the embodiment shown in fig. 6 can be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device may be the electronic device described in the above embodiments. The electronic device provided in the embodiment of the present disclosure may execute the processing procedure provided in the embodiment of the multimedia information topic analysis method, as shown in fig. 7, the device 70 includes: memory 71, processor 72, computer programs and communication interface 73; wherein a computer program is stored in the memory 71 and is configured to be executed by the processor 72 for performing the method as described above.

In addition, the embodiment of the disclosure also provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the multimedia information topic analysis method described in the above embodiment.

Furthermore, the embodiment of the present disclosure also provides a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the method for analyzing the multimedia information topic is implemented.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multimedia information topic analysis method is characterized by comprising the following steps:

extracting audio data and video data in the multimedia information;

2. The method of claim 1, wherein determining a target topic of the multimedia information according to the entity, the keyword, the semantic tag, the face tag, and the picture tag comprises:

merging the entity, the keyword and the semantic label to obtain merged text topic information;

combining the face label and the picture label to obtain combined visual topic information; determining a plurality of topics appearing in the multimedia information according to the combined text topic information and the combined visual topic information;

determining a weight value of each topic according to the provenance of each topic in the plurality of topics;

determining a target topic of the multimedia information according to the weight value of each topic, wherein the target topic is a topic with a weight value meeting a preset condition.

3. The method of claim 2, wherein determining a weight value for each topic from the provenance of each topic in the plurality of topics comprises:

if the topic appears in text data, determining that the weight value of the topic is a first weight value, wherein the text data comprises the voice recognition text and/or the caption text;

if the topic appears in image data, determining that the weight value of the topic is a second weight value, wherein the image data comprises the key frame;

if the topic appears in the text data and the image data at the same time, determining that the weight value of the topic is a third weight value, wherein the third weight value is the sum of the first weight value and the second weight value.

4. The method of claim 2, wherein merging the face label and the picture label to obtain merged visual topic information comprises:

and respectively de-duplicating the face label and the picture label of the key frame, and then combining the face label and the picture label of the key frame to obtain combined visual topic information.

5. The method of claim 1, wherein extracting speech information from the audio data to obtain speech recognition text comprises:

extracting first text information in voice information from the audio data, and carrying out error correction processing on the first text information;

and obtaining the voice recognition text based on the first text information after error correction processing.

6. The method of claim 1, wherein extracting caption information from key frames of the video data to obtain caption text comprises:

extracting second text information from the key frame of the video data, and carrying out error correction processing on the second text information;

and obtaining the subtitle text based on the second text information after error correction processing.

7. The method of claim 1, wherein extracting entities and keywords from the speech recognition text and the subtitle text, and determining semantic tags of the speech recognition text and the subtitle text comprises:

extracting entities in the voice recognition text and the subtitle text by utilizing a deep learning model;

extracting keywords in the voice recognition text and the subtitle text by using a keyword extraction model;

and determining semantic labels in the voice recognition text and the subtitle text by using a multi-classification model and a multi-label classification model.

8. An apparatus for analyzing a topic of multimedia information, the apparatus comprising:

9. An electronic device, comprising

A memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.