CN114005079A

CN114005079A - Multimedia stream processing method and device

Info

Publication number: CN114005079A
Application number: CN202111666523.6A
Authority: CN
Inventors: 赵悦汐; 程红兵; 鞠剑伟; 昝晨辉
Original assignee: Beijing Jinmao Education Technology Co ltd
Current assignee: Beijing Jinmao Education Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-02-01
Anticipated expiration: 2041-12-31
Also published as: CN114005079B

Abstract

The application provides a multimedia stream processing method and device. Wherein the method comprises the following steps: acquiring a multimedia stream fragment; decoding to obtain a video stream sub-segment and an audio stream sub-segment; analyzing the video stream sub-segment to generate scene information and first text information; analyzing the audio stream sub-segment to generate second text information; and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream. Through disassembling the multimedia stream file, the content of the multimedia file can be identified in a complex scene by effectively combining various independent AI modules, and the identification efficiency of the existing independent AI technology in the complex scene is effectively improved.

Description

Multimedia stream processing method and device

Technical Field

The present application relates to the field of multimedia information identification technologies, and in particular, to a method and an apparatus for processing a multimedia stream.

Background

With the continuous development and popularization of AI technology, many mature AI modules, such as ali multimedia AI, are available on the market to process information flow in media. Such as a video stream, an audio stream in multimedia, or an information stream in which a video stream is combined with an audio stream. In the process of processing the multimedia stream, the corresponding content in the acquired multimedia file can be identified through the corresponding AI module.

In the process of realizing the prior art, the inventor finds that:

the conventional AI module has a single identification mode. In the face of a complex scene to be identified, analysis cannot be performed through a single AI module, so that the identification efficiency of the multimedia file is reduced.

Therefore, it is desirable to provide a multimedia stream processing method and apparatus for solving the technical problem of low recognition efficiency of the existing independent AI technology in a complex scene.

Disclosure of Invention

The embodiment of the application provides a multimedia stream processing method and device, which are used for solving the technical problem of low identification efficiency of the existing independent AI technology in a complex scene.

Specifically, a multimedia stream processing method includes the following steps:

acquiring a multimedia stream fragment;

decoding to obtain a video stream sub-segment and an audio stream sub-segment;

analyzing the video stream sub-segment to generate scene information and first text information;

analyzing the audio stream sub-segment to generate second text information;

and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.

Further, analyzing the video stream sub-segment to generate scene information specifically includes:

and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.

Further, analyzing the video stream sub-segment to generate a first text message specifically includes:

and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.

Further, analyzing the video stream sub-segment to generate a first text message pointed by the object action behavior, specifically including:

analyzing the video stream sub-segments to obtain images lasting for a preset time;

and recognizing the image by using OCR to generate first text information.

Further, the first text information at least comprises one of teaching link information and knowledge point information.

Further, the second text information at least specifically includes one of text error correction information, keyword information, question information, and emotion description information.

Further, processing the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream specifically includes:

and carrying out cross validation on the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.

The embodiment of the application also provides a multimedia stream processing device.

Specifically, a multimedia stream processing apparatus includes:

the acquisition module is used for acquiring the multimedia stream fragments;

the decoding module is used for decoding and acquiring the video stream sub-segment and the audio stream sub-segment;

the video analysis module is used for analyzing the video stream sub-segments to generate scene information and first text information;

the audio analysis module is used for analyzing the audio stream sub-segment to generate second text information;

and the analysis abstract generating module is used for processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.

Further, the video analysis module is configured to analyze the video stream sub-segment to generate scene information, and specifically is configured to:

Further, the video analysis module is configured to analyze the video stream sub-segment to generate first text information, and specifically configured to:

The technical scheme provided by the application embodiment has at least the following beneficial effects:

through disassembling the multimedia stream file, the content of the multimedia file can be identified in a complex scene by effectively combining various independent AI modules, and the identification efficiency of the existing independent AI technology in the complex scene is effectively improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a multimedia stream processing method according to an embodiment of the present disclosure.

Fig. 2 is a schematic structural diagram of a multimedia stream processing apparatus according to an embodiment of the present disclosure.

100 multimedia stream processing apparatus

11 acquisition module

12 decoding module

13 video analysis module

14 audio frequency analysis module

15 an analysis summary generation module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is understood that the multimedia stream file is recorded with video stream information as well as audio stream information. The video stream information mainly corresponds to a plurality of continuous frame images in the multimedia file; the audio stream information corresponds to a collection of speech information in a multimedia file. Therefore, scene information and text information related to the environment are correspondingly recorded in the video stream information; the audio stream information is correspondingly recorded with voice information corresponding to the relevant environment in the video stream. The scene information here may be understood as object information related to a presentation object recorded in each frame image; the text information here can be understood as character-related symbol information recorded in each frame image.

Through a single AI module, scene information or text information in the video stream or audio stream information can be identified in a single way, so that object behaviors or existing text information in the video file or voice content in the audio file can be identified. However, multimedia files in complex scenes often contain both video information and audio information. If the recognition is continued through the single AI module, the video stream information and the audio stream information recorded in the multimedia file cannot be recognized comprehensively. Thus, the identified target content has a certain error with the real content recorded by the multimedia file. Although it is also possible to use several different single AI modules for the simultaneous recognition of recorded content in complex scenarios, the computational effort is large for each single AI module. Thus, the recognition speed of the multimedia file is reduced, and the structured combination of the related recognition results is not facilitated.

The embodiment of the application provides a multimedia stream processing method which is mainly used for processing multimedia files in complex scenes. In one embodiment provided by the present application, the multimedia stream processing method may be used for processing a multimedia file recorded with a complex scene of a classroom teaching process. Specifically, referring to fig. 1, a multimedia stream processing method includes the following steps:

s100: a multimedia stream fragment is obtained.

The multimedia stream segment may be a file in which media information such as text, graphics, video, animation, audio, and video of a corresponding scene is recorded. In a specific embodiment provided by the present application, the obtained multimedia stream segment is a multimedia file having a certain duration and recorded with a classroom teaching scene. The multimedia stream segment can be shot by corresponding video shooting equipment. Therefore, real-time scenes of a classroom can be shot, and a multimedia file recorded with information such as sound, characters, pictures, personnel objects and the like in the classroom teaching process is obtained.

S200: and decoding to obtain the video stream sub-segment and the audio stream sub-segment.

A video stream sub-segment is here understood to be the image information in a multimedia segment. An audio stream sub-segment is here understood to be the sound information in a multimedia segment. And decoding the acquired media stream segment, namely extracting image information and sound information in the multimedia file with a certain time length, and converting the image information and the sound information into a plurality of continuous frames of images and continuous audios in a preset file format, so as to obtain a video stream sub-segment and an audio stream sub-segment corresponding to the multimedia stream segment.

When the obtained multimedia stream segment is a multimedia file with a certain duration and recorded with a classroom teaching scene, a video stream sub-segment recorded with information such as characters, pictures, personnel objects and the like in the classroom teaching process and an audio stream sub-segment recorded with sound information in the classroom teaching process are correspondingly obtained through decoding.

S310: and analyzing the video stream sub-segment to generate scene information and first text information.

It is understood that several consecutive frames of images in a multimedia stream segment may constitute a video stream sub-segment. And each frame image is recorded with corresponding scene information. In a classroom teaching scene, the video stream sub-segment is a plurality of continuous frames of images recorded with information such as characters, pictures, personnel objects and the like in the classroom teaching process. And specific scene information corresponding to the person object in the current video stream sub-segment and specific text information corresponding to the characters and pictures in the current video stream sub-segment can be obtained through analysis of the AI module with the corresponding function.

Specifically, the specific action category of the current person object can be determined by identifying the person object in the current video stream sub-segment, so that the specific classroom teaching scene corresponding to the current video stream sub-segment is conveniently determined. Through the identification of the specific text information corresponding to the characters and pictures in the current video stream sub-segment, the specific text type or description content corresponding to the characters and pictures corresponding to the current video stream sub-segment can be determined, and therefore the first text information corresponding to the current video stream sub-segment can be generated. The first text information is here understood to be text information generated from a sub-segment of the video stream.

S320: and analyzing the audio stream sub-segment to generate second text information.

It will be appreciated that the audio stream sub-segment may be generated from speech information in the multimedia stream segment. In a classroom teaching scene, the audio stream sub-segment is a file recorded with related sound information in the classroom teaching process. And the AI module with the corresponding function analyzes to determine the specific narration content of the audio stream sub-segment and obtain second text information corresponding to the narration content of the audio stream sub-segment. The second text information is here understood to be text information generated from sub-segments of the audio stream.

S400: and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.

The analysis summary here can be understood as an overview of the specific real-time scene corresponding to the currently processed multimedia stream segment. And processing the scene information, the first text information and the second text information, and mainly identifying the heavy point data with higher real-time scene relevance corresponding to the multimedia stream segment. And integrating the identified target data to obtain a specific teaching process corresponding to the currently processed multimedia stream segment. The multimedia stream fragment is divided into the video stream sub-fragment and the audio stream sub-fragment, and the AI module with the corresponding function is used for analyzing, so that the data processing amount of the AI module with a single function can be effectively reduced, and the analysis module with the corresponding analysis function can be accurately selected, and the identification efficiency of the multimedia stream fragment is improved.

Further, in a preferred embodiment provided in the present application, analyzing the video stream sub-segment to generate scene information specifically includes: and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.

The identification information of the identity of the object can be understood as the facial feature information of the object. It can be understood that the specific identity information of a person object can be determined by acquiring an image with the face of the person object and recognizing the facial features of the person object through a pre-trained recognition algorithm. For example, information such as the name and the school number of a student or information such as the name and the work number of a teacher is specified.

The description information of the action behavior of the object can be understood as the specific action category of the human object in the current video stream sub-segment. It can be understood that the specific motion category of a human subject can be determined by acquiring an image with the motion of the body of the human subject and recognizing the image by a pre-trained recognition algorithm. For example, the current action behavior of the person object is determined to be writing behavior, standing behavior or writing behavior.

By identifying the specific identity information and behavior information of the related objects in the video stream sub-segments, teachers and students' behaviors in the teaching scene corresponding to the video stream sub-segments can be determined, and therefore the accuracy of the multimedia stream analysis abstract is improved.

Further, in a preferred embodiment provided in the present application, analyzing the video stream sub-segment to generate the first text information specifically includes: and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.

The first text information to which the object action behavior points may be understood as text information having a certain degree of association with the specific behavior of the human object. It can be understood that the first text information generated according to the video stream sub-segment includes a specific text category or description content corresponding to the text and picture information in the classroom teaching scene. For example, PPT presentation pages in a classroom, blackboard newspaper information related to classroom background. In a classroom teaching scene, the text information in the PPT display page in the classroom is text information related to action behaviors of the personnel object. However, the blackboard report information is background information in the video stream sub-segment, and is not related to the action behavior of the person in the current scene, and thus does not belong to the first text information.

Through the first text information related to the action of the person object in the video stream sub-segment in the targeted analysis, the data processing amount of the corresponding functional module can be reduced, and meanwhile, the recognition accuracy of the corresponding functional module is improved, so that the analysis efficiency of the first text information is effectively improved.

Further, in a preferred embodiment provided by the present application, analyzing a video stream sub-segment to generate first text information to which an object action behavior points specifically includes: analyzing the video stream sub-segments to obtain images lasting for a preset time; and recognizing the image by using OCR to generate first text information.

It can be understood that the object action behavior points to the text information with a certain degree of association with the object behavior in the scene corresponding to the current video stream sub-segment. In a classroom teaching scene, text information with a certain degree of association with object behaviors can be understood as text information related to teaching contents. For example, a board book written by a teacher, text information in a PPT display page, and the like. It should be noted that, in the actual teaching process, if the text information pointed by the object action behavior is important, the person object will continue to perform the relevant behavior with respect to the corresponding text information for a certain duration. That is, the image with important text information stays longer. And when the text information pointed by the object action behavior has low importance or is worthless text information, the corresponding time length is short. That is, the image with the insignificant text information stays for a short time. Therefore, the importance degree of the text in the image can be judged according to the duration of the image. The duration of the image can be preset according to actual conditions or empirical values. If the duration of a certain frame of image meets the preset condition, the subsequent text recognition process can be developed. Correspondingly, if the duration of a certain frame of image does not meet the preset condition, the importance of the corresponding text is low, and corresponding recognition is not required to be performed. Therefore, the recognition efficiency of the first text can be increased on the basis of ensuring the recognition accuracy of the first text.

Specifically, when identifying the image memorable text information which meets the requirement of lasting the preset duration, the method can be implemented in an OCR identification mode. That is, character information in an image satisfying a preset condition in a sub-segment of the video stream is identified using an optical character recognition technique. For example, relevant content in a PPT page meeting preset conditions in a classroom display PPT is recognized as first text information through an OCR recognition mode. Or recognizing the character content of the specific mark on the classroom blackboard as the first text information in an OCR (optical character recognition) mode.

Further, in a preferred embodiment provided by the present application, the first text information includes at least one of teaching link information and knowledge point information.

The teaching link information can be understood as a link corresponding to a current image recorded in an image meeting a preset duration in a video stream sub-segment. For example, the PPT also currently corresponds to a specific title. The knowledge point information here can be understood as specific annotation content of the current image recorded in the image satisfying the preset duration in the video stream sub-segment.

It can be understood that the related writing materials in the classroom teaching process inevitably have some information irrelevant to the teaching content. This is referred to herein as worthless information. The identification of the worthless information increases the calculation amount and the ratio of meaningless content in the first text. Therefore, by identifying the teaching link information or knowledge point information in the video stream sub-segment in a targeted manner, more accurate first text identification information can be obtained conveniently, the redundancy of the first text information is reduced, and the determination of the multimedia analysis abstract is facilitated.

Further, in a preferred embodiment provided by the present application, the second text information specifically includes at least one of text correction information, keyword information, question information, and emotion description information.

It is to be understood that the second text information corresponds to text information generated from a sub-segment of the audio stream. In a classroom teaching scene, the audio stream sub-segment is a file recorded with related sound information in the classroom teaching process. In practical applications, the audio stream sub-segment is recognized as the second text, which can be used to produce subtitle information synchronized with the video stream sub-segment. Through the natural language processing model, the character information in the caption can be corrected, and the extraction of the keyword information, the questioning information and the emotion description information can be carried out. These information can be understood as forming elements of the multimedia stream analysis summary. Therefore, the analyzed second text information at least specifically includes one of text error correction information, keyword information, question information, and emotion description information. Therefore, the accuracy of the multimedia stream analysis abstract can be ensured.

Further, in a preferred embodiment provided in the present application, the processing the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream specifically includes: and carrying out cross validation on the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.

The cross validation described herein can be understood as grouping various types of data such as the scene information, the first text information, and the second text information obtained by the analysis into different curriculum structures. It is understood that the analysis of the scene information, the first text information, is performed on the basis of the sub-segments of the video stream. The identification of the second text information is performed on the basis of the sub-segments of the audio stream. The video stream sub-segment and the audio stream sub-segment are extracted from the multimedia stream segment. If the analysis summary of the multimedia stream is performed only according to one of the scene information, the first text information and the second text information, the analysis summary of the multimedia stream with higher accuracy cannot be obtained. Therefore, the scene information, the first text information and the second text information need to be considered comprehensively, and are classified and summarized correspondingly according to the course structure, so that the multimedia stream analysis summary with higher accuracy is obtained. The process can also be understood as that the disassembly of the video content data is completed according to the acquired multimedia video content, and the disassembled data is reclassified and associated to the corresponding classroom link.

In a specific implementation manner provided by the application, the course content in the classroom teaching scene can be decomposed into three major categories of teaching content, teacher and student behaviors and teacher and student languages. Wherein, the teaching content can be mainly embodied in the voice content of teaching PPT and teacher; the teacher and student behaviors are mainly reflected in behavior changes of action limbs; the teacher and the student mainly embody the language communication. Therefore, the recognition of the text content of the specific mark on the PPT/blackboard is carried out by utilizing the OCR recognition technology aiming at the course content. At this point, the first division of the overall structure of the lesson is completed. Namely, the classroom links are distinguished. By combining the face recognition technology with the recognition of action behaviors, a series of action motions such as teacher teaching, student writing, raising hands, standing up, reading, writing and the like can be divided into behaviors. Finally, real-time caption translation is carried out on the voice in the classroom, the text information in the caption can be corrected through the natural language processing capability, and context information related to keyword information, question information, emotion description information and the like can be extracted. In this way, the related scene information, the first text information and the second text information are obtained. And summarizing various types of data such as the obtained scene information, the first text information and the second text information into different course structures, namely completing the dismantling of the classroom teaching video content, and reclassifying and associating the relevant data obtained by dismantling to a corresponding classroom link.

The embodiment of the present application further provides a multimedia stream processing apparatus 100, which is mainly used for processing multimedia files in complex scenes. In one embodiment provided by the present application, the multimedia stream processing apparatus 100 may be used for processing a multimedia file recorded with a complex scene of a classroom teaching process. Specifically, referring to fig. 2, a multimedia stream processing apparatus includes:

an obtaining module 11, configured to obtain a multimedia stream segment;

a decoding module 12, configured to decode and obtain a video stream sub-segment and an audio stream sub-segment;

the video analysis module 13 is configured to analyze the video stream sub-segment to generate scene information and first text information;

an audio analysis module 14, configured to analyze the audio stream sub-segment to generate second text information;

and the analysis summary generation module 15 is configured to process the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream.

An obtaining module 11, configured to obtain a multimedia stream segment. The multimedia stream segment may be a file in which media information such as text, graphics, video, animation, audio, and video of a corresponding scene is recorded. In a specific embodiment provided by the present application, the obtained multimedia stream segment is a multimedia file having a certain duration and recorded with a classroom teaching scene. The multimedia stream segment can be shot by corresponding video shooting equipment. Therefore, real-time scenes of a classroom can be shot, and a multimedia file recorded with information such as sound, characters, pictures, personnel objects and the like in the classroom teaching process is obtained.

And a decoding module 12, configured to decode and obtain the video stream sub-segment and the audio stream sub-segment. A video stream sub-segment is here understood to be the image information in a multimedia segment. An audio stream sub-segment is here understood to be the sound information in a multimedia segment. And decoding the acquired media stream segment, namely extracting image information and sound information in the multimedia file with a certain time length, and converting the image information and the sound information into a plurality of continuous frames of images and continuous audios in a preset file format, so as to obtain a video stream sub-segment and an audio stream sub-segment corresponding to the multimedia stream segment.

And the video analysis module 13 is configured to analyze the video stream sub-segment to generate scene information and first text information. It is understood that several consecutive frames of images in a multimedia stream segment may constitute a video stream sub-segment. And each frame image is recorded with corresponding scene information. In a classroom teaching scene, the video stream sub-segment is a plurality of continuous frames of images recorded with information such as characters, pictures, personnel objects and the like in the classroom teaching process. And specific scene information corresponding to the person object in the current video stream sub-segment and specific text information corresponding to the characters and pictures in the current video stream sub-segment can be obtained through analysis of the AI module with the corresponding function.

And the audio analysis module 14 is configured to analyze the audio stream sub-segment to generate second text information. It will be appreciated that the audio stream sub-segment may be generated from speech information in the multimedia stream segment. In a classroom teaching scene, the audio stream sub-segment is a file recorded with related sound information in the classroom teaching process. And the AI module with the corresponding function analyzes to determine the specific narration content of the audio stream sub-segment and obtain second text information corresponding to the narration content of the audio stream sub-segment. The second text information is here understood to be text information generated from sub-segments of the audio stream.

And the analysis summary generation module 15 is configured to process the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream. The analysis summary here can be understood as an overview of the specific real-time scene corresponding to the currently processed multimedia stream segment. And processing the scene information, the first text information and the second text information, and mainly identifying the heavy point data with higher real-time scene relevance corresponding to the multimedia stream segment. And integrating the identified target data to obtain a specific teaching process corresponding to the currently processed multimedia stream segment. The multimedia stream fragment is divided into the video stream sub-fragment and the audio stream sub-fragment, and the AI module with the corresponding function is used for analyzing, so that the data processing amount of the AI module with a single function can be effectively reduced, and the analysis module with the corresponding analysis function can be accurately selected, and the identification efficiency of the multimedia stream fragment is improved.

Further, in a preferred embodiment provided in the present application, the video analysis module 13 is configured to analyze the video stream sub-segment to generate scene information, and specifically configured to: and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.

Further, in a preferred embodiment provided in the present application, the video analysis module 13 is configured to analyze the video stream sub-segment to generate first text information, specifically: and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.

Further, in a preferred embodiment provided in the present application, the video analysis module 13 is configured to analyze a video stream sub-segment, and generate first text information to which an object action behavior points, and specifically configured to: analyzing the video stream sub-segments to obtain images lasting for a preset time; and recognizing the image by using OCR to generate first text information.

Further, in a preferred embodiment provided in the present application, the analysis summary generation module 15 is configured to process the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream, and specifically configured to: and carrying out cross validation on the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.

It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the statement that there is an element defined as "comprising" … … does not exclude the presence of other like elements in the process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for processing a multimedia stream, comprising the steps of:

acquiring a multimedia stream fragment;

decoding to obtain a video stream sub-segment and an audio stream sub-segment;

analyzing the audio stream sub-segment to generate second text information;

2. The method of claim 1, wherein analyzing the video stream sub-segments to generate scene information comprises:

3. The method of claim 1, wherein analyzing the video stream sub-segment to generate first text information comprises:

4. The method for processing multimedia stream according to claim 3, wherein analyzing the video stream sub-segment to generate the first text information pointed by the object action behavior comprises:

and recognizing the image by using OCR to generate first text information.

5. The multimedia stream processing method as claimed in claim 4, wherein the first text information includes at least one of teaching link information and knowledge point information.

6. The method for processing the multimedia stream according to claim 1, wherein the second text information specifically includes at least one of text error correction information, keyword information, question information, and emotion description information.

7. The method for processing the multimedia stream according to claim 1, wherein processing the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream comprises:

8. A multimedia stream processing apparatus, comprising:

the acquisition module is used for acquiring the multimedia stream fragments;

9. The multimedia stream processing apparatus according to claim 8, wherein the video analysis module is configured to analyze the video stream sub-segment to generate scene information, and is specifically configured to:

10. The multimedia stream processing apparatus according to claim 8, wherein the video analysis module is configured to analyze the video stream sub-segment to generate a first text message, and is specifically configured to: