CN114005079A - Multimedia stream processing method and device - Google Patents

Multimedia stream processing method and device Download PDF

Info

Publication number
CN114005079A
CN114005079A CN202111666523.6A CN202111666523A CN114005079A CN 114005079 A CN114005079 A CN 114005079A CN 202111666523 A CN202111666523 A CN 202111666523A CN 114005079 A CN114005079 A CN 114005079A
Authority
CN
China
Prior art keywords
information
segment
text information
sub
video stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111666523.6A
Other languages
Chinese (zh)
Other versions
CN114005079B (en
Inventor
赵悦汐
程红兵
鞠剑伟
昝晨辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jinmao Education Technology Co ltd
Original Assignee
Beijing Jinmao Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jinmao Education Technology Co ltd filed Critical Beijing Jinmao Education Technology Co ltd
Priority to CN202111666523.6A priority Critical patent/CN114005079B/en
Publication of CN114005079A publication Critical patent/CN114005079A/en
Application granted granted Critical
Publication of CN114005079B publication Critical patent/CN114005079B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/44Browsing; Visualisation therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a multimedia stream processing method and device. Wherein the method comprises the following steps: acquiring a multimedia stream fragment; decoding to obtain a video stream sub-segment and an audio stream sub-segment; analyzing the video stream sub-segment to generate scene information and first text information; analyzing the audio stream sub-segment to generate second text information; and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream. Through disassembling the multimedia stream file, the content of the multimedia file can be identified in a complex scene by effectively combining various independent AI modules, and the identification efficiency of the existing independent AI technology in the complex scene is effectively improved.

Description

Multimedia stream processing method and device
Technical Field
The present application relates to the field of multimedia information identification technologies, and in particular, to a method and an apparatus for processing a multimedia stream.
Background
With the continuous development and popularization of AI technology, many mature AI modules, such as ali multimedia AI, are available on the market to process information flow in media. Such as a video stream, an audio stream in multimedia, or an information stream in which a video stream is combined with an audio stream. In the process of processing the multimedia stream, the corresponding content in the acquired multimedia file can be identified through the corresponding AI module.
In the process of realizing the prior art, the inventor finds that:
the conventional AI module has a single identification mode. In the face of a complex scene to be identified, analysis cannot be performed through a single AI module, so that the identification efficiency of the multimedia file is reduced.
Therefore, it is desirable to provide a multimedia stream processing method and apparatus for solving the technical problem of low recognition efficiency of the existing independent AI technology in a complex scene.
Disclosure of Invention
The embodiment of the application provides a multimedia stream processing method and device, which are used for solving the technical problem of low identification efficiency of the existing independent AI technology in a complex scene.
Specifically, a multimedia stream processing method includes the following steps:
acquiring a multimedia stream fragment;
decoding to obtain a video stream sub-segment and an audio stream sub-segment;
analyzing the video stream sub-segment to generate scene information and first text information;
analyzing the audio stream sub-segment to generate second text information;
and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
Further, analyzing the video stream sub-segment to generate scene information specifically includes:
and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.
Further, analyzing the video stream sub-segment to generate a first text message specifically includes:
and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.
Further, analyzing the video stream sub-segment to generate a first text message pointed by the object action behavior, specifically including:
analyzing the video stream sub-segments to obtain images lasting for a preset time;
and recognizing the image by using OCR to generate first text information.
Further, the first text information at least comprises one of teaching link information and knowledge point information.
Further, the second text information at least specifically includes one of text error correction information, keyword information, question information, and emotion description information.
Further, processing the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream specifically includes:
and carrying out cross validation on the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
The embodiment of the application also provides a multimedia stream processing device.
Specifically, a multimedia stream processing apparatus includes:
the acquisition module is used for acquiring the multimedia stream fragments;
the decoding module is used for decoding and acquiring the video stream sub-segment and the audio stream sub-segment;
the video analysis module is used for analyzing the video stream sub-segments to generate scene information and first text information;
the audio analysis module is used for analyzing the audio stream sub-segment to generate second text information;
and the analysis abstract generating module is used for processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
Further, the video analysis module is configured to analyze the video stream sub-segment to generate scene information, and specifically is configured to:
and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.
Further, the video analysis module is configured to analyze the video stream sub-segment to generate first text information, and specifically configured to:
and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.
The technical scheme provided by the application embodiment has at least the following beneficial effects:
through disassembling the multimedia stream file, the content of the multimedia file can be identified in a complex scene by effectively combining various independent AI modules, and the identification efficiency of the existing independent AI technology in the complex scene is effectively improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart of a multimedia stream processing method according to an embodiment of the present disclosure.
Fig. 2 is a schematic structural diagram of a multimedia stream processing apparatus according to an embodiment of the present disclosure.
100 multimedia stream processing apparatus
11 acquisition module
12 decoding module
13 video analysis module
14 audio frequency analysis module
15 an analysis summary generation module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is understood that the multimedia stream file is recorded with video stream information as well as audio stream information. The video stream information mainly corresponds to a plurality of continuous frame images in the multimedia file; the audio stream information corresponds to a collection of speech information in a multimedia file. Therefore, scene information and text information related to the environment are correspondingly recorded in the video stream information; the audio stream information is correspondingly recorded with voice information corresponding to the relevant environment in the video stream. The scene information here may be understood as object information related to a presentation object recorded in each frame image; the text information here can be understood as character-related symbol information recorded in each frame image.
Through a single AI module, scene information or text information in the video stream or audio stream information can be identified in a single way, so that object behaviors or existing text information in the video file or voice content in the audio file can be identified. However, multimedia files in complex scenes often contain both video information and audio information. If the recognition is continued through the single AI module, the video stream information and the audio stream information recorded in the multimedia file cannot be recognized comprehensively. Thus, the identified target content has a certain error with the real content recorded by the multimedia file. Although it is also possible to use several different single AI modules for the simultaneous recognition of recorded content in complex scenarios, the computational effort is large for each single AI module. Thus, the recognition speed of the multimedia file is reduced, and the structured combination of the related recognition results is not facilitated.
The embodiment of the application provides a multimedia stream processing method which is mainly used for processing multimedia files in complex scenes. In one embodiment provided by the present application, the multimedia stream processing method may be used for processing a multimedia file recorded with a complex scene of a classroom teaching process. Specifically, referring to fig. 1, a multimedia stream processing method includes the following steps:
s100: a multimedia stream fragment is obtained.
The multimedia stream segment may be a file in which media information such as text, graphics, video, animation, audio, and video of a corresponding scene is recorded. In a specific embodiment provided by the present application, the obtained multimedia stream segment is a multimedia file having a certain duration and recorded with a classroom teaching scene. The multimedia stream segment can be shot by corresponding video shooting equipment. Therefore, real-time scenes of a classroom can be shot, and a multimedia file recorded with information such as sound, characters, pictures, personnel objects and the like in the classroom teaching process is obtained.
S200: and decoding to obtain the video stream sub-segment and the audio stream sub-segment.
A video stream sub-segment is here understood to be the image information in a multimedia segment. An audio stream sub-segment is here understood to be the sound information in a multimedia segment. And decoding the acquired media stream segment, namely extracting image information and sound information in the multimedia file with a certain time length, and converting the image information and the sound information into a plurality of continuous frames of images and continuous audios in a preset file format, so as to obtain a video stream sub-segment and an audio stream sub-segment corresponding to the multimedia stream segment.
When the obtained multimedia stream segment is a multimedia file with a certain duration and recorded with a classroom teaching scene, a video stream sub-segment recorded with information such as characters, pictures, personnel objects and the like in the classroom teaching process and an audio stream sub-segment recorded with sound information in the classroom teaching process are correspondingly obtained through decoding.
S310: and analyzing the video stream sub-segment to generate scene information and first text information.
It is understood that several consecutive frames of images in a multimedia stream segment may constitute a video stream sub-segment. And each frame image is recorded with corresponding scene information. In a classroom teaching scene, the video stream sub-segment is a plurality of continuous frames of images recorded with information such as characters, pictures, personnel objects and the like in the classroom teaching process. And specific scene information corresponding to the person object in the current video stream sub-segment and specific text information corresponding to the characters and pictures in the current video stream sub-segment can be obtained through analysis of the AI module with the corresponding function.
Specifically, the specific action category of the current person object can be determined by identifying the person object in the current video stream sub-segment, so that the specific classroom teaching scene corresponding to the current video stream sub-segment is conveniently determined. Through the identification of the specific text information corresponding to the characters and pictures in the current video stream sub-segment, the specific text type or description content corresponding to the characters and pictures corresponding to the current video stream sub-segment can be determined, and therefore the first text information corresponding to the current video stream sub-segment can be generated. The first text information is here understood to be text information generated from a sub-segment of the video stream.
S320: and analyzing the audio stream sub-segment to generate second text information.
It will be appreciated that the audio stream sub-segment may be generated from speech information in the multimedia stream segment. In a classroom teaching scene, the audio stream sub-segment is a file recorded with related sound information in the classroom teaching process. And the AI module with the corresponding function analyzes to determine the specific narration content of the audio stream sub-segment and obtain second text information corresponding to the narration content of the audio stream sub-segment. The second text information is here understood to be text information generated from sub-segments of the audio stream.
S400: and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
The analysis summary here can be understood as an overview of the specific real-time scene corresponding to the currently processed multimedia stream segment. And processing the scene information, the first text information and the second text information, and mainly identifying the heavy point data with higher real-time scene relevance corresponding to the multimedia stream segment. And integrating the identified target data to obtain a specific teaching process corresponding to the currently processed multimedia stream segment. The multimedia stream fragment is divided into the video stream sub-fragment and the audio stream sub-fragment, and the AI module with the corresponding function is used for analyzing, so that the data processing amount of the AI module with a single function can be effectively reduced, and the analysis module with the corresponding analysis function can be accurately selected, and the identification efficiency of the multimedia stream fragment is improved.
Further, in a preferred embodiment provided in the present application, analyzing the video stream sub-segment to generate scene information specifically includes: and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.
The identification information of the identity of the object can be understood as the facial feature information of the object. It can be understood that the specific identity information of a person object can be determined by acquiring an image with the face of the person object and recognizing the facial features of the person object through a pre-trained recognition algorithm. For example, information such as the name and the school number of a student or information such as the name and the work number of a teacher is specified.
The description information of the action behavior of the object can be understood as the specific action category of the human object in the current video stream sub-segment. It can be understood that the specific motion category of a human subject can be determined by acquiring an image with the motion of the body of the human subject and recognizing the image by a pre-trained recognition algorithm. For example, the current action behavior of the person object is determined to be writing behavior, standing behavior or writing behavior.
By identifying the specific identity information and behavior information of the related objects in the video stream sub-segments, teachers and students' behaviors in the teaching scene corresponding to the video stream sub-segments can be determined, and therefore the accuracy of the multimedia stream analysis abstract is improved.
Further, in a preferred embodiment provided in the present application, analyzing the video stream sub-segment to generate the first text information specifically includes: and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.
The first text information to which the object action behavior points may be understood as text information having a certain degree of association with the specific behavior of the human object. It can be understood that the first text information generated according to the video stream sub-segment includes a specific text category or description content corresponding to the text and picture information in the classroom teaching scene. For example, PPT presentation pages in a classroom, blackboard newspaper information related to classroom background. In a classroom teaching scene, the text information in the PPT display page in the classroom is text information related to action behaviors of the personnel object. However, the blackboard report information is background information in the video stream sub-segment, and is not related to the action behavior of the person in the current scene, and thus does not belong to the first text information.
Through the first text information related to the action of the person object in the video stream sub-segment in the targeted analysis, the data processing amount of the corresponding functional module can be reduced, and meanwhile, the recognition accuracy of the corresponding functional module is improved, so that the analysis efficiency of the first text information is effectively improved.
Further, in a preferred embodiment provided by the present application, analyzing a video stream sub-segment to generate first text information to which an object action behavior points specifically includes: analyzing the video stream sub-segments to obtain images lasting for a preset time; and recognizing the image by using OCR to generate first text information.
It can be understood that the object action behavior points to the text information with a certain degree of association with the object behavior in the scene corresponding to the current video stream sub-segment. In a classroom teaching scene, text information with a certain degree of association with object behaviors can be understood as text information related to teaching contents. For example, a board book written by a teacher, text information in a PPT display page, and the like. It should be noted that, in the actual teaching process, if the text information pointed by the object action behavior is important, the person object will continue to perform the relevant behavior with respect to the corresponding text information for a certain duration. That is, the image with important text information stays longer. And when the text information pointed by the object action behavior has low importance or is worthless text information, the corresponding time length is short. That is, the image with the insignificant text information stays for a short time. Therefore, the importance degree of the text in the image can be judged according to the duration of the image. The duration of the image can be preset according to actual conditions or empirical values. If the duration of a certain frame of image meets the preset condition, the subsequent text recognition process can be developed. Correspondingly, if the duration of a certain frame of image does not meet the preset condition, the importance of the corresponding text is low, and corresponding recognition is not required to be performed. Therefore, the recognition efficiency of the first text can be increased on the basis of ensuring the recognition accuracy of the first text.
Specifically, when identifying the image memorable text information which meets the requirement of lasting the preset duration, the method can be implemented in an OCR identification mode. That is, character information in an image satisfying a preset condition in a sub-segment of the video stream is identified using an optical character recognition technique. For example, relevant content in a PPT page meeting preset conditions in a classroom display PPT is recognized as first text information through an OCR recognition mode. Or recognizing the character content of the specific mark on the classroom blackboard as the first text information in an OCR (optical character recognition) mode.
Further, in a preferred embodiment provided by the present application, the first text information includes at least one of teaching link information and knowledge point information.
The teaching link information can be understood as a link corresponding to a current image recorded in an image meeting a preset duration in a video stream sub-segment. For example, the PPT also currently corresponds to a specific title. The knowledge point information here can be understood as specific annotation content of the current image recorded in the image satisfying the preset duration in the video stream sub-segment.
It can be understood that the related writing materials in the classroom teaching process inevitably have some information irrelevant to the teaching content. This is referred to herein as worthless information. The identification of the worthless information increases the calculation amount and the ratio of meaningless content in the first text. Therefore, by identifying the teaching link information or knowledge point information in the video stream sub-segment in a targeted manner, more accurate first text identification information can be obtained conveniently, the redundancy of the first text information is reduced, and the determination of the multimedia analysis abstract is facilitated.
Further, in a preferred embodiment provided by the present application, the second text information specifically includes at least one of text correction information, keyword information, question information, and emotion description information.
It is to be understood that the second text information corresponds to text information generated from a sub-segment of the audio stream. In a classroom teaching scene, the audio stream sub-segment is a file recorded with related sound information in the classroom teaching process. In practical applications, the audio stream sub-segment is recognized as the second text, which can be used to produce subtitle information synchronized with the video stream sub-segment. Through the natural language processing model, the character information in the caption can be corrected, and the extraction of the keyword information, the questioning information and the emotion description information can be carried out. These information can be understood as forming elements of the multimedia stream analysis summary. Therefore, the analyzed second text information at least specifically includes one of text error correction information, keyword information, question information, and emotion description information. Therefore, the accuracy of the multimedia stream analysis abstract can be ensured.
Further, in a preferred embodiment provided in the present application, the processing the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream specifically includes: and carrying out cross validation on the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
The cross validation described herein can be understood as grouping various types of data such as the scene information, the first text information, and the second text information obtained by the analysis into different curriculum structures. It is understood that the analysis of the scene information, the first text information, is performed on the basis of the sub-segments of the video stream. The identification of the second text information is performed on the basis of the sub-segments of the audio stream. The video stream sub-segment and the audio stream sub-segment are extracted from the multimedia stream segment. If the analysis summary of the multimedia stream is performed only according to one of the scene information, the first text information and the second text information, the analysis summary of the multimedia stream with higher accuracy cannot be obtained. Therefore, the scene information, the first text information and the second text information need to be considered comprehensively, and are classified and summarized correspondingly according to the course structure, so that the multimedia stream analysis summary with higher accuracy is obtained. The process can also be understood as that the disassembly of the video content data is completed according to the acquired multimedia video content, and the disassembled data is reclassified and associated to the corresponding classroom link.
In a specific implementation manner provided by the application, the course content in the classroom teaching scene can be decomposed into three major categories of teaching content, teacher and student behaviors and teacher and student languages. Wherein, the teaching content can be mainly embodied in the voice content of teaching PPT and teacher; the teacher and student behaviors are mainly reflected in behavior changes of action limbs; the teacher and the student mainly embody the language communication. Therefore, the recognition of the text content of the specific mark on the PPT/blackboard is carried out by utilizing the OCR recognition technology aiming at the course content. At this point, the first division of the overall structure of the lesson is completed. Namely, the classroom links are distinguished. By combining the face recognition technology with the recognition of action behaviors, a series of action motions such as teacher teaching, student writing, raising hands, standing up, reading, writing and the like can be divided into behaviors. Finally, real-time caption translation is carried out on the voice in the classroom, the text information in the caption can be corrected through the natural language processing capability, and context information related to keyword information, question information, emotion description information and the like can be extracted. In this way, the related scene information, the first text information and the second text information are obtained. And summarizing various types of data such as the obtained scene information, the first text information and the second text information into different course structures, namely completing the dismantling of the classroom teaching video content, and reclassifying and associating the relevant data obtained by dismantling to a corresponding classroom link.
The embodiment of the present application further provides a multimedia stream processing apparatus 100, which is mainly used for processing multimedia files in complex scenes. In one embodiment provided by the present application, the multimedia stream processing apparatus 100 may be used for processing a multimedia file recorded with a complex scene of a classroom teaching process. Specifically, referring to fig. 2, a multimedia stream processing apparatus includes:
an obtaining module 11, configured to obtain a multimedia stream segment;
a decoding module 12, configured to decode and obtain a video stream sub-segment and an audio stream sub-segment;
the video analysis module 13 is configured to analyze the video stream sub-segment to generate scene information and first text information;
an audio analysis module 14, configured to analyze the audio stream sub-segment to generate second text information;
and the analysis summary generation module 15 is configured to process the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream.
An obtaining module 11, configured to obtain a multimedia stream segment. The multimedia stream segment may be a file in which media information such as text, graphics, video, animation, audio, and video of a corresponding scene is recorded. In a specific embodiment provided by the present application, the obtained multimedia stream segment is a multimedia file having a certain duration and recorded with a classroom teaching scene. The multimedia stream segment can be shot by corresponding video shooting equipment. Therefore, real-time scenes of a classroom can be shot, and a multimedia file recorded with information such as sound, characters, pictures, personnel objects and the like in the classroom teaching process is obtained.
And a decoding module 12, configured to decode and obtain the video stream sub-segment and the audio stream sub-segment. A video stream sub-segment is here understood to be the image information in a multimedia segment. An audio stream sub-segment is here understood to be the sound information in a multimedia segment. And decoding the acquired media stream segment, namely extracting image information and sound information in the multimedia file with a certain time length, and converting the image information and the sound information into a plurality of continuous frames of images and continuous audios in a preset file format, so as to obtain a video stream sub-segment and an audio stream sub-segment corresponding to the multimedia stream segment.
When the obtained multimedia stream segment is a multimedia file with a certain duration and recorded with a classroom teaching scene, a video stream sub-segment recorded with information such as characters, pictures, personnel objects and the like in the classroom teaching process and an audio stream sub-segment recorded with sound information in the classroom teaching process are correspondingly obtained through decoding.
And the video analysis module 13 is configured to analyze the video stream sub-segment to generate scene information and first text information. It is understood that several consecutive frames of images in a multimedia stream segment may constitute a video stream sub-segment. And each frame image is recorded with corresponding scene information. In a classroom teaching scene, the video stream sub-segment is a plurality of continuous frames of images recorded with information such as characters, pictures, personnel objects and the like in the classroom teaching process. And specific scene information corresponding to the person object in the current video stream sub-segment and specific text information corresponding to the characters and pictures in the current video stream sub-segment can be obtained through analysis of the AI module with the corresponding function.
Specifically, the specific action category of the current person object can be determined by identifying the person object in the current video stream sub-segment, so that the specific classroom teaching scene corresponding to the current video stream sub-segment is conveniently determined. Through the identification of the specific text information corresponding to the characters and pictures in the current video stream sub-segment, the specific text type or description content corresponding to the characters and pictures corresponding to the current video stream sub-segment can be determined, and therefore the first text information corresponding to the current video stream sub-segment can be generated. The first text information is here understood to be text information generated from a sub-segment of the video stream.
And the audio analysis module 14 is configured to analyze the audio stream sub-segment to generate second text information. It will be appreciated that the audio stream sub-segment may be generated from speech information in the multimedia stream segment. In a classroom teaching scene, the audio stream sub-segment is a file recorded with related sound information in the classroom teaching process. And the AI module with the corresponding function analyzes to determine the specific narration content of the audio stream sub-segment and obtain second text information corresponding to the narration content of the audio stream sub-segment. The second text information is here understood to be text information generated from sub-segments of the audio stream.
And the analysis summary generation module 15 is configured to process the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream. The analysis summary here can be understood as an overview of the specific real-time scene corresponding to the currently processed multimedia stream segment. And processing the scene information, the first text information and the second text information, and mainly identifying the heavy point data with higher real-time scene relevance corresponding to the multimedia stream segment. And integrating the identified target data to obtain a specific teaching process corresponding to the currently processed multimedia stream segment. The multimedia stream fragment is divided into the video stream sub-fragment and the audio stream sub-fragment, and the AI module with the corresponding function is used for analyzing, so that the data processing amount of the AI module with a single function can be effectively reduced, and the analysis module with the corresponding analysis function can be accurately selected, and the identification efficiency of the multimedia stream fragment is improved.
Further, in a preferred embodiment provided in the present application, the video analysis module 13 is configured to analyze the video stream sub-segment to generate scene information, and specifically configured to: and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.
The identification information of the identity of the object can be understood as the facial feature information of the object. It can be understood that the specific identity information of a person object can be determined by acquiring an image with the face of the person object and recognizing the facial features of the person object through a pre-trained recognition algorithm. For example, information such as the name and the school number of a student or information such as the name and the work number of a teacher is specified.
The description information of the action behavior of the object can be understood as the specific action category of the human object in the current video stream sub-segment. It can be understood that the specific motion category of a human subject can be determined by acquiring an image with the motion of the body of the human subject and recognizing the image by a pre-trained recognition algorithm. For example, the current action behavior of the person object is determined to be writing behavior, standing behavior or writing behavior.
By identifying the specific identity information and behavior information of the related objects in the video stream sub-segments, teachers and students' behaviors in the teaching scene corresponding to the video stream sub-segments can be determined, and therefore the accuracy of the multimedia stream analysis abstract is improved.
Further, in a preferred embodiment provided in the present application, the video analysis module 13 is configured to analyze the video stream sub-segment to generate first text information, specifically: and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.
The first text information to which the object action behavior points may be understood as text information having a certain degree of association with the specific behavior of the human object. It can be understood that the first text information generated according to the video stream sub-segment includes a specific text category or description content corresponding to the text and picture information in the classroom teaching scene. For example, PPT presentation pages in a classroom, blackboard newspaper information related to classroom background. In a classroom teaching scene, the text information in the PPT display page in the classroom is text information related to action behaviors of the personnel object. However, the blackboard report information is background information in the video stream sub-segment, and is not related to the action behavior of the person in the current scene, and thus does not belong to the first text information.
Through the first text information related to the action of the person object in the video stream sub-segment in the targeted analysis, the data processing amount of the corresponding functional module can be reduced, and meanwhile, the recognition accuracy of the corresponding functional module is improved, so that the analysis efficiency of the first text information is effectively improved.
Further, in a preferred embodiment provided in the present application, the video analysis module 13 is configured to analyze a video stream sub-segment, and generate first text information to which an object action behavior points, and specifically configured to: analyzing the video stream sub-segments to obtain images lasting for a preset time; and recognizing the image by using OCR to generate first text information.
It can be understood that the object action behavior points to the text information with a certain degree of association with the object behavior in the scene corresponding to the current video stream sub-segment. In a classroom teaching scene, text information with a certain degree of association with object behaviors can be understood as text information related to teaching contents. For example, a board book written by a teacher, text information in a PPT display page, and the like. It should be noted that, in the actual teaching process, if the text information pointed by the object action behavior is important, the person object will continue to perform the relevant behavior with respect to the corresponding text information for a certain duration. That is, the image with important text information stays longer. And when the text information pointed by the object action behavior has low importance or is worthless text information, the corresponding time length is short. That is, the image with the insignificant text information stays for a short time. Therefore, the importance degree of the text in the image can be judged according to the duration of the image. The duration of the image can be preset according to actual conditions or empirical values. If the duration of a certain frame of image meets the preset condition, the subsequent text recognition process can be developed. Correspondingly, if the duration of a certain frame of image does not meet the preset condition, the importance of the corresponding text is low, and corresponding recognition is not required to be performed. Therefore, the recognition efficiency of the first text can be increased on the basis of ensuring the recognition accuracy of the first text.
Specifically, when identifying the image memorable text information which meets the requirement of lasting the preset duration, the method can be implemented in an OCR identification mode. That is, character information in an image satisfying a preset condition in a sub-segment of the video stream is identified using an optical character recognition technique. For example, relevant content in a PPT page meeting preset conditions in a classroom display PPT is recognized as first text information through an OCR recognition mode. Or recognizing the character content of the specific mark on the classroom blackboard as the first text information in an OCR (optical character recognition) mode.
Further, in a preferred embodiment provided by the present application, the first text information includes at least one of teaching link information and knowledge point information.
The teaching link information can be understood as a link corresponding to a current image recorded in an image meeting a preset duration in a video stream sub-segment. For example, the PPT also currently corresponds to a specific title. The knowledge point information here can be understood as specific annotation content of the current image recorded in the image satisfying the preset duration in the video stream sub-segment.
It can be understood that the related writing materials in the classroom teaching process inevitably have some information irrelevant to the teaching content. This is referred to herein as worthless information. The identification of the worthless information increases the calculation amount and the ratio of meaningless content in the first text. Therefore, by identifying the teaching link information or knowledge point information in the video stream sub-segment in a targeted manner, more accurate first text identification information can be obtained conveniently, the redundancy of the first text information is reduced, and the determination of the multimedia analysis abstract is facilitated.
Further, in a preferred embodiment provided by the present application, the second text information specifically includes at least one of text correction information, keyword information, question information, and emotion description information.
It is to be understood that the second text information corresponds to text information generated from a sub-segment of the audio stream. In a classroom teaching scene, the audio stream sub-segment is a file recorded with related sound information in the classroom teaching process. In practical applications, the audio stream sub-segment is recognized as the second text, which can be used to produce subtitle information synchronized with the video stream sub-segment. Through the natural language processing model, the character information in the caption can be corrected, and the extraction of the keyword information, the questioning information and the emotion description information can be carried out. These information can be understood as forming elements of the multimedia stream analysis summary. Therefore, the analyzed second text information at least specifically includes one of text error correction information, keyword information, question information, and emotion description information. Therefore, the accuracy of the multimedia stream analysis abstract can be ensured.
Further, in a preferred embodiment provided in the present application, the analysis summary generation module 15 is configured to process the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream, and specifically configured to: and carrying out cross validation on the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
The cross validation described herein can be understood as grouping various types of data such as the scene information, the first text information, and the second text information obtained by the analysis into different curriculum structures. It is understood that the analysis of the scene information, the first text information, is performed on the basis of the sub-segments of the video stream. The identification of the second text information is performed on the basis of the sub-segments of the audio stream. The video stream sub-segment and the audio stream sub-segment are extracted from the multimedia stream segment. If the analysis summary of the multimedia stream is performed only according to one of the scene information, the first text information and the second text information, the analysis summary of the multimedia stream with higher accuracy cannot be obtained. Therefore, the scene information, the first text information and the second text information need to be considered comprehensively, and are classified and summarized correspondingly according to the course structure, so that the multimedia stream analysis summary with higher accuracy is obtained. The process can also be understood as that the disassembly of the video content data is completed according to the acquired multimedia video content, and the disassembled data is reclassified and associated to the corresponding classroom link.
It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the statement that there is an element defined as "comprising" … … does not exclude the presence of other like elements in the process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for processing a multimedia stream, comprising the steps of:
acquiring a multimedia stream fragment;
decoding to obtain a video stream sub-segment and an audio stream sub-segment;
analyzing the video stream sub-segment to generate scene information and first text information;
analyzing the audio stream sub-segment to generate second text information;
and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
2. The method of claim 1, wherein analyzing the video stream sub-segments to generate scene information comprises:
and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.
3. The method of claim 1, wherein analyzing the video stream sub-segment to generate first text information comprises:
and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.
4. The method for processing multimedia stream according to claim 3, wherein analyzing the video stream sub-segment to generate the first text information pointed by the object action behavior comprises:
analyzing the video stream sub-segments to obtain images lasting for a preset time;
and recognizing the image by using OCR to generate first text information.
5. The multimedia stream processing method as claimed in claim 4, wherein the first text information includes at least one of teaching link information and knowledge point information.
6. The method for processing the multimedia stream according to claim 1, wherein the second text information specifically includes at least one of text error correction information, keyword information, question information, and emotion description information.
7. The method for processing the multimedia stream according to claim 1, wherein processing the scene information, the first text information, and the second text information to form an analysis summary of the multimedia stream comprises:
and carrying out cross validation on the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
8. A multimedia stream processing apparatus, comprising:
the acquisition module is used for acquiring the multimedia stream fragments;
the decoding module is used for decoding and acquiring the video stream sub-segment and the audio stream sub-segment;
the video analysis module is used for analyzing the video stream sub-segments to generate scene information and first text information;
the audio analysis module is used for analyzing the audio stream sub-segment to generate second text information;
and the analysis abstract generating module is used for processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream.
9. The multimedia stream processing apparatus according to claim 8, wherein the video analysis module is configured to analyze the video stream sub-segment to generate scene information, and is specifically configured to:
and analyzing the video stream sub-segments to generate object-oriented identity characteristic identification information and description information of object action behaviors.
10. The multimedia stream processing apparatus according to claim 8, wherein the video analysis module is configured to analyze the video stream sub-segment to generate a first text message, and is specifically configured to:
and analyzing the video stream sub-segment to generate first text information pointed by the object action behavior.
CN202111666523.6A 2021-12-31 2021-12-31 Multimedia stream processing method and device Active CN114005079B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111666523.6A CN114005079B (en) 2021-12-31 2021-12-31 Multimedia stream processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111666523.6A CN114005079B (en) 2021-12-31 2021-12-31 Multimedia stream processing method and device

Publications (2)

Publication Number Publication Date
CN114005079A true CN114005079A (en) 2022-02-01
CN114005079B CN114005079B (en) 2022-04-19

Family

ID=79932528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111666523.6A Active CN114005079B (en) 2021-12-31 2021-12-31 Multimedia stream processing method and device

Country Status (1)

Country Link
CN (1) CN114005079B (en)

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130004081A1 (en) * 2011-06-30 2013-01-03 Fujitsu Limited Image recognition device, image recognizing method, storage medium that stores computer program for image recognition
CN103646094A (en) * 2013-12-18 2014-03-19 上海紫竹数字创意港有限公司 System and method for automatic extraction and generation of audiovisual product content abstract
CN108124191A (en) * 2017-12-22 2018-06-05 北京百度网讯科技有限公司 A kind of video reviewing method, device and server
US20180336417A1 (en) * 2017-05-18 2018-11-22 Wipro Limited Method and a system for generating a contextual summary of multimedia content
CN108920513A (en) * 2018-05-31 2018-11-30 深圳市图灵机器人有限公司 A kind of multimedia data processing method, device and electronic equipment
US20200026729A1 (en) * 2017-03-02 2020-01-23 Ricoh Company, Ltd. Behavioral Measurements in a Video Stream Focalized on Keywords
CN110991246A (en) * 2019-10-31 2020-04-10 天津市国瑞数码安全系统股份有限公司 Video detection method and system
CN111260975A (en) * 2020-03-16 2020-06-09 安博思华智能科技有限责任公司 Method, device, medium and electronic equipment for multimedia blackboard teaching interaction
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US20200242953A1 (en) * 2017-10-20 2020-07-30 Shenzhen Eaglesoul Technology Co., Ltd. Internet teaching platform-based following teaching system
US20200314460A1 (en) * 2018-04-24 2020-10-01 Tencent Technology (Shenzhen) Company Limited Video stream processing method, computer device, and storage medium
WO2020215966A1 (en) * 2019-04-26 2020-10-29 北京大米科技有限公司 Remote teaching interaction method, server, terminal and system
CN111898441A (en) * 2020-06-30 2020-11-06 华中师范大学 Online course video resource content identification and evaluation method and intelligent system
CN112468877A (en) * 2021-02-01 2021-03-09 北京中科大洋科技发展股份有限公司 Intelligent news cataloging method based on AI content analysis and OCR recognition
US10978077B1 (en) * 2019-10-31 2021-04-13 Wisdom Garden Hong Kong Limited Knowledge point mark generation system and method thereof
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN112995696A (en) * 2021-04-20 2021-06-18 共道网络科技有限公司 Live broadcast room violation detection method and device
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis
CN113761986A (en) * 2020-06-05 2021-12-07 阿里巴巴集团控股有限公司 Text acquisition method, text live broadcast equipment and storage medium

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130004081A1 (en) * 2011-06-30 2013-01-03 Fujitsu Limited Image recognition device, image recognizing method, storage medium that stores computer program for image recognition
CN103646094A (en) * 2013-12-18 2014-03-19 上海紫竹数字创意港有限公司 System and method for automatic extraction and generation of audiovisual product content abstract
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US20200026729A1 (en) * 2017-03-02 2020-01-23 Ricoh Company, Ltd. Behavioral Measurements in a Video Stream Focalized on Keywords
US20180336417A1 (en) * 2017-05-18 2018-11-22 Wipro Limited Method and a system for generating a contextual summary of multimedia content
US20200242953A1 (en) * 2017-10-20 2020-07-30 Shenzhen Eaglesoul Technology Co., Ltd. Internet teaching platform-based following teaching system
CN108124191A (en) * 2017-12-22 2018-06-05 北京百度网讯科技有限公司 A kind of video reviewing method, device and server
US20200314460A1 (en) * 2018-04-24 2020-10-01 Tencent Technology (Shenzhen) Company Limited Video stream processing method, computer device, and storage medium
CN108920513A (en) * 2018-05-31 2018-11-30 深圳市图灵机器人有限公司 A kind of multimedia data processing method, device and electronic equipment
WO2020215966A1 (en) * 2019-04-26 2020-10-29 北京大米科技有限公司 Remote teaching interaction method, server, terminal and system
CN110991246A (en) * 2019-10-31 2020-04-10 天津市国瑞数码安全系统股份有限公司 Video detection method and system
US10978077B1 (en) * 2019-10-31 2021-04-13 Wisdom Garden Hong Kong Limited Knowledge point mark generation system and method thereof
CN111260975A (en) * 2020-03-16 2020-06-09 安博思华智能科技有限责任公司 Method, device, medium and electronic equipment for multimedia blackboard teaching interaction
CN113761986A (en) * 2020-06-05 2021-12-07 阿里巴巴集团控股有限公司 Text acquisition method, text live broadcast equipment and storage medium
CN111898441A (en) * 2020-06-30 2020-11-06 华中师范大学 Online course video resource content identification and evaluation method and intelligent system
CN112468877A (en) * 2021-02-01 2021-03-09 北京中科大洋科技发展股份有限公司 Intelligent news cataloging method based on AI content analysis and OCR recognition
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN112995696A (en) * 2021-04-20 2021-06-18 共道网络科技有限公司 Live broadcast room violation detection method and device
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LALITHA AGNIHOTRI 等: ""multimedia summary" Video audio Text information", 《PROCEEDINGS OF THE 7TH ACM SIGMM INTERNATIONAL WORKSHOP ON MULTIMEDIA INFORMATION RETRIEVAL》 *
叶泽雄: "基于视频内容分析的摘要以及检索系统", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
纪旭: "基于内容的新闻视频静态摘要技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN114005079B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN108648757B (en) Analysis method based on multi-dimensional classroom information
CN109359215B (en) Video intelligent pushing method and system
Ye et al. Recognizing american sign language gestures from within continuous videos
CN107920280A (en) The accurate matched method and system of video, teaching materials PPT and voice content
CN109275046A (en) A kind of teaching data mask method based on double video acquisitions
CN109697976B (en) Pronunciation recognition method and device
CN112115301B (en) Video annotation method and system based on classroom notes
CN111833672B (en) Teaching video display method, device and system
CN110569393B (en) Short video cutting method for air classroom
CN111145719B (en) Data labeling method and device for Chinese-English mixing and tone labeling
CN111833861A (en) Artificial intelligence based event evaluation report generation
CN110427977B (en) Detection method for classroom interaction behavior
KR20190080314A (en) Method and apparatus for providing segmented internet based lecture contents
CN110837793A (en) Intelligent recognition handwriting mathematical formula reading and amending system
CN112347997A (en) Test question detection and identification method and device, electronic equipment and medium
CN116050892A (en) Intelligent education evaluation supervision method based on artificial intelligence
CN113779345B (en) Teaching material generation method and device, computer equipment and storage medium
CN114996506A (en) Corpus generation method and device, electronic equipment and computer-readable storage medium
KR20190068841A (en) System for training and evaluation of english pronunciation using artificial intelligence speech recognition application programming interface
CN114005079B (en) Multimedia stream processing method and device
Krishnamoorthy et al. E-Learning Platform for Hearing Impaired Students
CN117252259A (en) Deep learning-based natural language understanding method and AI teaching aid system
CN116825288A (en) Autism rehabilitation course recording method and device, electronic equipment and storage medium
CN114173191B (en) Multi-language answering method and system based on artificial intelligence
CN114972716A (en) Lesson content recording method, related device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant