CN115552904A - Information processing method, encoder, decoder, and storage medium device - Google Patents

Information processing method, encoder, decoder, and storage medium device Download PDF

Info

Publication number
CN115552904A
CN115552904A CN202180035459.3A CN202180035459A CN115552904A CN 115552904 A CN115552904 A CN 115552904A CN 202180035459 A CN202180035459 A CN 202180035459A CN 115552904 A CN115552904 A CN 115552904A
Authority
CN
China
Prior art keywords
voice
information
over
narration
narrative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180035459.3A
Other languages
Chinese (zh)
Inventor
于浩平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Publication of CN115552904A publication Critical patent/CN115552904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

An information processing method, an encoder, a decoder, and a storage medium device are provided. The information processing method comprises the following steps: analyzing the code stream to obtain at least one narration voice-over information of the visual media content and corresponding presentation time (201); while playing the visual media content, at least one narrative voice-over information is presented (202) by presentation time.

Description

Information processing method, encoder, decoder, and storage medium device
Cross Reference to Related Applications
The present application claims priority to a prior U.S. provisional patent application entitled "A technology for narrow digital video" having application number 63/025,742, filed on the name of Haoping Yu, 2020, 05, 15, the entire contents of which are incorporated herein by reference; and
this application claims priority to a prior U.S. provisional patent application entitled "A technology for narrow digital visual media" having application number 63/034,295, filed on 03/06/2020 in the name of Haoping Yu, the entire contents of which are incorporated herein by reference.
Technical Field
The embodiments of the present application relate to multimedia technology, and relate to, but are not limited to, an information processing method, an encoder, a decoder, and a storage medium device.
Background
Because of availability and affordability, smart phones have become the most popular electronic devices, and having a smart phone is not only necessary but is a normal state today. Therefore, smartphones have a lot of significant impact on the whole society and culture. One of the changes in people's lifestyles is that consumers take pictures or take videos using smart phones as a way to record daily activities, which has become a common trend on a global scale.
Today, people feel the need to capture every moment in life. Consumers not only take pictures or record videos of famous landmarks with smart phones, but also take pictures of themselves through self-shooting. As social media applications become more popular, contemporary people begin to learn to communicate through photos and videos in addition to face-to-face or phone conversations. They will send the shot content immediately to friends, letting them see what they are doing. Visual media content such as images and videos has become a way to convey information and emotions.
However, relying on visual media content such as images, groups of images or videos is not enough to express people's emotions at that time.
Disclosure of Invention
In view of this, the information processing method, the encoder, the decoder, and the storage medium device provided in the embodiments of the present application can allow a user to embed an emotion expression (namely, narrative voice data) of a visual media content into a media file (media file) or a bitstream of the visual media content, so that when the user wants to play back the visual media content on an electronic device, the user can view the narrative voice data of the visual media content; the information processing method, encoder, decoder, and storage medium device provided in the embodiments of the present application are implemented as follows:
in a first aspect, an embodiment of the present application provides an information processing method, where the method includes: analyzing the code stream to obtain at least one narration voice-over information of the visual media content and corresponding presentation time; presenting the at least one narration voice-over information in accordance with the presentation time while playing the visual media content.
In a second aspect, an embodiment of the present application provides an information processing method, where the method includes: determining at least one narration voice-over information to be added and corresponding presentation time; under the condition that the visual media content corresponding to the at least one narration voice-over information is not changed, the at least one narration voice-over information and the corresponding presentation time are embedded into a media file or a bit stream of the visual media content in a preset mode, and a new media file or a new bit stream is obtained; and coding the new media file or the new bit stream to obtain a code stream.
In a third aspect, an embodiment of the present application provides a decoder, where the decoder includes a decoding module and a playing module; the decoding module is used for analyzing the code stream to obtain at least one narration voice-over information of the visual media content and corresponding presentation time; and the playing module is used for presenting the at least one narration voice-over information according to the presentation time when the visual media content is played.
In a fourth aspect, an embodiment of the present application provides a decoder, which includes a memory and a processor; wherein the memory is to store a computer program operable on the processor; the processor is configured to execute the information processing method of the decoding end according to the embodiment of the present application when the computer program is run.
In a fifth aspect, an embodiment of the present application provides an encoder, which includes a determining module, an embedding module, and an encoding module; the determining module is used for determining at least one narration voice-over information to be added and corresponding presenting time; the embedding module is used for embedding the at least one narration voice-over information and the corresponding presentation time into a media file or a bit stream of the visual media content in a preset mode under the condition that the visual media content corresponding to the at least one narration voice-over information is not changed, so as to obtain a new media file or a new bit stream; and the coding module is used for coding the new media file or the new bit stream to obtain a code stream.
In a sixth aspect, embodiments of the present application provide an encoder, which includes a memory and a processor; wherein the memory is to store a computer program operable on the processor; the processor is configured to execute the information processing method of the encoding end according to the embodiment of the application when the computer program is run.
In a seventh aspect, an embodiment of the present application provides a computer storage medium, where the computer storage medium stores a computer program, and the computer program, when executed by a processor, implements the method according to the embodiment of the present application.
In an eighth aspect, an electronic device, wherein the electronic device includes at least an encoder described in the embodiments of the present application and/or a decoder described in the embodiments of the present application.
In the information processing method provided by the embodiment of the application, the user can be allowed to embed the emotional expression (namely, narration bystander information) of the visual media content into the media file or the bit stream of the visual media content, so that when the user wants to play back the media visual media content on the electronic equipment, the associated narration bystander information can be viewed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic diagram illustrating an implementation flow of an information processing method at an encoding end according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating an implementation flow of an information processing method at a decoding end according to an embodiment of the present application;
FIG. 3 is a diagram illustrating a general data structure and a structure of a File based on a Media File Format (ISO-BMFF) of the International Organization for Standardization (ISO-BMFF), according to an embodiment of the present application;
FIG. 4 is a schematic diagram of the structure of an ISO-BMFF file according to an embodiment of the present application;
FIG. 5A is a diagram illustrating the structure of an ISO-BMFF file according to an embodiment of the present application;
FIG. 5B is a schematic view showing the structure of a meta box 502 according to the embodiment of the present application;
FIG. 6 is a block diagram of a decoder according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of an encoder according to an embodiment of the present application;
FIG. 8 is a block diagram of a hardware implementation of an encoder according to an embodiment of the present disclosure;
fig. 9 is a hardware block diagram of a decoder according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the following detailed descriptions of specific technical solutions of the present application are made with reference to the accompanying drawings in the embodiments of the present application. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
It should be noted that the terms "first \ second \ third" are used herein to distinguish similar or different objects and do not denote a particular ordering with respect to the objects, and it should be understood that "first \ second \ third" may be interchanged in a particular order or sequence as appropriate to enable the embodiments of the present application described herein to be practiced in other than the order shown or described herein.
The embodiment of the application provides an information processing method, which can be applied to a coding end and electronic equipment corresponding to the coding end, wherein the electronic equipment can be any electronic equipment with coding capacity, and the electronic equipment can be a mobile phone, a personal computer, a notebook computer, a television, a server and the like. The functions implemented by the information processing method may be implemented by a processor in the electronic device calling program code, which may of course be stored in a computer storage medium. It is seen that the electronic device includes at least a processor and a storage medium.
Fig. 1 is a schematic view of an implementation flow of an information processing method according to an embodiment of the present application, and as shown in fig. 1, the method may include the following steps 101 to 103:
step 101, determining at least one narration voice-over information to be added and corresponding presentation time;
step 102, under the condition that the visual media content corresponding to the at least one narration voice-over information is not changed, embedding the at least one narration voice-over information and the corresponding presentation time into a media file or a bit stream of the visual media content in a preset mode to obtain a new media file or a new bit stream;
and 103, encoding the new media file or the new bit stream to obtain a code stream.
In some embodiments, the visual media content is a video or a set of images; accordingly, when the narrative voice-over information is original text, text converted from audio, or combined audio and text, the presentation time of the text is represented in the form of a marked-up start frame and at least one continuation frame of the visual media content.
In some embodiments, the visual media content is a video clip or a set of images; accordingly, where the narrative voice-over information is original audio, text-converted audio, or combined audio and text, the presentation time of the audio is represented in the form of a marked start frame and duration of the visual media content.
In some embodiments, the number of persistent frames in the duration of the text-converted audio is less than the number of persistent frames for the corresponding text.
In some embodiments, the method further comprises: and embedding the registration information of the narration voice-over information into a media file or a bit stream of the visual media content in a preset mode.
In some embodiments, the registration information for narrative voice-over information includes at least one of: narrator name, creation date and time, ownership information of the visual media content.
In some embodiments, embedding the at least one narrative voice-over information in a preset manner into a media file or bitstream of the visual media content comprises: storing the at least one narration voice-over information in a preset manner at a starting position of the visual media content.
In some embodiments, the determining at least one narrative voice-over information to be added comprises: narrative voice-over information is created for at least one user of the visual media content, the at least one narrative voice-over information is obtained.
In some embodiments, the type of narrative bystander information comprises at least one of: a text type and an audio type; the type of visual media content includes at least one of: the system comprises a video, an image and an image group, wherein the image group comprises at least two images.
In some embodiments, when the type of the current narrative bystander information is a text type, the method further comprises: creating a text data segment; accordingly, embedding the at least one narrative voice-over information into a media file or bitstream of the visual media content in a predetermined manner, comprises: and embedding the current narrative voice-over information into a media file or a bit stream of the visual media content in a text data segment mode.
In some embodiments, when the type of the current narrative voice-over information is an audio type, the method further comprises: creating an audio clip; correspondingly, embedding the at least one narrative voice-over information into a media file or bit stream of the visual media content in a preset manner comprises: embedding the current narrative voice-over information in a media file or bitstream of the visual media content in an audio clip.
In some embodiments, when the type of the current narrative bystander information is a text type, the method further comprises: converting the current narration voice-over information into narration voice-over information corresponding to the audio type, and creating an audio clip; accordingly, embedding the at least one narrative voice-over information into a media file or bitstream of the visual media content in a predetermined manner, comprises: embedding the current narrative voice-over information into a media file or bitstream of the visual media content in an audio clip.
In some embodiments, when the type of the current narrative voice-over information is an audio type, the method further comprises: converting the current narration voice-over information into narration voice-over information corresponding to the text type, and creating a text data segment; correspondingly, embedding the at least one narrative voice-over information into a media file or bit stream of the visual media content in a preset manner comprises: embedding the current narrative voice-over information into a media file or a bit stream of the visual media content in a text data segment mode.
In some embodiments, the method further comprises: when the type of the visual media content is an image or an image group, determining the type of the at least one narration voice-over information as a text type and/or an audio type; when the type of the visual media content is a video, determining that the type of the at least one narrative voice-over information is a text type.
In some embodiments, the method further comprises: if the type of the narration voice-over information comprises a text type and an audio type, the narration voice-over information corresponding to the audio type is stored behind the narration voice-over information corresponding to the text type.
In some embodiments, the method further comprises: determining new narration voice-over information to be added; storing the new-narrative voice-over information after the existing narrative voice-over information.
In some embodiments, the media file or bitstream conforms to a preset data structure; wherein the preset data structure at least comprises one of the following items: a generic data structure and an ISO-BMFF data structure; accordingly, said embedding said at least one narrative voice-over information and corresponding presentation time in a media file or bitstream of said visual media content in a preset manner comprises: embedding the at least one narrative voice-over information and corresponding presentation time in a media file or bitstream of the visual media content in the form of the preset data structure.
In some embodiments, the ISO-BMFF data structure includes at least a narrative-whiteside metadata box, the narrative-whiteside metadata box including a metadata processing box and a narrative-whiteside application box; wherein the metadata processing box comprises metadata of the current narration voice-over information; the narrative voice-over application box includes at least one of the following narrative information: the start position of the current narrative voice-over information, the length of the current narrative voice-over information and the total amount of narrative voice-over information.
In some embodiments, the narration voice-over application box comprises a narration voice-over description box, the method further comprising: decoding, by the narrative voice box, at least one of the following narratives: text encoding standards, narrator name, creation date, creation time, ownership flag of the adjunct visual content, type of narration whitewashing information, encoding standards of narration whitewashing information, and text length of narration whitewashing information.
In some embodiments, the method further comprises: if the visual media content does not have a narration voice-over metadata box at the file level, acquiring the narration voice-over metadata box, and decoding the narration voice-over metadata box to obtain at least one narration voice-over information; and if the visual media content has a narration voice-over metadata box at the file level, acquiring the narration voice-over metadata box from the meco container box, and decoding the narration voice-over metadata box to obtain the at least one narration voice-over information.
In some embodiments, the text data segment is encoded using a predetermined text encoding standard, the predetermined text encoding standard including at least one of: UTF-8, UTF-16, GB2312-80, GBK and Big 5. Of course, the predetermined text encoding standard may be any other predefined standard.
In some embodiments, the audio segment is encoded using a predetermined audio coding standard, the predetermined audio coding standard including at least one of: AVS audio, MP3, AAC and WAV. Of course, the preset audio coding standard may also be any other predefined standard.
The embodiment of the present application further provides an information processing method, where the method may be applied to a decoding end and an electronic device corresponding to the decoding end, where the electronic device may be any electronic device with decoding capability and playing capability, and the electronic device may be a mobile phone, a personal computer, a notebook computer, a television, a server, and the like. The functions implemented by the information processing method may be implemented by a processor in the electronic device calling program code, which may of course be stored in a computer storage medium. It can be seen that the electronic device includes at least a processor and a storage medium.
Fig. 2 is a schematic flow chart of an implementation of an information processing method according to an embodiment of the present application, and as shown in fig. 2, the method may include the following steps 201 to 202:
step 201, analyzing the code stream to obtain at least one narration voice-over information of the visual media content and the corresponding presentation time.
It should be noted that, at the encoding end, when adding narrative bypass information to the media file or bitstream of the visual media content, the corresponding designated presentation time is also added. In this way, when the decoding end plays the narration voice-over information, the narration voice-over information is only presented in the corresponding presentation time, and the narration voice-over information is not presented in other time except the presentation time. The presentation time of the different spoken-on information may be the same or different, and is not limited in this application. Two or more than two narration voice-over information can be presented at the same time, and different narration voice-over information can be presented in sequence.
The type of visual media content may be various, for example, the visual media content is one image, a group of images (i.e., including two or more images), or a video. For different types of visual media content, the presentation formats of presentation times corresponding to different narrative voice formats may be the same or different. For example, a summary of the narrative bypass format and possible presentation formats is given in table 1 below.
Table 1 states the bystander format v.s. presentation format
Figure PCTCN2021075622-APPB-000001
Note 1: for text narrative voice-overs or converted text narrative voice-overs, each frame or image within a window marked by the "start" and "duration" of a video or a group of images, the entire narrative voice-over should be displayed together.
Note 2: for audio narrative voice-over, the narrative voice-over should begin playing from a "start" frame marked for a video or a group of pictures and last for the entire time period marked by the "duration" of the video, which is equal to the length of the audio signal being played. However, audio narration voice-over is allowed to exist whose playback time exceeds the playback duration of the video. If this happens, the playback device can freeze the video playback at the end of the video playback time or continue the video playback in a looping mode.
Note 3: the narrative voice-over of the synthesized audio should last less than the duration. If not, it should be considered that there is more than one narrative voice-over at a particular time.
Note 4: as with conventional text narrative bypass, converting text is useful for all frames within a duration.
Note 5: for a set of images, a complete audio (original or synthesized) narrative voice-over is associated with each frame in duration. The audio narration voice-over is played independently of the presentation of the image. The player may decide whether to play the same synthesized audio for each frame or only for the beginning frame. For example, if the player plays the image as a still image, the player may repeat the audio narrative voice-over for each frame. On the other hand, if the playback of the images has a certain frame rate, the synthesized audio can be played back in a non-synchronized manner. If the playback exceeds the duration, the player may freeze the video playback or continue playback (e.g., in a loop mode).
Step 202, when the visual media content is played, presenting the at least one narration voice-over information according to the presentation time.
In some embodiments, parsing the codestream to obtain at least one narrative voice-over information of the visual media content and a corresponding presentation time includes: analyzing the code stream to obtain a media file or a bit stream sent by an encoder; visual media content, at least one narrative voice-over information and a corresponding presentation time are obtained from the media file or the bitstream.
In some embodiments, the visual media content is a video or a set of images; accordingly, when the narrative voice-over information is original text, text converted from audio, or combined audio and text, the presentation time of the text is represented in the form of a marked start frame and at least one continuous frame of the visual media content; presenting the at least one narration voice-over information according to the presentation time while playing the visual media content, comprising: and continuously presenting the corresponding text from the beginning of playing the starting frame until the playing of the at least one continuous frame is finished.
In some embodiments, the visual media content is a video clip or a set of images; accordingly, when the narrative voice-over information is original audio, text-converted audio, or combined audio and text, the presentation time of the audio is represented in the form of a marked start frame and duration of the visual media content; presenting the at least one narration voice-over information according to the presentation time while playing the visual media content, comprising: and starting to play the audio from the beginning frame until the images or video frames in the duration are played.
In some embodiments, the number of persistent frames in the duration of the text-converted audio is less than the number of persistent frames for the corresponding text.
In some embodiments, the visual media content is an image, a group of images, or a video, and the narrative voice-over information is original audio, text-converted audio, or combined audio and text; correspondingly, when the visual media content is played, the presenting the at least one narrative voice-over information according to the presentation time comprises: repeatedly playing the audio while playing the image or statically displaying the group of images; and when the group of images or the video is played at a certain frame rate, playing the audio in a non-synchronous mode.
In some embodiments, said presenting said at least one narrative voice-over information while playing said visual media content, in accordance with said presentation time, comprises: and when the narration voice-over switch is in an open state, playing the visual media content, and presenting the at least one narration voice-over information according to the presentation time.
In some embodiments, the method further comprises: playing the visual media content without presenting the at least one narration voice-over information while the narration voice-over switch is in an off state.
In other embodiments, the method further comprises: and when the narration voice-over switch is in a closed state, playing the visual media content, closing narration voice-over information corresponding to part of attributes in the at least one narration voice-over information, and presenting the remaining narration voice-over information.
For example, it is sufficient to present one of the narrative voice-over information expressed in the at least one narrative voice-over information in the same meaning or similar meanings. As another example, only narrative voice-over information is presented that has ownership of the visual media content.
In some embodiments, said presenting said narrative voice-over information comprises: and when the narrative voice-over information is original text or text converted by audio, overlapping the text on a playing picture of the visual media content for displaying, or displaying the text in other windows independent of a playing window of the visual media content, or converting the original text into audio for playing.
In some embodiments, the converting the original text into audio for playing includes: in the case that the visual media content has audio, playing the audio belonging to the narration onwhite in a mixed manner with the audio belonging to the visual media content, or stopping playing the audio belonging to the visual media content and playing the audio belonging to the narration onwhite alone.
In some embodiments, said presenting said narrative voice-over information comprises: and under the condition that the narration voice-over information is original voice frequency or voice frequency converted by original text and the visual media content has voice frequency, the voice frequency belonging to the narration voice-over and the voice frequency belonging to the visual media content are mixed and played, or the voice frequency belonging to the visual media content is stopped to be played and the voice frequency belonging to the narration voice-over is played independently, or the original voice frequency is converted into text and then displayed.
In some embodiments, said presenting said narrative voice-over information comprises: and in the case that the narrative voice-over information is combined text and audio, presenting the text and the audio simultaneously or separately.
In some embodiments, said presenting said narrative voice-over information comprises: when the visual media content is not played and the presentation time of the next narration voice-over information is reached, providing a first option unit for a user to select the playing state of the narration voice-over information; when the visual media content is played and the narration voice-over information is not played, providing a second option unit for a user to select the playing state of the narration voice-over information; and presenting the narration voice-over information according to the selected option.
In some embodiments, said presenting narrative voice-over information in accordance with the selected option comprises: when a first option of the first option unit is selected, freezing (freeze) the playing of the visual media content until the narration voice-over information is played completely, and continuing to play the next narration voice-over information and the visual media content; when a second option of the first option unit is selected, ending playing the narration voice-over information and starting playing the next narration voice-over information; and when a third option of the second option unit is selected, circularly playing the visual media content.
Note that, the freezing of the playing of the visual media content means stopping at the current frame of the visual media content, rather than disappearing from the display interface.
In some embodiments, the looping playing the visual media content includes: and circularly playing the whole content of the visual media content, or circularly playing the marked frame image in the visual media content.
In some embodiments, the method further comprises: acquiring the registration information of the at least one narration voice-over information from the media file or the bit stream; and when the narration voice-over information is presented, presenting corresponding registration information.
In some embodiments, said presenting corresponding registration information when presenting said narrative voice-over information comprises: when the narration voice-over information is presented, displaying a trigger key of a pull-down menu; when the trigger key receives a trigger operation, displaying an option of whether to play registration information; and when the option for indicating the playing of the registration information receives an operation, presenting the registration information.
In some embodiments, the registration information of narrative voice-over information includes at least one of: narrator name, creation date and time, ownership information of the visual media content.
In some embodiments, said presenting said at least one narrative voice-over information as said visual media content is played, in accordance with said presentation time, comprises: and playing the video media content in the background, and presenting the at least one narration voice-over information in the foreground according to the presentation time.
In some embodiments, the method further comprises: receiving a new code stream, and obtaining new narration voice-over information of the visual media content from the new code stream; and presenting the new narration voice-over information.
In some embodiments, said presenting said new narrative voice-over information comprises: displaying an option of whether to play the new narrative voice-over information; and when the option for indicating the playing of the new narration voice-over information receives operation, presenting the new narration voice-over information.
In some embodiments, the parsing the codestream to obtain a media file or a bitstream sent by an encoder includes: analyzing the code stream to obtain a media file or a bit stream which accords with a preset data structure; wherein the preset data structure at least comprises one of the following items: a generic data structure and an ISO-BMFF data structure.
In some embodiments, the ISO-BMFF data structure includes at least a narrative voice-over metadata box, the narrative voice-over metadata box including a narrative voice-over metadata processing box and a narrative voice-over application box;
correspondingly, the method further comprises: processing metadata of the current narration voice-over information by the narration voice-over metadata processing box; describing, by the narrative voice-over application box, at least one of the following narratives: the starting position of the current description side-information, the data length of the current description side-information and the total number of the description side-information.
In some embodiments, the narration voice-over application box comprises a narration voice-over description box, the method further comprising: describing at least one of the following narratives by the narrative context box: text encoding standards, narrator name, creation date, creation time, ownership flag of the adjunct visual content, type of narration whitewashing information, encoding standards of narration whitewashing information, and text length of narration whitewashing information.
In some embodiments, the method further comprises: if the narration voice-over metadata box does not exist in the visual media content at the file level, creating the narration voice-over metadata box, and describing the at least one narration voice-over information through the narration voice-over metadata box; if the narration whiteside metadata box exists in the visual media content at the file level, the narration whiteside metadata box is created in the meco container box, and the at least one narration whiteside information is described through the narration whiteside metadata box.
In some embodiments, obtaining narrative bystander information from said media file or said bitstream comprises: under the condition that the narration voice-over information is of a text type, decoding the narration voice-over information from the media file or the bit stream according to a preset text decoding standard; wherein the preset text decoding standard is one of the following: UTF-8, UTF-16, GB2312-80, GBK and Big 5. Of course, the predetermined text decoding standard may be any other predefined standard.
In some embodiments, obtaining narration bystander information from the media file or the bitstream comprises: under the condition that the narration voice-over information is of an audio type, decoding from the media file or the bit stream according to a preset audio decoding standard to obtain the narration voice-over information; wherein the preset audio decoding standard is one of the following: AVS audio, MP3, AAC, and WAV. Of course, the preset audio decoding standard may also be any other predefined standard.
Today, images, videos and short messages have become a way to express information and emotions. Images or videos may capture visual information, but relying solely on them may not tell a complete story. People provide background information by sending complementary words with visual media content or express and reflect their own emotions on the subject in the visual media content. Technically, on today's computing and communication platforms, digital media and these commentary words are handled as separate entities. However, once the image or video clip is saved in the electronic device in a particular media format, all text regarding the emotion and background of the visual media content is omitted. Thus, these images and videos quickly become tedious and gradually lose life in a relatively short time.
Based on this, an exemplary application of the embodiment of the present application in a practical application scenario will be described below.
To enhance the viewing experience of digital media, in embodiments of the present application, a technique is described for adding narrative bystander information to a digital video, image or group of images. Narrative whitewashing information can be text, audio, or both, and is written to a digital media file or bitstream along with the original visual media content (including audio data) and can be displayed or played back with the digital media. With this technique, a user such as a photographer or a viewer can record their emotion while capturing or watching a video or image. The technique supports narration onboarding from multiple users and registers each narration onboarding entry by username (i.e., narrator), creation date, and creation time. Narrative bystander information is stored in a digital media file or bitstream along with associated registration information as a data set without changing the specific data structure of the original visual and audio content.
In embodiments of the present application, a technique is described that can add narrative bystander information to visual media content such as digital images or videos and save them together in a format that is convenient for communication and storage. The narrative collateral information may be in text or audio format, or both, and may be displayed or played back with the visual media content, or alternatively, the narrative collateral may be selected to be off and not displayed, but only the original visual media content. This technology allows users to express, share, and exchange their emotions on visual subjects by embedding narrative bystander information in digital media files or bitstreams, thereby enhancing the user's viewing experience of digital media and facilitating participation by generations of viewers.
To date, no technique has been developed to record, share, and exchange narrative bystander information as is conveniently available in digital media files. Today, when people add narrative voice-over information to a video clip, they have to use previous methods of editing the video, either embedding text in the original pixel layer of the video or mixing sound in the audio track. This process usually requires the use of special video editing software, which is inconvenient for the user. Because adding new narrative onlooker content each time changes the original video content, it is not possible for multiple users to share and exchange narrative onlooker content using this approach. For digital images, metadata, such as EXIF, IPTC, etc., are used only to describe and provide technical and administrative information of the image, such as technical features of the capture process, or ownership and rights of the image, etc. In contrast, the techniques described in the embodiments of the present application are specifically directed to recording, sharing, and exchanging sentimental comments between creators and viewers of visual media content. In particular, such techniques allow a user to record narrative voice-over information without altering the original visual media content. With this technique, a user may choose to view or listen to narrative voice-over information while the visual media content is being played back.
The embodiment of the application describes a narrative voice-over system which consists of an encoder and a decoder. The entire system may be implemented by an application or software running on an electronic device that may capture and display digital images, or record and play digital videos. For example, the device may be a smart phone, a tablet computer, a computer or a notebook computer, or a television. At the encoder side, the device obtains narration-side information from a narrator of narration-side information and embeds the narration-side information as data segments of a specific format into a media file or bitstream for images and video. The encoding process does not alter the original visual media content and its associated data. At the decoder side, when the image is displayed or the video is played back, the device extracts the narration voice-over information from the media file or bit stream, and then presents the narration voice-over information to the audience.
In this system, narrative voice-over information may be in text form, in audio form, or both. Narrative voice-overs of the text class are represented by text data, while spoken narrative voice-overs are saved as audio segments. The system supports text and audio coding of various standards such as UTF-8, UTF-16, GB2312-80, GBK, big 5, AAC, MP3, AVS-audio, WAV and the like. For video, in addition to the narration voice-over information itself, time information added with narration voice-over and presented by the player may also be written in the narration voice-over information set in accordance with the video playback time. For video, in addition to the narration voice-over information itself being written in the narration voice-over information set, the time at which the narration voice-over information is added and should be presented by the player according to the video playing time is also written in the narration voice-over information set. The temporal information may be represented by the video frame number in presentation order relative to the start of the video. Narrative bystandings may be provided for a plurality of images for a group of images. In this case, the frame number of the corresponding picture is also recorded in the narrative side information set. In addition, narrative voice-over information recorded in the data field also includes the name of the narrative voice-over creator, the date and time of creation, and/or ownership of the visual media content, etc. The system supports a variety of narrative bystanders, which may come from the visual media content creator (e.g., a photographer), or from the viewer, or from an organization that owns the content and wants to add comments. Thus, a user can generate and add narrative bystandings during visual media content collection, editing, and presentation. Each narrative voice entry is recorded in a specific data structure with the creator's name. The ownership flag is used to indicate whether the user owns the video. The system also allows a user to add a new narration voice-over after an existing narration voice-over associated with the same narration voice-over time. If this happens, a new narrative voice-over is added after the existing prior narrative voice-over. The technique also supports narrative voice-over having a text portion and an audio portion. In this case, the audio data is saved after the text portion is saved.
The basic function of the decoder/player in this system is to first parse and decode the narrative bystander information set from the visual media file or bitstream, and then present the narrative bystander information at the time specified by the narrative bystander information set. The player may be turned on or off narrative voice-over by a switch. When narrative voice-over is turned off, the original visual media content will be played without any modification. When a narrative bypass is opened, the exact narrative bypass presentation format is the decoder/player software choice. For example, a narrative voice-over of the text type may be displayed as a subtitle superimposed on the visual media content, or in a separate text area, or even played as an audio signal after being synthesized by a software decoder/player. If the media content is video, the audio signal may be played mixed with the original audio track or may be played separately by closing the original audio track. Speech narrative voice-overs, on the other hand, can be played as an audio signal separate from the image or video, and can be mixed with the original audio track in the video. The voice narration voice-over can also be displayed in text after conversion by the software player. When a narrative voice-over has a text portion and an audio portion in the data set, these two portions should be presented together at the same time or separately. When a narrative voice-over is too long to be presented before a new narrative voice-over appears at the next narrative voice-over time or before the end of the video, there are a number of options available for the decoder implementation to provide the user with maximum flexibility and enhanced viewing experience. For example, the decoder may freeze the video playback at the next narrative voice-over time, while ending all playback of narrative voice-over corresponding to the current narrative voice-over time, and then continue the video playback and presentation of the next narrative voice-over. If the narration side-white is too long, the decoder may choose to skip the rest of the narration side-white to smooth the video playback without freezing. The decoder may also play the original video in a looping mode while playing narrative voice-over.
A narration voice-over function may be built into the player that will display a narration voice-over message when there is a narration voice-over associated with the visual media content being played, rather than displaying the actual narration voice-over. The viewer may then decide whether to turn on the narration voice-over function and view or listen to the actual narration voice-over. The player may be built with a drop-down menu with options for displaying additional information (e.g., narrator name and narration voice-over creation date) as desired by the viewer. In narrative voice playback mode, the player can play narrative voice in the foreground while playing the original media (i.e., visual media content) in the background. If the original media is a video, the viewer may freeze the video when there is more than one narrative bystander at a particular time during the video playback, or simply repeat a segment of the video marked with a "duration," or cycle through the entire video as background while reviewing the narrative bystander. When a new narration voice-over appears during the playing of an existing narration voice-over, the player may have several options. For example, the player may display a new narrative voice-over message and have the viewer control the presentation of the new narrative voice-over. In this case, the viewer may again choose to switch to narrative voice-over mode and have the media play as background or freeze the media.
General data structures for narrating digital media:
the described techniques may be applied to different digital media formats. The exact implementation depends on the syntactic structure of the visual media content. Table 2 below shows a typical data set syntax and key data components that can implement the above-described functionality.
TABLE 2 general syntax for reciting the set of bystander information
Figure PCTCN2021075622-APPB-000002
Figure PCTCN2021075622-APPB-000003
* And (3) annotation: ANL is equal to the robust _ author _ name _ length in bytes.
The convention used herein is given as follows:
f (n): a bit string representing a fixed pattern, using n bits written from left to right, with the left bit leading;
b (8): represents one byte, and can have any bit string pattern (8 bits);
u (n): representing an unsigned integer using n bits.
The semantics of the syntax elements are as follows:
the robust _ data _ start _ code is used to specify a four-byte bit string having a fixed pattern that describes the start position of the voice-over information in the bitstream. It usually consists of a start code prefix of three bytes with a unique sequence followed by a byte dedicated to specifying narrative voice-over information;
number _ of _ periodic _ point is used to specify the total number of locations or frames in a video or a group of pictures, which are designated as periodic _ entry _ time for video or periodic _ starting _ frame _ number for a group of pictures. This value should be increased by 1 each time a new narrow _ entry _ time or a new narrow _ starting _ frame _ number is added. If the original media file has only one picture, number _ of _ robust _ point is set to 1.
The narrow _ entry _ time is used to specify the time at which the narrative voice-over is added and should be presented. This value is represented in presentation order by the associated frame number in the video. The first frame in the video is set to have its frame number equal to zero. This syntax element is present only when the original media is video. If the duration of the narrative side-word is greater than 1, then the frame number should be considered the starting frame number.
The narrow _ duration is used to specify the number of frames that the narrative voice-over will last when the original media is a video or a group of pictures. If the narrative voice-over is an audio clip, the robust _ duration is equal to the playback length of the audio signal. When the text narrative voice-over is synthesized and played as an audio signal, the playing of the audio signal should be completed within the narrow _ duration. When an audio narrative voice-over is converted to a text narrative voice-over, the narrative voice-over should be rendered for each frame in its entirety for the entire duration of the audio play time.
The tracking _ starting _ frame _ number is used to specify the frame number of the picture that has been described in the group of pictures. This syntax element is present only when the original media is a set of pictures. If the duration of the narrative side-word is greater than 1, then the frame number should be considered the starting frame number.
number _ of _ entries is used to specify the total number of entries of the narrative side information. When adding a new narration side information, the new narration side information should be added to the bottom of the list, following all previous narration side information, and the number _ of _ entries value is increased by one.
Text encoding standard id is used to specify the Text encoding standards (Text encoding standards) for the name of a narrator and the Text narrative information in the narrative side header data section. Table 2 shows an example of code values applicable to the universal text coding standard provided in the embodiment of the present application. Here, the first column is a general Text coding standard (Text coding standards), and the second column is a code value example of the Text coding standard (Text _ encoding _ standard _ id value). In other words, in the embodiment of the present application, the text data segment is encoded by using the following preset text encoding standard, where the preset text encoding standard at least includes one of the following: UTF-8, UTF-16, GB2312-80, GBK and Big 5. Of course, the predetermined text encoding standard may be any other predefined standard. Any standard encoding of the text encoding standard may be used herein. .
Table 3 example of codes for text encoding standard
Figure PCTCN2021075622-APPB-000004
The negative _ author _ name _ length specifies the length of the negative _ author _ name in bytes.
The name _ author _ name is used to specify the name of a narrator, where the narrator can be an individual or a group organization.
The conditional _ creation _ date is used to specify the date on which the narrative side-information is added. Any standard expression for dates may be used herein. For example, a date may be represented in a numeric format using 4 digits to represent a year, then 2 digits to represent a month, then 2 digits to represent a day. For example, 2019, 21 and 2019, 20190921 and 2019, 10 and 30, 20191030, respectively. In this expression, one byte is used for every two digits.
The narrow _ creation _ time is used to specify the time to add the narration side-information. Any standard expression of time may be used herein. For example, time may be expressed as: mm is ss.TZ, where each digit of hh (hour), mm (minute), ss (second) uses one byte, and TZ (time zone) uses eight bits of encoding.
visual _ content _ ownershirp _ flag equal to 1 indicates that the narrator owns the visual media content. A visual content own flag equal to 0 indicates that the narrator of the white entry beside the narration does not own the visual media content.
The narrative _ data _ type is used for specifying the type (namely data format) of narrative voice-over information, wherein 0 represents that the narrative voice-over is in a text format, 1 represents that the narrative voice-over is in an audio format, and 2 represents that the narrative voice-over is in a text and audio format.
text _ directional _ data _ length is in bytes and is used to specify the length of the text narrative voice-over information.
The robust _ audio _ codec _ id is used to specify the audio codec used in encoding the audio narrative side-whites. Table 4 shows an example of code values suitable for a general audio codec provided in an embodiment of the present application. Here, the first column is an audio codec (audio codec), and the second column is a code value example (robust _ audio _ codec _ id value) of the audio codec. In other words, in the embodiment of the present application, the audio segment may be encoded using a general audio standard that is one of the following: AVS audio, MP3, AAC and WAV.
TABLE 4 examples of codes for audio coding standards
Audio codec Robust _ audio _ codec _ id value
MP3 0
AAC 1
AVS-audio 2
WAV 3
(Reserved for any other audio codec…)
audio _ speech _ data _ length is used to specify the length of the audio narration subtitle information in bytes, and has a default value of 0.
text _ textual _ data is used to represent narrative voice-over information that is actually in text format.
The audio _ speech _ data is used to represent the actual audio narration voice-over information.
Narrative side information structure for ISO-BMFF based media files:
ISO-BMFF is widely used in the industry as a container format for visual media content (e.g., video, still images, and image groups). For example, the most popular video streaming and storage format today is the MP4 format, which is fully compliant with ISO-BMFF. In the present embodiment, a narrative data structure is described that is suitable for raw visual media content packaged by the ISO-BMFF file format. This data structure is fully compliant with the metadata format in ISO-BMFF, which can be embedded in a 'File' box at the 'File' level or a 'moov' box at the 'movie' level in the ISO-BMFF File format. To facilitate understanding of the narrative voice-over features described in this application, the narrative voice-over information described is organized in three hierarchical layers, which makes software implementations easy to create, edit, and play narrative voice-over for existing media files.
As shown in FIG. 3, the overall structure of an ISO-BMFF file with a proposed narrative bystander metadata segment is shown. In fig. 3, a standard ISO-BMFF file has an "ftyp" box (i.e., the file type box 401 shown in fig. 4), "moov" box (i.e., one of the narrative bystander metadata boxes 402 shown in fig. 4), "trak" box, "mdat" box (i.e., one of the narrative bystander metadata boxes 402 shown in fig. 4), and "data" box (i.e., the media data box 403 shown in fig. 4), where the "ftyp" box has general information about the media file, the "moov" box contains the "trak" box containing all the meta information about the original visual media content, and the "mdat" box contains all the original visual media content. In the figure, when a new narrative voice-over is inserted, the narrative voice-over suggested in table 2 is contained in the metadata box "meta (for navigation)" and narrative box "navigation box" for narrative voice-over. The narrative bystander information herein does not include the actual narrative bystander content. For example, the actual text content (text _ textual _ data) or audio content (audio _ textual _ data) representing the narrative voice-over content is saved in the narrative voice-over data section in the "mdat" box. This data segment is immediately after the original visual data segment, as shown in fig. 4. As mentioned above, this "meta" box can also be placed in an identical "moov" box.
An embodiment of the present application provides a structure of an ISO-BMFF file, and fig. 4 is a schematic diagram of the structure of the ISO-BMFF file according to the embodiment of the present application, and as shown in fig. 4, the structure 400 includes: a file type box (i.e., "ftyp box") 401, a narrative whiteside metadata box (indicated by meta) 402, and a media data box (i.e., "mdat box") 403; wherein the content of the first and second substances,
a file type box 401 for containing information indicating the ISO-BMFF file type;
narration-side metadata box 402 for holding metadata of media data (i.e., visual media content) and metadata of narration-side information of the visual media content; wherein the narration voice-over information is the emotional expression of the user to the subject content of the visual media content.
It should be noted that the visual media content may be a video, a group of images, or a single image. The type of the narration side-white information is not limited, and the narration side-white information can be text type, audio type, or data combining the text type and the audio type. That is, the structure supports the user to express emotion to the visual media content in the form of text or voice or a mixture of text and voice.
A media data box 403 for containing the visual media content and the narrative voice-over information.
The order of arrangement of the visual media content and narration-side information is not limited, for example, as shown in fig. 4, the narration-side information may be located after the visual media content.
In the embodiment of the present application, the media data box 403 not only contains visual media content, but also contains narrative voice-over information of the visual media content. The emotion expression (namely narration bystander information) of the visual media content by the user is always in one file with the visual media content, namely in an ISO-BMFF file, so that the emotion of the visual media content can be recorded immediately as long as the user can obtain the visual media content, and the structure ensures that the addition of the narration bystander is simpler for the user and does not need to be added by additional specific application; and the user can obtain the narration voice-over of the visual media content only by downloading the ISO-BMFF file.
An embodiment of the present application further provides a structure of an ISO-BMFF file, and fig. 5A is a schematic diagram of the structure of the ISO-BMFF file according to the embodiment of the present application, as shown in fig. 5A, the structure 500 includes: a file type box 401, a narrative bystander metadata box 402, and a media data box 403; therein, narrative voice-over metadata box 402 includes:
moov box 501, for holding metadata of the visual media content; and
a meta box 502 for holding metadata of narrative voice-over information of the visual media content.
In some embodiments, meta box 502 exists in moov box 501 in a file-level or movie-level form.
In some embodiments, the structure of the meta box 502 is shown in fig. 5B, wherein the syntax of the meta box 502 is shown in table 1 below, and the meta box 502 is used to contain at least one of the information shown in table 5:
TABLE 5 syntax of meta box 502
Figure PCTCN2021075622-APPB-000005
The contents in table 5 are explained as follows:
(1) The metadata structure can be added in a file level, and can also be added in a Moov box at the Movie level;
(2) box _ size is the total size of the box 502 in bytes;
(3) box _ type is set to "meta", i.e. 4 characters in lower case, indicating that this is a narrative voice box;
(4) navtive _ metadata _ handler _ box (): the box structure contained in this box is defined by handler _ type as described below, as shown in table 6 below. Here, the ISO-BMFF requires inclusion of the metadata process cartridge;
(5) Relocation _ application _ box (): a main box representing a narration voice application format is included in a narration voice metadata box. The detailed description thereof is shown in table 7 below.
TABLE 6 syntax of metadata processing box
Figure PCTCN2021075622-APPB-000006
The contents in table 6 are explained as follows:
(1) box _ size is the total size of the box, in bytes;
(2) box _ type is designated as "hdlr" (i.e., 4 characters in lower case) to indicate that this is for narrative of a whiteside data processing box;
(3) Handler _ type is designated "napp" (i.e., 4 characters in lower case) indicating that the metadata processing box "napp" will be used to define the media narrative application format;
(4) The version, flags, predefined, reserved, and name fields may set the version, flag, predefined, reserved, and name fields according to ISO-BMFF requirements.
Table 7 sets forth the syntax of the voice-over application box
Figure PCTCN2021075622-APPB-000007
It should be noted that the narrative bypass application box, defined as a "full box," may be updated in the future. The contents in table 7 are explained as follows:
(1) box _ size is the total size of the box, in bytes;
(2) box _ type is designated as "napp" (i.e., 4 characters in lower case) to indicate the metadata box format defined here for narrative voice applications;
(3) version, flags and reserved fields for future updates;
(4) The media _ type indicates a format of the visual media content. Example definitions as shown below (note: the media types defined herein for still images and groups of images may also be used using ISO-BMFF);
i. video: "video" (i.e., 4 characters in lower case);
a still image: "imag" (i.e., 4 characters in lower case);
image set: "picg" (i.e., 4 characters in lower case).
(5) The dormant _ data _ starting _ location indicates the starting position of the current narration voice-over information in the 'mdat' box associated with the original visual media content file, and the starting position is in bytes;
(6) The statistical _ data _ total _ length indicates the total amount of the narration voice-over information in the "mdat" box. This value should be updated each time a new narrative bystander information is added to the ISO-BMFF file. Adding the value and the narrative _ data _ starting _ location to obtain the initial position of the next narrative voice-over information, which simplifies the software implementation of the narrative voice-over process;
(7) number _ of _ prediction _ points specifies the location or total number of frames in a video or a group of images that have been designated as narrative points, i.e., the total number of image frames in visual media content that have been designated with narrative bystandings. This value should be updated each time a new narrative voice-over is added to an image frame that does not already have any narrative voice-over, for example, the value is increased by 1. If the visual media content has only one image (e.g., in the case of a still image), then the value is set to 1;
(8) The decompression _ point is defined as the frame number of a frame already recited in a video or a group of images, i.e., the frame number of an image frame in the visual media content that has been assigned a narrative gutter. If the duration of the narrative bystander information is greater than 1, then the frame number should be considered the starting frame number. If number _ of _ migration _ points is greater than 1, then the migration _ points should be arranged in ascending order. Note: syntax elements are similar to the robust _ entry _ time and robust _ starting _ frame _ number in table 7;
(9) narrow _ point _ description (): this is the box containing information about the narration point. I.e., the narrative point, describes a box that can contain at least one of the information shown in table 8 below.
Table 8 syntax of Point description Box
Figure PCTCN2021075622-APPB-000008
Note that the above-described decompression _ point _ description () is defined as a "complete box" and can be updated in the future. The contents in table 8 are explained as follows: (1) box _ size is the total size of the box, in bytes;
(2) box _ type is designated "nptd" (lower case 4 characters) to indicate that this is a narration _ point description box "nptd";
(3) version, flags and reserved fields for future updates;
(4) number _ of _ entries specifies the total number of entries of the narrative whiteside information. When adding a new narration side information, the new narration side information should be added to the bottom of the list, following all previous narration side information, and the number _ of _ entries value is increased by one.
(5) The narrow _ duration specifies the number of frames for which the current narrative voice-over information will persist when the original media is a video or a group of pictures. If the medium is a still image and the media _ type is equal to "imag", the robust _ duration should be set to 1. If the narrative side information is an audio clip, the negative _ duration is equal to the playback length of the audio signal. When narrative voice-over information of text type is synthesized and played as audio signal, the playing of audio signal should be completed in the narrow _ duration. When the narrative voice-over information of the audio class is converted into a text class, the narrative voice-over information should be presented for each frame in a whole time duration of the audio playing time;
(6) The narrative _ data _ location indicates the starting position of the narrative voice-over information in the "mdat" box, i.e. the position of the current narrative voice-over information relative to the starting position, relative to the narrative _ data _ starting _ location specified in the narrative voice-over application box;
(7) The narrative _ data _ length indicates the length of the current narrative data in bytes;
(8) narrow _ description (): this is the box containing information about the navigation point, i.e. the description box of the currently narrative voice-over information. The cartridge may contain at least one of the information shown in table 9 below.
TABLE 9 syntax of description box for the current narrative of bystander information
Figure PCTCN2021075622-APPB-000009
It should be noted that the description box currently describing the bystander information is defined as a "complete box" and can be updated in the future. The explanation for the contents shown in table 9 above is as follows:
(1) box _ size is the total size of the box, in bytes;
(2) box _ type is designated "nrtd" (i.e., 4 characters in lower case), indicating that this is the description box "nrtd" currently reciting the side-information;
(3) version, flags and reserved fields for future updates;
(4) text encoding standard type is used to describe the text encoding standard of the narrator name. Its definition is the same as "text _ encoding _ standard _ id" in tables 7 and 8; if the narrative voice-over is in text format, the text encoding standard specified herein is also applicable to encoding narrative voice-over content;
(5) The negative _ author _ name _ length is used for specifying the length of the negative _ author _ name in bytes;
(6) The name _ author _ name is used to specify the name of the person or entity that created the currently narrative voice-over information. It should be noted that n in the table is equal to "adaptive _ author _ name _ length";
(7) The relational _ creation _ date is used to specify the date on which the narrative side-information is added. The definition is the same as table 2;
(8) The temporal _ creation _ time is used to specify the time of the narrative side-information. The definition is the same as table 2;
(9) media _ ownershirp _ flag equal to 1 indicates that the narrator owns the visual media content; media _ content _ ownershirp _ flag equal to 0 indicates that the narrator of the side-of-narration white entry does not own the visual media content;
(10) The narrative _ data _ type is used for specifying a data format of narrative voice-over information, wherein 0 represents that the narrative voice-over is in a text format, 1 represents that the narrative voice-over is in an audio format, and 2 represents that the narrative voice-over is in a text and audio format; if the original media type of the visual media content is video, i.e. media _ type = 'video', then the robust _ data _ type can only be 0;
(11) The audio _ encoding _ type is used to specify the encoding standard for narrative side-information of an audio format. Any coding standard may be used herein. For example, the audio _ encoding _ type may be defined as follows:
i. for text narrative voice-over: following the coding standard in table 3;
for audio narrative voice-over: the coding standard in table 4 was followed.
(12) text _ textual _ data _ length is a text portion of a narrative voice-over having a text portion and an audio portion. When a narrative voice-over has a text portion and an audio portion, the text portion will first be saved in "mdat" and then the audio data. The length of the audio data is equal to the textual _ data _ length minus the textual _ data _ length in the description box of the image frame that recites the voice-over information.
Based on the foregoing embodiments, the decoder provided in the embodiments of the present application includes modules included in the decoder and units included in the modules, fig. 6 is a schematic structural diagram of the decoder in the embodiments of the present application, and as shown in fig. 6, the decoder 600 includes: comprises a decoding module 601 and a playing module 602; wherein the content of the first and second substances,
a decoding module 601, configured to parse the code stream to obtain at least one narration voice-over information of the visual media content and a corresponding presentation time;
a playing module 602, configured to, when the visual media content is played, present the at least one narration voice-over information according to the presentation time.
In some embodiments, the decoding module 601 is configured to: analyzing the code stream to obtain a media file or a bit stream sent by an encoder; visual media content, at least one narrative voice-over information and a corresponding presentation time are obtained from the media file or the bitstream.
In some embodiments, the visual media content is a video or a set of images; accordingly, when the narrative voice-over information is original text, text converted from audio, or combined audio and text, the presentation time of the text is represented in the form of a marked start frame and at least one continuous frame of the visual media content; a play module 602, configured to: and continuously presenting the corresponding text from the beginning of playing the starting frame until the playing of the at least one continuous frame is finished.
In some embodiments, the visual media content is a video clip or a set of images; accordingly, when the narrative voice-over information is original audio, text-converted audio, or combined audio and text, the presentation time of the audio is represented in the form of a marked start frame and duration of the visual media content; a play module 602, configured to: and starting to play the audio from the beginning frame until the image or video frame in the duration is played.
In some embodiments, the number of sustained frames within the duration of the text-converted audio is less than the number of sustained frames of the corresponding text.
In some embodiments, the visual media content is an image, a group of images, or a video, and the narrative bystander information is original audio, text-converted audio, or combined audio and text; accordingly, a play module 602 is configured to: when the image is played or the group of images are statically displayed, the audio is played repeatedly; and when the group of images or the video is played according to a certain frame rate, playing the audio in an asynchronous mode.
In some embodiments, the playing module 602 is configured to: and when the narration voice-over switch is in an open state, playing the visual media content, and presenting the at least one narration voice-over information according to the presentation time.
In some embodiments, the playing module 602 is configured to: playing the visual media content without presenting the at least one narration voice-over information while the narration voice-over switch is in an off state.
In some embodiments, the playing module 602 is configured to: and when the narration voice-over information is an original text or a text converted from audio, the text is superposed on a playing picture of the visual media content for displaying, or the text is displayed on other windows independent of a playing window of the visual media content, or the original text is converted into audio for playing.
In some embodiments, a play module 602 to: in the case where the visual media content has audio, the audio belonging to the narration whiteside is mixed with the audio belonging to the visual media content to be played, or the audio belonging to the narration whiteside is stopped from being played and the audio belonging to the narration whiteside is played alone.
In some embodiments, the playing module 602 is configured to: and under the condition that the narration voice-over information is original voice frequency or voice frequency converted from original text and the visual media content has voice frequency, the voice frequency belonging to the narration voice-over and the voice frequency belonging to the visual media content are mixed and played, or the voice frequency belonging to the narration voice-over is stopped to be played and is played independently, or the original voice frequency is converted into text and then is presented.
In some embodiments, the playing module 602 is configured to: in the case where the narrative voice-over information is a combined text and audio, the text and the audio are presented simultaneously or separately.
In some embodiments, a play module 602 to: when the visual media content is not played and the presentation time of the next narration voice-over information is reached, providing a first option unit for a user to select the playing state of the narration voice-over information; when the visual media content is played and the narration voice-over information is not played, providing a second option unit for a user to select the playing state of the narration voice-over information; and presenting the narration voice-over information according to the selected option.
In some embodiments, the playing module 602 is configured to: when the first option of the first option unit is selected, the playing of the visual media content is frozen until the narration voice-over information is played, and the next narration voice-over information and the visual media content are played continuously; when the second option of the first option unit is selected, ending playing the narration voice-over information and starting playing the next narration voice-over information; and when a third option of the second option unit is selected, circularly playing the visual media content.
In some embodiments, a play module 602 to: and circularly playing the whole content of the visual media content, or circularly playing the marked frame images in the visual media content.
In some embodiments, a play module 602 to: acquiring the registration information of the at least one narration voice-over information from the media file or the bit stream; and when the narration voice-over information is presented, presenting corresponding registration information.
In some embodiments, the playing module 602 is configured to: when the narration voice-over information is presented, displaying a trigger key of a pull-down menu; when the trigger key receives a trigger operation, displaying an option of whether to play registration information; when an option indicating to play the registration information receives an operation, the registration information is presented.
In some embodiments, the registration information for narrative voice-over information includes at least one of: narrator name, creation date and time, ownership information of the visual media content.
In some embodiments, the playing module 602 is configured to: and playing the video media content in the background, and presenting the at least one narration voice-over information in the foreground according to the presentation time.
In some embodiments, the decoding module 601 is further configured to receive a new code stream, and obtain new narrative voice-over information of the visual media content from the new code stream; the playing module 602 is further configured to present the new narration voice-over information.
In some embodiments, a play module 602 to: displaying an option of whether to play the new narration voice message; and when the option for indicating the playing of the new narration voice-over information receives operation, presenting the new narration voice-over information.
In some embodiments, the decoding module 601 is configured to: analyzing the code stream to obtain a media file or a bit stream which accords with a preset data structure; wherein the preset data structure at least comprises one of the following items: common data structures and the international organization for standardization-based on the media file format ISO-BMFF data structure.
In some embodiments, the ISO-BMFF data structure includes at least a narrative whiteside metadata box, the narrative whiteside metadata box including a metadata processing box and a narrative whiteside application box; a decoding module 601, configured to: obtaining metadata of the current narration voice-over information from a metadata processing box of the media file or the bit stream; obtaining, from a narrative voice-over application box of the media file or the bitstream, at least one of: the start position of the current narrative voice-over information, the length of the current narrative voice-over information and the total amount of narrative voice-over information.
In some embodiments, the narration voice-over application box comprises a narration voice-over description box, the method further comprising: decoding, by the narrative voice box, at least one of the following narratives: text encoding criteria, narrator name, creation date, creation time, ownership flag of adjunct visual content, type of narration voice-over information, encoding criteria for narration voice-over information, and text length of narration voice-over information.
In some embodiments, the decoding module 601 is further configured to: if the visual media content does not have a narration voice-over metadata box at the file level, acquiring the narration voice-over metadata box, and decoding the narration voice-over metadata box to acquire the at least one narration voice-over information; and if the narration voice-over metadata box exists in the visual media content at the file level, acquiring the narration voice-over metadata box from the meco container box, and decoding the narration voice-over metadata box to obtain the at least one narration voice-over information.
In some embodiments, the decoding module 601 is configured to: under the condition that the narration voice-over information is of a text type, decoding the narration voice-over information from the media file or the bit stream according to a preset text decoding standard; wherein the preset text decoding standard is one of the following: UTF-8, UTF-16, GB2312-80, GBK and Big 5. Of course, the predetermined text decoding standard may be any other predefined standard.
In some embodiments, the decoding module 601 is configured to: under the condition that the narration voice-over information is of an audio type, decoding the narration voice-over information from the media file or the bit stream according to a preset audio decoding standard; wherein the preset audio decoding standard is one of the following: AVS audio, MP3, AAC, and WAV. Of course, the preset audio decoding standard may also be any other predefined standard.
Based on the foregoing embodiments, the encoder provided in the embodiments of the present application includes modules included in the encoder and units included in the modules, fig. 7 is a schematic structural diagram of the encoder in the embodiments of the present application, and as shown in fig. 7, the encoder 700 includes: comprises a determining module 701, an embedding module 702 and an encoding module 703; wherein the content of the first and second substances,
a determining module 701, configured to determine at least one narration voice-over information to be added and a corresponding presentation time;
an embedding module 702, configured to embed, in a preset manner, the at least one narration voice-over information and corresponding presentation time into a media file or a bitstream of the visual media content without changing the visual media content corresponding to the at least one narration voice-over information, so as to obtain a new media file or a new bitstream;
the encoding module 703 is configured to encode the new media file or the new bitstream to obtain a code stream.
In some embodiments, the visual media content is a video or a set of images; accordingly, when the narrative voice-over information is original text, text converted from audio, or combined audio and text, the presentation time of the text is represented in the form of a marked-up start frame and at least one continuation frame of the visual media content.
In some embodiments, the visual media content is a video clip or a set of images; accordingly, where the narrative voice-over information is original audio, text-converted audio, or combined audio and text, the presentation time of the audio is represented in the form of a marked start frame and duration of the visual media content.
In some embodiments, the number of sustained frames within the duration of the text-converted audio is less than the number of sustained frames of the corresponding text.
In some embodiments, the embedding module 702 is further configured to embed the registration information of the narrative voice-over information into a media file or a bitstream of the visual media content in a preset manner.
In some embodiments, the registration information for narrative voice-over information includes at least one of: narrator name, creation date and time, ownership information of the visual media content.
In some embodiments, the embedding module 702 is configured to: storing the at least one narration voice-over information in a preset manner at a starting position of the visual media content.
In some embodiments, the determining module 701 is configured to: narrative voice-over information is created for at least one user of the visual media content, the at least one narrative voice-over information is obtained.
In some embodiments, the type of narrative bystander information comprises at least one of: a text type and an audio type; the type of visual media content includes at least one of: the system comprises a video, an image and an image group, wherein the image group comprises at least two images.
In some embodiments, when the type of the current narrative bystander information is a text type, the embedding module 702 is configured to: creating a text data segment; embedding the current narrative voice-over information into a media file or a bit stream of the visual media content in a text data segment mode.
In some embodiments, when the type of the current narrative voice-over information is an audio type, the embedding module 702 is configured to: creating an audio clip; embedding the current narrative voice-over information into a media file or bitstream of the visual media content in an audio clip.
In some embodiments, when the type of the current narrative bystander information is a text type, the embedding module 702 is configured to: converting the current narration voice-over information into narration voice-over information corresponding to the audio type, and creating an audio clip; embedding the current narrative voice-over information in a media file or bitstream of the visual media content in an audio clip.
In some embodiments, when the type of the current narrative voice-over information is an audio type, the embedding module 702 is configured to: converting the current narration voice-over information into narration voice-over information corresponding to the text type, and creating a text data segment; embedding the current narrative voice-over information into a media file or a bit stream of the visual media content in a text data segment mode.
In some embodiments, the determining module 701 is configured to determine that the type of the at least one narrative voice-over information is a text type and/or an audio type when the type of the visual media content is an image or a group of images; when the type of the visual media content is a video, determining that the type of the at least one narration voice-over information is a text type.
In some embodiments, the embedding module 702 is configured to, if the types of narrative voice-over information include a text type and an audio type, store the narrative voice-over information corresponding to the audio type after the narrative voice-over information corresponding to the text type.
In some embodiments, the determining module 701 is configured to determine new narrative voice-over information to be added; an embedding module 702 is configured to store the new statement voice-over information after the existing statement voice-over information.
In some embodiments, the media file or bitstream conforms to a preset data structure; wherein the preset data structure at least comprises one of the following items: universal data structure and international organization for standardization-ISO-BMFF data structure based on media file format; an embedding module 702 configured to: embedding the at least one narrative voice-over information and corresponding presentation time in a media file or bitstream of the visual media content in the form of the preset data structure.
In some embodiments, the ISO-BMFF data structure includes at least a narration voice-over box, the narration voice-over box including a narration voice-over processing box and a narration voice-over application box; accordingly, the embedding module 702 is further configured to: processing metadata of the current narration voice-over information by the narration voice-over metadata processing box; describing, by the narrative voice-over application box, at least one of the following narratives: the start position of the current narrative voice-over information, the data length of the current narrative voice-over information and the total number of narrative voice-over information.
In some embodiments, the narration voice-over application box comprises a narration voice-over description box, the embedding module 702 further configured to: describing at least one of the following narratives by the narrative context box: text encoding standards, narrator name, creation date, creation time, ownership flag of the adjunct visual content, type of narration whitewashing information, encoding standards of narration whitewashing information, and text length of narration whitewashing information.
In some embodiments, the embedding module 702 is further configured to: if the visual media content does not have a narration voice-over metadata box at the file level, creating the narration voice-over metadata box, and describing the at least one narration voice-over information through the narration voice-over metadata box; if a narration voice-over metadata box exists in the visual media content at the file level, the narration voice-over metadata box is created in a meco container box, and at least one narration voice-over information is described through the narration voice-over metadata box.
In some embodiments, the text data segment is encoded using a predetermined text encoding standard, the predetermined text encoding standard including at least one of: UTF-8, UTF-16, GB2312-80, GBK and Big 5. Of course, the predetermined text encoding standard may be any other predefined standard.
In some embodiments, the audio segment is encoded using a predetermined audio coding standard, the predetermined audio coding standard including at least one of: AVS audio, MP3, AAC and WAV. Of course, the predetermined audio coding standard may also be any other predefined standard.
The above description of the encoder and decoder embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the encoder and decoder embodiments of the present application, reference is made to the description of the method embodiments of the present application for understanding.
It should be noted that, in the embodiment of the present application, if the method described above is implemented in the form of a software functional module and sold or used as a standalone product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an electronic device to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
The present application provides a computer storage medium applied to an encoder 700, and the computer storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements the method of any one of the foregoing embodiments.
Based on the above-mentioned composition of the encoder 700 and the computer storage medium, refer to fig. 8, which shows a specific hardware structure diagram of the encoder 700 provided in the embodiment of the present application. As shown in fig. 8, may include: a first communication interface 801, a memory 802, and a processor 803; the various components are coupled together by a first bus system 804. It is understood that the first bus system 804 is used to enable connection communications between these components. The first bus system 804 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as the first bus system 804 in fig. 8. Wherein, the first and the second end of the pipe are connected with each other,
a first communication interface 801, configured to receive and transmit signals during information transmission and reception with other external network elements;
a memory 802 for storing a computer program capable of running on the processor 803;
a processor 803 for executing, when running the computer program, the following:
determining at least one narration voice-over information to be added and corresponding presenting time;
under the condition that the visual media content corresponding to the at least one narration voice-over information is not changed, the at least one narration voice-over information and the corresponding presentation time are embedded into a media file or a bit stream of the visual media content in a preset mode, and a new media file or a new bit stream is obtained;
and coding the new media file or the new bit stream to obtain a code stream.
It will be appreciated that the memory 802 in the subject embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 802 of the subject systems and methods is intended to comprise, without being limited to, these and any other suitable types of memory.
The processor 803 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 803. The Processor 803 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 802, and the processor 803 reads the information in the memory 802, and completes the steps of the above method in combination with the hardware thereof.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof. For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Optionally, as another embodiment, the processor 803 is further configured to perform the method of any one of the previous embodiments when running the computer program.
Based on the above-mentioned components of the decoder 600 and the computer storage medium, refer to fig. 9, which shows a specific hardware structure diagram of the decoder 900 provided in the embodiment of the present application. As shown in fig. 9, may include: a second communication interface 901, a memory 902 and a processor 903; the various components are coupled together by a second bus system 904. It will be appreciated that the second bus system 904 is used to enable communications among the components. The second bus system 904 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as the second bus system 904 in figure 9. Wherein the content of the first and second substances,
a second communication interface 901, configured to receive and send signals in a process of receiving and sending information with other external network elements;
a memory 902 for storing a computer program capable of running on the processor 803;
a processor 903 for executing, when running the computer program, the following:
analyzing the code stream to obtain at least one narration voice-over information of the visual media content and corresponding presentation time;
presenting the at least one narration voice-over information in accordance with the presentation time while playing the visual media content.
It is understood that the memory 902 is similar in hardware functionality to the memory 802 and the processor 903 is similar in hardware functionality to the processor 803; and will not be described in detail herein.
Accordingly, an embodiment of the present application provides a computer storage medium, wherein the computer storage medium stores a computer program, and the computer program, when executed by a processor, implements a method of information processing as described in an encoding end or a method as described in a decoding end of an embodiment of the present application.
An embodiment of the present application provides an electronic device, where the electronic device at least includes an encoder and/or a decoder according to the embodiments of the present application.
Here, it should be noted that: the above description of the decoder, encoder, storage medium and device embodiments, similar to the description of the method embodiments described above, has similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the decoder, encoder, storage medium and apparatus of the present application, reference should be made to the description of the embodiments of the method of the present application for understanding.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" or "some embodiments" means that a particular feature, structure or characteristic described in connection with the embodiments is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply any order of execution, and the order of execution of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice, such as: multiple modules or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be electrical, mechanical or in other forms.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules; can be located in one place or distributed on a plurality of network units; some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may be separately regarded as one unit, or two or more modules may be integrated into one unit; the integrated module can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an electronic device to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
The features disclosed in the several product embodiments presented in this application can be combined arbitrarily, without conflict, to arrive at new product embodiments.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (55)

  1. An information processing method, the method comprising:
    analyzing the code stream to obtain at least one narration voice-over information of the visual media content and corresponding presentation time;
    and presenting the at least one narrative voice-over information according to the presentation time when the visual media content is played.
  2. The method of claim 1, wherein parsing the codestream to obtain at least one narrative voice-over information and corresponding presentation time of the visual media content comprises:
    analyzing the code stream to obtain a media file or a bit stream sent by an encoder;
    visual media content, at least one narrative bystander information and a corresponding presentation time are obtained from the media file or the bitstream.
  3. The method of claim 1, wherein the visual media content is a video or a set of images; when the narrative voice-over information is original text, text converted from audio, or combined audio and text, the presentation time of the text is represented in the form of a marked start frame and at least one continuous frame of the visual media content;
    when the visual media content is played, presenting the at least one narration voice-over information according to the presentation time, wherein the presenting comprises:
    and continuously presenting the corresponding text from the beginning of playing the starting frame until the playing of the at least one continuous frame is finished.
  4. A method according to claim 1 or 3, wherein the visual media content is a video clip or a set of images; accordingly, when the narrative voice-over information is original audio, text-converted audio, or combined audio and text, the presentation time of the audio is represented in the form of a marked start frame and duration of the visual media content;
    when the visual media content is played, presenting the at least one narration voice-over information according to the presentation time, wherein the presenting comprises:
    and starting to play the audio from the beginning frame until the image or video frame in the duration is played.
  5. The method of claim 4, wherein the number of persistent frames within the duration of the text-converted audio is less than the number of persistent frames for the corresponding text.
  6. The method of claim 1, wherein the visual media content is an image, a group of images, or a video, and the narrative bystander information is original audio, text-converted audio, or combined audio and text; correspondingly, when the visual media content is played, the presenting the at least one narrative voice-over information according to the presentation time comprises:
    when the image is played or the group of images are statically displayed, the audio is played repeatedly;
    and when the group of images or the video is played according to a certain frame rate, playing the audio in an asynchronous mode.
  7. The method of claim 1, wherein said presenting said at least one narrative voice-over information at said presentation time while playing said visual media content comprises:
    and when the narration voice-over switch is in an open state, playing the visual media content, and presenting the at least one narration voice-over information according to the presentation time.
  8. The method of claim 7, wherein the method further comprises: playing the visual media content without presenting the at least one narration voice-over information while the narration voice-over switch is in an off state.
  9. The method of claim 1 or 7, wherein said presenting said narrative voice-over information comprises:
    and when the narration voice-over information is an original text or a text converted from audio, the text is superposed on a playing picture of the visual media content for displaying, or the text is displayed on other windows independent of a playing window of the visual media content, or the original text is converted into audio for playing.
  10. The method of claim 9, wherein said converting the original text to audio for playback comprises:
    in the case where the visual media content has audio, the audio belonging to the narration whiteside is mixed with the audio belonging to the visual media content to be played, or the audio belonging to the narration whiteside is stopped from being played and the audio belonging to the narration whiteside is played alone.
  11. The method of claim 1, wherein said presenting said narrative voice-over information comprises:
    and under the condition that the narration voice-over information is original voice frequency or voice frequency converted from original text and the visual media content has voice frequency, the voice frequency belonging to the narration voice-over and the voice frequency belonging to the visual media content are mixed and played, or the voice frequency belonging to the narration voice-over is stopped to be played and is played independently, or the original voice frequency is converted into text and then is presented.
  12. The method of claim 1, wherein said presenting said narrative voice-over information comprises:
    in the case where the narrative voice-over information is a combined text and audio, the text and the audio are presented simultaneously or separately.
  13. The method of claim 1, wherein said presenting said narrative voice-over information comprises:
    when the visual media content is not played and the presentation time of the next narration voice-over information is reached, providing a first option unit for a user to select the playing state of the narration voice-over information; and
    when the visual media content is played and the narration voice-over information is not played, providing a second option unit for a user to select the playing state of the narration voice-over information; and
    and presenting the narration voice-over information according to the selected option.
  14. The method of claim 13, wherein said presenting narrative voice-over information in accordance with the selected option comprises:
    when the first option of the first option unit is selected, the playing of the visual media content is frozen until the narration voice-over information is played, and the next narration voice-over information and the visual media content are played continuously;
    when the second option of the first option unit is selected, ending playing the narration voice-over information and starting playing the next narration voice-over information;
    and when a third option of the second option unit is selected, circularly playing the visual media content.
  15. The method of claim 14, wherein the looping playing the visual media content comprises:
    and circularly playing the whole content of the visual media content, or circularly playing the marked frame images in the visual media content.
  16. The method of claim 1, wherein the method further comprises:
    acquiring the registration information of the at least one narration voice-over information from the media file or the bit stream;
    and when the narration voice-over information is presented, presenting corresponding registration information.
  17. The method of claim 16, wherein the presenting the narrative voice-over information, while presenting the corresponding registration information, comprises:
    when the narration voice-over information is presented, displaying a trigger key of a pull-down menu;
    when the trigger key receives a trigger operation, displaying an option of whether to play registration information;
    and when the option for indicating the playing of the registration information receives an operation, presenting the registration information.
  18. The method of claim 16, wherein the registration information for narrative voice-over information includes at least one of: narrator name, creation date and time, ownership information of the visual media content.
  19. The method of claim 1, wherein said presenting said at least one narrative voice-over information at said presentation time while playing said visual media content comprises:
    and playing the video media content in the background, and presenting the at least one narrative voice-over information in the foreground according to the presentation time.
  20. The method of claim 1, wherein the method further comprises:
    receiving a new code stream, and acquiring new narration voice-over information of the visual media content from the new code stream;
    and presenting the new narration voice-over information.
  21. The method of claim 20, wherein said presenting said new narrative bystander information comprises:
    displaying an option of whether to play the new narration voice message;
    and when the option for indicating the playing of the new narration voice-over information receives operation, presenting the new narration voice-over information.
  22. The method of claim 2, wherein the parsing the codestream to obtain the media file or the bitstream transmitted by the encoder comprises:
    analyzing the code stream to obtain a media file or a bit stream which accords with a preset data structure; wherein the preset data structure at least comprises one of the following items: common data structures and international organization for standardization-based on the media file format ISO-BMFF data structure.
  23. The method of claim 22, wherein the ISO-BMFF data structure includes at least a narrative voice-over metadata box, the narrative voice-over metadata box including a metadata processing box and a narrative voice-over application box;
    accordingly, the method further comprises:
    obtaining metadata of the current narration voice-over information from a metadata processing box of the media file or the bit stream;
    obtaining, from a narrative voice-over application box of the media file or the bitstream, at least one of: the start position of the current narrative voice-over information, the length of the current narrative voice-over information and the total amount of narrative voice-over information.
  24. The method of claim 23, wherein the narrative voice-over application box comprises a narrative voice-over description box, the method further comprising:
    decoding, by the narration bystander description box, to obtain at least one of the following narration information: text encoding standards, narrator name, creation date, creation time, ownership flag of the adjunct visual content, type of narration whitewashing information, encoding standards of narration whitewashing information, and text length of narration whitewashing information.
  25. The method of claim 23, wherein the method further comprises:
    if the visual media content does not have a narration voice-over metadata box at the file level, acquiring the narration voice-over metadata box, and decoding the narration voice-over metadata box to acquire the at least one narration voice-over information;
    and if the visual media content has a narration voice-over metadata box at the file level, acquiring the narration voice-over metadata box from the meco container box, and decoding the narration voice-over metadata box to obtain the at least one narration voice-over information.
  26. The method of claim 2, wherein obtaining narrative bystander information from the media file or the bitstream comprises:
    decoding the narration voice-over information from the media file or the bit stream according to a preset text decoding standard under the condition that the narration voice-over information is of a text type;
    wherein the preset text decoding standard is one of the following: UTF-8, UTF-16, GB2312-80, GBK and Big 5.
  27. The method of claim 1, wherein obtaining narrative bystander information from the media file or the bitstream comprises:
    under the condition that the narration voice-over information is of an audio type, decoding the narration voice-over information from the media file or the bit stream according to a preset audio decoding standard;
    wherein the preset audio decoding standard is one of the following: AVS audio, MP3, AAC, and WAV.
  28. An information processing method, the method comprising:
    determining at least one narration voice-over information to be added and corresponding presentation time;
    under the condition that the visual media content corresponding to the at least one narration onlooker information is not changed, the at least one narration onlooker information and the corresponding presentation time are embedded into a media file or a bit stream of the visual media content in a preset mode, and a new media file or a new bit stream is obtained;
    and coding the new media file or the new bit stream to obtain a code stream.
  29. The method of claim 28, wherein the visual media content is a video or a set of images; accordingly, when the narrative voice-over information is original text, text converted from audio, or combined audio and text, the presentation time of the text is represented in the form of a marked-up start frame and at least one continuation frame of the visual media content.
  30. The method of claim 28, wherein the visual media content is a video clip or a set of images; accordingly, where the narrative voice-over information is original audio, text-converted audio, or combined audio and text, the presentation time of the audio is represented in the form of a marked start frame and duration of the visual media content.
  31. The method of claim 30, wherein the number of sustained frames within the duration of the text-converted audio is less than the number of sustained frames of the corresponding text.
  32. The method of claim 28, wherein the method further comprises: and embedding the registration information of the narration voice-over information into a media file or a bit stream of the visual media content in a preset mode.
  33. The method of claim 32, wherein the registration information for narrative voice-over information includes at least one of: narrator name, creation date and time, ownership information of the visual media content.
  34. The method of claim 28, wherein said embedding said at least one narrative bystander information into a media file or bitstream of said visual media content in a predetermined manner comprises:
    storing the at least one narration voice-over information in a preset manner at a starting position of the visual media content.
  35. The method of claim 28, wherein said determining at least one narrative voice-over information to be added comprises:
    narrative voice-over information is created for at least one user of the visual media content, the at least one narrative voice-over information is obtained.
  36. The method of claim 28, wherein,
    the type of narrative side-information comprises at least one of the following: a text type and an audio type;
    the type of visual media content includes at least one of: the system comprises a video, an image and an image group, wherein the image group comprises at least two images.
  37. The method of claim 36, wherein when the type of the current narrative voice-over information is a text type, the method further comprises:
    creating a text data segment;
    accordingly, embedding the at least one narrative voice-over information into a media file or bitstream of the visual media content in a predetermined manner, comprises:
    and embedding the current narrative voice-over information into a media file or a bit stream of the visual media content in a text data segment mode.
  38. The method of claim 36, wherein when the type of the current narrative voice-over information is an audio type, the method further comprises:
    creating an audio clip;
    accordingly, embedding the at least one narrative voice-over information into a media file or bitstream of the visual media content in a predetermined manner, comprises:
    embedding the current narrative voice-over information into a media file or bitstream of the visual media content in an audio clip.
  39. The method of claim 36, wherein when the type of the current narrative voice-over information is a text type, the method further comprises:
    converting the current narration voice-over information into narration voice-over information corresponding to the audio type, and creating an audio clip;
    accordingly, embedding the at least one narrative voice-over information into a media file or bitstream of the visual media content in a predetermined manner, comprises:
    embedding the current narrative voice-over information into a media file or bitstream of the visual media content in an audio clip.
  40. The method of claim 36, wherein when the type of the current narrative voice-over information is an audio type, the method further comprises:
    converting the current narration voice-over information into narration voice-over information corresponding to the text type, and creating a text data segment;
    accordingly, embedding the at least one narrative voice-over information into a media file or bitstream of the visual media content in a predetermined manner, comprises:
    embedding the current narrative voice-over information into a media file or a bit stream of the visual media content in a text data segment mode.
  41. The method of claim 36, wherein the method further comprises:
    when the type of the visual media content is an image or an image group, determining the type of the at least one narrative voice-over information to be a text type and/or an audio type;
    when the type of the visual media content is a video, determining that the type of the at least one narration voice-over information is a text type.
  42. The method of claim 28, wherein the method further comprises:
    if the type of the narration voice-over information comprises a text type and an audio type, the narration voice-over information corresponding to the audio type is stored behind the narration voice-over information corresponding to the text type.
  43. The method of claim 28, wherein the method further comprises:
    determining new narration bystander information to be added;
    storing the new-narrative voice-over information after the existing narrative voice-over information.
  44. A method according to any one of claims 28 to 43, wherein the media file or bitstream conforms to a preset data structure; wherein the preset data structure at least comprises one of the following items: universal data structure and international organization for standardization-based on the media file format ISO-BMFF data structure; correspondingly, the embedding the at least one narrative voice-over information and the corresponding presentation time into the media file or the bit stream of the visual media content in a preset manner comprises:
    embedding the at least one narrative voice-over information and corresponding presentation time in a media file or bitstream of the visual media content in the form of the preset data structure.
  45. The method of claim 43, wherein the ISO-BMFF data structure comprises at least a narrative voice-over metadata box comprising a narrative voice-over metadata processing box and a narrative voice-over application box;
    accordingly, the method further comprises:
    processing metadata of the current narration voice-over information by the narration voice-over metadata processing box;
    describing, by the narrative voice-over application box, at least one of the following narratives: the start position of the current narrative voice-over information, the data length of the current narrative voice-over information and the total number of narrative voice-over information.
  46. The method of claim 45, wherein the narrative voice-over application box comprises a narrative voice-over description box, the method further comprising:
    describing at least one of the following narratives by the narrative context box: text encoding standards, narrator name, creation date, creation time, ownership flag of the adjunct visual content, type of narration whitewashing information, encoding standards of narration whitewashing information, and text length of narration whitewashing information.
  47. The method of claim 45, wherein the method further comprises:
    if the narration voice-over metadata box does not exist in the visual media content at the file level, creating the narration voice-over metadata box, and describing the at least one narration voice-over information through the narration voice-over metadata box;
    if the narration whiteside metadata box exists in the visual media content at the file level, the narration whiteside metadata box is created in the meco container box, and the at least one narration whiteside information is described through the narration whiteside metadata box.
  48. The method of claim 37 or 40, wherein the text data segments are encoded using a predetermined text encoding standard, the predetermined text encoding standard comprising at least one of: UTF-8, UTF-16, GB2312-80, GBK and Big 5.
  49. The method of claim 38 or 39, wherein the audio segment is encoded using a preset audio coding standard, the preset audio coding standard comprising at least one of: AVS audio, MP3, AAC and WAV.
  50. A decoder comprising a decoding module and a playing module; wherein the content of the first and second substances,
    the decoding module is used for analyzing the code stream to obtain at least one narration voice-over information of the visual media content and corresponding presentation time;
    and the playing module is used for presenting the at least one narration voice-over information according to the presentation time when the visual media content is played.
  51. A decoder, the decoder comprising a memory and a processor; wherein, the first and the second end of the pipe are connected with each other,
    the memory for storing a computer program operable on the processor;
    the processor, when executing the computer program, is configured to perform the method of any of claims 1 to 27.
  52. An encoder comprising a determination module, an embedding module, and an encoding module; wherein, the first and the second end of the pipe are connected with each other,
    the determining module is used for determining at least one narration voice-over information to be added and corresponding presenting time;
    the embedding module is used for embedding the at least one narration voice-over information and the corresponding presentation time into a media file or a bit stream of the visual media content in a preset mode under the condition that the visual media content corresponding to the at least one narration voice-over information is not changed, so as to obtain a new media file or a new bit stream;
    and the coding module is used for coding the new media file or the new bit stream to obtain a code stream.
  53. An encoder comprising a memory and a processor; wherein the content of the first and second substances,
    the memory for storing a computer program operable on the processor;
    the processor, when running the computer program, is configured to perform the method of any of claims 28 to 49.
  54. A computer storage medium, wherein the computer storage medium stores a computer program which, when executed by a processor, implements the method of any of claims 1 to 27, or the method of any of claims 28 to 49.
  55. An electronic device, wherein the electronic device comprises at least an encoder according to claim 52 or 53 and/or a decoder according to claim 50 or 51.
CN202180035459.3A 2020-05-15 2021-02-05 Information processing method, encoder, decoder, and storage medium device Pending CN115552904A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US202063025742P 2020-05-15 2020-05-15
US63/025,742 2020-05-15
US202063034295P 2020-06-03 2020-06-03
US63/034,295 2020-06-03
PCT/CN2021/075622 WO2021227580A1 (en) 2020-05-15 2021-02-05 Information processing method, encoder, decoder, and storage medium device

Publications (1)

Publication Number Publication Date
CN115552904A true CN115552904A (en) 2022-12-30

Family

ID=78526399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180035459.3A Pending CN115552904A (en) 2020-05-15 2021-02-05 Information processing method, encoder, decoder, and storage medium device

Country Status (2)

Country Link
CN (1) CN115552904A (en)
WO (1) WO2021227580A1 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9582506B2 (en) * 2008-12-31 2017-02-28 Microsoft Technology Licensing, Llc Conversion of declarative statements into a rich interactive narrative
US20160212487A1 (en) * 2015-01-19 2016-07-21 Srinivas Rao Method and system for creating seamless narrated videos using real time streaming media
EP3316247B1 (en) * 2015-08-05 2022-03-30 Sony Group Corporation Information processing device, information processing method, and program
CN109300177B (en) * 2017-07-24 2024-01-23 中兴通讯股份有限公司 Picture processing method and device
CN110475159A (en) * 2018-05-10 2019-11-19 中兴通讯股份有限公司 The transmission method and device of multimedia messages, terminal
CN111046199B (en) * 2019-11-29 2024-03-19 鹏城实验室 Method for adding white-out to image and electronic equipment

Also Published As

Publication number Publication date
WO2021227580A1 (en) 2021-11-18

Similar Documents

Publication Publication Date Title
US8977107B2 (en) Storage device and method for resuming playback of content
US8508579B2 (en) System and method for generating and reproducing 3D stereoscopic image file including 2D image
TWI701945B (en) Method and apparatus for efficient delivery and usage of audio messages for high quality of experience
US20090157750A1 (en) Integrated multimedia file format structure, and multimedia service system and method based on the intergrated multimedia format structure
CN103039087A (en) Signaling random access points for streaming video data
US20080150964A1 (en) Method and apparatus for displaying content
KR20110056476A (en) Multimedia distribution and playback systems and methods using enhanced metadata structures
JP6617719B2 (en) Information processing apparatus, information recording medium, information processing method, and program
US9729842B2 (en) Image storage apparatus, image reproducing apparatus, mehod of storing image, method of reproducing an image, recording medium and photographing apparatus
CN113545095A (en) Method, apparatus and computer program for optimizing transmission of a portion of packaged media content
KR20140115659A (en) Apparatus, method and computer readable recording medium of creating and playing a live picture file
JP2022510366A (en) Devices and methods for signaling information in container file format
US11575951B2 (en) Method, device, and computer program for signaling available portions of encapsulated media content
KR101295377B1 (en) Method for constructing of file format and apparatus and method for processing broadcast signal with file which has file format
CN115552904A (en) Information processing method, encoder, decoder, and storage medium device
WO2022037026A1 (en) Information processing method, encoder, decoder, storage medium, and device
CN112312219A (en) Streaming media video playing and generating method and equipment
US20070211759A1 (en) Multiplexing device, multiplexing method, and multiplexing program
JP2004363825A (en) Recording and reproducing device, recording and reproducing method, and recording medium and program therefor
JP4280701B2 (en) Data file editing method and apparatus, control program, and storage medium
KR101656102B1 (en) Apparatus and method for generating/providing contents file
GB2421394A (en) Providing Audio-Visual Content
KR20230101907A (en) Method and apparatus for MPEG DASH to support pre-roll and mid-roll content during media playback
Shrestha Optimising Media Contents for Mobile Devices: Creating Smart Media with Metadata
Farulla et al. Standard and protocols for visual materials

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination