CN112752130A - Data display method and media processing device - Google Patents

Data display method and media processing device Download PDF

Info

Publication number
CN112752130A
CN112752130A CN201911040334.0A CN201911040334A CN112752130A CN 112752130 A CN112752130 A CN 112752130A CN 201911040334 A CN201911040334 A CN 201911040334A CN 112752130 A CN112752130 A CN 112752130A
Authority
CN
China
Prior art keywords
video frame
user
data
area
displayed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911040334.0A
Other languages
Chinese (zh)
Inventor
李波
李斌斌
姚亚群
由佳礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Haisi Technology Co ltd
Original Assignee
Shanghai Haisi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Haisi Technology Co ltd filed Critical Shanghai Haisi Technology Co ltd
Priority to CN201911040334.0A priority Critical patent/CN112752130A/en
Priority to PCT/CN2020/113826 priority patent/WO2021082742A1/en
Publication of CN112752130A publication Critical patent/CN112752130A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • User Interface Of Digital Computer (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A data display method and a media processing device are used for solving the problem that a display position of a data display mode is inconsistent with an attention focus of a viewer in the prior art, so that user experience is improved. The method comprises the following steps: determining a user region of interest in a first video frame; determining a display area in the first video frame according to the user interesting area, wherein the display area is used for displaying data to be displayed corresponding to the first video frame; and superposing the image corresponding to the data to be displayed in the display area. Therefore, when the image corresponding to the data to be displayed is displayed in the first video frame, the display area is determined according to the area of interest of the user, so that the image corresponding to the data to be displayed can be displayed near the area of interest of the user, and the user can be more concerned to improve the user experience.

Description

Data display method and media processing device
Technical Field
The present application relates to the field of media technologies, and in particular, to a data display method and a media processing apparatus.
Background
When playing a media file, a terminal device sometimes needs to display other data, such as subtitles or pictures, in addition to displaying a video picture. Taking subtitle data as an example, displaying subtitles in a video picture is a way to help a viewer understand the content of the video, which is an auxiliary experience. In general, subtitles are displayed in a fixed position on a screen (for example, below the screen), and the font and color of text are also generally fixed.
With the development of media playing and displaying technologies, the resolution of video pictures is continuously improved, and the screen size of terminal equipment is continuously increased. Data show that the visual range of the comfort degree of human eyes is 60 degrees, and the visual range of the concentration degree is 20 degrees. That is, the attention range of the human eye is limited. In the case of a large screen size, the viewer may pay attention to the subtitles under the screen while watching the movie, and the viewer may miss the wonderful picture. Especially for the disabled people who depend on the subtitles deeply, when the disabled people watch the subtitles on a larger screen, the display of the subtitles is inconsistent with the focus of attention of the viewers, which brings inconvenience to the viewers and affects the user experience.
In summary, the data display method in the prior art has the problems that the display position is inconsistent with the attention focus of the viewer, and the user experience is poor.
Disclosure of Invention
The embodiment of the application provides a data display method and a media processing device, which are used for solving the problem that the subtitle display position is inconsistent with the focus of attention of a viewer in a subtitle display mode in the prior art and improving user experience.
In a first aspect, an embodiment of the present application provides a data display method, including the following steps: determining a user region of interest in a first video frame; determining a display area in the first video frame according to the user interesting area, wherein the display area is used for displaying data to be displayed corresponding to the first video frame; then, an image corresponding to the data to be displayed is superimposed on the display area.
Wherein the data to be displayed includes at least one of subtitle data or picture data.
By adopting the data display method provided by the first aspect, the display area is determined according to the user interested area in the first video frame, and the image corresponding to the data to be displayed can be displayed near the user interested area, so that the display position is consistent with the focus of attention of the user, thereby bringing attention to the user and improving the user experience.
In the data display method provided in the first aspect, the determination of the user region of interest in the first video frame may be performed in four ways.
In a first mode
Determining a user region of interest in a first video frame, specifically by: analyzing the first video frame and the second video frame, and determining a region in the first video frame, in which a person moves compared with the second video frame; and taking the area where the human object moves as the area of interest of the user. The first video frame and the second video frame are obtained by decoding the same media file, and the playing time of the second video frame is earlier than that of the first video frame.
The method determines the area of interest of the user, and the focus focused by human eyes is usually a moving place in the picture, so that the data to be displayed near the area is easy to be observed by the user.
Mode two
Determining a user region of interest in a first video frame, specifically by: analyzing the first video frame and the second video frame, and determining a plurality of regions in the first video frame, in which people move compared with the second video frame; and taking the area with the largest area or the area with the largest movement amplitude of the person in the plurality of areas in which the person moves as the user interested area. The first video frame and the second video frame are obtained by decoding the same media file, and the playing time of the second video frame is earlier than that of the first video frame.
If a plurality of people move in the picture, the area with the maximum motion amplitude and area of the people is easily noticed by human eyes, so the user interested area determined by the scheme is the user noticed area in the picture.
Mode III
Determining a user region of interest in a first video frame, specifically by: analyzing the first video frame and the second video frame, and determining a plurality of areas with the motion of the human face in the first video frame compared with the second video frame; and taking the area with the largest amplitude of the human face motion in the plurality of areas with the human face motion as the user interested area. The first video frame and the second video frame are obtained by decoding the same media file, and the playing time of the second video frame is earlier than that of the first video frame.
When the user interested region is determined in the third mode, the face motion of the person in the picture can be detected, the region with larger face motion amplitude of the person compared with the first video frame and the second video frame is determined, the data to be displayed is displayed near the region, and the data to be displayed can be corresponding to the person. Therefore, by adopting the third mode, the determined user region of interest can be more accurate, and the displayed data to be displayed can accurately move along with the picture person.
Mode IV
Determining a user region of interest in a first video frame, specifically by: receiving coordinate information input by a camera, wherein the coordinate information is used for indicating a region of interest when a user watches a first video frame; and determining the region of interest of the user according to the coordinate information. The first video frame and the second video frame are obtained by decoding the same media file, and the playing time of the second video frame is earlier than that of the first video frame.
In a fourth way, the user's region of interest can be captured with substantially no change in the picture (relative still) in the first video frame compared to the second video frame. Of course, the scheme provided by the fourth mode can also be applied to other scenarios, and is not described herein again.
In one possible design, after determining the user interest region, the method further includes: carrying out face recognition and scene recognition on the region of interest of the user, and determining the emotion of people in the region of interest of the user and the scene of the region of interest of the user; and superposing the emotion of the person in the area of interest of the user and the emoticon corresponding to the scene of the area of interest of the user on the display area.
By adopting the scheme, the emotion of the character can be more intuitively expressed, and the user experience is further improved. Of course, the emoticons may also be displayed in the area of interest of the user or in the vicinity of the corresponding character or scene, which is not specifically limited in this embodiment of the application.
In one possible design, determining a display area in the first video frame based on the user interest area includes: determining the area of an image corresponding to the data to be displayed according to the size of the data to be displayed; selecting a plurality of candidate display areas around a user interested area, wherein the area of each candidate display area in the candidate display areas is larger than or equal to the area of an image corresponding to data to be displayed; and determining one candidate display area in the plurality of candidate display areas as the display area according to the distance between the central point of each candidate display area and the central point of the area of interest of the user and the difference arithmetic sum of the pixels in each candidate display area.
Wherein, due to different types (characters or pictures) of the data to be displayed, the size of the data to be displayed can be understood differently. For example, when the data to be displayed is text data such as subtitle data, the size of the data to be displayed may be determined according to the number of characters and the size of fonts included in the data to be displayed; when the data to be displayed is picture data, the size of the data to be displayed can be understood as the picture size.
The closer the display area is to the area of interest of the user (namely the attention area of the user), the more convenient the user can watch the subtitles; the simpler the background color of the candidate display area is, the smaller the difference of the background color is, the more convenient the user can view the data to be displayed. Therefore, by adopting the scheme, the distance from the user interested region and the pixel difference value of the candidate display region can be comprehensively considered to select the display region.
In addition, for the case that the data to be displayed is text data such as subtitle data, after determining the display area in the first video frame according to the user interest area, the method further includes: determining an average value of pixels in the display area; the reverse color of the average pixel value is taken as the display color of the data to be displayed.
By adopting the scheme, the reverse color of the pixel average value in the display area is taken as the display color of the data to be displayed, so that the color of the data to be displayed can be prevented from being mixed with the color of the display area, and the problems of unclear subtitle display and image detail shielding can be avoided.
In one possible design, in a case where the data to be displayed is text data such as subtitle data, before superimposing an image corresponding to the data to be displayed on the display area, the method further includes: analyzing the semantics of the data to be displayed, and determining key words in the data to be displayed; and determining the display mode of the keywords in the image corresponding to the data to be displayed according to a preset configuration strategy.
By adopting the scheme, the impact force of the keywords on the visual effect can be generated for the user, and the user can be reminded of paying attention.
Specifically, the display mode of the keyword in the image corresponding to the data to be displayed is determined according to a preset configuration policy, and the display mode can be implemented as follows: and displaying the keywords in an image corresponding to the data to be displayed in an enlarged manner or through an animation effect.
In a second aspect, an embodiment of the present application provides a media processing apparatus, which includes a processor and a transmission interface; a processor configured to call program code stored in the memory through the transport interface to perform the steps of: determining a user region of interest in a first video frame; determining a display area in the first video frame according to the user interesting area, wherein the display area is used for displaying data to be displayed corresponding to the first video frame; and superposing the image corresponding to the data to be displayed in the display area.
Wherein the data to be displayed includes at least one of subtitle data or picture data.
In one possible design, the processor is specifically configured to: analyzing a first video frame and a second video frame, determining an area where a person moves in the first video frame compared with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame; and taking the area where the human object moves as the area of interest of the user.
In another possible design, the processor is specifically configured to: analyzing a first video frame and a second video frame, and determining a plurality of regions in which people move in the first video frame compared with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame; and taking the area with the largest area or the area with the largest movement amplitude of the person in the plurality of areas in which the person moves as the user interested area.
In yet another possible design, the processor is specifically configured to: analyzing a first video frame and a second video frame, and determining a plurality of regions of the first video frame with the facial motion of the person compared with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame; and taking the area with the largest human face motion amplitude in the plurality of areas with human face motion as the user interested area.
In another possible design, the processor is specifically configured to: receiving coordinate information input by a camera, wherein the coordinate information is used for indicating a region of interest when a user watches a first video frame; and determining the region of interest of the user according to the coordinate information.
In one possible design, the processor is further to: after the user interested area is determined, carrying out face recognition and scene recognition on the user interested area, and determining the emotion of people in the user interested area and the scene of the user interested area; and superposing the emotion of the person in the area of interest of the user and the emoticon corresponding to the scene of the area of interest of the user on the display area.
In one possible design, the processor is specifically configured to: determining the area of an image corresponding to the data to be displayed according to the size of the data to be displayed; selecting a plurality of candidate display areas around a user interested area, wherein the area of each candidate display area in the candidate display areas is larger than or equal to the area of an image corresponding to data to be displayed; and determining one candidate display area in the plurality of candidate display areas as the display area according to the distance between the central point of each candidate display area and the central point of the area of interest of the user and the difference arithmetic sum of the pixels in each candidate display area.
In one possible design, the processor is further to: after determining a display area in the first video frame according to the user interest area, determining an average value of pixels in the display area; the reverse color of the average pixel value is taken as the display color of the data to be displayed.
In one possible design, the processor is further to: before the image corresponding to the data to be displayed is superposed in the display area, analyzing the semantics of the data to be displayed, and determining keywords in the data to be displayed; and determining the display mode of the keywords in the image corresponding to the data to be displayed according to a preset configuration strategy.
In one possible design, the processor is specifically configured to: and displaying the keywords in an image corresponding to the data to be displayed in an enlarged manner or through an animation effect.
The media processing device provided by the second aspect may be configured to execute the data display method provided by the first aspect, and for implementation and technical effects, which are not described in detail in the media processing device provided by the second aspect, reference may be made to relevant descriptions in the data display method provided by the first aspect, and details are not repeated here.
In a third aspect, an embodiment of the present application further provides a media processing apparatus, where the media processing apparatus includes a determining module and a superimposing module; the determining module is used for determining a user region of interest in the first video frame; and determining a display area in the first video frame according to the user interested area, wherein the display area is used for displaying the data to be displayed corresponding to the first video frame. The superposition module is used for superposing the image corresponding to the data to be displayed in the display area.
Wherein the data to be displayed includes at least one of subtitle data or picture data.
In one possible design, the determining module is specifically configured to: analyzing a first video frame and a second video frame, determining an area where a person moves in the first video frame compared with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame; and taking the area where the human object moves as the area of interest of the user.
In another possible design, the determining module is specifically configured to: analyzing a first video frame and a second video frame, and determining a plurality of regions in which people move in the first video frame compared with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame; and taking the area with the largest area or the area with the largest movement amplitude of the person in the plurality of areas in which the person moves as the user interested area.
In yet another possible design, the determining module is specifically configured to: analyzing a first video frame and a second video frame, and determining a plurality of regions of the first video frame with the facial motion of the person compared with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame; and taking the area with the largest human face motion amplitude in the plurality of areas with human face motion as the user interested area.
In another possible design, the determining module is specifically configured to: receiving coordinate information input by a camera, wherein the coordinate information is used for indicating a region of interest when a user watches a first video frame; and determining the region of interest of the user according to the coordinate information.
In one possible design, the determining module is further configured to: after the user interested area is determined, carrying out face recognition and scene recognition on the user interested area, and determining the emotion of people in the user interested area and the scene of the user interested area; the overlay module is further configured to: and superposing the emotion of the person in the area of interest of the user and the emoticon corresponding to the scene of the area of interest of the user on the display area.
In one possible design, the determining module is specifically configured to: determining the area of an image corresponding to the data to be displayed according to the size of the data to be displayed; selecting a plurality of candidate display areas around a user interested area, wherein the area of each candidate display area in the candidate display areas is larger than or equal to the area of an image corresponding to data to be displayed; and determining one candidate display area in the plurality of candidate display areas as the display area according to the distance between the central point of each candidate display area and the central point of the area of interest of the user and the difference arithmetic sum of the pixels in each candidate display area.
In one possible design, the determining module is further configured to: after determining a display area in the first video frame according to the user interest area, determining an average value of pixels in the display area; the reverse color of the average pixel value is taken as the display color of the data to be displayed.
In one possible design, the determining module is further configured to: before the superposition module superposes the image corresponding to the data to be displayed in the display area, analyzing the semantics of the data to be displayed and determining keywords in the data to be displayed; and determining the display mode of the keywords in the image corresponding to the data to be displayed according to a preset configuration strategy.
Specifically, the determining module is specifically configured to: and displaying the keywords in an image corresponding to the data to be displayed in an enlarged manner or through an animation effect.
The media processing device provided by the third aspect may be configured to execute the data display method provided by the first aspect, and for implementation and technical effects, which are not described in detail in the media processing device provided by the third aspect, reference may be made to relevant descriptions in the data display method provided by the first aspect, and details are not described here again.
In a fourth aspect, the present application provides a computer-readable storage medium storing program instructions that, when executed on a computer or processor, cause the computer or processor to perform the method of the first aspect or any one of the implementations of the first aspect.
In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed on a computer or processor, causes the computer or processor to perform the method of the first aspect or any one of the implementations of the first aspect.
Drawings
Fig. 1 is a schematic structural diagram of a media processing device provided in the prior art;
fig. 2 is a schematic flowchart of a data display method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a region of interest of a user according to an embodiment of the present application;
FIG. 4 is a diagram illustrating a first display effect provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of a second video frame and a user interested region provided in an embodiment of the present application;
fig. 6 is a schematic structural diagram of a first media processing device according to an embodiment of the present disclosure;
FIG. 7 is a diagram illustrating a second display effect provided by an embodiment of the present application;
fig. 8 is a schematic structural diagram of a second media processing device according to an embodiment of the present application;
fig. 9 is a schematic diagram illustrating a third display effect provided by the embodiment of the present application;
fig. 10 is a schematic structural diagram of a third media processing device according to an embodiment of the present application;
fig. 11 is a schematic diagram illustrating a fourth display effect provided by the embodiment of the present application;
fig. 12 is a schematic diagram illustrating a fifth display effect provided by an embodiment of the present application;
fig. 13 is a schematic structural diagram of a fifth media processing device according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of a media processing device according to an embodiment of the present disclosure.
Detailed Description
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
Data show that the visual range of the comfort degree of human eyes is 60 degrees, and the visual range of the concentration degree is 20 degrees. In the case of a large screen size, if the subtitle display data is displayed at a fixed position on the screen, the display position is likely to exceed the concentration visual range of human eyes, or even exceed the comfort visual range of human eyes. Such a situation where the display position is inconsistent with the focus of attention of the viewer brings a poor viewing experience to the user.
In the following, the subtitle display technology in the prior art will be described in detail by taking data to be displayed as subtitle data as an example.
As shown in fig. 1, which is a schematic structural diagram of a media processing device provided in the prior art, optionally, the media processing device may perform playback-related processing on a multimedia file, and the media processing device may also be used to play a media file. The media processing apparatus shown in fig. 1 includes a parser, an audio decoder, a synchronization control module, a video decoder, a video post-processing module, an image composition module, and a subtitle rendering module. Each of the modules may be implemented by hardware, software, or a combination of hardware and software. For example, the video decoder, the subtitle rendering module, the video post-processing module, etc. are implemented by hardware logic, the motion region analysis, the display policy processing, etc. modules may be implemented by software code running on a hardware processor, and the other modules, such as the audio decoder, etc., may be implemented by software.
Illustratively, a media file in a format of mp4, etc. is parsed by a parser to obtain three parts, namely an audio coding file, a video coding file and subtitle data. The audio coding file may be audio Elementary Stream (ES) data, and the video coding file may be video ES data. The audio coding file is decoded by an audio decoder to obtain audio data; performing subtitle rendering processing on the subtitle data to obtain a subtitle image; the video coding file is processed by a video decoder to obtain a video frame, and then the video frame is processed by a video post-processing module and then is synthesized with the subtitle image. In addition, the synchronous control module is also used for synchronizing the image obtained by video post-processing with the audio data, so that the output of the audio output interface is synchronized with the output of the video output interface, namely, the audio output by the audio output interface is synchronized with the video picture output by the video output interface.
The media processing device may be, for example, a set top box, a smart television, a smart large screen, a mobile phone, a tablet computer, or other devices with a display function, or a processor chip in a set top box, a display screen, a smart large screen, a television (television), a mobile phone, or other devices with a display function, and the processor chip may be, for example, a system on chip (SoC) or a baseband chip.
When a media file is played by the media processing device shown in fig. 1, because subtitle data is usually set at a fixed position on a screen (for example, below the screen) for display during subtitle rendering, and fonts and colors are also usually fixed, and the display form of subtitles is single, it is difficult for a viewer to take into account both a focus of attention and subtitles on a video picture, and if a background color at the fixed position (for example, below the screen) is close to the color of the subtitles, a situation that the viewer cannot clearly see the subtitles may occur, which may bring poor viewing experience to the user.
In the embodiment of the present application, the processing of the video coding file and the data to be displayed (for example, subtitle data) is mainly taken as an example for description, and the processing of the audio coding file can be analogized, and is not explained in detail.
In order to improve user experience and solve the problem that a display position of data to be displayed is inconsistent with a focus of attention of a viewer in the prior art, embodiments of the present application provide an exemplary data display method and a media processing device. In an alternative case, the device may be an integrated chip.
Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings.
An embodiment of the present application provides a data display method, as shown in fig. 2, the data display method includes the following steps.
S201: a user region of interest in a first video frame is determined.
In a plurality of video frames obtained after decoding a video coding file, subtitles may need to be added to all the video frames, or subtitles may need to be added to only a part of the video frames. The first video frame is a video frame which needs to be added with subtitles in a plurality of video frames.
The user interest region in the first video frame is a region of interest when the user views the first video frame. In a specific implementation, due to the difference of the pictures of the first video frame, the user interested region can have different understandings.
For example, if only one person moves in the first video frame compared with the video frame before the first video frame (hereinafter referred to as the second video frame), the user interested region may be a region where the moving person is located, as shown in fig. 3, a dashed frame is a position where the person is located in the second video frame, and a solid frame is a position where the person is located in the first video frame, and then the position of the solid frame in the first video frame may be regarded as the user interested region.
For example, if two people are present in the first video frame and the second video frame, the user interested region may be a region where a person with a larger action amplitude is present in the first video frame compared with the second video frame.
For another example, if the faces of two people in the first video frame and the second video frame are close-ups, the region of interest of the user may be a person with a large face movement amplitude.
For another example, if the first video frame and the second video frame have only slight or no change, the user interest area changes according to the aesthetic, habit, personal preference, etc. of the user.
S202: a display area in the first video frame is determined based on the user region of interest.
The display area is used for displaying the data to be displayed corresponding to the first video frame. Specifically, the data to be displayed may be text data such as subtitle data, or may be picture data. For example, when a video picture is played, subtitle display may be performed, and at this time, data to be displayed is subtitle data; for another example, when a video picture is played, text advertisements can be displayed in the picture, and the data to be displayed is advertisement data at this time; for another example, when a video frame is played, another picture (for example, a picture advertisement or a picture related to video content) may be displayed in the frame, and at this time, the data to be displayed is picture data.
After determining the user region of interest, a display area in the first video frame may be determined based on the user region of interest. The display area is typically near the user area of interest so that the user can conveniently see the data to be displayed in the display area while focusing on the picture of the user area of interest.
Specifically, in S202, determining a display area in the first video frame according to the user interest area may be implemented as follows: determining the area of an image corresponding to the data to be displayed according to the size of the data to be displayed; selecting a plurality of candidate display areas around a user interested area, wherein the area of each candidate display area in the candidate display areas is larger than or equal to the area of an image corresponding to data to be displayed; and determining one candidate display area in the plurality of candidate display areas as the display area according to the distance between the central point of each candidate display area and the central point of the area of interest of the user and the difference arithmetic sum of the pixels in each candidate display area.
Wherein, the arithmetic sum of the difference values of the pixels in each candidate display area can be understood as follows: the candidate display area includes a plurality of pixel points, and each pixel point can be represented by a set of three primary colors, i.e., red, green, blue (RGB). For a certain candidate display area, the difference value of the RGB of each pixel point and the previous pixel point can be calculated, and the difference value arithmetic sum of the pixels in the candidate display area can be obtained by adding the difference values. For example, if a candidate region includes 1024 × 1024 pixels, the difference between the RGB of the second pixel and the RGB of the first pixel, and the difference between the RGB of the third pixel and the RGB of the second pixel … …, the difference between the RGB of the 1024 × 1024 pixel and the RGB of the 1024 × 1024-1 pixel are calculated. And adding the calculated difference values to obtain the difference arithmetic sum of the pixels in the candidate display area.
The area of each candidate display region is greater than or equal to the area of the image corresponding to the data to be displayed, so that each candidate display region has enough space to display the image corresponding to the data to be displayed.
Due to the different types (text or pictures) of the data to be displayed, the size of the data to be displayed can be understood differently. For example, when the data to be displayed is text data such as subtitle data, the size of the data to be displayed may be determined according to the number of characters and the size of fonts included in the data to be displayed; when the data to be displayed is picture data, the size of the data to be displayed can be understood as the picture size.
In the embodiment of the present application, the reason for selecting the display area according to the distance between the center point of the candidate display area and the center point of the user interest area and the difference arithmetic sum of the pixels in the candidate display area is mainly that: the closer the user is to the region of interest (namely the attention region of the user), the more convenient the user can watch the data to be displayed; and secondly, the simpler the background color of the candidate display area is, the smaller the background color difference is, and the more convenient the user can watch the data to be displayed. Thus, the display area may be selected taking into account the distance from the user's region of interest and the pixel difference values within the area.
In particular, in the case where the data to be displayed is text data such as subtitle data, the manner of determining the display area may be understood as follows: first, the area of the image corresponding to the data to be displayed (i.e., the size of the display area required for displaying the data to be displayed) may be determined according to the data to be displayed and the preset font size. Then, several candidate display regions are selected around the user interested region, for example, four regions respectively located at the upper left corner, the lower left corner, the upper right corner and the lower right corner of the user interested region can be selected as the candidate display regions. The area of each candidate display area is larger than or equal to the area of the image corresponding to the data to be displayed obtained through calculation. And then, comprehensively considering the distance between each candidate display area and the region of interest of the user and the pixel difference value of each candidate display area, and selecting one candidate display area as the display area.
Illustratively, for the user region of interest shown in fig. 3, the selected display area may be as shown in the b example in fig. 4. In the example of b of fig. 4, the subtitle data "master, go" is displayed in the upper right corner of the user's region of interest. In addition, an example a of fig. 4 shows a subtitle display mode in the related art. It can be seen from comparison between the example a and the example b that by the data display method provided by the embodiment of the application, the display position of the subtitle data is closer to the region of interest of the user, so that the user can consider subtitles and video pictures during watching, and the user experience is improved.
In addition, for the case that the data to be displayed is text data such as subtitle data, after determining the display area, the method shown in fig. 2 may further include: determining an average value of pixels in the display area; the reverse color of the average pixel value is taken as the display color of the data to be displayed.
The reverse color of the pixel average value in the display area is taken as the display color of the data to be displayed, so that the color of the data to be displayed can be prevented from being mixed with the color of the display area, and the problems of unclear text display and image detail shielding can be avoided. Illustratively, the black pixel value is 0, the white pixel value is 255, and if it is determined that the average value of the pixels in the display area is 50, the pixel value of the display color of the data to be displayed may be 255-50 — 205.
Wherein the inverse of the average value of the pixels within the display area can be understood as follows: as mentioned above, each pixel point can be represented by RGB. Then, for a plurality of pixel points included in the display area, the average RGB values of the plurality of pixel points may be respectively obtained. After the average RGB value is obtained, the average RGB value can be subtracted from the maximum pixel value, and the reverse color of the average pixel value is obtained. Wherein, the maximum pixel value is determined according to the bit width, for example, the display system adopts 8 bit width, and then the maximum pixel value is 28-1=255。
Illustratively, the average RGB value of the display area is: r-10, G-20, B-30, and the display system is 8-bit wide, the inverse of the pixel average is the color corresponding to R-245, G-235, B-225.
One specific example of determining the display area is given below by way of example. Taking the user interested area as a rectangle as an example, correspondingly, the user interested area can be represented by four parameters, namely x, y, w and h, wherein x represents an abscissa of a vertex of the user interested area in the first video frame, y represents an ordinate of the vertex in the first video frame, w represents a width of the user interested area, and h represents a height of the user interested area. It should be noted that the origin of coordinates of the coordinate system of the vertex is a vertex of the first video frame, and for example, the vertex at the upper left corner of the first video frame may be taken as the origin of coordinates, and then the meaning of x, y, w, and h may be as shown in fig. 5.
Specifically, the step of determining the display area may be as follows.
(1) After the region of interest of the user, the width and the height of the first video frame, the picture of the first video frame and the data to be displayed are obtained, the area S required for displaying the data to be displayed can be determined according to the preset font size and the data to be displayed.
(2) Four regions S1, S2, S3, S4 of area S are selected around the user' S region of interest (up, down, left, right).
(3) The distances between the center T of the user region of interest and the centers of S1, S2, S3 and S4 are calculated, resulting in L1, L2, L3 and L4, respectively.
(4) And calculating the texture complexity of the picture in the S1, S2, S3 and S4 areas, namely calculating the sum of the difference arithmetic of the pixels in the S1, S2, S3 and S4 areas to obtain W1, W2, W3 and W4 respectively.
(5) A display area is selected. Specifically, the weighting factor Yn is calculated using the formula Ln × a + Wn × B — Yn, and the region corresponding to the smallest weighting factor Yn among S1, S2, S3, and S4 is taken as the display region Sx. And B are preset coefficients, n is a number and takes a value of 1-4, Ln is L1, L2, L3 and L4 which are obtained by calculation in the step (3), and Wn is W1, W2, W3 and W4 which are obtained by calculation in the step (4).
(6) And calculating the pixel average value of the Sx area, and taking the reverse color of the pixel average value as the display color of the data to be displayed.
S203: and superposing the image corresponding to the data to be displayed in the display area.
As described in the media processing apparatus shown in fig. 1, the subtitle data is rendered to obtain a subtitle image, and then is output after being image-synthesized with a video frame. Similarly, for the data to be displayed, no matter whether the data to be displayed is text or pictures, the data to be displayed can be rendered to form an image and superimposed on the display area in the first video frame (i.e., the rendered image and the first video frame are subjected to image synthesis).
Specifically, the step of rendering the data to be displayed is similar to the step of rendering the subtitles in the prior art, except that in the method shown in fig. 2, the data to be displayed needs to be rendered in the display area determined in S202. It should be noted that if the data to be displayed is text data such as subtitle data, the font color and the like can be determined according to the above-mentioned inverse color manner. In addition, the font type of the characters in the image corresponding to the data to be displayed can be set according to the requirement.
The above is a description of the flow of the entire data display method. As described above, there are various ways to determine the region of interest of the user in S201. In the following, taking the data to be displayed as subtitle data as an example, several specific ways of determining the region of interest of the user are given.
In a first mode
In the first mode, determining the user interest region in the first video frame may be implemented as follows: analyzing the first video frame and the second video frame, and determining a region in the first video frame, in which a person moves compared with the second video frame; and taking the area where the human object moves as the area of interest of the user.
The first video frame and the second video frame are obtained by decoding the media file, and the playing time of the first video frame is earlier than that of the second video frame.
In practical applications, the operation of decoding the video encoded file can be implemented by the video decoder in fig. 1. Specifically, a plurality of video frames can be obtained after decoding the video coding file, the first video frame and the second video frame are two frames of the plurality of video frames, and the playing time of the second video frame is earlier than that of the first video frame. Illustratively, the second video frame and the first video frame may be two frames adjacent in play time.
The first method is applicable to a scene with only one character in each of the first video frame and the second video frame. If there is only one person in the video picture, and the focus of attention of the user is focused on the movement of the person, the area where the person moves may be used as the user interest area. The user's region of interest determined in manner one may be as shown in fig. 3. The dashed frame is the position of the person in the second video frame, and the solid frame is the position of the person in the first video frame, so the position of the solid frame in the first video frame can be regarded as the user region of interest. In a specific implementation, the region of interest of the user can be represented by the four parameters x, y, w and h.
In a specific implementation, a motion region analysis module may be added to the media processing device shown in fig. 1 to implement the method for determining a region of interest of a user, and a display policy processing module may be added to the media processing device shown in fig. 1 to determine a display region. Then, a schematic structural diagram of a media processing device provided in an embodiment of the present application may be as shown in fig. 6. In the media processing device shown in fig. 6, the motion area analysis module determines parameters x, y, w, h (i.e., motion coordinates) of a region of interest of a user, the display policy processing module determines a display area according to the output x, y, w, h of the motion area analysis module and other information (e.g., the video width and height of the first video frame), and the subtitle rendering module can perform subtitle rendering in the display area.
The role of the video width and height of the first video frame in determining the display area can be understood as follows: the video width and height of the first video frame are used when determining the candidate display area, because the candidate display area needs a certain area to display the image corresponding to the data to be displayed, the candidate display area is not beyond the video width and height range of the first video frame, for example, the user interested area is at the upper right corner of the picture of the first video frame, and the right side or the upper side of the user interested area does not have too large area, then the candidate display area can be determined not to be selected at the right side or the upper side of the user interested area according to the video width and height of the first video frame, so as to avoid the candidate display area exceeding the video picture of the first video frame.
It should be noted that only the processing of the video coding file and the subtitle data is shown in the media processing apparatus shown in fig. 6, and the modules related to the audio processing in the media processing apparatus are not shown in fig. 6.
In the first way, the area of interest of the user is determined, and since the focus of attention of human eyes is usually a moving place in a picture, subtitles displayed near the area are easy to observe by the user.
Mode two
In the second mode, determining the user interest region in the first video frame may be implemented as follows: analyzing the first video frame and the second video frame, and determining a plurality of areas in the first video frame, in which people move compared with the second video frame; and taking the area with the largest area or the area with the largest movement amplitude of the person in the plurality of areas in which the person moves as the user interested area. The first video frame and the second video frame are obtained by decoding the media file, and the playing time of the second video frame is earlier than that of the first video frame.
The second mode is similar to the first mode, except that a plurality of people move in the second mode, so that the area with the largest motion amplitude and area of the people is easily noticed by the human eyes, and the area with the largest motion amplitude and area of the people can be selected as the area of interest of the user.
Mode III
In the third mode, determining the user interest region in the first video frame may be implemented as follows: analyzing the first video frame and the second video frame, and determining a plurality of areas with the motion of the human face in the first video frame compared with the second video frame; and taking the area with the largest amplitude of the human face motion in the plurality of areas with the human face motion as the user interested area. The first video frame and the second video frame are obtained by decoding the media file, and the playing time of the second video frame is earlier than that of the first video frame.
The third mode is suitable for a scene with a plurality of character conversations. In such a scenario, an Artificial Intelligence (AI) analysis may be performed on the face of the person, for example, a multi-layer neural network face recognition model is used to detect the movement of the face of the person in the picture, determine an area where the movement of the face of the person is relatively large compared to the movement of the face of the person in the first video frame and the movement of the face of the person in the second video frame, display subtitles near the area, that is, correspond subtitles to the person, and display the words of the person in the form of subtitles around the person. Therefore, by adopting the third mode, the determined user interest region can be more accurate, and the displayed subtitles can accurately move along with the picture characters.
When the subtitle display is performed after the region of interest of the user is determined in the third mode, the subtitle display effect may be as shown in an example b in fig. 7. In the example of b of fig. 7, the language spoken by the character is displayed around the character in the form of a subtitle. In addition, an example a of fig. 7 shows a subtitle display mode in the related art. As can be seen from the comparison between the example a and the example b, by the method for determining the region of interest of the user in the third mode, the displayed subtitles can accurately move along with the picture characters, so that the user can understand the subtitles conveniently, and the user experience is improved.
It should be noted that, for a video picture having a plurality of people performing a dialog, what each person says can be regarded as a set of subtitle data, and each set of subtitle data can be displayed according to the data display method provided in the embodiment of the present application. It should be noted that, in practical applications, each set of subtitle data is not displayed in only one video frame, but is configured with a certain display time, that is, after the display area is determined by the method shown in fig. 2, in a plurality of video frames after the first video frame, the set of subtitle data is displayed in the same display area. Then, for a video picture with a plurality of characters performing a dialog, a case where there are a plurality of groups of subtitles in one video frame may occur (for example, b example of fig. 7).
For example, in the b example of fig. 7, it is assumed that video frame 1, video frame 2, and video frame 3 …, video frame 64 is 64 video frames that are continuously played in time. For the set of subtitle data that "the master gets back and the foot station gets soft", the display area can be determined by comparing the video frame 2 with the video frame 1. If the display time of the group of subtitle data lasts 63 frames, the group of subtitle data is displayed in corresponding display areas in video frames 2-64.
When a set of subtitle data "boss, i will immediately file up, you will wait a little" needs to be added when playing to the video frame 55, then the display area can be determined by comparing the video frame 55 with the video frame 54. If the display time of the set of subtitle data continues for 10 frames, the set of subtitle data is displayed in the corresponding display area in each of the video frames 55 to 64, i.e., the "boss, i immediately sends the file, and you wait for" slightly ". Then, for the video frames 55 to 64, the two sets of subtitle data are displayed simultaneously, as shown in the example of b of fig. 7.
In a specific implementation, an AI character recognition module may be added to the media processing device shown in fig. 1 to implement the method for determining the region of interest of the user according to the third embodiment, and a display policy processing module is added to the media processing device shown in fig. 1 to determine the display region. Then, a schematic structural diagram of a media processing device provided in an embodiment of the present application may be as shown in fig. 8. In the media processing device shown in fig. 8, parameters x, y, w, h (i.e. character coordinates) of a user region of interest are determined by the AI character recognition module, the display strategy processing module determines a display area according to the parameters x, y, w, h determined by the AI character recognition module, and the subtitle rendering module can render subtitles in the display area.
It should be noted that only the processing of the video coding file and the subtitle data is shown in the media processing apparatus shown in fig. 8, and the blocks related to the audio processing are not shown in fig. 8.
Mode IV
In the fourth mode, determining the user interest region in the first video frame may be implemented as follows: receiving coordinate information input by a camera, wherein the coordinate information is used for indicating a region of interest when a user watches a first video frame; and determining the region of interest of the user according to the coordinate information. Illustratively, the camera may be an external camera.
The method is applicable to any scene, and is particularly applicable to the case that the picture of the first video frame is basically unchanged (relatively still) compared with the picture of the second video frame, in which case the region of interest of the user changes with the factors of the aesthetic quality, habit, personal preference and the like of the user, at this time, the region of interest of the user can be captured by the camera, and the region of interest of the user is defined as the region of interest of the user.
In the fourth aspect, the camera has an eye tracking function, and can capture an eye region of interest. It should be understood that the camera in the fourth mode may be an external camera, or may also be a camera integrated in the display device, and in a possible implementation manner, the function of capturing the eyeball attention area may also be implemented by other eyeball tracking devices.
When the subtitle display is performed after the region of interest of the user is determined in the fourth mode, the subtitle display effect can be as shown in fig. 9. In fig. 9, caption data (i.e., "fifteen minutes later") is displayed in the eyeball focus area for the user to view the caption.
In a specific implementation, an eyeball tracking module may be added to the media processing device shown in fig. 1 to implement the method for determining the region of interest of the user described in the fourth embodiment, and a display policy processing module is added to the media processing device shown in fig. 1 to determine the display region. Then, a schematic structural diagram of a media processing device provided in an embodiment of the present application may be as shown in fig. 10. In the media processing device shown in fig. 10, the eyeball tracking module determines an attention area (parameters x, y, w, h) of an eyeball as a user interest area, the display policy processing module determines a display area according to the parameters x, y, w, h output by the eyeball tracking module, and the subtitle rendering module can render a subtitle in the display area.
It should be noted that only the processing of the video coding file and the subtitle data is shown in the media processing apparatus shown in fig. 10, and the blocks related to the audio processing are not shown in fig. 10.
Of course, in practical applications, the manner of determining the region of interest of the user is not limited to the four listed above. For example, if no person appears in the first video frame and the second video frame or no person moves, the area where the picture of the first video frame changes compared with the second video frame can be determined by comparing the first video frame and the second video frame, and the area can be defined as the user interested area. The specific manner of determining the region of interest of the user is not limited in the embodiment of the present application.
In addition, in the data display method shown in fig. 2, after the determination of the user region of interest is performed in S201, face recognition and scene recognition may be performed on the user region of interest, and the emotion of a person in the user region of interest and the scene of the user region of interest are determined; then, the emotion of the person in the region of interest of the user and the emoticon corresponding to the scene of the region of interest of the user are superposed on the display area.
Specifically, the AI neural network model may be used to analyze the current scene (e.g., rain, snow, cloudy day, sunny day, city, country) and the emotion of the character (e.g., happiness, anger, sadness, music), match the emoticons (emoji) and select a font and a color that can represent the emotion of the current character to be displayed following the character in motion.
For example, when the AI analyzes that the face of the person is smiling and outputs "happy", the AI can pop up near the person
Figure BDA0002252657260000131
When the AI analyzes that the character faces the angry output, the angry can be popped up near the character
Figure BDA0002252657260000132
When the AI analyzes that the face of the character is sad and outputs sadness, the AI can pop up the character nearby
Figure BDA0002252657260000133
When the AI analyzes that the scene is 'raining', the method can pop up near the scene
Figure BDA0002252657260000134
When the AI analyzes that the scene is 'sunny', the scene can be popped up near the scene
Figure BDA0002252657260000135
When the AI analyzes that the scene is 'night', the method can pop up near the scene
Figure BDA0002252657260000136
Illustratively, after the above face recognition and scene recognition methods are adopted, the display effect may be as shown in fig. 11. In the example of fig. 11, a corresponding crying icon is displayed in the display area of the data to be displayed of "door locked, pound not on" to express the emotion of the person. As can be seen from fig. 11, by adopting the above manner, the emotion of the person can be more intuitively expressed, and the user experience is further improved.
The above example describes superimposing emoticons on a display area. Of course, the emoticon may also be superimposed in the area of interest of the user or displayed near the corresponding character or scene, which is not specifically limited in this embodiment of the application.
In a specific implementation, functions of expression recognition and scene recognition may be added to the AI character recognition module of the media processing device shown in fig. 8 to implement the above-described scheme.
In addition, when the data to be displayed is the characters, the semantics of the data to be displayed can be analyzed, and the keywords in the data to be displayed are determined; and then, determining the display mode of the keywords in the image corresponding to the data to be displayed according to a preset configuration strategy. Specifically, the keywords may be displayed in bold or by animation in the image corresponding to the data to be displayed.
Specifically, the neural network module can be used for analyzing and detecting the semantics and keywords of the data to be displayed, and the display strategy which can reflect the caption semantics most is used for rendering and displaying, so that the display strategy has impact on user experience and brings good user experience.
For example, the preset configuration policy may be: keywords (such as lifesaving, rolling and bang …) of verbs and phonics can be marked with red and bold, animation effect is added, and the like; name-like keywords can be replaced by small pictures, e.g. for telephone
Figure BDA0002252657260000137
Replacement, for football
Figure BDA0002252657260000138
For replacement and umbrella
Figure BDA0002252657260000139
Replacement for rose
Figure BDA00022526572600001310
Replacement, etc.
Illustratively, after the above keyword analysis method is adopted, the display effect may be as shown in fig. 12. In fig. 12, two keywords "pop" and "life" are displayed in bold, and animation effects of blasting are added. As can be seen from fig. 12, in the above manner, the keywords may generate an impact force in a visual effect on the user, so as to remind the user of paying attention.
In a specific implementation, a keyword analysis module may be added to the media processing apparatus shown in fig. 1 to implement the above-described scheme, as shown in fig. 13. The method comprises the steps of determining a user region of interest through an AI comprehensive identification module, performing keyword analysis on data to be displayed through a keyword analysis module, determining a display region through a display strategy processing module, and performing rendering of keyword effect, font size, font color and the like through a subtitle rendering module. Likewise, only the processing of the video encoded file and the data to be displayed is shown in the media processing apparatus shown in fig. 13, and the modules related to the audio processing are not shown in fig. 13.
Fig. 6, fig. 8, fig. 10, and fig. 13 are schematic structural diagrams of a media processing device according to an embodiment of the present application. Each module may be implemented by software, hardware, or a combination of software and hardware. In particular, for the motion region analysis module, the AI character recognition module, the eyeball tracking module, and the AI integrated recognition module, all of these modules may be integrated in the media processing device (for example, these modules may be integrated by software), or some of these modules may be integrated as needed. The embodiments of the present application do not limit this.
In summary, by using the data display method provided by the embodiment of the present application, the display area of the data to be displayed is determined according to the user interest area in the first video frame, and the data to be displayed can be displayed near the user interest area, so that the display position is consistent with the focus of attention of the user, and the user experience is improved.
Based on the same inventive concept, the present application provides another exemplary media processing device, which can be used to execute the data display method shown in fig. 2. Illustratively, the media processing apparatus may be a processor chip, and the processor chip may be a chip processor in a set-top box, a display screen, a smart large screen, a TV, a mobile phone or other device with display function, for example, may be an SoC or a baseband chip.
As shown in fig. 14, the media processing device 1400 includes a processor 1401 and a transmission interface 1402. The transmission interface 1402 may be a one-way communication interface or a two-way communication interface and may be used for e.g. sending and receiving messages to establish a connection, to acknowledge and to exchange any other information related to the communication link and/or e.g. image processed picture data and/or related to data transmission. The transmission Interface may include a transmission Interface and a reception Interface, and may be any type of Interface according to any proprietary or standardized Interface protocol, such as a High Definition Multimedia Interface (HDMI), a Mobile Industry Processor Interface (MIPI), a Display Serial Interface (DSI) standardized By MIPI, a Video Electronics Standards Association (VESA) standardized Embedded Display Port (eDP), a Display Port (DP), or a V-By-One Interface, which is a digital Interface standard developed for image transmission, and various wired or wireless interfaces, optical interfaces, and the like.
Specifically, the processor 1401 is configured to call up program code stored in the memory through the transmission interface 1402 to execute the data display method shown in fig. 2.
In one possible implementation, the media processing device 1400 may further include a memory, and the memory stores the program codes.
It should be noted that the media processing device 1400 may be configured to execute the data display method shown in fig. 2, and the implementation manner not described in detail in the media processing device 1400 may refer to the relevant description in the data display method shown in fig. 2, and is not described again here.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims (21)

1. A method of displaying data, comprising:
determining a user region of interest in a first video frame;
determining a display area in the first video frame according to the user interesting area, wherein the display area is used for displaying data to be displayed corresponding to the first video frame;
and superposing the image corresponding to the data to be displayed in the display area.
2. The method of claim 1, wherein the determining the region of interest for the user in the first video frame comprises:
analyzing the first video frame and the second video frame, and determining an area where a person moves in comparison with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame;
and taking the region with the human moving as the region of interest of the user.
3. The method of claim 1, wherein the determining the region of interest for the user in the first video frame comprises:
analyzing the first video frame and the second video frame, and determining a plurality of regions where people move in the first video frame compared with the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame;
and taking the area with the largest area or the area with the largest human movement amplitude in the plurality of areas with the human movement as the user interested area.
4. The method of claim 1, wherein the determining the region of interest for the user in the first video frame comprises:
analyzing the first video frame and the second video frame, and determining that the first video frame and the second video frame have multiple regions with human face motion compared with each other, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame;
and taking the region with the maximum human face motion amplitude in the plurality of regions with the human face motion as the user region of interest.
5. The method of any of claims 1 to 4, further comprising, after determining the user region of interest:
carrying out face recognition and scene recognition on the user region of interest, and determining the emotion of people in the user region of interest and the scene of the user region of interest;
and superposing the emotion of the person in the area of interest of the user and the emoticon corresponding to the scene of the area of interest of the user on the display area.
6. The method of any of claims 1 to 5, wherein determining a display area in the first video frame based on the user region of interest comprises:
determining the area of an image corresponding to the data to be displayed according to the size of the data to be displayed;
selecting a plurality of candidate display areas around the user interested area, wherein the area of each candidate display area in the candidate display areas is larger than or equal to the area of the image corresponding to the data to be displayed;
and determining one candidate display area in the plurality of candidate display areas as the display area according to the distance between the central point of each candidate display area and the central point of the user interest area and the difference arithmetic sum of the pixels in each candidate display area.
7. The method of any of claims 1 to 6, further comprising, after determining a display area in the first video frame based on the user interest area:
determining an average value of pixels within the display area;
and taking the reverse color of the pixel average value as the display color of the data to be displayed.
8. The method according to any one of claims 1 to 7, further comprising, before superimposing the image corresponding to the data to be displayed on the display area:
analyzing the semantics of the data to be displayed, and determining key words in the data to be displayed;
and determining the display mode of the keywords in the image corresponding to the data to be displayed according to a preset configuration strategy.
9. The method of claim 8, wherein determining a display mode of the keyword in the image corresponding to the data to be displayed according to a preset configuration policy comprises:
and displaying the keywords in an image corresponding to the data to be displayed in an enlarged manner or through an animation effect.
10. The method of any one of claims 1 to 9, wherein the determining the region of interest for the user in the first video frame comprises:
receiving coordinate information input by a camera, wherein the coordinate information is used for indicating a region of interest when a user watches the first video frame;
and determining the region of interest of the user according to the coordinate information.
11. The method according to any one of claims 1 to 10, wherein the data to be displayed comprises: at least one of subtitle data or picture data.
12. A media processing apparatus, comprising: a processor and a transmission interface;
the processor is configured to call the program code stored in the memory through the transmission interface to perform the following steps:
determining a user region of interest in a first video frame;
determining a display area in the first video frame according to the user interesting area, wherein the display area is used for displaying data to be displayed corresponding to the first video frame;
and superposing the image corresponding to the data to be displayed in the display area.
13. The apparatus of claim 12, wherein the processor is specifically configured to:
analyzing the first video frame and the second video frame, and determining an area where a person moves in comparison with the first video frame and the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame;
and taking the region with the human moving as the region of interest of the user.
14. The apparatus of claim 12, wherein the processor is specifically configured to:
analyzing the first video frame and the second video frame, and determining that the first video frame and the second video frame have multiple regions with human face motion compared with each other, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame;
and taking the region with the maximum human face motion amplitude in the plurality of regions with the human face motion as the user region of interest.
15. The apparatus of any of claims 12 to 14, wherein the processor is further configured to:
after the user interesting region is determined, carrying out face recognition and scene recognition on the user interesting region, and determining the emotion of people in the user interesting region and the scene of the user interesting region;
and superposing the emotion of the person in the area of interest of the user and the emoticon corresponding to the scene of the area of interest of the user on the display area.
16. The apparatus of any one of claims 12 to 15, wherein the processor is specifically configured to:
determining the area of an image corresponding to the data to be displayed according to the size of the data to be displayed;
selecting a plurality of candidate display areas around the user interested area, wherein the area of each candidate display area in the candidate display areas is larger than or equal to the area of the image corresponding to the data to be displayed;
and determining one candidate display area in the plurality of candidate display areas as the display area according to the distance between the central point of each candidate display area and the central point of the user interest area and the difference arithmetic sum of the pixels in each candidate display area.
17. The apparatus of any one of claims 12 to 16, wherein the data to be displayed comprises: at least one of subtitle data or picture data.
18. A media processing apparatus, comprising:
the determining module is used for determining a user interested region in the first video frame; determining a display area in the first video frame according to the user interesting area, wherein the display area is used for displaying data to be displayed corresponding to the first video frame;
and the superposition module is used for superposing the image corresponding to the data to be displayed in the display area.
19. The apparatus of claim 18, wherein the determination module is specifically configured to:
analyzing the first video frame and the second video frame, and determining an area where a person moves in comparison with the first video frame and the second video frame, wherein the first video frame and the second video frame are obtained by decoding a media file, and the playing time of the second video frame is earlier than that of the first video frame;
and taking the region with the human moving as the region of interest of the user.
20. The apparatus of claim 18 or 19, wherein the data to be displayed comprises: at least one of subtitle data or picture data.
21. A computer-readable storage medium, characterized in that it stores program instructions which, when run on a computer or processor, cause the computer or processor to carry out the method of any one of claims 1 to 11.
CN201911040334.0A 2019-10-29 2019-10-29 Data display method and media processing device Pending CN112752130A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911040334.0A CN112752130A (en) 2019-10-29 2019-10-29 Data display method and media processing device
PCT/CN2020/113826 WO2021082742A1 (en) 2019-10-29 2020-09-07 Data display method and media processing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911040334.0A CN112752130A (en) 2019-10-29 2019-10-29 Data display method and media processing device

Publications (1)

Publication Number Publication Date
CN112752130A true CN112752130A (en) 2021-05-04

Family

ID=75640206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911040334.0A Pending CN112752130A (en) 2019-10-29 2019-10-29 Data display method and media processing device

Country Status (2)

Country Link
CN (1) CN112752130A (en)
WO (1) WO2021082742A1 (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090128702A1 (en) * 2007-11-15 2009-05-21 Canon Kabushiki Kaisha Display control apparatus, method, and storage medium
CN101518055A (en) * 2006-09-21 2009-08-26 松下电器产业株式会社 Subtitle generation device, subtitle generation method, and subtitle generation program
US20090273711A1 (en) * 2008-04-30 2009-11-05 Centre De Recherche Informatique De Montreal (Crim) Method and apparatus for caption production
CN103139375A (en) * 2011-12-02 2013-06-05 Lg电子株式会社 Mobile terminal and control method thereof
CN103731615A (en) * 2012-10-11 2014-04-16 晨星软件研发(深圳)有限公司 Display method and display device
CN106060572A (en) * 2016-06-08 2016-10-26 乐视控股(北京)有限公司 Video playing method and device
CN107911646A (en) * 2016-09-30 2018-04-13 阿里巴巴集团控股有限公司 The method and device of minutes is shared, is generated in a kind of meeting
CN108377418A (en) * 2018-02-06 2018-08-07 北京奇虎科技有限公司 A kind of video labeling treating method and apparatus
CN108419141A (en) * 2018-02-01 2018-08-17 广州视源电子科技股份有限公司 A kind of method, apparatus, storage medium and the electronic equipment of subtitle position adjustment
US10299008B1 (en) * 2017-11-21 2019-05-21 International Business Machines Corporation Smart closed caption positioning system for video content
KR20200030913A (en) * 2018-09-13 2020-03-23 에스케이브로드밴드주식회사 Display apparatus for virtual reality, and control method thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5741136A (en) * 1993-09-24 1998-04-21 Readspeak, Inc. Audio-visual work with a series of visual word symbols coordinated with oral word utterances
TWI520610B (en) * 2013-08-01 2016-02-01 晨星半導體股份有限公司 Television control apparatus and associated method
TW201837739A (en) * 2017-04-05 2018-10-16 集雅科技股份有限公司 Animation generating system and method thereof
CN107172351B (en) * 2017-06-16 2020-04-03 福建星网智慧科技股份有限公司 Method for rapidly superposing subtitles in real time by using camera

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101518055A (en) * 2006-09-21 2009-08-26 松下电器产业株式会社 Subtitle generation device, subtitle generation method, and subtitle generation program
US20090128702A1 (en) * 2007-11-15 2009-05-21 Canon Kabushiki Kaisha Display control apparatus, method, and storage medium
US20090273711A1 (en) * 2008-04-30 2009-11-05 Centre De Recherche Informatique De Montreal (Crim) Method and apparatus for caption production
CN103139375A (en) * 2011-12-02 2013-06-05 Lg电子株式会社 Mobile terminal and control method thereof
CN103731615A (en) * 2012-10-11 2014-04-16 晨星软件研发(深圳)有限公司 Display method and display device
CN106060572A (en) * 2016-06-08 2016-10-26 乐视控股(北京)有限公司 Video playing method and device
CN107911646A (en) * 2016-09-30 2018-04-13 阿里巴巴集团控股有限公司 The method and device of minutes is shared, is generated in a kind of meeting
US10299008B1 (en) * 2017-11-21 2019-05-21 International Business Machines Corporation Smart closed caption positioning system for video content
CN108419141A (en) * 2018-02-01 2018-08-17 广州视源电子科技股份有限公司 A kind of method, apparatus, storage medium and the electronic equipment of subtitle position adjustment
CN108377418A (en) * 2018-02-06 2018-08-07 北京奇虎科技有限公司 A kind of video labeling treating method and apparatus
KR20200030913A (en) * 2018-09-13 2020-03-23 에스케이브로드밴드주식회사 Display apparatus for virtual reality, and control method thereof

Also Published As

Publication number Publication date
WO2021082742A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
CN106210855B (en) object display method and device
US10628700B2 (en) Fast and robust face detection, region extraction, and tracking for improved video coding
KR101527672B1 (en) System and method for video caption re-overlaying for video adaptation and retargeting
US7876978B2 (en) Regions of interest in video frames
CN110463195B (en) Method and apparatus for rendering timed text and graphics in virtual reality video
CN110557678B (en) Video processing method, device and equipment
US10855946B2 (en) Semiconductor integrated circuit, display device provided with same, and control method
US10664949B2 (en) Eye contact correction in real time using machine learning
US20150206354A1 (en) Image processing apparatus and image display apparatus
US11450044B2 (en) Creating and displaying multi-layered augemented reality
US20100060783A1 (en) Processing method and device with video temporal up-conversion
EP2109313A1 (en) Television receiver and method
US20140223474A1 (en) Interactive media systems
CN101242474A (en) A dynamic video browse method for phone on small-size screen
CN112601120B (en) Subtitle display method and device
US20120301030A1 (en) Image processing apparatus, image processing method and recording medium
CN112788235A (en) Image processing method, image processing device, terminal equipment and computer readable storage medium
KR20130104215A (en) Method for adaptive and partial replacement of moving picture, and method of generating program moving picture including embedded advertisement image employing the same
CN114630057B (en) Method and device for determining special effect video, electronic equipment and storage medium
US20220172440A1 (en) Extended field of view generation for split-rendering for virtual reality streaming
KR20180129339A (en) Method for image compression and method for image restoration
US20150029297A1 (en) Data Processing Method And Electronic Device
CN113875227A (en) Information processing apparatus, information processing method, and program
CN112752130A (en) Data display method and media processing device
CN102447869A (en) Role replacement method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 101, No. 2 Hongqiaogang Road, Qingpu District, Shanghai, 201721

Applicant after: Haisi Technology Co.,Ltd.

Address before: Room 101, 318 Shuixiu Road, Jinze town (xicen), Qingpu District, Shanghai, 201799

Applicant before: Shanghai Haisi Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210504