CN111723758A - Video information processing method and device, electronic equipment and storage medium - Google Patents

Video information processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111723758A
CN111723758A CN202010598266.6A CN202010598266A CN111723758A CN 111723758 A CN111723758 A CN 111723758A CN 202010598266 A CN202010598266 A CN 202010598266A CN 111723758 A CN111723758 A CN 111723758A
Authority
CN
China
Prior art keywords
target object
behavior
video
content
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010598266.6A
Other languages
Chinese (zh)
Other versions
CN111723758B (en
Inventor
黄其亮
黄杰怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010598266.6A priority Critical patent/CN111723758B/en
Publication of CN111723758A publication Critical patent/CN111723758A/en
Application granted granted Critical
Publication of CN111723758B publication Critical patent/CN111723758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program

Abstract

The application provides a video information processing method and device based on artificial intelligence, electronic equipment and a computer readable storage medium; the method comprises the following steps: presenting a video in a human-computer interaction interface; detecting a behavior of a target object during presentation of the video; determining a behavior characterization for content appearing in the video according to the behavior of the target object; when the behavior representation of the target object indicates that the content appearing in the video exceeds the bearing capacity of the target object, presenting the shielding effect of the content. By the method and the device, the content in the video can be displayed in a personalized mode.

Description

Video information processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to artificial intelligence technologies, and in particular, to a method and an apparatus for processing video information, an electronic device, and a computer-readable storage medium.
Background
Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. Artificial intelligence is now rapidly developing and widely used in various industries.
Computer vision processing based on artificial intelligence is widely applied, and by taking online videos as examples, resources of the videos are more and more abundant, however, in the embodiment of the application, it is found that audiences of the videos have differences in video acceptance degrees due to differences in aspects such as psychological qualities, regional cultures, religious beliefs and the like.
There is no effective solution for the contradiction between the diversity of video contents and the difference of audience reception degree in the related art.
Content of application
The embodiment of the application provides a video information processing method and device, electronic equipment and a computer readable storage medium, which can display contents in a video in a personalized manner.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a method for processing video information, which comprises the following steps:
presenting a video in a human-computer interaction interface;
detecting a behavior of a target object during presentation of the video;
determining a behavior characterization for content appearing in the video according to the behavior of the target object;
when the behavior representation of the target object indicates that the content appearing in the video exceeds the bearing capacity of the target object, presenting the shielding effect of the content.
An embodiment of the present application provides a processing apparatus for video information, including:
the video playing module is used for presenting videos in the human-computer interaction interface;
a detection module for detecting a behavior of a target object during presentation of the video;
a determining module, configured to determine, according to a behavior of the target object, a behavior characterization for content appearing in the video;
the video playing module is further configured to present a shielding effect of the content when the behavior representation of the target object indicates that the content appearing in the video exceeds the bearing capacity of the target object.
In the above scheme, the detection module is further configured to acquire a behavior image of the target object; the determining module is further configured to identify the behavior representation of the target object according to the behavior image.
In the above scheme, the determining module is further configured to identify a behavior type of the target object in the behavior image; and inquiring the corresponding relation between different behavior types and behavior representations according to the identified behavior types to obtain the behavior representations corresponding to the identified behavior types.
In the above solution, the determining module is further configured to invoke a neural network model to perform the following processing: extracting a feature vector of the behavior image; mapping the extracted feature vectors into probabilities corresponding to a plurality of behavior representations, and determining the behavior representation corresponding to the maximum probability as the behavior representation of the target object; the neural network model is obtained by taking a sample behavior image of the target object and labeled behavior characterization aiming at the sample behavior image as sample training.
In the above solution, the determining module is further configured to determine that the content appearing in the video exceeds the tolerance of the target object when the behavioral representation of the target object indicates that the emotion type of the target object belongs to fear or aversion; the video playing module is further configured to perform at least one of the following operations: superposing materials in all the picture areas of the content; superposing materials in a partial picture area of the content; skipping playback of the content; reducing the volume of playing the content.
In the above scheme, the video playing module is further configured to determine a current frame of the video playing; superposing materials in the area with the difference between the current frame and the previous frame, so that the area with the difference presents at least one of the following shielding effects: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
In the above scheme, the video playing module is further configured to determine a focal area of a line of sight of the target object in the content; performing target recognition on the focus area to determine a target in the focus area, and overlaying materials on the target to enable the target to present at least one of the following shielding effects: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
In the above scheme, the video playing module is further configured to collect positions of reflective bright spots on outer surfaces of a pupil and a cornea of the eyeball of the target object; and determining a focal area corresponding to the sight line of the target object in the content according to the positions of the pupil of the target object and the reflection bright spot on the outer surface of the cornea of the eyeball.
In the above scheme, the video playing module is further configured to determine a corneal reflection vector of the target object according to the positions of the reflective bright spots on the outer surfaces of the pupil and the cornea of the eyeball of the target object; determining the sight line direction of the target object when the target object watches the video according to the corneal reflection vector of the target object; and determining the focus area in the content according to the sight line direction of the target object when watching the video.
In the above scheme, the video playing module is further configured to divide the focus area into a plurality of candidate frames; predicting a candidate box comprising the target and the type of the target according to the feature vector of each candidate box; determining targets belonging to a set type and included in the candidate frame; wherein the set type of the target comprises at least one of: a type of terror; the type of pornography.
In the above scheme, the video playing module is further configured to mark the target, so that when the marked target appears again in the video, the material is superimposed on the target in the video.
The embodiment of the application provides a method for processing video information, which comprises the following steps:
presenting a video in a human-computer interaction interface;
and when the content appearing in the video exceeds the bearing capacity of the target object, presenting the shielding effect of the content.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the video information processing method provided by the embodiment of the application when the processor executes the executable instructions stored in the memory.
The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the video information processing method provided by the embodiment of the application.
The embodiment of the application has the following beneficial effects:
the emotion aiming at the content in the video is predicted through the behavior collected during the video presentation period, and shielding is performed when the situation type that the content cannot be received occurs, so that shielding is performed according to the difference of bearing capacity of different objects, the video content is presented in a personalized mode according to the individual difference of the objects, the link of video preprocessing is saved, and the timeliness of video release is improved.
Drawings
Fig. 1 is a schematic structural diagram of a video information processing system 100 provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of a terminal 400 provided in an embodiment of the present application;
fig. 3 is a schematic flowchart of a method for processing video information according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a method for processing video information according to an embodiment of the present application;
fig. 5 is a schematic flowchart of a method for processing video information according to an embodiment of the present application;
fig. 6A and 6B are schematic diagrams of application scenarios provided by the related art;
fig. 7A, fig. 7B, fig. 7C, and fig. 7D are schematic diagrams of application scenarios provided by an embodiment of the present application;
fig. 8 is a flowchart illustrating a method for processing video information according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.
2) The client is any Application program (App) that can run in the terminal, and may be a Native App in the terminal, and the Web Application (Web App) or the Hybrid App may be used for various purposes, such as a social network client, a browser, a video client, and a live broadcast client.
3) The material is a graphic element which can be superimposed in an image so as to enable the image to have a new display effect, and comprises the following components: mosaic, fuzzy, corrosion, frosting, gridding or shading and the like.
In the embodiment of the application, it is found that the watching of video (such as horror films) meets the psychological needs of people for pursuing stimulation, but the overstimulated content causes psychological and physiological discomfort to people. Referring to fig. 6A and 6B, fig. 6A and 6B are schematic diagrams of an application scenario provided by the related art, and in fig. 6A, for a video containing an image of content that may cause discomfort to a user, prompt information 601 is presented before the user views the video. In fig. 6B, content 602, which may cause user discomfort in general knowledge, is subjected to a coding process in advance to present a mosaic display effect.
The related art has the following technical problems: for users who want to pursue motivational content and worry about excessive or frightening, the premise of observation does not alleviate the user's discomfort. The advanced unified coding processing cannot meet the watching requirements of users with different bearing capacities, and the real-time performance of the online of videos with high timeliness requirements can be influenced.
In view of the foregoing technical problems, embodiments of the present application provide a method, an apparatus, a device, and a computer-readable storage medium for processing video information, which can achieve the purpose of individually shielding video content beyond the tolerance of a target object (e.g., a user or an artificial intelligence robot). An exemplary application of the video information processing method provided by the embodiment of the present application is described below, and the video information processing method provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a client running in a terminal alone, or may be implemented by a server and a client running in a terminal in a cooperation manner.
Next, the embodiments of the present application are described by taking an example that the target object is a user and is cooperatively implemented by a server and a client running in a terminal, and it is understood that the target object here may also be a program capable of simulating human behavior or simulating output of human behavior data, for example, a test program of the client.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a system 100 for processing video information according to an embodiment of the present disclosure. The system 100 for processing video information includes: the server 200, the network 300, and the terminal 400 will be separately described.
The server 200 is a background server of the client 410, and is configured to send a corresponding video to the client 410 in response to a video acquisition request sent by the client 410.
The network 300, which is used as a medium for communication between the server 200 and the terminal 400, may be a wide area network or a local area network, or a combination of both.
The terminal 400 is used for operating a client 410, and the client 410 is a client with a video playing function. A client 410 for presenting videos in a human-computer interaction interface 411; the video processing device is also used for detecting the behavior of the user during the presentation of the video and determining the behavior characterization of the user for the content appearing in the video; when the behavior representation of the user indicates that the content appearing in the emotion type representation video of the user exceeds the bearing capacity of the user, the shielded video is presented in the human-computer interaction interface 411.
Next, a structure of the terminal 400 in fig. 1 is explained, referring to fig. 2, fig. 2 is a schematic structural diagram of the terminal 400 provided in the embodiment of the present application, and the terminal 400 shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2.
The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.
The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;
an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the video information processing apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows the video information processing apparatus 455 stored in the memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: a video playing module 4551, a detecting module 4552 and a determining module 4553, which are logical and thus can be arbitrarily combined or further split according to the functions implemented. The functions of the respective modules will be explained below.
The following describes an embodiment of the present application by taking an example of a processing method for cooperatively implementing video information provided by the embodiment of the present application by a server 200 and a client running in a terminal 400 in fig. 1. Referring to fig. 3, fig. 3 is a schematic flowchart of a method for processing video information according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.
In step S101, the server transmits a video to the client.
In some embodiments, the server sends a corresponding video to the client in response to a video acquisition request sent by the client.
Here, the server is a background server of the client, and the client is an Application program (APP) having a video playing function, such as a social network client, a live broadcast client, or a short video client. The type of video sent by the server to the client may be any type of video, such as horror or comedy, for example.
In step S102, the client presents a video in the human-computer interaction interface.
In some embodiments, the client receives the video sent by the server and presents the content of the video in the human-computer interaction interface, and can also present the control functions of the video (such as video volume control, stop playing and turning on or off of the health mode), and subtitle content.
In step S103, the client detects the behavior of the target object during the presentation of the video.
Here, the behavior of the client detecting the target object may be to detect the behavior of the target object after the target object has started the health mode (for example, to collect a behavior image of the target object); or the client detects the behavior of the target object whether the target object starts the health mode or not. The health mode refers to a function capable of shielding all or part of the content in the video according to the bearing capacity of the target object. The behavior of the target object may be an eye behavior, a limb behavior, a head behavior, a voice behavior, or a gesture behavior.
In some embodiments, it may be that when the target object triggers video playback, the client turns on the health mode by default; the health mode may also be turned on by a custom setting of the target object.
As an example, after the client acquires the video sent by the server, the type of the video is determined; when the video is of a type including content exceeding the bearing capacity of the target object, the client starts a health mode by default, or prompt information is presented in a human-computer interaction interface; wherein, the prompt message is used for prompting the target object to start the health mode.
In the following, a specific implementation of the client determining the type of the video will be described in detail.
In some embodiments, a client obtains viewing data of a video through a server; the type of video is determined from the viewing data of the video.
Here, the viewing data of the video includes at least one of: bullet screen data; video comments; the number of viewing times; a viewing operation (e.g., a fast-forward operation, a rewind operation, or an operation of adjusting a play progress, etc.); video classification labels. The watching data of the video can be historical watching data of all users in the whole network; historical viewing data of social users who may also be target objects; but also historical viewing data of the target object.
For example, when a review or barrage of a video includes a number of "horror," "scare," etc. fields, and/or includes a number of fast forward operations, it may be determined that horror content is included in the video; when a comment or bullet screen of the video includes a large number of fields such as 'unsuitable for children', 'forbidden for eighteen', and the like, and/or includes a large number of operations for adjusting the playing progress, it can be determined that pornographic content is included in the video.
After the client determines the type of the video, the method further comprises the following steps: the client judges whether the video is of a type including content exceeding the bearing capacity of the target object, and the specific implementation process is as follows:
the client acquires the type of the video and calls a neural network model to execute the following processing: extracting a feature vector of the type of the video; and mapping the extracted feature vectors into the probability that the video is of the type including the content exceeding the bearing capacity of the target object and the probability that the video is not of the type including the content exceeding the bearing capacity of the target object, and determining the type corresponding to the maximum probability as a prediction result. In this manner, the client can determine whether the video is of a type that includes content that exceeds the affordance of the target object.
Here, the neural network model is trained using, as a sample, historical viewing data and image data of a target object (including age, preference, and the like of the target object).
For example, in fig. 7A, when the type of the video acquired by the client is terrorist, before the human-computer interaction interface presents the video, a prompt window 701 is presented for prompting the user to start a health mode. When a user clicks a button of "start health mode", triggering the client to enter a health mode, that is, shielding all or part of the content in the video according to the bearing capacity of the user (a specific implementation manner of shielding all or part of the content in the video according to the bearing capacity of the user will be described in detail below); when the user clicks the button of 'watching the original video', the client does not start the health mode, namely, the original video is presented, namely, the whole or part of the content in the video is not shielded according to the bearing capacity of the user. Therefore, different selection opportunities are provided for the user, and the personalized film watching requirements of the user can be met.
In some embodiments, the client may determine the behavior (e.g., eye behavior, limb behavior, or head behavior) of the target object by acquiring a behavior image of the target object (implementation procedures will be described in detail below); the behavior of the target object can be acquired by setting behavior option buttons (including a fear button and an aversion button) in the human-computer interaction interface and responding to the operation aiming at the behavior option buttons, for example, when the user watches a horror picture causing discomfort in the process of watching a video, the user can click the fear button, and thus, the client can acquire the behavior of indicating that the emotion type is the fear type by the behavior representation of the user; when a user watches a pornographic picture causing discomfort, the user can click an aversion button, so that the client can acquire the behavior representation of the user and indicate that the emotion type is an aversive behavior; the behavior of the target object corresponding to the gesture operation can be obtained by setting the gesture operation (for example, waving a hand in front of the camera, clicking or multi-clicking a screen, or sliding the screen), and responding to the gesture operation of the target object, for example, when the client sets the gesture operation of waving a hand in front of the camera to indicate that the emotion type is a fear type, and when the user watches a horror screen causing discomfort in the process of watching a video, waving a hand in front of the camera, the client can obtain the gesture behavior of which the behavior representation indicates that the emotion type is a fear type; the voice operation can be set, the voice operation of the target object is responded, the voice information is obtained, the voice recognition is carried out on the voice information, the behavior of the target object is determined, for example, when the user watches a horror picture causing discomfort in the process that the user watches videos, the microphone obtains voice information similar to 'terrorism' or 'good fear', and the voice recognition is carried out on the voice information, so that the client can obtain the voice behavior indicating that the emotion type is the fear type according to the behavior representation of the user. Therefore, the client can quickly and accurately acquire the behavior representing the emotion type of the target object so as to improve the efficiency of subsequently identifying the emotion type of the target object.
In step S104, the client determines a behavior representation of the target object for the content appearing in the video according to the behavior of the target object.
By way of example, the types of emotion indicated by the behavioral representations may include happiness, anger, fear, aversion, sadness, and the like.
In some embodiments, the client captures behavior of the target object in real-time and determines a behavioral characterization of the target object for content appearing in the video. That is, the client may invoke a corresponding service (e.g., a behavior recognition service) of the terminal, and the recognition process of the behavior characterization is completed through the terminal.
As an example, a client acquires a plurality of reference behavior representations by calling a behavior recognition service of a terminal; and respectively matching the behaviors of the target object with the plurality of reference behavior representations, calculating the similarity of the behaviors of the target object corresponding to each reference behavior representation, and taking the reference behavior representation with the highest similarity with the behaviors of the target object as the behavior representation of the target object aiming at the content appearing in the video.
In other embodiments, the client acquires the behavior of the target object in real time and sends data representing the behavior of the target object to the server for behavior representation identification. That is, the client may invoke a corresponding service (e.g., a behavior recognition service) of the server, and the recognition process of the behavior characterization is completed through the server.
As an example, the client sends data characterizing the behavior of the target object to the server; the server acquires a plurality of reference behavior representations by calling a behavior recognition service; respectively matching the behaviors of the target object with a plurality of reference behavior representations, calculating the similarity of the behaviors of the target object corresponding to each reference behavior representation, and taking the reference behavior representation with the highest similarity with the behaviors of the target object as the behavior representation of the target object aiming at the content appearing in the video; and sending the behavior representation of the target object aiming at the content appearing in the video to the client.
For example, when the behavior of the target object is an eye behavior such as closing eyes or moving a focus of sight, a limb behavior such as blocking sight with hands or objects, a voice behavior indicating that an emotion type is a fear type after voice recognition of voice information (e.g., "fear", "terrorism", etc.), or a head behavior such as shaking head to avoid the focus of sight, it may be determined that an emotion type indicated by a behavior representation of a behavior corresponding to the target object is fear. When the behavior of the target object is eye behavior such as squinting, body behavior such as blocking sight with hands or objects, voice behavior indicating that the emotion type is aversive after voice recognition is performed on voice information (such as 'nausea', 'very nausea', and the like), or head behavior such as frown and mouth corner pull-down, it can be determined that the emotion type indicated by the behavior representation of the behavior corresponding to the target object is aversive.
According to the embodiment of the application, the corresponding emotion type can be accurately judged according to the behavior of the target object when the video is watched, so that whether the content appearing in the video exceeds the bearing capacity of the target object can be accurately determined according to the emotion type of the target object subsequently, and the accuracy of shielding the video is improved.
In step S105, when the behavior representation of the target object indicates that the content appearing in the video exceeds the tolerance of the target object, the client presents the shielding effect of the content.
In some embodiments, step S105 may be preceded by: when the behavior representation of the target object indicates that the emotion type of the target object belongs to fear or aversion, the client determines that the content appearing in the video exceeds the bearing capacity of the target object.
Here, when the behavior representation of the target object indicates that the emotion type of the target object belongs to fear or dislike, representing content appearing in the video causes discomfort to the target object, that is, the target object is not willing to view content appearing in the video, for example, horror content which makes the target object feel fear, pornographic content which makes the target object feel dislike, or the like, and thus the client needs to mask content appearing in the video.
In some embodiments, the implementation of the client to present the masking effect of the content includes at least one of: superposing materials in all picture areas of the content; superposing materials in a partial picture area of the content; skipping playback of the content; the volume of the played content is reduced.
In the following, specific descriptions are respectively given to the implementation manners of the shielding effect of the content presented by the client.
(1) The client overlays the material in all the screen areas of the content.
Here, the shielding effect presented by the client after superimposing the material in all the screen regions of the content includes at least one of: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
As an example, the client obscures the entire screen area, replaces the entire screen area with a blank, black, or preset background, or codes the entire screen area to present a mosaic effect in the entire screen area.
Therefore, the target object can be prevented from seeing the content causing discomfort to the target object to the maximum extent, but the target object can not know the complete plot of the video. In response to the above problem, the client may superimpose the material only in a partial screen area of the content to avoid the above problem.
In some embodiments, when the behavioral representation of the target object indicates that the emotion type representation of the target object shows that the content appearing in the video exceeds the secondary bearing capacity of the target object, the client superimposes the material in all the picture areas of the content; when the behavior representation of the target object indicates that the content appearing in the emotion type representation video of the target object exceeds the first-level bearing capacity of the target object, the client superimposes the material in the partial picture area of the content (a specific implementation manner of superimposing the material in the partial picture area of the content will be described in detail below).
Here, taking the movie rating as an example, the movie corresponding to the secondary bearing capacity is ranked higher than the movie corresponding to the primary bearing capacity, for example, the movie corresponding to the secondary bearing capacity is suitable for people of any age to watch, and the movie corresponding to the primary bearing capacity is only suitable for people of 18 years (inclusive) or more to watch. That is, taking the example that the video is a horror film, the horror content in the video corresponding to the secondary bearing capacity is more horror than the horror content in the video corresponding to the primary bearing capacity.
Therefore, the video can be shielded to different degrees according to the bearing capacity of different levels of the user, so that the user cannot watch the overstimulated picture, but can watch a part of slightly stimulated picture, and the personalized film watching requirement of the user is met.
(2) The client overlays the material in a partial screen area of the content.
Here, the masking effect presented by the client after superimposing the material in the partial screen region of the content includes at least one of: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
As an example, the client prints codes on the partial screen area to present a mosaic display effect in the partial screen area, or overlays a map on the partial screen area to present an occlusion display effect in the partial screen area, where the map may be a glary picture, so that the uncomfortable emotion of the target object can be relieved to a greater extent.
As another example, the client may overlay material corresponding to the behavioral representation of the target object indicating the emotional type of the target object in the partial screen area of the content, for example, when the emotional type of the target object belongs to fear, overlay a hype map in the partial screen area of the content to make the target object happy to relieve the fear of the target object; when the emotion type of the target object belongs to aversion, the content appearing in the video may be a pornographic picture, so that the material may be superimposed in a partial picture area of the content to make the target object unable to view the pornographic picture, so as to alleviate the aversion of the target object. Therefore, the discomfort of the user can be relieved, and the interestingness of the user in watching the video can be improved.
In some embodiments, the client determines a current frame of the video playback; the material is superimposed in the area where there is a difference between the current frame and the previous frame.
Taking a video as a horror film as an example, what usually causes fear of a user in the video is a horror element (e.g., a monster or a zombie) which suddenly appears in a next frame, and for this reason, the client may determine an area where there is a difference between a current frame and a previous frame as a horror area in a video playing picture, and then superimpose the horror area on the material. In this way, the material can be automatically overlapped in the area (such as a terrorist area or an aversion area) which suddenly appears in the video, so that the efficiency of identifying the content is improved, and the condition that the shielding content is delayed is avoided.
In other embodiments, the client determines a focal region of a line of sight of the target object in the content via an eye tracking system; and carrying out target identification on the focus area to determine a target in the focus area, and superposing materials on the target.
Here, the target may be a horror element (e.g., monster or zombie, etc.) as described above; or a pornographic element (e.g., nude, etc.). The client may invoke a corresponding service (e.g., a target recognition service) of the terminal, and the process of target recognition is completed through the terminal. The client can also call a corresponding service (for example, a target identification service) of the server, and the target identification process is completed through the server. Of course, the process of achieving target identification may also be cooperatively implemented by the client and the server. The following description will take an example of a process in which a client invokes a corresponding service (e.g., a target recognition service) of a terminal and completes target recognition by the terminal. The process of the client invoking the corresponding service (e.g., the target identification service) of the server and completing the target identification through the server is similar to the following process, and will not be described again.
As an example, the process of the client determining the focus area of the line of sight of the target object in the content is: a client calls a camera device (such as a camera) of a terminal to acquire the positions of the reflective bright spots of the pupil and the outer surface of the cornea of the eyeball of a target object; a focal region corresponding to the line of sight of the target object is determined in the content based on the locations of the reflective bright spots on the outer surface of the pupil and the cornea of the eyeball of the target object.
Here, the Reflection bright spot on the outer surface of the Cornea of the eyeball refers to Purkinje's spot (Purkinje Image), which is a bright light spot on the Cornea of the eyeball, generated by Reflection (CR) of light entering the pupil on the outer surface of the Cornea.
The principle of determining a focal region corresponding to the target object's line of sight in the content from the locations of the reflective bright spots of the pupil and the outer surface of the eyeball cornea of the target object is: because the position of the terminal camera is fixed, the position of the terminal screen light source is also fixed, the center position of the eyeball is unchanged, and the absolute position of the purkinje spot does not change along with the rotation of the eyeball. But the positions of the target object relative to the pupils and the eyeballs are constantly changing, for example, when the target object stares at the camera, the purkinje spot is located between the pupils of the target object; and when the target object is raised, the purkinje spot is just below the pupil of the target object. Therefore, as long as the positions of the pupil and the purkinje spot on the eye image are positioned in real time and the corneal reflection vector is calculated, the sight line direction of the target object can be estimated and obtained by using the geometric model. And then, based on the relationship between the eye characteristics of the target object and the content presented on the terminal screen, which is established in the previous calibration process (i.e. the target object is made to watch on a specific point on the terminal screen), the focus area corresponding to the sight line of the target object can be determined in the content presented on the video.
For example, the client determines the corneal reflection vector of the target object according to the pupil of the target object and the position of the reflection bright spot on the outer surface of the cornea of the eyeball; determining the sight line direction of the target object when the target object watches the video according to the cornea reflection vector of the target object; a focal region is determined in the content according to a line-of-sight direction of the target object when viewing the video.
As an example, the specific process of the client performing target identification on the focus area to determine the target in the focus area is as follows: the client divides the focus area into a plurality of candidate frames; predicting a candidate frame comprising the target and the type of the included target according to the feature vector of each candidate frame; an object belonging to the set type included in the candidate box is determined.
Here, the setting type of the target includes at least one of: a type of terror; the type of pornography. Object recognition involves two processes, object location and object recognition, respectively.
For example, the focus area range is large, so that more accurate shielding is realized, and the influence on the film watching experience of a user due to the fact that the user cannot watch other contents in a video due to too much shielding contents is avoided, the client divides the focus area cost into a plurality of candidate frames through an intelligent image recognition system, wherein the candidate frames comprise targets which may be set types; then, extracting features of the image in each candidate frame through a neural network (such as a convolutional neural network) to obtain feature vectors; classifying the feature vectors corresponding to each candidate frame through a Support Vector Machine (S VM) to determine candidate frames including the target and the type of the target; and finally, selecting a candidate frame comprising the target belonging to the set type, and determining an accurate region of the target belonging to the set type in the candidate frame through border Regression (Bounding Box Regression).
For example, in fig. 7B, when content causing user discomfort (e.g., terrorist elements such as monsters) appears in the video, a hot-screen area 702 (i.e., the above-mentioned focus area) causing user discomfort is first automatically identified in the video content by the eye tracking system; the hot-screen area 702 is then identified and marked by the intelligent image recognition system to obtain marked content 703 (i.e., the target described above).
The embodiment of the application can accurately shield the content appearing in the video, avoids the problem that the user cannot watch the content which cannot feel uncomfortable in the video and miss the key plot of the video due to the fact that the shielded content is not accurate enough, and improves the film watching experience of the user.
The following description will take an example of a process of performing object recognition by cooperation of a client and a server.
In some embodiments, first, the client determines a focus area of a line of sight of the target object in the content through the eye tracking system, and sends the focus area to the server; then, the server identifies the target of the focus area through an intelligent image identification system to determine the target in the focus area, marks the target and sends the marked target to the client; and finally, the client overlaps the material on the marked target through a video image processing system.
In some embodiments, after performing target recognition on the focus area to determine the target in the focus area, the method further includes: the objects are marked so that when the marked objects appear again in the video, the objects are overlaid with material in the video.
Here, the client may invoke a corresponding service (e.g., a tagging service or an overlay service) of the terminal, and the processes of tagging and overlay are completed through the terminal. The client can also call a corresponding service (e.g., a marking service or an overlaying service) of the server, and the marking and overlaying processes are completed through the server. Of course, the process of completing the marking and the overlaying may also be cooperatively implemented by the client and the server.
The following description will take an example of a process in which a client invokes a corresponding service (e.g., a tagging service or an overlay service) of a terminal and tagging and overlay are performed by the terminal. The process of the client invoking the corresponding service (e.g., the tagging service or the overlay service) of the server to complete tagging and overlay through the server is similar to the following process, and will not be described again.
As an example, the client marks the target through an intelligent image recognition system to superimpose material on the target in the video when the marked target appears again in the video.
Taking the cooperative embodiment of the server and the client as an example for explanation, the server marks the target through the intelligent image recognition system and sends the marked target to the client. When the marked target appears in the video again, the client automatically overlaps the material on the marked target in the video through the video image processing system.
According to the embodiment of the application, the content which causes discomfort of the user and appears again can be automatically shielded in real time, the content which causes discomfort can be shielded before the user does not see the content, the frequency of seeing the content which causes discomfort can be reduced, and therefore discomfort of the user is reduced.
(3) The client skips the playing of the content.
As one example, when a user views content causing discomfort, the content played by a video is fast-forwarded (e.g., at double speed, at quadruple speed, or at eight speed, etc.), thereby reducing the time for the user to view the content causing discomfort to relieve the user's mood.
As another example, when the user views the content causing discomfort, a key frame of the content being played by the video is selected to be played (i.e., a non-key frame is discarded), so that the user may be prevented from viewing the content causing discomfort.
Here, the key frame does not contain content causing discomfort, and the key frame may be a video frame containing a video key plot.
(4) The client reduces the volume of the played content.
Taking a video as an example of horror film, sound effect rendering is an important means for raising horror atmosphere, and when a user watches uncomfortable content, the client can automatically reduce the volume (for example, mute) of the played content, so that the fear of the user is relieved, and excessive discomfort in the film watching process is avoided.
In some embodiments, the client may continuously mask the content causing the discomfort of the target object during the whole video watching process of the user, or may intermittently mask the content causing the discomfort of the target object, for example, when the target object has a behavior exceeding its bearing capacity, the content causing the discomfort of the target object is masked, and when the behavior disappears, the content is not masked.
According to the embodiment of the application, the video content exceeding the bearing capacity of the user can be shielded in a personalized manner according to the behavior of the user, so that the psychological requirement of the user on pursuing stimulation can be met, discomfort caused by excess can be avoided, and the proper watching experience of different users in health is ensured.
Referring to fig. 4, fig. 4 is a schematic flowchart of a processing method of video information according to an embodiment of the present application, and based on fig. 3, step S103 may be replaced by step S106, and step S104 may be replaced by step S107.
In step S106, the client acquires a behavior image of the target object during the video presentation.
In some embodiments, the client invokes a camera device (e.g., a camera) of the terminal to capture the behavior image of the target object.
Here, the movement directions of key points (for example, hands, eyes, face, head, and the like) of body parts of the target object are included in the behavior image.
In step S107, the client identifies the behavior representation of the target object according to the behavior image.
Here, the client may invoke a corresponding service (e.g., a behavior recognition service) of the terminal, and the recognition process of the behavior characterization is completed through the terminal. The client can also call a corresponding service (e.g., a behavior recognition service) of the server, and the recognition process of the behavior characterization is completed through the server.
The following description will be given by taking an example that the client identifies the behavior representation of the target object according to the behavior image, where a process of the client calling the server to complete the identification of the behavior representation is similar to a process of the client completing the identification of the behavior representation, and will not be described again.
In some embodiments, the client identifies a behavior type of the target object in the behavior image; and inquiring the corresponding relation between different behavior types and the behavior representations according to the identified behavior types to obtain the behavior representations corresponding to the identified behavior types.
As an example, the client determines, by the gesture behavior recognition system, a movement direction of a key point of a body part of the target object in the behavior image; matching the motion direction of the key point with a plurality of behaviors stored in the gesture behavior recognition system, and determining the similarity between the motion direction of the key point and each behavior; determining the behavior with the highest similarity to the movement direction of the key point as the behavior of the target object; and determining the behavior representation of the target object according to the behavior of the target object.
For example, when the behavior of the target object is to secure eyes, obstruct a view with hands or objects, avoid a focus of the view, and the like, it may be determined that the behavior corresponding to the behavior of the target object is feared. When the behaviors of the target object are squinting eyes, frowning eyebrows, mouth corner pull-down and the like, the behavior representation corresponding to the behaviors of the target object can be determined to be aversive.
In other embodiments, the client invokes the neural network model to perform the following: extracting a feature vector of the behavior image; and mapping the extracted feature vectors into the probabilities corresponding to the plurality of behavior representations, and determining the behavior representation corresponding to the maximum probability as the behavior representation of the target object.
Here, the neural network model is obtained by training a sample behavior image of the target object and a labeled behavior representation for the sample behavior image as a sample. The neural network model may be stored in a local or cloud end (e.g., a server), and the client may perform the behavior characterization by calling the neural network model in the local or cloud end.
As an example, the training process of the neural network model specifically includes: firstly, obtaining a training sample, wherein the training sample comprises a sample behavior image of a target object and a behavior representation of a label corresponding to the sample behavior image; then extracting a feature vector of the sample behavior image, mapping the extracted feature vector into probabilities corresponding to a plurality of behavior representations, and determining the behavior representation corresponding to the maximum probability as a predicted behavior representation of the target object; then calculating the difference between the predicted behavior representation and the labeled behavior representation corresponding to the sample behavior image; and finally, updating parameters in the neural network model according to the difference.
Here, the sample behavior image of the target object may be acquired before the target object views the video, for example, before the user views the video, the sample behavior image of the user is captured by a camera, and the user labels the corresponding behavior representation for the captured sample behavior image.
The behavior representation of the target object is determined by the method of the rule and the neural network model, on one hand, the process of determining the behavior representation of the target object by the rule is simple, the speed of identifying the behavior representation is high, the speed of shielding video content is improved, and the time of watching uncomfortable content by a user can be reduced; on the other hand, the behavior characterization process of the target object is determined by the neural network model method, the accuracy is high, whether the content appearing in the video is shielded or not can be accurately determined according to the behavior of the user, and the influence on the film watching experience of the user caused by shielding of the content which does not exceed the bearing capacity of the user due to misjudgment can be avoided.
Referring to fig. 5, fig. 5 is a flowchart illustrating a method for processing video information according to an embodiment of the present application, and based on fig. 3, step S108 may be included after step S105.
In step S108, the client cancels the masking effect in response to the operation of turning off the healthy mode to display the original video.
Here, the operation may be various types of operations that are preset by the operating system and do not conflict with the registered operation, for example, a button click operation or a slide operation, and may be a voice operation.
In some embodiments, the client may generate a misjudgment due to the behavior of the target object, so as to shield the content that is not shielded, so that the target object cannot view the video content that the target object wants to see, or, taking the video is a horror film as an example, the bearing capacity of the target object does not change from beginning to end in the process of viewing the horror film, the bearing capacity of the target object gradually increases along with the development of the video episode, and the health mode is opened, so that the video content that the target object cannot receive before is shielded from beginning to end, so that the target object cannot view the subsequent video content that does not exceed the bearing capacity. In this regard, a human machine interface in the client may present a health mode off button, and when the user triggers the health mode off button, the client undoes the masking effect to display the original video.
As an example, in fig. 7D, when the user clicks the health mode off button 706, the client plays the original video without the material superimposed.
The embodiment of the application can ensure that the health mode can be closed through the operation of the user after the content which the user does not want to shield is shielded so as to cancel the shielding of the video, and ensure that the user can watch the content meeting the requirement of the user.
Next, a method for processing video information according to an embodiment of the present application will be described, taking a case where a video is a horror film as an example.
Watching video (e.g., horror films) meets psychological needs of people pursuing stimulation, but overstimulated content causes psychological and physiological discomfort to people.
Referring to fig. 6A and 6B, fig. 6A and 6B are schematic diagrams of an application scenario provided by the related art, in fig. 6A, for a video containing an image that may cause discomfort to a user, prompt information 601 is presented on a human-computer interaction interface before the user views the video. In fig. 6B, content 602 that may cause user discomfort in general knowledge is coded in advance, without considering the difference in bearing capacity of different users.
The related art has the following problems: for users who want to pursue motivational content and worry about excessive or frightening, the premise of observation does not alleviate the user's discomfort. And the prior uniform coding processing cannot meet the watching requirements of users with different bearing capacities.
In view of the above problems, embodiments of the present application provide a method for processing video information, which enables users with different stimulus acceptance degrees to meet psychological needs of pursuing stimulus while watching a video, and avoids uncomfortable feelings caused by content that excessively causes discomfort of the users, thereby ensuring healthy and appropriate watching experiences of different users.
The implementation of the embodiment of the present application is described below with reference to fig. 7A, 7B, 7C, and 7D, where fig. 7A, 7B, 7C, and 7D are schematic diagrams of application scenarios provided by the embodiment of the present application.
In step S701, when the user watches a video with stimulus content (e.g., horror video), the user is prompted whether to turn on the health mode. And responding to the trigger operation aiming at the health mode, entering the health mode, and intelligently printing codes when the user has excessive discomfort.
As an example, in fig. 7A, when the user views a video with stimulus content, the human-machine-interaction interface presents a prompt window 701. When a user clicks a button for starting the health mode, the health mode is entered, and intelligent code printing is carried out when the user feels excessive discomfort.
In step S702, after the health mode is turned on, the content causing discomfort to the user is discriminated in real time by the discrimination system.
In some embodiments, the recognition system includes a gesture behavior recognition system and an eye tracking system. The state of discomfort of the user is distinguished through the gesture behavior recognition system, such as eyes are tightly closed or sight line is shielded; at the same time, the hot zone of the picture (i.e., the focal region described above) that causes discomfort to the user is identified by the eye tracking system.
As an example, in fig. 7B, when content causing user discomfort (e.g., a horror element such as a monster) appears in the video, a picture hot zone 702 causing user discomfort is automatically identified in the video content by the eye-tracking system.
In step S703, the hot area of the screen is identified and marked by the intelligent image identification system.
As an example, in fig. 7B, the marked content 703 (i.e., the above-mentioned target) is obtained by identifying and marking the picture hot zone 702 by the intelligent image recognition system.
In step S704, the marked content is coded in real time by the video image processing system.
Here, in addition to coding the markup content, other harmonious content replacements, such as ways of pasting or rendering, may be performed on the markup content.
As an example, in fig. 7B, a coding process is performed on the markup content 703 to present the coded markup content 704 in the human-machine interaction interface.
In step S705, when the tagged content reappears in the video content, the reappeared tagged content is coded in real time by the video image processing system.
As an example, in fig. 7C, when the reappearance markup content 705 is presented in the human-computer interaction interface, the reappearance markup content 705 is automatically coded.
In step S706, in response to the operation of turning off the healthy mode, the original video is played.
By way of example, in FIG. 7D, when the user clicks the health mode off button 706, the uncoded original video is played.
Referring to fig. 8, fig. 8 is a schematic flowchart of a method for processing video information according to an embodiment of the present application, and the following description will be described in detail with reference to fig. 8.
In step S801, after the health mode is turned on, the client makes a discrimination of a screen causing excessive discomfort reaction to the user by calling a discrimination system of the terminal while playing a video.
In some embodiments, the recognition system includes a gesture behavior recognition system and an eye tracking system. The client judges the behavior reaction when the user feels uncomfortable by calling a gesture behavior recognition system of the terminal, for example, the behaviors of closing eyes, shielding sight by hands or objects, avoiding focus of sight and the like; meanwhile, the client traces and distinguishes the picture hot area watched by the user in real time by calling an eye tracking system of the terminal.
In step S802, the background receives the identified hot area of the picture, and the image content of the area is identified and marked by the intelligent image identification system.
In step S803, the client performs coding processing on the area in which the content appears in the screen by calling the video image processing system of the terminal according to the mark information returned by the background, and presents video content of the screen that changes in real time according to the user response in the human-computer interaction interface.
In step S804, when the marked content appears again in the video frame, the client recognizes by calling the video image processing system of the terminal and performs the coding process.
In summary, according to the embodiment of the application, when a user watches a video, the content causing excessive discomfort of the user is distinguished through the gesture behavior recognition system and the eye movement tracking system, the content is marked according to intelligent image recognition, the image of the content area is automatically printed to relieve the discomfort of the user, and when the image of the content appears again in the video, the code is automatically printed. The stimulation-pursuing health-care device meets the requirements of users with different stimulation acceptance degrees on pursuing stimulation and is moderate in health.
Continuing with the exemplary structure of the video information processing apparatus 455 provided by the embodiment of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the video information processing apparatus 455 of the memory 450 may include:
the video playing module 4551 is used for presenting videos in the human-computer interaction interface;
a detecting module 4552 configured to detect a behavior of a target object during presentation of the video;
a determining module 4553, configured to determine, according to the behavior of the target object, a behavior characterization for content appearing in the video;
the video playing module 4551 is further configured to, when the behavior representation of the target object indicates that content appearing in the video exceeds the bearing capacity of the target object, present a shielding effect of the content.
In some embodiments, the detecting module 4552 is further configured to acquire a behavior image of the target object; the determining module is further configured to identify the behavior representation of the target object according to the behavior image.
In some embodiments, the determining module 4553 is further configured to identify a behavior type of the target object in the behavior image; and inquiring the corresponding relation between different behavior types and behavior representations according to the identified behavior types to obtain the behavior representations corresponding to the identified behavior types.
In some embodiments, the determining module 4553 is further configured to invoke the neural network model to perform the following processing: extracting a feature vector of the behavior image; mapping the extracted feature vectors into probabilities corresponding to a plurality of behavior representations, and determining the behavior representation corresponding to the maximum probability as the behavior representation of the target object; the neural network model is obtained by taking a sample behavior image of the target object and labeled behavior characterization aiming at the sample behavior image as sample training.
In some embodiments, the determining module 4553 is further configured to determine that the content appearing in the video exceeds the endurance of the target object when the behavioral representation of the target object indicates that the emotion type of the target object belongs to fear or aversion; the video playing module is further configured to perform at least one of the following operations: superposing materials in all the picture areas of the content; superposing materials in a partial picture area of the content; skipping playback of the content; reducing the volume of playing the content.
In some embodiments, the video playing module 4551 is further configured to determine a current frame of the video playing; superposing materials in the area with the difference between the current frame and the previous frame, so that the area with the difference presents at least one of the following shielding effects: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
In some embodiments, the video playing module 4551 is further configured to determine a focus area of a line of sight of the target object in the content; performing target recognition on the focus area to determine a target in the focus area, and overlaying materials on the target to enable the target to present at least one of the following shielding effects: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
In some embodiments, the video playing module 4551 is further configured to acquire positions of reflective bright spots on the outer surfaces of the pupil and the cornea of the eyeball of the target object; and determining a focal area corresponding to the sight line of the target object in the content according to the positions of the pupil of the target object and the reflection bright spot on the outer surface of the cornea of the eyeball.
In some embodiments, the video playing module 4551 is further configured to determine a corneal reflection vector of the target object according to positions of reflective bright spots on the outer surfaces of the pupil and the cornea of the eyeball of the target object; determining the sight line direction of the target object when the target object watches the video according to the corneal reflection vector of the target object; and determining the focus area in the content according to the sight line direction of the target object when watching the video.
In some embodiments, the video playing module 4551 is further configured to divide the focus area into a plurality of candidate boxes; predicting a candidate box comprising the target and the type of the target according to the feature vector of each candidate box; determining targets belonging to a set type and included in the candidate frame; wherein the set type of the target comprises at least one of: a type of terror; the type of pornography.
In some embodiments, the video playing module 4551 is further configured to mark the target, so as to superimpose the material on the target in the video when the marked target appears again in the video.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, cause the processor to execute a method for processing video information provided by embodiments of the present application, for example, a method for processing video information as shown in fig. 3, fig. 4, fig. 5 or fig. 8, where the computer includes various computing devices including a smart terminal and a server.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions can correspond, but do not necessarily correspond, to files in a file system, and can be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts stored in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the embodiment of the present application has the following beneficial effects:
(1) by detecting the behavior of the user when watching the video, the emotion of the user when watching the video is distinguished, the content appearing in the video is shielded in a differentiated mode, the video content exceeding the bearing capacity of the user can be shielded in a personalized mode according to the bearing capacity of different users, the psychological requirement of the user for pursuing stimulation can be met, discomfort caused by the content which causes discomfort of the user excessively can be avoided, and the healthy and appropriate watching experience of different users is guaranteed.
(2) The corresponding behavior representation can be accurately judged according to the behavior of the target object when the video is watched, so that whether the content appearing in the video exceeds the bearing capacity of the target object can be accurately determined subsequently according to the behavior representation of the target object, and the accuracy of video shielding is improved.
(3) The method and the device can accurately shield the content appearing in the video, avoid the problem that the user cannot watch the content which cannot be uncomfortable in the video due to the fact that the shielded content is not accurate enough and the key plot of the video is missed, and improve the film watching experience of the user.
(4) The content which causes discomfort of the user and reappears can be automatically shielded, the content which causes discomfort can be shielded before the user does not see the content which causes discomfort, and the frequency of seeing the content which causes discomfort can be reduced, so that the discomfort of the user is reduced.
(5) The behavior representation of the target object is determined by the method of the rule and the neural network model, on one hand, the process of determining the behavior representation of the target object by the rule is simple, the speed of identifying the behavior representation is high, the speed of shielding video content is improved, and the time of watching the content causing discomfort by a user can be reduced; on the other hand, the behavior characterization process of the target object is determined by the neural network model method to be complex, but the accuracy is high, whether the content appearing in the video is shielded or not can be accurately determined according to the behavior of the user, and the influence on the film watching experience of the user caused by shielding of the content which does not exceed the bearing capacity of the user due to misjudgment can be avoided.
(6) After the content which the user does not want to shield is shielded, the control health mode can be closed through the operation of the user so as to cancel the shielding of the video, and the user can be ensured to watch the content meeting the requirement of the user.
(7) The video can be shielded to different degrees according to the bearing capacity of different levels of the user, so that the user cannot watch the overstimulated picture, but can watch a part of slightly stimulated picture, and the personalized film watching requirement of the user is met.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (15)

1. A method for processing video information, the method comprising:
presenting a video in a human-computer interaction interface;
detecting a behavior of a target object during presentation of the video;
determining a behavior characterization for content appearing in the video according to the behavior of the target object;
when the behavior characterization indicates that the content appearing in the video exceeds the bearing capacity of the target object, presenting a shielding effect of the content.
2. The method of claim 1,
the detecting the behavior of the target object comprises the following steps:
acquiring a behavior image of the target object;
the determining, according to the behavior of the target object, a behavior characterization for content appearing in the video includes:
and identifying the behavior representation of the target object according to the behavior image.
3. The method of claim 2, wherein the identifying the behavioral representation of the target object from the behavioral image comprises:
identifying the behavior type of the target object in the behavior image;
and inquiring the corresponding relation between different behavior types and behavior representations according to the identified behavior types to obtain the behavior representations corresponding to the identified behavior types.
4. The method of claim 2, wherein the identifying the behavioral representation of the target object from the behavioral image comprises:
calling the neural network model to execute the following processing:
extracting a feature vector of the behavior image;
mapping the extracted feature vectors into probabilities corresponding to a plurality of behavior representations, and determining the behavior representation corresponding to the maximum probability as the behavior representation of the target object;
the neural network model is obtained by taking a sample behavior image of the target object and labeled behavior characterization aiming at the sample behavior image as sample training.
5. The method of claim 1, further comprising:
determining that content present in the video exceeds the tolerance of the target object when the behavioral representation of the target object indicates that the emotion type of the target object belongs to fear or aversion;
the presenting of the masking effect of the content comprises:
performing at least one of the following operations:
superposing materials in all the picture areas of the content;
superposing materials in a partial picture area of the content;
skipping playback of the content;
reducing the volume of playing the content.
6. The method of claim 5, wherein said overlaying material in the partial-screen region of the content comprises:
determining a current frame of the video playing;
overlapping the material in the region where there is a difference between the current frame and the previous frame so that
The areas of difference exhibit at least one of the following shielding effects: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
7. The method of claim 5, wherein said overlaying material in the partial-screen region of the content comprises:
determining a focus area of a line of sight of the target object in the content;
performing target identification on the focus area to determine a target in the focus area, and overlaying materials on the target to enable the target to be overlapped
The target exhibits at least one of the following shielding effects: mosaic; blurring; corroding; sanding; a grid; and (6) shielding.
8. The method of claim 7, wherein determining a focal region of a line of sight of the target object in the content comprises:
collecting the positions of reflecting bright spots on the outer surfaces of the pupil and the cornea of the eyeball of the target object;
and determining a focal area corresponding to the sight line of the target object in the content according to the positions of the pupil of the target object and the reflection bright spot on the outer surface of the cornea of the eyeball.
9. The method of claim 8, wherein determining a focal region in the content corresponding to the target object's line of sight based on the locations of reflective hot spots of the target object's pupil and outer surface of the eyeball cornea comprises:
determining a cornea reflection vector of the target object according to the positions of the pupil of the target object and the reflection bright spot on the outer surface of the cornea of the eyeball;
determining the sight line direction of the target object when the target object watches the video according to the corneal reflection vector of the target object;
and determining the focus area in the content according to the sight line direction of the target object when watching the video.
10. The method of claim 7, wherein the performing target recognition on the focus area to determine a target in the focus area comprises:
dividing the focus area into a plurality of candidate frames;
predicting a candidate box comprising the target and the type of the target according to the feature vector of each candidate box;
determining targets belonging to a set type and included in the candidate frame;
wherein the set type of the target comprises at least one of: a type of terror; the type of pornography.
11. The method according to any one of claims 7 to 10, further comprising:
marking the target to
When the marked object appears again in the video, the material is superimposed on the object in the video.
12. A method for processing video information, the method comprising:
presenting a video in a human-computer interaction interface;
and when the content appearing in the video exceeds the bearing capacity of the target object, presenting the shielding effect of the content.
13. An apparatus for processing video information, comprising:
the video playing module is used for presenting videos in the human-computer interaction interface;
a detection module for detecting a behavior of a target object during presentation of the video;
a determining module, configured to determine, according to a behavior of the target object, a behavior characterization for content appearing in the video;
the video playing module is further configured to present a shielding effect of the content when the behavior representation of the target object indicates that the content appearing in the video exceeds the bearing capacity of the target object.
14. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the method of processing video information according to any one of claims 1 to 11 or the method of processing video information according to claim 12 when executing the executable instructions stored in the memory.
15. A computer-readable storage medium having stored thereon executable instructions for causing a processor to perform a method of processing video information according to any one of claims 1 to 11 or a method of processing video information according to claim 12 when executed.
CN202010598266.6A 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium Active CN111723758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010598266.6A CN111723758B (en) 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010598266.6A CN111723758B (en) 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111723758A true CN111723758A (en) 2020-09-29
CN111723758B CN111723758B (en) 2023-10-31

Family

ID=72569530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010598266.6A Active CN111723758B (en) 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111723758B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112827862A (en) * 2020-12-30 2021-05-25 重庆金康动力新能源有限公司 Grade sorting method and test equipment
CN115942054A (en) * 2022-11-18 2023-04-07 优酷网络技术(北京)有限公司 Video playing method and device, electronic equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012174186A (en) * 2011-02-24 2012-09-10 Mitsubishi Electric Corp Image processor for monitoring
CN106454490A (en) * 2016-09-21 2017-02-22 天脉聚源(北京)传媒科技有限公司 Method and device for smartly playing video
CN106454155A (en) * 2016-09-26 2017-02-22 新奥特(北京)视频技术有限公司 Video shade trick processing method and device
CN107493501A (en) * 2017-08-10 2017-12-19 上海斐讯数据通信技术有限公司 A kind of audio-video frequency content filtration system and method
CN108495191A (en) * 2018-02-11 2018-09-04 广东欧珀移动通信有限公司 Video playing control method and related product
CN108900908A (en) * 2018-07-04 2018-11-27 三星电子(中国)研发中心 Video broadcasting method and device
CN111050105A (en) * 2019-12-14 2020-04-21 中国科学院深圳先进技术研究院 Video playing method and device, toy robot and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012174186A (en) * 2011-02-24 2012-09-10 Mitsubishi Electric Corp Image processor for monitoring
CN106454490A (en) * 2016-09-21 2017-02-22 天脉聚源(北京)传媒科技有限公司 Method and device for smartly playing video
CN106454155A (en) * 2016-09-26 2017-02-22 新奥特(北京)视频技术有限公司 Video shade trick processing method and device
CN107493501A (en) * 2017-08-10 2017-12-19 上海斐讯数据通信技术有限公司 A kind of audio-video frequency content filtration system and method
CN108495191A (en) * 2018-02-11 2018-09-04 广东欧珀移动通信有限公司 Video playing control method and related product
CN108900908A (en) * 2018-07-04 2018-11-27 三星电子(中国)研发中心 Video broadcasting method and device
CN111050105A (en) * 2019-12-14 2020-04-21 中国科学院深圳先进技术研究院 Video playing method and device, toy robot and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
H.-I. KIM等: ""Gaze estimation using a webcam for region of interest detection"", 《SIGNAL, IMAGE AND VIDEO PROCESSING》, vol. 10 *
王宁致等: ""基于表情分析和视线追踪的用户反馈采集技术"", 《智能计算机与应用》, vol. 9, no. 3 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112827862A (en) * 2020-12-30 2021-05-25 重庆金康动力新能源有限公司 Grade sorting method and test equipment
CN115942054A (en) * 2022-11-18 2023-04-07 优酷网络技术(北京)有限公司 Video playing method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN111723758B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN112507799B (en) Image recognition method based on eye movement fixation point guidance, MR glasses and medium
US20200175262A1 (en) Robot navigation for personal assistance
US8154615B2 (en) Method and apparatus for image display control according to viewer factors and responses
US20170011258A1 (en) Image analysis in support of robotic manipulation
US20190034706A1 (en) Facial tracking with classifiers for query evaluation
CN112034977B (en) Method for MR intelligent glasses content interaction, information input and recommendation technology application
Yang et al. Benchmarking commercial emotion detection systems using realistic distortions of facial image datasets
US20160191995A1 (en) Image analysis for attendance query evaluation
US20170238859A1 (en) Mental state data tagging and mood analysis for data collected from multiple sources
KR102092931B1 (en) Method for eye-tracking and user terminal for executing the same
US20170095192A1 (en) Mental state analysis using web servers
US20220383389A1 (en) System and method for generating a product recommendation in a virtual try-on session
JP7151959B2 (en) Image alignment method and apparatus
US11430561B2 (en) Remote computing analysis for cognitive state data metrics
CN111723758B (en) Video information processing method and device, electronic equipment and storage medium
WO2012063561A1 (en) Information notification system, information notification method, information processing device and control method for same, and control program
KR20190067433A (en) Method for providing text-reading based reward advertisement service and user terminal for executing the same
CN113556603B (en) Method and device for adjusting video playing effect and electronic equipment
US20200226012A1 (en) File system manipulation using machine learning
CN111654752A (en) Multimedia information playing method, device and related equipment
CN109298782B (en) Eye movement interaction method and device and computer readable storage medium
CN108334821B (en) Image processing method and electronic equipment
US20230135254A1 (en) A system and a method for personalized content presentation
CN112749357B (en) Interaction method and device based on shared content and computer equipment
CN113313026A (en) Face recognition interaction method, device and equipment based on privacy protection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant