CN111723758B - Video information processing method and device, electronic equipment and storage medium - Google Patents

Video information processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111723758B
CN111723758B CN202010598266.6A CN202010598266A CN111723758B CN 111723758 B CN111723758 B CN 111723758B CN 202010598266 A CN202010598266 A CN 202010598266A CN 111723758 B CN111723758 B CN 111723758B
Authority
CN
China
Prior art keywords
target object
video
behavior
content
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010598266.6A
Other languages
Chinese (zh)
Other versions
CN111723758A (en
Inventor
黄其亮
黄杰怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010598266.6A priority Critical patent/CN111723758B/en
Publication of CN111723758A publication Critical patent/CN111723758A/en
Application granted granted Critical
Publication of CN111723758B publication Critical patent/CN111723758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program

Abstract

The application provides a processing method and device of video information based on artificial intelligence, electronic equipment and a computer readable storage medium; the method comprises the following steps: presenting a video in a human-computer interaction interface; detecting a behavior of a target object during presentation of the video; determining a behavior characterization for content present in the video based on the behavior of the target object; and when the behavior representation of the target object indicates that the content appearing in the video exceeds the bearing capacity of the target object, the shielding effect of the content is presented. According to the application, the content in the video can be displayed in a personalized way.

Description

Video information processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to artificial intelligence technology, and in particular, to a method and apparatus for processing video information, an electronic device, and a computer readable storage medium.
Background
Artificial intelligence (AI, artificial Intelligence) is the theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. Artificial intelligence is now rapidly evolving and is widely used in a variety of industries.
Computer vision processing based on artificial intelligence is widely applied, and takes online video as an example, the video resources are more and more abundant, however, in the embodiment of the application, the audience of the video has differences in the acceptance degree of the video due to differences in psychological quality, regional culture, religion beliefs and the like.
The related art has no effective solution for contradiction between diversity of video contents and variability of audience receiving degree.
Content of the application
The embodiment of the application provides a video information processing method, a video information processing device, electronic equipment and a computer readable storage medium, which can display contents in video in a personalized way.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a processing method of video information, which comprises the following steps:
presenting a video in a human-computer interaction interface;
detecting a behavior of a target object during presentation of the video;
determining a behavior characterization for content present in the video based on the behavior of the target object;
and when the behavior representation of the target object indicates that the content appearing in the video exceeds the bearing capacity of the target object, the shielding effect of the content is presented.
An embodiment of the present application provides a processing apparatus for video information, including:
the video playing module is used for presenting videos in the human-computer interaction interface;
a detection module for detecting a behavior of a target object during presentation of the video;
a determining module for determining a behavior characterization for content appearing in the video according to the behavior of the target object;
the video playing module is further used for presenting the shielding effect of the content when the behavior representation of the target object indicates that the content appearing in the video exceeds the bearing capacity of the target object.
In the above scheme, the detection module is further configured to collect a behavior image of the target object; the determining module is further used for identifying behavior characterization of the target object according to the behavior image.
In the above solution, the determining module is further configured to identify, in the behavior image, a behavior type of the target object; and inquiring the corresponding relation between different behavior types and behavior characterization according to the identified behavior types to obtain the behavior characterization corresponding to the identified behavior types.
In the above scheme, the determining module is further configured to invoke the neural network model to perform the following processing: extracting feature vectors of the behavior image; mapping the extracted feature vector into probabilities corresponding to a plurality of behavior characterizations, and determining the behavior characterization corresponding to the maximum probability as the behavior characterization of the target object; the neural network model is obtained by training a sample behavior image of the target object and a behavior characterization of a label of the sample behavior image.
In the above scheme, the determining module is further configured to determine that, when the behavioral representation of the target object indicates that the emotion type of the target object belongs to fear or dislike, content appearing in the video exceeds the bearing capability of the target object; the video playing module is further used for executing at least one of the following operations: superposing materials in all picture areas of the content; superimposing a material in a partial picture area of the content; skipping playing of the content; and reducing the volume of playing the content.
In the above scheme, the video playing module is further configured to determine a current frame of the video playing; superimposing material in a region where there is a difference between the current frame and the previous frame such that the region where there is a difference exhibits at least one of the following masking effects: mosaic; blurring; etching; sanding; a grid; and (5) shielding.
In the above solution, the video playing module is further configured to determine a focal area of a line of sight of the target object in the content; performing object recognition on the focus area to determine an object in the focus area, and overlaying material on the object to enable the object to present at least one of the following shielding effects: mosaic; blurring; etching; sanding; a grid; and (5) shielding.
In the above scheme, the video playing module is further configured to collect positions of the pupil of the target object and the reflective bright spots on the outer surface of the cornea of the eyeball; and determining a focus area corresponding to the sight line of the target object in the content according to the positions of the pupil of the target object and the reflecting bright spots on the outer surface of the cornea of the eyeball.
In the above scheme, the video playing module is further configured to determine a corneal reflection vector of the target object according to positions of a pupil of the target object and a reflection bright spot on an outer surface of a cornea of an eyeball; determining a sight line direction of the target object when watching the video according to the cornea reflection vector of the target object; and determining the focus area in the content according to the sight line direction of the target object when watching the video.
In the above scheme, the video playing module is further configured to divide the focal area into a plurality of candidate boxes; predicting a candidate frame comprising the target and the type of the target according to the feature vector of each candidate frame; determining targets belonging to a set type included in the candidate frame; wherein the setting type of the target comprises at least one of the following: terrorist type; pornography type.
In the above scheme, the video playing module is further configured to mark the target, so that when the marked target appears again in the video, the material is superimposed on the target in the video.
The embodiment of the application provides a processing method of video information, which comprises the following steps:
presenting a video in a human-computer interaction interface;
and when the content appearing in the video exceeds the bearing capacity of the target object, the shielding effect of the content is presented.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the processing method of the video information provided by the embodiment of the application when executing the executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium which stores executable instructions for causing a processor to execute, thereby realizing the processing method of video information provided by the embodiment of the application.
The embodiment of the application has the following beneficial effects:
the emotion of the content in the video is predicted through the collected behaviors during the video presentation, and shielding is carried out when the situation type of the content which cannot be received occurs, so that shielding is carried out according to the difference of bearing capacities of different objects, personalized video content presentation according to the individual difference of the objects is realized, the video preprocessing link is saved, and the timeliness of video release is improved.
Drawings
Fig. 1 is a schematic diagram of a video information processing system 100 according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a terminal 400 according to an embodiment of the present application;
fig. 3 is a flowchart of a method for processing video information according to an embodiment of the present application;
fig. 4 is a flowchart of a method for processing video information according to an embodiment of the present application;
fig. 5 is a flowchart of a method for processing video information according to an embodiment of the present application;
fig. 6A and 6B are schematic views of application scenarios provided by the related art;
fig. 7A, fig. 7B, fig. 7C, and fig. 7D are schematic diagrams of application scenarios provided by embodiments of the present application;
fig. 8 is a flowchart of a method for processing video information according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) In response to a condition or state that is used to represent the condition or state upon which the performed operation depends, the performed operation or operations may be in real-time or with a set delay when the condition or state upon which it depends is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.
2) The client is any Application (App) that can be run in the terminal, may be a Native App in the terminal, and the Web Application (Web App) or Hybrid App may be used for various purposes, such as a social network client, a browser, a video client, a live client, and the like.
3) The material, which refers to a graphic element that can be superimposed in an image so that the image has a new display effect, includes: mosaic, blurring, corrosion, frosting, grid or shielding, etc.
In embodiments of the present application, it has been found that viewing video (e.g., horror film) meets the psychological needs of people pursuing stimulation, but overstimulating content can cause psychological and physiological discomfort to people. Referring to fig. 6A and 6B, fig. 6A and 6B are schematic views of application scenes provided by the related art, and in fig. 6A, for a video containing an image of content that may cause discomfort to a user, prompt 601 is presented before the user views. In fig. 6B, content 602 that may cause user discomfort in general awareness is coded in advance to present a mosaic display effect.
The related art has the following technical problems: for users who want to pursue the motivational content and worry about excessive or fear, the pre-viewing reminder does not alleviate the user's discomfort. The unified coding processing in advance cannot meet the watching demands of users with different bearing capacities, and the online instantaneity of videos with higher timeliness requirements can be affected.
In view of the above technical problems, embodiments of the present application provide a method, apparatus, device, and computer-readable storage medium for processing video information, which can achieve the purpose of personalized shielding of video content beyond the bearing capability of a target object (e.g., a user or an artificial intelligent robot). The following describes an exemplary application of the video information processing method provided by the embodiment of the present application, where the video information processing method provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a client running in a terminal alone or may be implemented by a server in conjunction with a client running in the terminal.
In the following, embodiments of the present application will be described with reference to a client that is executed by a server and a terminal and whose target object is a user, it will be understood that the target object may be a program that can simulate human behavior or simulate outputting human behavior data, for example, a test program of the client.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a video information processing system 100 according to an embodiment of the present application. The video information processing system 100 includes: the server 200, the network 300, and the terminal 400 will be described separately.
The server 200 is a background server of the client 410, and is configured to send a corresponding video to the client 410 in response to a video acquisition request sent by the client 410.
The network 300 may be a wide area network or a local area network, or a combination of both, for mediating communication between the server 200 and the terminal 400.
The terminal 400 is configured to run the client 410, where the client 410 is a client with a video playing function. A client 410 for presenting video in a human-machine interaction interface 411; and further for detecting a behavior of the user during presentation of the video and determining a behavior characterization of the user for content present in the video; when the behavior characterization of the user indicates that the content appearing in the emotion type characterization video of the user exceeds the affordance of the user, the masked video is presented in the human-machine interaction interface 411.
Next, referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal 400 provided in an embodiment of the present application, and the terminal 400 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.
The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.
Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
network communication module 452 for reaching other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
A presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;
an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the processing device for video information provided in the embodiments of the present application may be implemented in software, and fig. 2 shows the processing device 455 for video information stored in the memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the video playing module 4551, the detecting module 4552 and the determining module 4553 are logical, so that any combination or further splitting may be performed according to the functions implemented. The functions of the respective modules will be described hereinafter.
In the following, the embodiment of the present application will be described by taking a processing method of video information provided by the embodiment of the present application implemented cooperatively by the server 200 and the client running in the terminal 400 in fig. 1 as an example. Referring to fig. 3, fig. 3 is a flowchart of a method for processing video information according to an embodiment of the present application, and the steps shown in fig. 3 will be described.
In step S101, the server transmits a video to the client.
In some embodiments, the server sends the corresponding video to the client in response to the video acquisition request sent by the client.
Here, the server is a background server of a client, and the client is an Application (APP) having a video playing function, for example, a social network client, a live client, or a short video client. The type of video that the server sends to the client may be any type of video, such as horror or comedy, etc.
In step S102, the client presents the video in the human-computer interaction interface.
In some embodiments, the client receives the video sent by the server, and presents the content of the video in the man-machine interaction interface, and may also present control functions of the video (e.g., video volume control, stop playing, and on or off of a healthy mode), and subtitle content.
In step S103, the client detects the behavior of the target object during the presentation of the video.
Here, the client detecting the behavior of the target object may be detecting the behavior of the target object after the target object has turned on the health mode (e.g., collecting a behavior image of the target object); or whether the target object starts the health mode or not, the client detects the behavior of the target object. The health mode refers to a function of shielding all or part of content in the video according to the bearing capacity of the target object. The behavior of the target object may be eye behavior, limb behavior, head behavior, voice behavior, or gesture behavior.
In some embodiments, the client may default to the on-health mode when the target object triggers video playback; the health mode can also be started through the custom setting of the target object.
As one example, after the client acquires the video transmitted by the server, the type of the video is determined; when the video is of a type comprising content exceeding the bearing capacity of the target object, the client defaults to open a health mode, or presents prompt information in a human-computer interaction interface; the prompt information is used for prompting the target object to start the health mode.
Next, a specific implementation of the client to determine the type of video will be described in detail.
In some embodiments, a client obtains viewing data of a video through a server; the type of video is determined from the viewing data of the video.
Here, the viewing data of the video includes at least one of: bullet screen data; video comments; number of views; viewing operations (e.g., a fast forward operation, a rewind operation, or an operation to adjust a play progress, etc.); video classification labels. The viewing data of the video may be historical viewing data of all users of the whole network; historical viewing data of social users that may also be target objects; but also historical viewing data of the target object.
For example, when a comment or bullet screen of a video includes a number of "horror", "scare", etc. fields, and/or a number of fast forward operations, it may be determined that horror content is included in the video; the inclusion of pornography may be determined when a video comment or bullet screen includes a number of "child not appropriate," "eighteen," etc. fields and/or includes a number of operations to adjust the progress of the playback.
After the client determines the type of the video, the method further comprises: the client judges whether the video is of a type comprising content exceeding the bearing capacity of the target object, and the specific implementation process is as follows:
the client acquires the type of the video and calls the neural network model to execute the following processes: extracting a characteristic vector of the type of the video; the extracted feature vector is mapped to a probability that the video includes a type exceeding the affordability content of the target object and a probability that the video does not include a type exceeding the affordability content of the target object, and a type corresponding to the maximum probability is determined as a prediction result. In this way, the client can determine whether the video is of a type that includes content that exceeds the affordability of the target object.
Here, the neural network mode is trained using, as a sample, historical viewing data and portrait data of the target object (including age or preference of the target object, etc.).
For example, in fig. 7A, when the type of video acquired by the client is horror, a prompt window 701 is presented for prompting the user to turn on the health mode before the video is presented on the man-machine interface. When the user clicks the button for "opening the health mode", the client is triggered to enter the health mode, i.e. all or part of the content in the video is masked according to the user's bearing capacity (a specific implementation of masking all or part of the content in the video according to the user's bearing capacity will be described in detail below); when the user clicks the button "see original video", the client does not turn on the healthy mode, i.e. presents the original video, i.e. does not mask all or part of the content in the video according to the user's affordance. Thus, different selection opportunities of the user are provided, and the personalized viewing requirements of the user can be met.
In some embodiments, the client may determine the behavior (e.g., eye behavior, limb behavior, or head behavior) of the target object by capturing a behavior image of the target object (detailed implementation will be described below); the behavior of the target object can be obtained by setting a behavior option button (comprising a fear button and an aversion button) in the human-computer interaction interface and responding to the operation of the behavior option button, for example, when a user views a terrorist picture causing discomfort in the process of viewing a video, the user can click the fear button, so that the client can obtain the behavior of which the behavior representation of the user indicates that the emotion type is the fear type; when a user views a pornographic picture causing discomfort, an aversion button can be clicked, so that the client can acquire the behavior of which the behavior representation of the user indicates that the emotion type is the aversion type; the gesture operation (such as waving hands in front of a camera, clicking or multi-clicking a screen or sliding a screen) can be set, the gesture operation of a target object corresponding to the gesture operation can be responded, for example, the gesture operation of waving hands of a client side set in front of the camera indicates that the emotion type is a fear type, when a user views a fear picture causing discomfort in the process of viewing a video, waving hands in front of the camera, the client side can obtain gesture behaviors of the user, wherein the gesture characteristics of the gesture behavior indicate that the emotion type is the fear type; the voice operation can be set, the voice information can be obtained in response to the voice operation of the target object, voice recognition is carried out on the voice information, and the behavior of the target object is determined, for example, when a user views a terrorist picture causing discomfort in the process of viewing a video, a microphone obtains the voice information similar to 'very terrorist' or 'good fear', and through the voice recognition on the voice information, a client can obtain the voice behavior of which the behavior representation of the user indicates that the emotion type is the fear type. Therefore, the client can quickly and accurately acquire the behavior of representing the emotion type of the target object, so that the efficiency of subsequently identifying the emotion type of the target object is improved.
In step S104, the client determines, according to the behavior of the target object, a behavior characterization of the target object with respect to the content appearing in the video.
By way of example, emotion types indicated by behavior characterization may include happiness, vigilance, fear, aversion, and grime, among others.
In some embodiments, a client gathers behavior of a target object in real-time and determines behavior characterizations of the target object for content that appears in a video. That is, the client may invoke a corresponding service (e.g., behavior recognition service) of the terminal through which the recognition process of behavior characterization is completed.
As an example, a client obtains a plurality of reference behavior characterizations by invoking a behavior recognition service of a terminal; and matching the behaviors of the target object with a plurality of reference behavior representations respectively, calculating the similarity of the behaviors of the target object corresponding to each reference behavior representation respectively, and taking the reference behavior representation with the highest similarity with the behaviors of the target object as the behavior representation of the target object for the content appearing in the video.
In other embodiments, the client collects the behavior of the target object in real time and sends data characterizing the behavior of the target object to the server for identification of the behavior characterization. That is, the client may invoke a corresponding service (e.g., behavior recognition service) of the server through which the recognition process of behavior characterization is completed.
As an example, a client sends data characterizing the behavior of a target object to a server; the server acquires a plurality of reference behavior characterizations by calling a behavior recognition service; matching the behaviors of the target object with a plurality of reference behavior representations respectively, calculating the similarity of the behaviors of the target object corresponding to each reference behavior representation respectively, and taking the reference behavior representation with the highest similarity with the behaviors of the target object as the behavior representation of the target object for the content appearing in the video; and sending the behavior representation of the target object aiming at the content appearing in the video to the client.
For example, when the behavior of the target object is eye behavior such as eye tightening or eye focus shifting, limb behavior such as eye blocking by hands or objects, voice behavior indicating that the emotion type is fear type after voice recognition of voice information (e.g., "fear", "horror", etc.), or head behavior such as shaking to avoid eye focus, the emotion type indicated by behavior characterization of the corresponding behavior of the target object may be determined to be fear. When the behavior of the target object is an eye behavior such as squinting, a limb behavior such as blocking a line of sight with a hand or an object, a voice behavior indicating that the emotion type is aversive after voice recognition of voice information (for example, "nausea", "very regurgitation", etc.), or a head behavior such as pulling down the head and the head angle, the emotion type indicated by the behavior characterization of the corresponding behavior of the target object can be determined to be aversive.
The embodiment of the application can accurately judge the corresponding emotion type according to the behavior of the target object when watching the video, so that whether the content appearing in the video exceeds the bearing capacity of the target object can be accurately determined according to the emotion type of the target object, and the accuracy of shielding the video is improved.
In step S105, when the behavior characterization of the target object indicates that the content appearing in the video exceeds the affordability of the target object, the client presents a masking effect of the content.
In some embodiments, prior to step S105 may include: when the behavior characterization of the target object indicates that the emotion type of the target object belongs to fear or dislike, the client determines that the content appearing in the video exceeds the bearing capacity of the target object.
Here, when the behavior characterization of the target object indicates that the emotion type of the target object belongs to fear or dislike, characterizing the content appearing in the video causes discomfort to the target object, that is, the target object is not willing to watch the content appearing in the video, for example, horror content that makes the target object fear, pornography content that makes the target object feel dislike, or the like, and therefore, the client needs to mask the content appearing in the video.
In some embodiments, implementations of the masking effect of the client presentation content include at least one of: superposing the material in the whole picture area of the content; superimposing the material in a partial picture area of the content; skipping playback of the content; the volume of the played content is reduced.
The implementation manner of the shielding effect of the client presentation content will be specifically described below.
(1) The client superimposes the material in the full picture area of the content.
Here, the shielding effect presented by the client after superimposing the material in the entire screen area of the content includes at least one of: mosaic; blurring; etching; sanding; a grid; and (5) shielding.
As an example, the client blurs the entire picture area as a whole, replaces the entire picture area with a blank, black or preset background, or codes the entire picture area to present a mosaic effect in the entire picture area.
Therefore, the target object can be prevented from seeing the uncomfortable content to the greatest extent, but the target object can not know the complete plot of the video. For the above-described problem, the client may superimpose the material only in a partial picture area of the content to avoid the above-described problem.
In some embodiments, when the behavioral representation of the target object indicates that content appearing in the emotion type representation video of the target object exceeds the secondary bearing capacity of the target object, the client superimposes material in all picture regions of the content; when the behavior representation of the target object indicates that the content appearing in the emotion type representation video of the target object exceeds the first-order bearing capability of the target object, the client superimposes material in a partial picture region of the content (a specific implementation of superimposing material in the partial picture region of the content will be described in detail below).
Here, taking the movie classification as an example, the level of the movie corresponding to the second level bearing capacity is higher than the level of the movie corresponding to the first level bearing capacity, for example, the movie corresponding to the second level bearing capacity is suitable for viewing by people of any age, and the movie corresponding to the first level bearing capacity is only suitable for viewing by people of 18 years (including more than 18 years). That is, taking the example that the video is a horror film, horror content in the video corresponding to the secondary bearing capability is more horror than horror content in the video corresponding to the primary bearing capability.
Therefore, the video can be shielded to different degrees according to the bearing capacity of different levels of the user, so that the user cannot watch the overstimulated picture, but can watch a part of slightly stimulated picture, and the personalized viewing requirement of the user is met.
(2) The client superimposes the material in a partial picture area of the content.
Here, the shielding effect presented by the client after superimposing the material in the partial picture area of the content includes at least one of: mosaic; blurring; etching; sanding; a grid; and (5) shielding.
As an example, the client encodes the partial picture area to display the mosaic display effect in the partial picture area, or overlays the partial picture area to display the occlusion display effect in the partial picture area, where the overlay may be a smiling picture, so that the uncomfortable emotion of the target object may be relieved to a greater extent.
As another example, the client may superimpose, in a partial picture area of the content, a material corresponding to the behavior representation of the target object indicating the emotion type of the target object, for example, when the emotion type of the target object belongs to fear, overlay a fun map in the partial picture area of the content, and happy the target object to alleviate fear of the target object; when the emotion type of the target object belongs to aversion, the content appearing in the video may be a pornographic picture, so that materials can be superimposed in a partial picture area of the content, so that the target object cannot watch the pornographic picture, and the aversion of the target object is relieved. Therefore, not only can the uncomfortable feeling of the user be relieved, but also the interestingness of the user when watching the video can be improved.
In some embodiments, the client determines a current frame of the video playback; the material is superimposed in the region where there is a difference between the current frame and the previous frame.
Taking a video as a horror film for example, a horror element (such as a monster or a zombie) suddenly appearing in the next frame is generally caused to fear a user in the video, and for this reason, the client may determine a region where a difference exists between the current frame and the previous frame as a horror region in a video playing picture, and then superimpose the horror region with a material. In this way, the sudden areas (such as horror areas or aversion areas) in the video can be automatically overlaid with the material, so that the content identification efficiency is improved, and the situation that the shielding content is delayed is avoided.
In other embodiments, the client determines a focal region of the line of sight of the target object in the content by an eye tracking system; and carrying out target identification on the focus area to determine a target in the focus area, and superposing the material on the target.
Here, the target may be a terrorist element as described above (e.g., monster or zombie, etc.); pornographic elements (e.g., nude, etc.) are also possible. The client may invoke a corresponding service (e.g., an object recognition service) of the terminal through which the object recognition process is completed. The client may also invoke a corresponding service (e.g., object recognition service) of the server through which the object recognition process is completed. Of course, the process of completing the target recognition may also be cooperatively implemented by the client and the server. The following description will be given by taking, as an example, a procedure for completing target recognition by a terminal by calling a corresponding service (e.g., target recognition service) of the terminal by the client. The process of completing the target recognition by the server is similar to the following, and will not be described in detail.
As an example, the process of determining the focal region of the line of sight of the target object in the content by the client is: the client calls a camera device (such as a camera) of the terminal to collect the positions of the pupil of the target object and the reflection bright spots on the outer surface of the cornea of the eyeball; a focal region corresponding to a line of sight of a target object is determined in the content based on positions of the pupil of the target object and the reflected bright spots of the outer surface of the cornea of the eyeball.
Here, the reflection bright spot on the outer surface of the cornea of the eyeball refers to Purkinje Image (Purkinje Image), which is a bright spot on the cornea of the eyeball, and is generated by the reflection (CR, cornea l Reflection) of the light entering the pupil on the outer surface of the cornea.
The principle of determining the focal region corresponding to the line of sight of the target object in the content based on the positions of the pupil of the target object and the reflected bright spots of the outer surface of the cornea of the eyeball is: because the position of the terminal camera is fixed, the position of the terminal screen light source is also fixed, the center position of the eyeball is unchanged, and the absolute position of the purkinje is not changed along with the rotation of the eyeball. But its position relative to the pupil and the eyeball is constantly changing, for example, the purkinje spot is located between the pupils of the target object when the target object stares at the camera; and when the target object lifts the head, the purkinje spot is just below the pupil of the target object. Thus, the pupil and purkinje positions on the eye image are positioned in real time, and the cornea reflection vector is calculated, so that the line-of-sight direction of the target object can be estimated by using the geometric model. And determining a focus area corresponding to the sight of the target object in the content of the video based on the relation between the eye characteristics of the target object and the content presented by the terminal screen, which is established in the pre-calibration process (namely, the target object is enabled to watch a specific point on the terminal screen).
For example, the client determines the cornea reflection vector of the target object according to the pupil of the target object and the position of the reflection bright spot on the outer surface of the cornea of the eyeball; determining the sight line direction of the target object when watching the video according to the cornea reflection vector of the target object; the focus area is determined in the content according to the line of sight direction when the target object views the video.
As an example, the specific process of the client performing object recognition on the focus area to determine the object in the focus area is: the client divides the focus area into a plurality of candidate frames; predicting a candidate frame comprising the target and the type of the target according to the feature vector of each candidate frame; the targets belonging to the set type included in the candidate box are determined.
Here, the setting type of the target includes at least one of: terrorist type; pornography type. Target recognition involves two processes, target location and target recognition, respectively.
For example, in order to achieve more accurate shielding, the client side first divides the focal region cost into a plurality of candidate boxes through the intelligent image recognition system, wherein the candidate boxes comprise targets which may be of a set type; then feature extraction is performed on the images in each candidate frame through a neural network (such as a convolutional neural network) to obtain feature vectors; classifying the feature vectors corresponding to each candidate frame by a support vector machine (S VM, support Vector Machine) to determine the candidate frame comprising the target and the type of the target; and finally, selecting a candidate frame comprising the target belonging to the set type, and determining the precise region of the target belonging to the set type, which is included in the candidate frame, through frame regression (Bounding Box Regression).
For example, in fig. 7B, when content causing discomfort to the user (e.g., horror such as monster) appears in the video, the hot zone 702 (i.e., the above-described focus area) of the picture causing discomfort to the user is first automatically identified in the video content by the eye-tracking system; the picture hotspots 702 are then identified and marked by the intelligent image recognition system to obtain marked content 703 (i.e., the targets described above).
The embodiment of the application can accurately shield the content in the video, avoid the problem that the key plot of the video is missed because the shielding content is inaccurate and the user cannot watch the content which cannot feel uncomfortable in the video, and improve the viewing experience of the user.
The following description will take as an example a process of performing target recognition by the client and the server cooperatively.
In some embodiments, first the client determines a focal region of a line of sight of a target object in the content by an eye tracking system and sends the focal region to the server; then the server carries out target identification on the focus area through the intelligent image identification system so as to determine a target in the focus area, marks the target, and sends the marked target to the client; and finally, the client superimposes the material on the marked target through a video image processing system.
In some embodiments, after performing object recognition on the focal region to determine the object in the focal region, further comprising: the objects are marked so that when the marked objects reappear in the video, material is superimposed on the objects in the video.
Here, the client may call a corresponding service (e.g., a tagging service or an overlay service) of the terminal, through which the tagging and overlay processes are completed. The client may also invoke a corresponding service (e.g., a tagging service or an overlay service) of the server through which the tagging and overlay process is completed. Of course, the process of completing the marking and overlaying may also be performed cooperatively by the client and the server.
The procedure of completing the tagging and superimposition by the terminal will be described below by taking as an example a procedure in which a corresponding service (e.g., a tagging service or superimposition service) of the terminal is invoked by the client. The process of completing the marking and stacking by the server is similar to the following process, and will not be described in detail.
As an example, the client tags the target through the intelligent image recognition system to superimpose material on the target in the video when the tagged target reappears in the video.
Taking the cooperative embodiment of the server and the client as an example, the server marks the target through the intelligent image recognition system and sends the marked target to the client. When the marked target appears again in the video, the client automatically superimposes the material on the marked target in the video through the video image processing system.
The embodiment of the application can automatically and real-timely shield the content which is presented again and causes the discomfort of the user, shield the content which causes the discomfort before the user does not see the content which causes the discomfort, and reduce the frequency of the user seeing the content which causes the discomfort, thereby reducing the discomfort of the user.
(3) The client skips the playing of the content.
As one example, when a user views content causing discomfort, the content playing the video is fast-forwarded (e.g., twice, four times, eight times, etc.), thereby reducing the time for the user to view the content causing discomfort to alleviate the user's mood.
As another example, when a user views content that causes discomfort, a key frame of the content that the video is playing is selected for play (i.e., non-key frames are discarded), thereby making it possible for the user to fail to view the content that causes discomfort.
Here, the key frame does not include discomfort-causing content, and the key frame may be a video frame including a video key scenario.
(4) The client reduces the volume of the played content.
Taking a horror film as an example, sound effect rendering is an important means for setting up horror atmosphere, and when a user views uncomfortable content, the client can automatically reduce the volume (such as silence) of playing the content, thereby relieving the fear of the user and avoiding excessive discomfort in the viewing process.
In some embodiments, the client may continuously mask the content that causes discomfort to the target object during the time the user views the entire video, or may intermittently mask the content that causes discomfort to the target object, e.g., mask the content that causes discomfort to the target object when the target object exhibits behavior that exceeds its affordability, and not mask when the behavior disappears.
The embodiment of the application can individually shield the video content exceeding the bearing capacity of the user according to the behavior of the user, thereby not only meeting the psychological requirement of the user for pursuing stimulation, but also avoiding uncomfortable feeling caused by excessive, and ensuring the healthy and moderate watching experience of different users.
Referring to fig. 4, fig. 4 is a flowchart of a method for processing video information according to an embodiment of the present application, and based on fig. 3, step S103 may be replaced with step S106, and step S104 may be replaced with step S107.
In step S106, the client acquires a behavior image of the target object during the presentation of the video.
In some embodiments, the client invokes a camera device (e.g., a camera) of the terminal to capture a behavioral image of the target object.
Here, the motion direction of key points (e.g., hands, eyes, face, head, etc.) of the body part of the target object is included in the behavior image.
In step S107, the client identifies a behavior characterization of the target object from the behavior image.
Here, the client may call a corresponding service (e.g., a behavior recognition service) of the terminal, through which recognition process of behavior characterization is completed. The client may also invoke a corresponding service (e.g., behavior recognition service) of the server through which the recognition process of behavior characterization is accomplished.
The following describes an example in which the client identifies the behavior representation of the target object according to the behavior image, where a process that the client invokes the server to complete the identification of the behavior representation is similar to a process that the client completes the identification of the behavior representation, and will not be described in detail.
In some embodiments, the client identifies a behavior type of the target object in the behavior image; and inquiring the corresponding relation between different behavior types and behavior characterization according to the identified behavior types to obtain the behavior characterization corresponding to the identified behavior types.
As an example, the client determines, by means of a gesture behavior recognition system, a direction of movement of a keypoint of a body part of the target object in the behavior image; matching the movement direction of the key point with a plurality of behaviors stored in a gesture behavior recognition system, and determining the similarity between the movement direction of the key point and each behavior; determining the behavior with the highest similarity to the movement direction of the key point as the behavior of the target object; and determining the behavior representation of the target object according to the behavior of the target object.
For example, when the behavior of the target object is eye-tight, a hand or object obscures the line of sight, and the focus of the line of sight is avoided, etc., it may be determined that the behavior characterization corresponding to the behavior of the target object is fear. When the behaviors of the target object are squinting, frowning, mouth angle pulling-down and the like, the behavior characterization corresponding to the behaviors of the target object can be determined to be aversion.
In other embodiments, the client invokes the neural network model to perform the following: extracting feature vectors of the behavior image; and mapping the extracted feature vector into probabilities corresponding to a plurality of behavior characterizations, and determining the behavior characterization corresponding to the maximum probability as the behavior characterization of the target object.
Here, the neural network model is obtained by training a sample by using a sample behavior image of the target object and a behavior characterization of a label for the sample behavior image as a sample. The neural network model may be stored locally or in the cloud (e.g., a server), and the client may complete the identification of the behavioral characterization by invoking the local or cloud neural network model.
By way of example, the training process of the neural network model is specifically: firstly, acquiring a training sample, wherein the training sample comprises a sample behavior image of a target object and a behavior representation of a label corresponding to the sample behavior image; then extracting feature vectors of the sample behavior image, mapping the extracted feature vectors into probabilities corresponding to a plurality of behavior characterizations, and determining the behavior characterization corresponding to the maximum probability as the predicted behavior characterization of the target object; calculating the difference between the predicted behavior representation and the behavior representation of the label of the corresponding sample behavior image; and finally updating parameters in the neural network model according to the difference.
Here, the sample behavior image of the target object may be acquired before the target object views the video, for example, before the user views the video, the sample behavior image of the user is acquired by the camera, and the user is labeled with the corresponding behavior representation for the acquired sample behavior image.
According to the method, the behavior representation of the target object is determined through the method of the rule and the neural network model, on one hand, the process of determining the behavior representation of the target object through the rule is simple, the behavior representation recognition speed is high, the speed of shielding video content is improved, and the time for users to watch uncomfortable content can be reduced; on the other hand, the behavior characterization process of determining the target object by the neural network model is complex, but the accuracy is high, whether to shield the content appearing in the video can be accurately determined according to the behavior of the user, and the influence on the viewing experience of the user due to the shielding of the content which does not exceed the bearing capacity of the user caused by misjudgment can be avoided.
Referring to fig. 5, fig. 5 is a flowchart of a method for processing video information according to an embodiment of the present application, and based on fig. 3, step S108 may be included after step S105.
In step S108, the client cancels the shielding effect to display the original video in response to the operation of turning off the healthy mode.
Here, the operation may be various forms of operation that are preset by the operating system and have no conflict with the registered operation, for example, a button click operation, or a slide operation, and of course, may be a voice operation.
In some embodiments, the client may misjudge the behavior of the target object, so that the unmasked content is masked, so that the target object cannot watch the video content wanted by the client, or taking the case that the video is a horror film, the bearing capacity of the target object is not changed from beginning to end in the process of watching the horror film, the bearing capacity of the target object gradually rises along with development of a video scenario, and the situation that the bearing capacity of the target object is not acceptable from beginning to end is masked due to starting of a health mode, so that the target object cannot watch the subsequent video content which does not exceed the bearing capacity of the target object. For this, a human-machine interface in the client may present a healthy mode close button, when the user triggers the healthy mode close button, the client undoes the masking effect to display the original video.
As an example, in fig. 7D, when the user clicks the healthy mode off button 706, the client plays the original video of the non-superimposed material.
The embodiment of the application can ensure that after the content which the user does not want to shield is shielded, the health mode can be closed through the operation of the user so as to cancel the shielding of the video, and the user can be ensured to watch the content meeting the requirements of the user.
The method for processing video information provided by the embodiment of the application is described below by taking a horror film as an example.
Viewing video (e.g., horror film) meets the psychological needs of people pursuing stimulation, but overstimulating content can cause psychological and physiological discomfort to people.
Referring to fig. 6A and 6B, fig. 6A and 6B are schematic views of application scenarios provided by the related art, and in fig. 6A, for a video containing an image of content that may cause discomfort to a user, prompt information 601 is presented on a human-computer interaction interface before the user views the video. In fig. 6B, for content 602 that may cause user discomfort in general awareness, the advanced coding process does not take into account the difference in bearing capacity of different users.
The related art has the following problems: for users who want to pursue the motivational content and worry about excessive or fear, the pre-viewing reminder does not alleviate the user's discomfort. And the advanced unified coding process can not meet the watching requirements of users with different bearing capacities.
In view of the above problems, the embodiment of the application provides a method for processing video information, which can enable users with different stimulus receiving degrees to meet psychological demands for pursuing stimulus in the process of watching video, avoid uncomfortable feeling caused by uncomfortable content of the users excessively, and ensure proper watching experience of different users.
The implementation manner of the embodiment of the present application is described below with reference to fig. 7A, 7B, 7C and 7D, and fig. 7A, 7B, 7C and 7D are schematic application scenarios provided by the embodiment of the present application.
In step S701, when the user views a video with a stimulus content (e.g., horror video), the user is prompted whether to turn on the health mode. And responding to the triggering operation aiming at the health mode, entering the health mode, and performing intelligent coding when the user has excessive discomfort.
By way of example, in FIG. 7A, when a user views a video with motivational content, a human-machine interaction interface presents a prompt window 701. When the user clicks a button for starting the health mode, the user enters the health mode, and intelligent coding is performed when the user has excessive discomfort.
In step S702, after the health mode is turned on, contents causing discomfort to the user are recognized in real time by the recognition system.
In some embodiments, the discrimination system includes a gesture behavior recognition system and an eye tracking system. Recognizing uncomfortable states of a user, such as closing eyes or blocking vision, through a gesture behavior recognition system; at the same time, a hot zone of the picture (i.e., the above-mentioned focus area) causing user discomfort is identified by the eye tracking system.
As an example, in fig. 7B, when content causing user discomfort (e.g., horror such as monster) appears in the video, a picture hotzone 702 causing user discomfort is automatically identified in the video content by the eye-tracking system.
In step S703, the screen hotspots are identified and marked by the intelligent image identification system.
As an example, in fig. 7B, the screen hotspots 702 are identified and marked by the intelligent image recognition system, and the mark content 703 (i.e., the above-described object) is obtained.
In step S704, the marking content is coded in real time by the video image processing system.
Here, in addition to coding the tag content, other harmonious content may be replaced by the tag content, for example, a map or rendering method.
As an example, in fig. 7B, the tag content 703 is subjected to a coding process to present the coded tag content 704 in the man-machine interaction interface.
In step S705, when the marked content appears again in the video content, the reproduced marked content is coded in real time by the video image processing system.
As an example, in fig. 7C, when the reoccurring mark content 705 is presented in the human-computer interaction interface, the reoccurring mark content 705 is automatically coded.
In step S706, in response to the operation of turning off the health mode, the original video is played.
As an example, in fig. 7D, when the user clicks the healthy mode off button 706, the original video that is not coded is played.
Referring to fig. 8, fig. 8 is a flowchart of a method for processing video information according to an embodiment of the present application, and the detailed description will be given below with reference to fig. 8.
In step S801, after the health mode is turned on, the client makes a discrimination of a picture causing an excessive uncomfortable reaction of the user by calling the discrimination system of the terminal when the video is played.
In some embodiments, the discrimination system includes a gesture behavior recognition system and an eye tracking system. The client judges the behavior response of the user when the user has uncomfortable feeling by calling a gesture behavior recognition system of the terminal, for example, the behaviors of closing eyes, blocking the sight by hands or objects, avoiding the focus of the sight and the like; meanwhile, the client side tracks and distinguishes the hot area of the picture currently viewed by the user in real time by calling an eye tracking system of the terminal.
In step S802, the background receives the identified picture hotspots, and identifies and marks the image content of the region through the intelligent image identification system.
In step S803, the client performs coding processing on the region where the content appears in the picture by calling the video image processing system of the terminal according to the marking information returned by the background, and presents the video content of the picture which changes in real time according to the user reaction in the man-machine interaction interface.
In step S804, when the marked content appears again on the video screen, the client recognizes and performs coding processing by calling the video image processing system of the terminal.
In summary, the embodiment of the application can distinguish the content causing excessive discomfort of the user through the gesture behavior recognition system and the eye movement tracking system when the user watches the video, mark the content according to the intelligent image recognition, automatically code the content area image to relieve the discomfort of the user, and automatically code the content image when the content image appears in the video again. The requirements of users with different stimulus acceptances on pursuing stimulus are met, and the health is moderate.
Continuing with the description below of an exemplary architecture of the video information processing apparatus 455 implemented as a software module provided by embodiments of the present application, in some embodiments, as shown in fig. 2, the software module stored in the video information processing apparatus 455 of the memory 450 may include:
the video playing module 4551 is configured to present a video in the human-computer interaction interface;
a detection module 4552 for detecting a behaviour of a target object during presentation of the video;
a determining module 4553 configured to determine a behavior characterization for content appearing in the video according to the behavior of the target object;
The video playing module 4551 is further configured to present a masking effect of content appearing in the video when the behavior representation of the target object indicates that the content exceeds the affordance of the target object.
In some embodiments, the detection module 4552 is further configured to acquire a behavioral image of the target object; the determining module is further used for identifying behavior characterization of the target object according to the behavior image.
In some embodiments, the determining module 4553 is further configured to identify, in the behavioral image, a behavioral type of the target object; and inquiring the corresponding relation between different behavior types and behavior characterization according to the identified behavior types to obtain the behavior characterization corresponding to the identified behavior types.
In some embodiments, the determining module 4553 is further configured to invoke the neural network model to perform the following: extracting feature vectors of the behavior image; mapping the extracted feature vector into probabilities corresponding to a plurality of behavior characterizations, and determining the behavior characterization corresponding to the maximum probability as the behavior characterization of the target object; the neural network model is obtained by training a sample behavior image of the target object and a behavior characterization of a label of the sample behavior image.
In some embodiments, the determining module 4553 is further configured to determine that content appearing in the video exceeds the bearing capability of the target object when the behavior characterization of the target object indicates that the emotion type of the target object is fear or aversion; the video playing module is further used for executing at least one of the following operations: superposing materials in all picture areas of the content; superimposing a material in a partial picture area of the content; skipping playing of the content; and reducing the volume of playing the content.
In some embodiments, the video playing module 4551 is further configured to determine a current frame of the video playing; superimposing material in a region where there is a difference between the current frame and the previous frame such that the region where there is a difference exhibits at least one of the following masking effects: mosaic; blurring; etching; sanding; a grid; and (5) shielding.
In some embodiments, the video playing module 4551 is further configured to determine a focal region of the line of sight of the target object in the content; performing object recognition on the focus area to determine an object in the focus area, and overlaying material on the object to enable the object to present at least one of the following shielding effects: mosaic; blurring; etching; sanding; a grid; and (5) shielding.
In some embodiments, the video playing module 4551 is further configured to collect positions of the pupil and the reflective bright spots on the outer surface of the cornea of the eyeball of the target object; and determining a focus area corresponding to the sight line of the target object in the content according to the positions of the pupil of the target object and the reflecting bright spots on the outer surface of the cornea of the eyeball.
In some embodiments, the video playing module 4551 is further configured to determine a corneal reflection vector of the target object according to the positions of the pupil and the reflective bright spots on the outer surface of the cornea of the eyeball of the target object; determining a sight line direction of the target object when watching the video according to the cornea reflection vector of the target object; and determining the focus area in the content according to the sight line direction of the target object when watching the video.
In some embodiments, the video playing module 4551 is further configured to divide the focal region into a plurality of candidate boxes; predicting a candidate frame comprising the target and the type of the target according to the feature vector of each candidate frame; determining targets belonging to a set type included in the candidate frame; wherein the setting type of the target comprises at least one of the following: terrorist type; pornography type.
In some embodiments, the video playing module 4551 is further configured to mark the target, so as to superimpose the material on the target in the video when the marked target appears again in the video.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions that, when executed by a processor, cause the processor to perform a method for processing video information provided by embodiments of the present application, for example, a method for processing video information as shown in fig. 3, 4, 5, or 8, where the computer includes various computing devices including an intelligent terminal and a server.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, e.g., in one or more scripts in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.
In summary, the embodiment of the application has the following beneficial effects:
(1) The emotion of the user when watching the video is distinguished by detecting the behavior of the user when watching the video, and differentiated shielding is carried out on the content appearing in the video, so that the video content exceeding the bearing capacity of the user can be personalized shielding aiming at bearing capacity of different users, thereby not only meeting psychological demands of the user pursuing stimulation, but also avoiding uncomfortable feeling caused by uncomfortable content of the user excessively and ensuring healthy and moderate watching experience of different users.
(2) The method and the device can accurately judge corresponding behavior characterization according to the behavior of the target object when watching the video, so that whether the content appearing in the video exceeds the bearing capacity of the target object can be accurately determined according to the behavior characterization of the target object, and the accuracy of shielding the video is improved.
(3) The method can accurately shield the content appearing in the video, avoid the problem that the key plot of the video is missed due to the fact that the shielding content is inaccurate, and the influence on the fact that users cannot watch the content which cannot feel uncomfortable in the video is avoided, and improve the viewing experience of the users.
(4) The content which is caused to be uncomfortable for the user and appears again can be automatically shielded, not only can the content which is caused to be uncomfortable for the user be shielded before the user does not see the content which is caused to be uncomfortable, but also the frequency of the user seeing the content which is caused to be uncomfortable can be reduced, and therefore the uncomfortable feeling of the user is reduced.
(5) On one hand, the behavior characterization process of the target object is simple, the behavior characterization speed is high, the speed of shielding video content is improved, and the time for users to watch uncomfortable content can be reduced; on the other hand, the method for determining the target object by the neural network model has complex behavior characterization process, but higher accuracy, can accurately determine whether to shield the content appearing in the video according to the behavior of the user, and can avoid the influence on the viewing experience of the user due to the shielding of the content which does not exceed the bearing capacity of the user caused by misjudgment.
(6) After the content which the user does not want to shield is shielded, the control health mode can be closed through the operation of the user so as to cancel the shielding of the video, and the user can be ensured to watch the content meeting the requirements of the user.
(7) The video can be shielded to different degrees according to the bearing capacities of different levels of the user, so that the user cannot watch the overstimulated picture, but can watch a part of slightly stimulated picture, and the personalized video watching requirement of the user is met.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (13)

1. A method for processing video information, the method comprising:
presenting a video in a human-computer interaction interface;
detecting a behavior of a target object during presentation of the video;
determining a behavior characterization for content present in the video based on the behavior of the target object;
when the behavior characterization indicates that the emotion type of the target object belongs to fear or aversion, determining that content appearing in the video exceeds the bearing capacity of the target object;
Determining a focal region of a line of sight of the target object in the content;
performing object recognition on the focus area to determine objects in the focus area, and superposing materials on the objects so as to enable the objects to be in a specific shape
The target exhibits at least one of the following shielding effects: mosaic; blurring; etching; and (5) a grid.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the detecting the behavior of the target object includes:
collecting a behavior image of the target object;
the determining behavior characterization for content appearing in the video according to the behavior of the target object comprises the following steps:
and identifying the behavior representation of the target object according to the behavior image.
3. The method of claim 2, wherein the identifying a behavioral representation of the target object from the behavioral image comprises:
identifying a behavior type of the target object in the behavior image;
and inquiring the corresponding relation between different behavior types and behavior characterization according to the identified behavior types to obtain the behavior characterization corresponding to the identified behavior types.
4. The method of claim 2, wherein the identifying a behavioral representation of the target object from the behavioral image comprises:
Invoking the neural network model to perform the following:
extracting feature vectors of the behavior image;
mapping the extracted feature vector into probabilities corresponding to a plurality of behavior characterizations, and determining the behavior characterization corresponding to the maximum probability as the behavior characterization of the target object;
the neural network model is obtained by training a sample behavior image of the target object and a behavior characterization of a label of the sample behavior image.
5. The method of claim 1, wherein when presenting the masking effect of the content, the method further comprises:
determining a current frame of the video playing;
superimposing material in a region where there is a difference between the current frame and the previous frame so that
The region of difference exhibits at least one of the following shielding effects: mosaic; blurring; etching; and (5) a grid.
6. The method of claim 1, wherein the determining a focal region of the line of sight of the target object in the content comprises:
collecting the positions of the pupil of the target object and the reflection bright spots on the outer surface of the cornea of the eyeball;
and determining a focus area corresponding to the sight line of the target object in the content according to the positions of the pupil of the target object and the reflecting bright spots on the outer surface of the cornea of the eyeball.
7. The method according to claim 6, wherein the determining a focal region corresponding to the line of sight of the target object in the content according to the positions of the pupil and the reflected bright spots of the outer surface of the cornea of the eyeball of the target object includes:
determining a cornea reflection vector of the target object according to the positions of the pupil of the target object and the reflection bright spots on the outer surface of the cornea of the eyeball;
determining a sight line direction of the target object when watching the video according to the cornea reflection vector of the target object;
and determining the focus area in the content according to the sight line direction of the target object when watching the video.
8. The method of claim 1, wherein said identifying the focus area for target determination in the focus area comprises:
dividing the focus area into a plurality of candidate boxes;
predicting a candidate frame comprising the target and the type of the target according to the feature vector of each candidate frame;
determining targets belonging to a set type included in the candidate frame;
wherein the setting type of the target comprises at least one of the following: terrorist type; pornography type.
9. The method according to any one of claims 1 to 8, further comprising:
marking the target to
When the marked target appears again in the video, the material is overlapped on the target in the video.
10. A method for processing video information, the method comprising:
presenting a video in a human-computer interaction interface;
when the content appearing in the video exceeds the bearing capacity of the target object, the shielding effect of the content is presented;
superimposing material for a target of the line of sight of the target object in the focus area of the content so that
The target exhibits at least one of the following shielding effects: mosaic; blurring; etching; and (5) a grid.
11. A processing apparatus for video information, comprising:
the video playing module is used for presenting videos in the human-computer interaction interface;
a detection module for detecting a behavior of a target object during presentation of the video;
a determining module for determining a behavior characterization for content appearing in the video according to the behavior of the target object;
the determining module is further configured to determine that content appearing in the video exceeds a bearing capability of the target object when the behavioral representation of the target object indicates that the emotion type of the target object belongs to fear or aversion;
The video playing module is further configured to determine a focus area of a line of sight of the target object in the content when content appearing in the video exceeds a bearing capacity of the target object, perform target recognition on the focus area to determine a target in the focus area, and superimpose material on the target so that
The target exhibits at least one of the following shielding effects: mosaic; blurring; etching; and (5) a grid.
12. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the method of processing video information according to any one of claims 1 to 9 or the method of processing video information according to claim 10 when executing executable instructions stored in said memory.
13. A computer readable storage medium storing executable instructions for causing a processor to perform the method of processing video information according to any one of claims 1 to 9 or the method of processing video information according to claim 10.
CN202010598266.6A 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium Active CN111723758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010598266.6A CN111723758B (en) 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010598266.6A CN111723758B (en) 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111723758A CN111723758A (en) 2020-09-29
CN111723758B true CN111723758B (en) 2023-10-31

Family

ID=72569530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010598266.6A Active CN111723758B (en) 2020-06-28 2020-06-28 Video information processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111723758B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112827862B (en) * 2020-12-30 2022-08-23 重庆金康动力新能源有限公司 Grade sorting method and test equipment
CN115942054A (en) * 2022-11-18 2023-04-07 优酷网络技术(北京)有限公司 Video playing method and device, electronic equipment and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012174186A (en) * 2011-02-24 2012-09-10 Mitsubishi Electric Corp Image processor for monitoring
CN106454490A (en) * 2016-09-21 2017-02-22 天脉聚源(北京)传媒科技有限公司 Method and device for smartly playing video
CN106454155A (en) * 2016-09-26 2017-02-22 新奥特(北京)视频技术有限公司 Video shade trick processing method and device
CN107493501A (en) * 2017-08-10 2017-12-19 上海斐讯数据通信技术有限公司 A kind of audio-video frequency content filtration system and method
CN108495191A (en) * 2018-02-11 2018-09-04 广东欧珀移动通信有限公司 Video playing control method and related product
CN108900908A (en) * 2018-07-04 2018-11-27 三星电子(中国)研发中心 Video broadcasting method and device
CN111050105A (en) * 2019-12-14 2020-04-21 中国科学院深圳先进技术研究院 Video playing method and device, toy robot and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012174186A (en) * 2011-02-24 2012-09-10 Mitsubishi Electric Corp Image processor for monitoring
CN106454490A (en) * 2016-09-21 2017-02-22 天脉聚源(北京)传媒科技有限公司 Method and device for smartly playing video
CN106454155A (en) * 2016-09-26 2017-02-22 新奥特(北京)视频技术有限公司 Video shade trick processing method and device
CN107493501A (en) * 2017-08-10 2017-12-19 上海斐讯数据通信技术有限公司 A kind of audio-video frequency content filtration system and method
CN108495191A (en) * 2018-02-11 2018-09-04 广东欧珀移动通信有限公司 Video playing control method and related product
CN108900908A (en) * 2018-07-04 2018-11-27 三星电子(中国)研发中心 Video broadcasting method and device
CN111050105A (en) * 2019-12-14 2020-04-21 中国科学院深圳先进技术研究院 Video playing method and device, toy robot and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Gaze estimation using a webcam for region of interest detection";H.-I. Kim等;《Signal, Image and Video Processing》;第10卷;全文 *
"基于表情分析和视线追踪的用户反馈采集技术";王宁致等;《智能计算机与应用》;第9卷(第3期);全文 *

Also Published As

Publication number Publication date
CN111723758A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN112507799B (en) Image recognition method based on eye movement fixation point guidance, MR glasses and medium
US11937929B2 (en) Systems and methods for using mobile and wearable video capture and feedback plat-forms for therapy of mental disorders
RU2714096C1 (en) Method, equipment and electronic device for detecting a face vitality
US20200175262A1 (en) Robot navigation for personal assistance
US20170011258A1 (en) Image analysis in support of robotic manipulation
Yang et al. Benchmarking commercial emotion detection systems using realistic distortions of facial image datasets
US11917288B2 (en) Image processing method and apparatus
CN110447232B (en) Electronic device for determining user emotion and control method thereof
CN112034977B (en) Method for MR intelligent glasses content interaction, information input and recommendation technology application
KR102092931B1 (en) Method for eye-tracking and user terminal for executing the same
US20160191995A1 (en) Image analysis for attendance query evaluation
CN110326300B (en) Information processing apparatus, information processing method, and computer-readable storage medium
JP7151959B2 (en) Image alignment method and apparatus
CN111723758B (en) Video information processing method and device, electronic equipment and storage medium
EP2449514A1 (en) Method and apparatus for image display control according to viewer factors and responses
US20220383389A1 (en) System and method for generating a product recommendation in a virtual try-on session
Saif et al. Robust drowsiness detection for vehicle driver using deep convolutional neural network
CN113556603B (en) Method and device for adjusting video playing effect and electronic equipment
CN111654752A (en) Multimedia information playing method, device and related equipment
KR20190048630A (en) Electric terminal and method for controlling the same
Zhou et al. Long-term person tracking for unmanned aerial vehicle based on human-machine collaboration
CN114281236B (en) Text processing method, apparatus, device, medium, and program product
CN111967436B (en) Image processing method and device
US20230135254A1 (en) A system and a method for personalized content presentation
CN108334821A (en) A kind of image processing method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant