CN112911384A - Video playing method and video playing device - Google Patents

Video playing method and video playing device Download PDF

Info

Publication number
CN112911384A
CN112911384A CN202110074458.1A CN202110074458A CN112911384A CN 112911384 A CN112911384 A CN 112911384A CN 202110074458 A CN202110074458 A CN 202110074458A CN 112911384 A CN112911384 A CN 112911384A
Authority
CN
China
Prior art keywords
video
user
interest
playing
image frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110074458.1A
Other languages
Chinese (zh)
Inventor
马聪
赵瑞
孟宪宇
刘坤
方会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics China R&D Center, Samsung Electronics Co Ltd filed Critical Samsung Electronics China R&D Center
Priority to CN202110074458.1A priority Critical patent/CN112911384A/en
Publication of CN112911384A publication Critical patent/CN112911384A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/4508Management of client data or end-user data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/4728End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for selecting a Region Of Interest [ROI], e.g. for requesting a higher resolution version of a selected region

Abstract

A video playing method and a video playing apparatus are disclosed. The video playing method comprises the following steps: acquiring a plurality of image frames of a video; identifying an object of interest from the plurality of image frames; determining a voice corresponding to the object of interest; and playing the voice corresponding to the interested object while playing the video to the user. The video playing method can enable the user to listen to the voice corresponding to the interested object in the video while watching the picture of the video. Thus, the user can better understand and/or learn the content of the video.

Description

Video playing method and video playing device
Technical Field
The present invention relates to the multimedia field, and more particularly, to a video playing method and a video playing apparatus.
Background
With the various popularization of intelligent terminals, video has become a multimedia more and more popular with users.
Generally, videos can be divided into videos with voice and videos without voice. Voiced video generally helps users to better understand and/or learn the video. However, the voice in the video with voice usually needs to be recorded in advance or made in a post-production mode, which requires a large cost; voice-free video requires a great deal of effort by the user to learn and/or learn the content of the video.
Therefore, there is a need for a user to obtain a method that facilitates the user to better understand and/or learn the video.
Disclosure of Invention
The invention aims to provide a video playing method and a video playing device.
One aspect of the present invention provides a video playing method, including: acquiring a plurality of image frames of a video; identifying an object of interest from the plurality of image frames; determining a voice corresponding to the object of interest; and playing the voice corresponding to the interested object while playing the video to the user.
Optionally, the step of identifying the object of interest from the plurality of image frames comprises: inputting the plurality of image frames to an image recognition model, wherein the image recognition model is trained in advance to output text indicating correspondence to an object of interest in the input plurality of training image frames in response to the input plurality of training image frames; based on the image recognition model, text corresponding to the object of interest is output.
Optionally, the step of determining the speech corresponding to the object of interest comprises: the speech corresponding to the output text is synthesized by artificial speech synthesis to obtain a speech corresponding to the object of interest.
Optionally, the step of playing the video to the user comprises: determining an initial playing rate of the video according to a user data model, wherein the user data model is a model established according to at least one of speed, content, age and gender learned by a user; and playing the video to the user according to the initial playing speed of the video.
Optionally, the video playing method further includes: shooting a user video including a user's motion in real time; identifying a user's action from a user video, wherein playing the video to the user comprises: determining similarity between the identified user's actions and actions in the playing video; and controlling the playing speed of the video based on the similarity.
Optionally, the step of controlling the playing rate of the video based on the similarity includes: determining the level of the user based on the similarity, wherein the higher the similarity is, the higher the level of the user is; the playback rate of the video is controlled based on the user's level, wherein different levels correspond to different playback rates.
Optionally, the step of identifying the object of interest from the plurality of image frames comprises: requesting a user to select a plurality of candidate objects in response to an object of interest corresponding to the plurality of candidate objects; in response to a user selecting one of the plurality of candidate objects, an action of the one candidate object is identified as an object of interest.
Optionally, the object of interest comprises at least one of a motion and an indicator of the person.
Another aspect of the present invention provides a video playback device, including: an image frame acquisition unit configured to acquire a plurality of image frames of a video; an object of interest recognition unit configured to recognize an object of interest from the plurality of image frames; a voice determination unit configured to determine a voice corresponding to the object of interest; a playing unit configured to play a voice corresponding to the object of interest while playing the video to the user.
Another aspect of the present invention provides a computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements any of the video playback methods described above.
Another aspect of the invention provides a computing device comprising: a processor; a memory storing a computer program which, when executed by the processor, implements any of the video playback methods described above.
The video playing method can enable the user to listen to the voice corresponding to the interested object in the video while watching the picture of the video. Thus, the user can better understand and/or learn the content of the video.
In addition, the video playing method of the invention can synthesize the image frame of the video and the voice corresponding to the object of interest together to form a new video. Thus, the user can automatically obtain a new video with intelligent dubbing without other manual operations.
In addition, the video playing method can automatically adjust the playing speed of the video according to the action of the user when the video is played, so that the user can understand or keep up with the playing speed of the video without manually adjusting the playing speed of the video.
In addition, the video playing method can provide the function of eliminating one or more candidate objects which are not interesting and/or unnecessary for the user when the interesting object corresponds to the plurality of candidate objects, so that the user experience can be greatly improved. Further, since only one candidate object is retained as the object of interest, the voice corresponding to the object of interest can be accurately determined, thereby enabling the voice of the object of interest in which the user is interested to be accurately played.
Drawings
The above and other objects and features of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate, by way of example, an example in which:
fig. 1 illustrates a video playing method according to an embodiment of the present invention;
FIG. 2 illustrates an image recognition model according to an embodiment of the present invention;
fig. 3 illustrates a method of controlling a play rate of a video according to an embodiment of the present invention;
fig. 4 illustrates a method of determining an object of interest when the object of interest corresponds to a plurality of candidate objects according to the present invention;
fig. 5 illustrates a flowchart of a video playing method when an object of interest is a motion of a person according to an embodiment of the present invention;
FIG. 6 shows a schematic diagram of a user motion scenario according to an embodiment of the invention;
FIG. 7 shows a schematic diagram in a game class scenario, according to an embodiment of the invention;
FIG. 8 shows a schematic diagram under a video narration scene according to an embodiment of the invention;
fig. 9 illustrates a video playback apparatus according to an embodiment of the present invention;
FIG. 10 shows a block diagram of a computing device, according to an embodiment of the invention.
Detailed Description
The following detailed description is provided to assist the reader in obtaining a thorough understanding of the methods, devices, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatus, and/or systems described herein will be apparent to those skilled in the art upon review of the disclosure of this application. For example, the order of operations described herein is merely an example, and is not limited to those set forth herein, but may be changed as will become apparent after understanding the disclosure of the present application, except to the extent that operations must occur in a particular order. Moreover, descriptions of features known in the art may be omitted for clarity and conciseness.
The features described herein may be embodied in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein have been provided to illustrate only some of the many possible ways to implement the methods, devices, and/or systems described herein, which will be apparent after understanding the disclosure of the present application.
As used herein, the term "and/or" includes any one of the associated listed items and any combination of any two or more.
Although terms such as "first", "second", and "third" may be used herein to describe various elements, components, regions, layers or sections, these elements, components, regions, layers or sections should not be limited by these terms. Rather, these terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section referred to in the examples described herein could also be referred to as a second element, component, region, layer or section without departing from the teachings of the examples.
In the specification, when an element (such as a layer, region or substrate) is described as being "on," "connected to" or "coupled to" another element, it can be directly on, connected to or coupled to the other element or one or more other elements may be present therebetween. In contrast, when an element is referred to as being "directly on," "directly connected to," or "directly coupled to" another element, there may be no intervening elements present.
The terminology used herein is for the purpose of describing various examples only and is not intended to be limiting of the disclosure. The singular is also intended to include the plural unless the context clearly indicates otherwise. The terms "comprises," "comprising," and "having" specify the presence of stated features, quantities, operations, elements, components, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, quantities, operations, components, elements, and/or combinations thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs after understanding the present disclosure. Unless explicitly defined as such herein, terms (such as those defined in general dictionaries) should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and should not be interpreted in an idealized or overly formal sense.
Further, in the description of the examples, when it is considered that detailed description of well-known related structures or functions will cause a vague explanation of the present disclosure, such detailed description will be omitted.
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, embodiments may be implemented in various forms and are not limited to the embodiments described herein.
Fig. 1 illustrates a video playing method according to an embodiment of the present invention.
Referring to fig. 1, in step S110, a plurality of image frames of a video may be acquired.
Here, the video may be a video including various contents. In one non-limiting example, the video may be a video that includes dance content (e.g., jazz, latin, etc.). In another non-limiting example, the video may be a video that includes athletic content (e.g., yoga, fitness, etc.). In yet another non-limiting example, the video may be a video that includes game content (e.g., a racing game, an action game, etc.). However, the above examples are merely exemplary, and the present invention does not limit the content included in the video.
In addition, the plurality of image frames of the video may be obtained in various ways. For example, multiple frames of a video may be acquired sequentially. For another example, multiple image frames of a video may be acquired at the same interval. As another example, multiple image frames of a video may be acquired at different intervals. However, the above examples are merely exemplary, and the present invention is not limited to a specific manner of acquiring a plurality of image frames of a video. Further, the number of the plurality of image frames of the acquired video may be any number, but the present invention is not limited thereto.
In step S120, an object of interest may be identified from a plurality of image frames.
Here, the object of interest may be an object that may be of interest to a user existing in at least one of the plurality of image frames. In one non-limiting example, the object of interest includes at least one of a motion and an indicator of a person. For example, the object of interest may be a motion of a person when the person is present in at least one of the plurality of image frames. For another example, when an indicator (e.g., a turn indicator, an acceleration indicator, etc.) is present in at least one of the plurality of image frames, the object of interest may be the indicator. As another example, the object of interest may be both a motion and an indicator of a person. However, the above examples are merely exemplary, and the present invention is not limited to the object of interest. In other words, the object of interest may be a wide variety of objects of interest to the user.
In one embodiment, a plurality of image frames may be input to an image recognition model and text corresponding to an object of interest may be output based on the image recognition model. Here, the image recognition model may be trained in advance to output a text indicating correspondence with the object of interest in the input plurality of training image frames in response to the input plurality of training image frames. For example, the image recognition model may be trained by various existing training methods (e.g., supervised training methods, unsupervised training methods), and so on.
In the present invention, the image recognition model may be implemented by an Artificial Intelligence (AI) technique. For example, the image recognition model may be implemented by a neural network. An image recognition model according to an embodiment of the present invention will be described in detail later with reference to fig. 2.
In step S130, a voice corresponding to the object of interest is determined.
Here, the voice corresponding to the object of interest may be synthesized by artificial voice synthesis. For example, a voice corresponding to the text output in the embodiment described with reference to step S120 is synthesized by artificial voice synthesis to obtain a voice corresponding to the object of interest.
In one exemplary example, when the object of interest is a motion of a person, if the recognized motion of the person is a hand-up motion, a voice corresponding to the hand-up motion (e.g., a voice "please lift up") may be determined according to a motion-voice library previously obtained through AI training. The action-voice library may include an action and at least one voice corresponding to the action. In another exemplary example, when the object of interest is a flag, if the recognized flag is a right turn, a voice corresponding to a hand raising action may be determined (e.g., voice "please right turn"). However, the above examples are merely exemplary, and the present invention is not limited to the voice corresponding to the object of interest.
In step S140, while the video is played to the user, the voice corresponding to the object of interest is played.
That is, according to the video playing method of the present invention, the user can hear the voice corresponding to the object of interest in the video while watching the picture of the video. Thus, the user can better understand and/or learn the content of the video.
Alternatively, the initial playback rate of the video may be determined from the user data model. The user data model may be a model established according to at least one of speed, content, age, gender learned by the user. For example, the user data model may indicate a degree of user's knowledge or familiarity with the video. The video may then be played to the user according to the initial play rate of the video.
In addition, in a preferred embodiment, the playing rate of the video can also be controlled while the video is being played to the user. A method of controlling the playback rate of the video will be described later with reference to fig. 3.
Further, optionally, speech corresponding to the object of interest may be synthesized with a plurality of image frames to form a new video. Thus, the user can automatically obtain a new video with intelligent dubbing without other manual operations.
FIG. 2 illustrates an image recognition model according to an embodiment of the present invention.
Here, the image recognition model may include, for example, a Deep Neural Network (DNN) (e.g., a fully connected network, a deep convolutional network, a cyclic neural network, etc.). In one embodiment, the image recognition model may include a 3D-based deep convolutional neural network to automatically extract spatial and/or temporal features of a plurality of frames in a video and classify and/or identify based on the extracted features. However, the present invention is not limited thereto, and the image recognition model may be implemented as a neural network of any other structure.
Referring to fig. 2, as an illustrative example, an image recognition model may include an input layer, a convolutional layer, a pooling layer, a vectorization layer, a fully-connected layer, and a classification layer, and a text collection unit.
An input layer of the image recognition model may receive a plurality of image frames and transfer the received plurality of image frames to a next layer.
The convolution layer of the image recognition model may extract feature data (e.g., a feature map) by receiving image data output from an upper layer and performing a corresponding convolution operation on the image data. For example, the convolutional layers of the image recognition model may be three-dimensional convolutional layers, so that three-dimensional feature data may be extracted in consideration of both spatial and temporal dimensions. In fig. 2, as an illustrative example only, the image recognition model may include two convolutional layers, and the number of feature maps of the two convolutional layers may be 32 and 128, respectively. However, the present invention is not limited to two convolutional layers with the number of feature maps being 32 and 128, respectively. An image recognition model according to the present invention may include any number of convolutional layers, and each convolutional layer may have any number of feature maps.
Each convolutional layer of the image recognition model may be followed by a pooling layer. The pooling layer may perform pooling operations by using pooling (e.g., maximum pooling, average pooling, etc.) techniques. Pooling may enable translational invariance to the extracted features.
The last pooling layer of the image recognition model may be connected to the vectorizing layer. The vectorization layer may vectorize the received feature data and output the vectorized feature data to a subsequent fully-connected layer.
In fig. 2, the number of fully connected layers of the image recognition model is shown as 2. The number of the neurons of the 2 full-connection layers can be 2056 and 512 respectively, and a traditional feedforward neural network connection mode can be adopted. However, the present invention is not limited to the fully-connected layer having the number of neurons and the number of neurons 2 and adopting the feedforward neural network connection method. The image recognition model according to the invention can comprise any number of fully connected layers, each fully connected layer can have any number of neurons, and the connection mode of the fully connected layers can be any neural network connection mode.
In fig. 2, the fully connected layer of the image recognition model may be connected to the classification layer. For example, the classification layer may be implemented using a softmax classifier. In one example, the classification layer may include a two-part classification layer. A portion of the classification layer may be used to classify the type of video. For example, by way of non-limiting example only, a video may be classified as one of a dance class, a sports class, and a game class by a portion of the classification layer. Another portion of the classification layer may be used to perform a specific secondary classification of the video. For example, through the further partial classification layer, the video may be classified as at least one object of interest of one of jazz and latin, etc. in dance class, yoga and fitness, etc. in sports class, and racing and action class in game class. The above classification is merely exemplary, and the present invention is not limited to the above specific classification, and videos may be classified into any category of videos as necessary. At this time, the classification layer of the image recognition model may output a feature value (e.g., a feature vector) corresponding to at least one object of interest in the received plurality of image frames. In addition, the classification layers of the present invention are not limited to two, and may be one or more.
The text collection unit of the image recognition model may be previously trained through AI to obtain a data set including an object of interest and a subject text. Here, the subject text may be text for describing the identified object of interest in a plurality of frames. For example, a text collection unit of the image recognition model may receive feature values (e.g., feature vectors) corresponding to at least one object of interest in the plurality of image frames and determine text corresponding to the at least one object of interest based on the feature values and a dataset including the object of interest and the subject text.
Fig. 3 illustrates a method of controlling a play rate of a video according to an embodiment of the present invention.
Referring to fig. 3, in step S310, a user video including a user' S motion may be photographed in real time.
For example, a user video including a user's motion may be captured by a camera of an electronic device playing the video.
In step S320, the user' S motion may be identified from the user video.
In one embodiment, the user's actions may be identified from the user video through a user action identification model. For example, the user action model may have a similar structure to the image recognition model described with reference to fig. 2. The structure of the image recognition model according to the embodiment of the present invention has been described above with reference to fig. 2, and is not repeated in detail here. Here, the user motion model may recognize the motion of the user from the user video by outputting a feature value (e.g., a feature vector) corresponding to the motion of the user.
In step S330, a similarity between the recognized user' S motion and the motion in the video being played may be determined.
Here, various existing techniques may be used to determine the similarity between the recognized user's motion and the motion in the playing video. For example, the similarity may be determined based on a distance between a feature value corresponding to the recognized user's motion and a feature value corresponding to a motion in the playing video. The greater the distance between the feature value corresponding to the identified user's motion and the feature value corresponding to the motion in the playing video, the lower the similarity between the identified user's motion and the motion in the playing video.
In step S340, the playing rate of the video may be controlled based on the similarity.
In one embodiment, the similarity may be related to a level of the user. For example, the higher the similarity, the higher the rank of the user. In other words, the rank of the user may be determined based on the similarity. Here, different levels may correspond to different playback rates, and the levels may indicate how well the user is aware of or familiar with the video. Accordingly, the playback rate of the video can be controlled based on the level of the user.
Because the playing speed of the video can be automatically adjusted according to the action of the user when the video is played, the user can understand or follow the playing speed of the video without manually adjusting the playing speed of the video.
Fig. 4 illustrates a method of determining an object of interest when the object of interest corresponds to a plurality of candidate objects according to the present invention.
Referring to fig. 4, in step S410, in response to an object of interest corresponding to a plurality of candidate objects, a user is requested to select the plurality of candidate objects.
Here, the user may be requested to select a plurality of candidates by means of voice playback and/or image display.
In one example, when the object of interest includes a motion of a person, the object of interest corresponding to the plurality of candidate objects may indicate the presence of a plurality of persons in the video. At this point, the user may be requested to make a selection of multiple people in the video.
In another example, when the object of interest includes an indicator, the object of interest corresponding to the plurality of candidate objects may indicate the presence of the plurality of indicators in the video. At this point, the user may be requested to select multiple indicators in the video.
However, the above examples are merely exemplary, and the case where the object of interest of the present invention corresponds to a plurality of candidate objects is not limited to the above examples.
In step S420, one candidate object may be identified as the object of interest in response to a user selecting one candidate object of the plurality of candidate objects.
Here, the user may select one candidate object from the plurality of candidate objects by means of voice, mouse, remote controller, touch, and/or the like.
Since the function of excluding one or more candidate objects that are not interesting and/or unnecessary for the user can be provided to the user when the object of interest corresponds to a plurality of candidate objects, the user experience can be greatly improved. Further, since only one candidate object is retained as the object of interest, the voice corresponding to the object of interest can be accurately determined, thereby enabling the voice of the object of interest in which the user is interested to be accurately played.
Fig. 5 illustrates a flowchart of a video playing method when an object of interest is a motion of a person according to an embodiment of the present invention.
Referring to fig. 5, a video may be played using a smart device having a camera. Multiple people detection may be performed on the video to determine whether there is a single person in the video. Here, various existing techniques may be used to detect the presence of a person from a video. For example, Histogram of Oriented Gradient (HOG) -Support Vector Machine (SVM) methods can be used for detection, when using HOG + SVM for pedestrian detection. The main idea of the HOG feature is that by analyzing the image frames, the appearance and shape of local objects can be well described by gradient or edge density direction distribution. The direction histogram of soil heap or edge can be collected to each pixel point of the image, the characteristics of the image can be described according to the information of the histogram, the image is classified and used by an SVM according to the collected HOG characteristic vector, and finally the number of people in the video is obtained and numbered. Although the HOG-SVM method is described above as detecting the presence of a person from a video, the present invention is not limited thereto and other methods of detecting a person are also possible.
In other words, when it is detected that the number of people in the video is not a single person, i.e., there are a plurality of people, a plurality of people appearing in the video may be tagged. Here, for example, one person of the persons appearing in the video may be marked as 1 and the remaining persons may be marked as 2 by way of display for the user to select two persons in the video.
The smart device may receive the user's voice selecting two people in the video. For example, when the user utters the voice "select 1", the region of the person corresponding to the mark 1 may be determined as the region of interest of the user. At this time, an object of interest in the region of interest of the user (in fig. 5, i.e., the motion of the user) may be recognized through the image recognition model described in fig. 2 and text corresponding to the object of interest (e.g., body parallel to the ground, left leg lifted, both hands 180 degrees) is output.
Next, a voice corresponding to the output text may be synthesized by artificial voice synthesis. At this time, the user can hear the voice about the explanation and/or introduction of the video while watching the video, wherein the body is parallel to the ground, the left leg is lifted, and the two hands are 180 degrees, so that the user can quickly know the content of the video.
In addition, when playing the video, the playing rate of the video may be controlled as described with reference to fig. 3.
Specifically, an initial playing rate of the video may be determined based on the user data model, and the video may be played to the user according to the initial playing rate of the video. The user data model is a model for determining the user rating established according to at least one of the speed, content, age, and gender learned by the user. Here, the user rating may indicate a degree of user's understanding and/or familiarity with dances in the video. When the user level is high, the initial playing speed is high; when the user level is low, the initial play rate is faster. Table 1 below shows, by way of example only, the correspondence between user ratings and the initial play rate V.
TABLE 1
Figure BDA0002907071360000101
Figure BDA0002907071360000111
In table 1, scenes or content within a video may be generally classified into a dance class, a sports class, and a game class. However, the above classifications are merely examples, and other classifications are possible.
Here, dance-like (e.g., jazz, Latin) may require play-rate adjustments based on the user's exercise state. For example, when the user ratings are determined to be 1, 2, 3, and 4, respectively, the playback rates of the videos may correspond to 0.5 × V, 0.8 × V, 1.0 × V, and 1.2 × V, respectively. The sports category (e.g., yoga, fitness, etc.) does not require rate adjustment because it is relatively slow, which is indicated by an X in table 1. The game class (e.g., racing, action class, etc.) needs to be adjusted in rate as appropriate according to the level (i.e., level) of play of the user. For example, when the user ratings are determined to be 1, 2, 3, and 4, respectively, the playback rates of the videos may correspond to 0.7 × V, 0.8 × V, 0.9 × V, and 1.0 × V, respectively.
Here, the data shown in table 1 is only one example for showing the correspondence existing between the user level and the playback rate, and the correspondence existing between the user level and the playback rate is not limited by the present invention.
In addition, a user video including the user's motion may be photographed while the video is played. The user rating may be determined according to a similarity between the user's motion and the motion of the person in the video. At this time, the video playing rate can be adjusted according to the judged user level while the video is played.
Fig. 6 shows a schematic view of a user motion scene according to an embodiment of the invention.
In the user motion scenario of fig. 6, the object of interest to the user in the image frame may be a motion of a person in the video. In this case, the motion of the person may be recognized from the image frame of the video through the AI technique and the voice corresponding to the motion of the person may be determined, so that the voice corresponding to the motion of the person in the video may be played while the video is played to the user. Therefore, the user can better learn the motion of the person in the video from the played voice corresponding to the motion of the person in the video when performing motion according to the video.
FIG. 7 shows a schematic diagram in a game class scenario, according to an embodiment of the invention.
In the Game class (e.g., "Grid Racer Game") scene of fig. 7, the object of interest to the user in the image frame may be an indicator in the video. Here, the video released by the other players in the big data after playing is subjected to a pre-AI learning (for example, a pre-trained image recognition model) to obtain an indicator-prompt text library, and is converted into speech after being synthesized. When the indicator is recognized from the image frame of the played video related to the game, a prompt voice corresponding to the indicator is output (for example, the voice "turn right immediately").
Further, the rank of the user may be determined from the user data model. Here, the level of the user may indicate the level of the user's play. If the determined level of the user indicates that the user is not playing well, the play speed of the video may be reduced so that the user may keep up with the play speed of the video. If the determined level of the user indicates that the user plays better, the play speed of the video may not be changed.
Fig. 8 shows a schematic diagram under a video commentary scene according to an embodiment of the invention.
In the video illustration scene of fig. 8, when the user is watching a video of a diving game, the object of interest of the user in the image frame may be the motion of a person in the video. In this case, a motion of a person may be recognized (e.g., bent forward one and a half) from an image frame of a video through an AI technique and a voice corresponding to the determined motion of the person may be determined, so that the voice corresponding to the motion of the person in the video may be played while the video is played to the user. Therefore, the user can better understand the motion of the person in the video according to the played voice corresponding to the motion of the person in the video when watching.
Fig. 9 illustrates a video playback apparatus according to an embodiment of the present invention.
Referring to fig. 9, the video playback apparatus 900 may include an image frame acquiring unit 910, an object of interest recognizing unit 920, a voice determining unit 930, and a playback unit 940. The video playback device 900 may be configured to perform any of the methods described with reference to fig. 1-8.
Here, the image frame acquisition unit 910 may be configured to acquire a plurality of image frames of a video. The object of interest recognition unit 920 may be configured to recognize an object of interest from a plurality of image frames. The speech determination unit 930 is configured to determine a speech corresponding to the object of interest. The playing unit 940 may be configured to play a voice corresponding to the object of interest while playing the video to the user.
The method of acquiring a plurality of image frames of a video performed by the image frame acquiring unit 910, the method of recognizing an object of interest from a plurality of image frames performed by the object of interest recognizing unit 920, the method of determining a voice corresponding to an object of interest performed by the voice determining unit 930, and the method of playing a voice corresponding to an object of interest while playing a video to a user performed by the playing unit 940 have been described above in connection with at least one of fig. 1 to 8. Therefore, for the sake of brevity and unnecessary redundant description, detailed descriptions of the methods performed by the image frame acquiring unit 910, the object of interest recognizing unit 920, the voice determining unit 930, and the playing unit 940 will be omitted.
FIG. 10 shows a block diagram of a computing device, according to an embodiment of the invention.
Referring to fig. 10, a computing device 1000 according to an embodiment of the invention may include a processor 1010 and a memory 1020. Here, the memory 1020 stores a computer program, wherein the computer program realizes any of the methods described with reference to fig. 1 to 8 when executed by the processor 1010. For the sake of brevity, any of the methods described with reference to fig. 1-8 as performed by processor 1010 will not be described again here.
Further, the method according to the exemplary embodiment of the present invention may be implemented as a computer program in a computer-readable recording medium. The computer program may be implemented by a person skilled in the art from the description of the method described above. The computer program, when executed in a computer, implements any of the video playback methods of the present invention.
According to an exemplary embodiment of the invention, a computer-readable storage medium may be provided, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out any of the methods disclosed in the present application. For example, the computer program, when executed by a processor, causes the processor to perform the steps of: acquiring a plurality of image frames of a video; identifying an object of interest from a plurality of image frames; determining a voice corresponding to the object of interest; and playing the voice corresponding to the interested object while playing the video to the user.
Furthermore, it should be understood that the respective units in the device according to the exemplary embodiment of the present invention may be implemented as hardware components and/or software components. The individual units may be implemented, for example, using Field Programmable Gate Arrays (FPGAs) or Application Specific Integrated Circuits (ASICs), depending on the processing performed by the individual units as defined by the skilled person.
The video playing method can enable the user to listen to the voice corresponding to the interested object in the video while watching the picture of the video. Thus, the user can better understand and/or learn the content of the video.
In addition, the video playing method of the invention can synthesize the image frame of the video and the voice corresponding to the object of interest together to form a new video. Thus, the user can automatically obtain a new video with intelligent dubbing without other manual operations.
In addition, the video playing method can automatically adjust the playing speed of the video according to the action of the user when the video is played, so that the user can understand or keep up with the playing speed of the video without manually adjusting the playing speed of the video.
In addition, the video playing method can provide the function of eliminating one or more candidate objects which are not interesting and/or unnecessary for the user when the interesting object corresponds to the plurality of candidate objects, so that the user experience can be greatly improved. Further, since only one candidate object is retained as the object of interest, the voice corresponding to the object of interest can be accurately determined, thereby enabling the voice of the object of interest in which the user is interested to be accurately played.
While the present disclosure includes particular examples, it will be apparent to those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered merely as illustrative and not restrictive. The description of features or aspects in each example should be considered applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order and/or if components in the described systems, architectures, devices, or circuits are combined in a different manner and/or replaced or supplemented by other components or their equivalents. Therefore, the scope of the disclosure is defined not by the detailed description but by the claims and their equivalents, and all changes within the scope of the claims and their equivalents are to be construed as being included in the disclosure.

Claims (11)

1. A video playing method comprises the following steps:
acquiring a plurality of image frames of a video;
identifying an object of interest from the plurality of image frames;
determining a voice corresponding to the object of interest;
and playing the voice corresponding to the interested object while playing the video to the user.
2. The video playback method of claim 1, wherein identifying the object of interest from the plurality of image frames comprises:
inputting the plurality of image frames to an image recognition model, wherein the image recognition model is trained in advance to output text indicating correspondence to an object of interest in the input plurality of training image frames in response to the input plurality of training image frames;
based on the image recognition model, text corresponding to the object of interest is output.
3. The video playback method of claim 2, wherein the step of determining the speech corresponding to the object of interest comprises:
the speech corresponding to the output text is synthesized by artificial speech synthesis to obtain a speech corresponding to the object of interest.
4. The video playback method of claim 1, wherein the step of playing back the video to the user comprises:
determining an initial playing rate of the video according to a user data model, wherein the user data model is a model established according to at least one of speed, content, age and gender learned by a user;
and playing the video to the user according to the initial playing speed of the video.
5. The video playback method of claim 1, wherein the video playback method further comprises:
shooting a user video including a user's motion in real time;
the user's actions are identified from the user video,
wherein the step of playing the video to the user comprises:
determining similarity between the identified user's actions and actions in the playing video;
and controlling the playing speed of the video based on the similarity.
6. The video playback method of claim 5, wherein the step of controlling the playback rate of the video based on the similarity comprises:
determining the level of the user based on the similarity, wherein the higher the similarity is, the higher the level of the user is;
the playback rate of the video is controlled based on the user's level, wherein different levels correspond to different playback rates.
7. The video playback method of claim 1, wherein identifying the object of interest from the plurality of image frames comprises:
requesting a user to select a plurality of candidate objects in response to an object of interest corresponding to the plurality of candidate objects;
in response to a user selecting one of the plurality of candidate objects, an action of the one candidate object is identified as an object of interest.
8. The video playback method of claim 1, wherein the object of interest includes at least one of a motion of a person and an indicator.
9. A video playback device, the video playback device comprising:
an image frame acquisition unit configured to acquire a plurality of image frames of a video;
an object of interest recognition unit configured to recognize an object of interest from the plurality of image frames;
a voice determination unit configured to determine a voice corresponding to the object of interest;
a playing unit configured to play a voice corresponding to the object of interest while playing the video to the user.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the video playback method according to any one of claims 1 to 8.
11. A computing device, comprising:
a processor;
memory storing a computer program which, when executed by a processor, implements a video playback method as claimed in any one of claims 1 to 8.
CN202110074458.1A 2021-01-20 2021-01-20 Video playing method and video playing device Pending CN112911384A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110074458.1A CN112911384A (en) 2021-01-20 2021-01-20 Video playing method and video playing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110074458.1A CN112911384A (en) 2021-01-20 2021-01-20 Video playing method and video playing device

Publications (1)

Publication Number Publication Date
CN112911384A true CN112911384A (en) 2021-06-04

Family

ID=76116502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110074458.1A Pending CN112911384A (en) 2021-01-20 2021-01-20 Video playing method and video playing device

Country Status (1)

Country Link
CN (1) CN112911384A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104081760A (en) * 2012-12-25 2014-10-01 华为技术有限公司 Video play method, terminal and system
CN106446782A (en) * 2016-08-29 2017-02-22 北京小米移动软件有限公司 Image identification method and device
CN110519636A (en) * 2019-09-04 2019-11-29 腾讯科技(深圳)有限公司 Voice messaging playback method, device, computer equipment and storage medium
CN110996149A (en) * 2019-12-23 2020-04-10 联想(北京)有限公司 Information processing method, device and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104081760A (en) * 2012-12-25 2014-10-01 华为技术有限公司 Video play method, terminal and system
CN106446782A (en) * 2016-08-29 2017-02-22 北京小米移动软件有限公司 Image identification method and device
CN110519636A (en) * 2019-09-04 2019-11-29 腾讯科技(深圳)有限公司 Voice messaging playback method, device, computer equipment and storage medium
CN110996149A (en) * 2019-12-23 2020-04-10 联想(北京)有限公司 Information processing method, device and system

Similar Documents

Publication Publication Date Title
CN106778854B (en) Behavior identification method based on trajectory and convolutional neural network feature extraction
Zolfaghari et al. Eco: Efficient convolutional network for online video understanding
CN107742107B (en) Facial image classification method, device and server
Zha et al. Exploiting image-trained CNN architectures for unconstrained video classification
Sharma et al. Action recognition using visual attention
Vinyes Mora et al. Deep learning for domain-specific action recognition in tennis
Tsunoda et al. Football action recognition using hierarchical lstm
CN110267119B (en) Video precision and chroma evaluation method and related equipment
Tran et al. Two-stream flow-guided convolutional attention networks for action recognition
US11600067B2 (en) Action recognition with high-order interaction through spatial-temporal object tracking
De Campos et al. An evaluation of bags-of-words and spatio-temporal shapes for action recognition
CN108280421B (en) Human behavior recognition method based on multi-feature depth motion map
CN111209897A (en) Video processing method, device and storage medium
CN114283350A (en) Visual model training and video processing method, device, equipment and storage medium
Xiao et al. Overview: Video recognition from handcrafted method to deep learning method
Zhao et al. Saliency-guided video classification via adaptively weighted learning
CN112911384A (en) Video playing method and video playing device
CN109685146A (en) A kind of scene recognition method based on double convolution sum topic models
Iosifidis et al. Human action recognition based on bag of features and multi-view neural networks
CN115205961A (en) Badminton motion recognition method and device, electronic equipment and storage medium
Ilyes Lakhal et al. Residual stacked rnns for action recognition
De Souza Action Recognition in Videos: Data-efficient approaches for supervised learning of human action classification models for video
Niu et al. Semantic video shot segmentation based on color ratio feature and SVM
Ramadan Video Analysis by Deep Learning
Jhuang Dorsal stream: from algorithm to neuroscience

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210604

RJ01 Rejection of invention patent application after publication