Background
Currently, online shopping through the internet, especially the mobile internet, is a very common phenomenon, in online shopping, goods are generally advertised on a special shopping website in a way of combining pictures and characters, and a user needs to search on the shopping website through keywords and the like to find interesting goods. If a user wants to purchase a particular item in a video while watching a video content such as a movie, a tv show, a live broadcast, a feature program, etc., the user must first search by himself to determine the name and model of the item, etc., which takes additional time for the user, resulting in a possibility that the user may stop purchasing. Currently, most videos can only enable users to directly enter a purchasing link from the videos through a simple mode of pre-advertising or suspending advertisement insertion, and the advertisements have no interaction with video contents, so that the purchasing conversion rate is not ideal while the user experience is influenced.
Particularly, at present, there are many activities for carrying out commodity publicity through live broadcasting, it is obvious that there is not enough time for carrying out sufficient manual arrangement on video contents in live broadcasting to accurately define an interaction area for each commodity, a main broadcast and the like also often need to be sent through a simple user terminal such as a mobile phone and the like, complex video editing operation is difficult to carry out, the interaction area selected in a video by utilizing a traditional image recognition algorithm and the like is not allowed in operation time and hardware resources, and a great obstacle is caused to direct interaction between audiences and the video.
Disclosure of Invention
One of the objectives of the present invention is to provide a video encoding and decoding method, so that an interactive area can be efficiently and quickly established for an article displayed in a video while live broadcasting is performed, so as to provide a better experience for a user to perform an online shopping activity while watching the live broadcasting.
An embodiment of the present invention relates to a video encoding and decoding method, including defining a decision area in a first reference frame of a plurality of reference frames of a video, the decision area being defined by saving respective coordinates of a first vertex and a second vertex thereof; extracting at least one portion of the decision region in the first reference frame and determining coordinates of first and second vertices of the decision region by motion vectors of the extracted portion in reference frames other than the first reference frame among the plurality of reference frames; determining coordinates of first and second vertices of a decision region of each frame by linear interpolation using the decision regions of two adjacent reference frames in each frame between the two adjacent reference frames; displaying the decision region according to the coordinates of the first and second vertices when playing the video; and displaying the still picture matching the decision region and the uniform resource locator information associated with the still picture in response to a user performing a selected action within the decision region. In some embodiments, feedback information is sent in response to the user performing the selected action outside of the decision region, the feedback information including a location of occurrence of the selected action, playing time information of the video at the time of occurrence, and a ratio of the number of times the user performed the selected action outside and inside of the decision region.
In some embodiments, the first vertex and the second vertex are two vertices diagonally opposite of a rectangle.
In some embodiments, the still picture matching the decision region is determined from a correlation with a feature vector of a pixel matrix formed of at least a portion of the decision region.
In some embodiments, the still picture matching the decision region is predefined in the first reference frame when the decision region is defined.
In some embodiments, when there is more than one still picture matching the determination region, the still pictures are displayed in a list form.
In some embodiments, displaying the still picture that matches the decision region includes displaying within the decision region.
In some embodiments, displaying the still picture that matches the decision region includes displaying the still picture within the decision region in the reference frame.
In some embodiments, the step of defining the first and second vertex coordinates in the first reference frame and the step of determining the first and second vertex coordinates in reference frames other than the first reference frame are performed by a first user terminal, and the step of determining the first and second vertex coordinates in frames between two adjacent reference frames is performed by a second user terminal in remote communication with the first user terminal.
In some embodiments, the step of defining the decision region in the first reference frame is performed by a first user terminal, and the steps of determining the first and second vertex coordinates in reference frames other than the first reference frame and determining the first and second vertex coordinates in a frame between two adjacent reference frames are performed by a second user terminal in remote communication with the first user terminal.
In some embodiments, the first reference frame is an intra-coded frame.
In some embodiments, the intervals between two adjacent reference frames in the plurality of reference frames are equal.
The embodiment of the disclosure allows providing an easy-to-operate method for setting the interactive area in the video, and reducing the consumption of computing resources and human intervention by setting the judgment area in only part of the reference frames and adopting different judgment area automatic tracking modes for other reference frames and other frames except the reference frames. The user can more conveniently carry out shopping operation while watching the video by interacting with the judgment area.
Detailed Description
Those skilled in the art will appreciate that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
A schematic diagram of decision region prediction according to an embodiment of the present disclosure is shown in fig. 1. Currently, for the convenience of transmission through a network, various video compression formats such as h.264 are generally used to increase the compression ratio as much as possible. In such a compression format, image frames are divided into different types of frames such as an intra-coded frame, a predictive-coded image frame, and a bidirectionally predictive-coded image frame, wherein frames other than the intra-coded frame are increased in compression ratio by a prediction offset such as a macroblock motion vector. Since the scene of the live broadcast picture is usually not changed much, the compression method is suitable, but because not every frame contains the complete picture, searching the part related to the commodity from every frame needs more computing resources to analyze the decoded image again. As shown, for example, the video includes three image frames 100, 110 and 120, each frame includes a garment worn by the presenter, a determination area is created in the played video for a portion of the image corresponding to the garment, and a user operation indicated by a mouse click, a finger touch, or the like, performed by the user outside or inside the determination area is monitored. When the user operates in the judgment area, corresponding introduction pictures, characters, purchase links and the like can be popped up so as to interact with the clothes while watching the video.
Fig. 2 is a flowchart of a video encoding and decoding method according to an embodiment of the present disclosure. In step S201, a determination area is first defined in a first reference frame of a plurality of reference frames of a video, the determination area being defined by saving respective coordinates of a first vertex and a second vertex thereof. The multiple reference frames may be selected as intra-frames in the video, and when the number of intra-frames is small, a series of reference frames may also be selected at equal intervals. For example, a first reference frame may be selected as an intra-coded frame, and then a series of frames may be selected as reference frames with equal time or frame number intervals. In the embodiment of fig. 1, frames 100 and 110 are reference frames, where the rectangular box represented by the dashed line is the decision region containing the item. Each decision region may be defined by only its two opposite vertices, e.g. the decision region in frame 100 may be defined by a first vertex 101 and a second vertex 102, which may be constructed at the same location during playback, with the coordinates of these two vertices saved and transmitted over the network. The coordinates of the first vertex 101 and the second vertex 102 need only be defined once, automatically or manually, in the frame 100, i.e. the first reference frame, and the frame 100 may be extracted for presentation to a video producer or an administrator, for example, and by which the position of the decision area containing the item is defined directly in the frame 100 by a selection operation of a mouse or a finger.
In step S202, at least one portion in the determination area in the first reference frame 100 is extracted. The at least one portion is, for example, a pixel matrix of 16 × 16, 8 × 8, 4 × 4, etc. and the luminance and chrominance information included therein. The coordinates of the first and second vertices of the decision regions of the reference frames other than the first reference frame 100 are determined by the motion vectors of the extracted portions in the reference frames. The regions of the color range of the other reference frames that are similar to the pixel matrix may be first searched, then the regions closest to the extracted pixel matrix may be selected from the regions by comparing the pixel matrix characteristic vector values, etc., and finally the determination regions in the other reference frames may be obtained by the relative positions of the pixel matrix in the determination region of the reference frame 100 and the positions of the selected regions closest to the pixel matrix. The filtering may be performed by using a method such as a correlation coefficient or a feature vector distance, or the recognition may be performed by using another image recognition method such as deep learning other than this method. The selected reference frame is the basis for calculating the position of the judgment area in the frame of the non-reference frame by using an interpolation method in the later step, so the accuracy of the position of the judgment area is improved as much as possible on the premise of simplifying the user operation.
In step S203, the coordinates of the first and second vertices of the decision region of each frame between two adjacent reference frames are determined by linear interpolation using the decision regions of the two adjacent reference frames in each frame. In a frame other than the reference frame, for example, frame 120 in fig. 1, the coordinates of the first vertex 101 and the second vertex 102 of the determination area in frame 100 are (x)0,y0) And (x)1,y1) The coordinates of the first vertex 111 and the second vertex 112 of the determination area in the frame 110 are (X) respectively0,Y0) And (X)1,Y1) Then, a linear interpolation function f (x, y) ═ a (x-x) can be constructed when estimating the coordinates of the first vertex 121 and the second vertex 122 in the frame 1200,y-y0)+b(x-X0,y-Y0) + c to calculate interpolated coordinates (x, y) where frames 100 and 110 are two reference frames adjacent to frame 120, respectively, and the interpolated coordinates of vertices 121 and 122 are [ (x) respectively0+X0)/2,(y0+Y0)/2]And [ (x)1+X1)/2,(y1+Y1)/2]. Interpolation is one of the methods that can obtain the coordinate position of the vertex of the decision area in the non-reference frame which may occupy most of the frame number at the fastest speed, and compared with the identification method in step 202, the method can save the computing resources and is very important for saving the power of the mobile platform. The two decision region estimation methods for different types of frames in conjunction with steps 202 and 203 are particularly helpful in achieving a good compromise of accuracy and efficiency on mobile platforms.
In step S204, a decision region is displayed according to the coordinates of the first and second vertices in each frame, and a still picture matching the decision region and uniform resource locator information associated with the still picture are displayed in response to a user performing a selected action within the decision region. Finding a matching still picture can be performed by the color range screening and feature vector value comparison, and the like, and the matching still picture is determined according to the correlation degree of the feature vector of the pixel matrix formed by a part of the judgment area (preferably located at the action position selected by the user). The correlation degree is, for example, a correlation coefficient, a vector distance, or the like. Machine learning approaches commonly used today, such as convolutional neural networks, may also be used. The still picture matching the decision region may be manually defined once in advance by a video producer or an operator or the like in defining the decision region in the first reference frame. Each static picture can correspond to a commodity, and Uniform Resource Locator (URL) information and auxiliary information such as price information, type information, preferential information and express delivery information corresponding to the static picture are stored on the user terminal or the remote server. This information is preferably stored on a remote server and can be modified by the goods provider and sent to the user terminal via the network. After the most similar static picture is found and the corresponding URL information is found, the information may be displayed within the clicked determination area. The URL may also be associated with the still picture in hyperlink form and linked to the URL when the user clicks on the still picture displayed in the decision region. Based on efficiency considerations, it is possible to display a still picture within the decision region of the reference frame and fix the position of the picture display with respect to the picture so as not to affect user operations when the decision region moves too fast. If there is more than one still picture matching the decision region, the still pictures may be displayed in a list form and selected again by the user.
If the user carries out the selected action outside the judgment area, the feedback information is sent through the user terminal, and the feedback information firstly comprises the occurrence position of the selected action and the playing time information of the video when the selected action occurs. Therefore, other users or administrators of the remote server can extract images corresponding to the areas clicked by the users through the playing time information and the occurrence positions, and provide the images to commodity providers for knowing the information of which items the users are interested in. If the various commodities corresponding to the defined interaction area do not include the content interested by the user, the commodities interested by the user can be added according to the feedback information, which is very useful for the commodity provider. The feedback information can also comprise a user name of the user, contact ways of a mobile phone and the like, operation history and the like, so that personalized interactive content can be realized for the user according to the feedback information, and interested commodities can be provided for each user in a targeted manner. The feedback information may also comprise a ratio of the number of selected actions that the user has performed outside and inside the decision area at the current moment, and if the ratio is significantly larger than 1, it may indicate that the user has a lack of interest in the current push, and the content of the pushed goods should be modified accordingly.
The step of defining the first and second vertex coordinates in the first reference frame and the step of determining the first and second vertex coordinates in the reference frames other than the first reference frame may be performed by a first user terminal held by a video producer or an operator or the like, and the step of determining the first and second vertex coordinates in the frame between two adjacent reference frames may be performed by a second user terminal held by a video viewer who remotely communicates with the first user terminal when playing the video. It is preferable in view of efficiency that the step of defining the decision region in the first reference frame is performed by a first user terminal held by a video producer or an operator or the like, and the steps of determining the first and second vertex coordinates in reference frames other than the first reference frame and the steps of determining the first and second vertex coordinates in a frame between two adjacent reference frames are performed by a second user terminal held by a video viewer who remotely communicates with the first user terminal when playing the video. Thus, the condition that the judgment area is not accurate due to the difference of the hardware models of the user terminals during playing can be avoided.
It will be appreciated by those skilled in the art from a consideration of the drawings and specification that various other devices and/or methods which incorporate embodiments in accordance with the concepts and principles of the invention are included within the scope of the present disclosure and are not limited to those explicitly described.