CN115134633B

CN115134633B - Remote video method and related device

Info

Publication number: CN115134633B
Application number: CN202110327092.4A
Authority: CN
Inventors: 刘尚; 胡翔宇; 徐卫国; 许旺灿; 杨小海
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2024-04-26
Anticipated expiration: 2041-03-26
Also published as: WO2022199594A1; CN115134633A

Abstract

The embodiment of the application discloses a remote video method which is applied to a terminal. The method comprises the following steps: the method comprises the steps that a first terminal obtains a first image frame and a second image frame, the zoom magnification of the first image frame is larger than that of the second image frame, and the first image frame is determined according to region of interest information; the first terminal transmits a plurality of image frames and indication information to the second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, and the indication information comprises the region of interest information of the first image frame. By adding the image frame with lower zoom ratio in the process of sending the image frame related to the region of interest, the receiving end can timely intercept and display the picture corresponding to the changed region of interest from the image frame with lower zoom ratio, so that the response time of adjusting the region of interest of the video is improved, and the phenomenon of blocking of the video picture is avoided.

Description

Remote video method and related device

Technical Field

The present application relates to the field of communications technologies, and in particular, to a remote video method and related devices.

Background

With the development of communication technology, remote video is widely applied in video call, remote medical treatment, remote education and other scenes. The picture resolution of remote video is typically no more than 1080P, subject to power consumption and time delay. However, in many occasions, if the acquisition resolution of the video is 1080P, the sharpness of the acquired video may not meet the actual requirements. For example, in a tele-job coaching scene, small and medium-sized fonts may not be recognizable even if the video frame is enlarged.

Based on this, in the related art, the resolution of the video acquired by the transmitting end is improved, and the resolution of the acquired video is reduced to 1080P and then transmitted to the receiving end. Thus, when the receiving end needs to amplify a certain area (the area is generally referred to as an area of interest), the receiving end feeds back the area that needs to be amplified to the transmitting end. After the transmitting end acquires the region needing to be amplified, the transmitting end intercepts the region needing to be amplified from the acquired video picture, converts the intercepted video picture into 1080P and transmits the 1080P to the receiving end. Because the intercepted video picture is acquired with high resolution, the definition of the video picture can be ensured to meet the requirement.

However, when the receiving end remotely controls the transmitting end, a higher response time exists, which easily causes a video picture to have an obvious clamping phenomenon. For example, after performing a picture-enlarging operation, the receiving end needs to wait for a long time (typically, it needs to wait for more than 300 ms) before the receiving end can receive the data transmitted by the transmitting end and display the enlarged video picture.

Disclosure of Invention

The embodiment of the application provides a remote video method, wherein a sending end adds an image frame with lower zoom magnification, namely an image frame with more contents in a picture, in the process of sending the image frame related to a region of interest to a receiving end. Therefore, when the interested region in the receiving end changes, the receiving end can timely intercept and display pictures corresponding to the changed interested region from the image frames with lower zoom multiplying power, namely, the receiving end can display the adjusted video pictures without waiting for a long time, the response time when the video interested region is adjusted is improved, and the phenomenon of clamping and pausing of the video pictures is avoided.

The first aspect of the present application provides a remote video method, which is applied to a first terminal serving as a video acquisition end and a video sending end in a remote video process. The method comprises the following steps: the first terminal acquires a plurality of image frames including a first image frame and a second image frame. The zoom magnification of the first image frame is larger than that of the second image frame, wherein the zoom magnification refers to the magnification of an image output by an image sensor of the camera. Under the condition that the camera does not move, the larger the zoom multiplying power of the camera is, the larger the shot object in the image output by the camera is, and the smaller the shot range in the image is; the smaller the zoom magnification of the camera, the smaller the subject in the image output by the camera, and the larger the range captured in the image.

The first image frame is determined from region of interest information indicating a location of a region of interest. In the first terminal, the region of interest information may be stored, and the region of interest information may be fed back by the second terminal to the first terminal. The region of interest information is used to indicate the location of the region of interest. The region of interest refers to a region obtained by a user zooming in, zooming out or panning a video picture displayed on a screen of the terminal through interaction modes such as a screen of the terminal being touched by a finger. In brief, the region of interest is the region to be displayed on the screen of the terminal during the remote video process. Based on the region of interest information in the first terminal, the first terminal may determine a location of the region of interest, thereby acquiring a first image frame related to the region of interest. The picture content in the first image frame is the content of the region of interest.

The first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, the image frame to be displayed is used for generating a video, and the indication information comprises the region-of-interest information of the first image frame. Wherein the indication information may include coordinate information for indicating a location of the region of interest. For example, in the case where the region of interest is a rectangular region, coordinate information of four vertices of the rectangular region may be included in the instruction information. For another example, in the case where the region of interest is a rectangular region, the indication information may include coordinate information of one vertex of the rectangular region (e.g., an upper left corner vertex of the rectangular region) and a width and a height of the rectangular region. In this way, the coordinate information of the four vertices of the rectangular region can be calculated based on the coordinate information of one vertex of the rectangular region and the width and height of the rectangular region.

In the scheme, the first terminal serving as the transmitting end adds an image frame with lower zoom magnification, namely an image frame with more contents in a picture, in the process of transmitting the image frame related to the region of interest to the second terminal serving as the receiving end. Therefore, when the region of interest in the second terminal changes, the second terminal can timely intercept and display pictures corresponding to the changed region of interest from the image frames with lower zoom multiplying power, namely, the second terminal can display the adjusted video pictures without waiting for a long time, the response time when the region of interest of the video is adjusted is improved, and the phenomenon of blocking of the video pictures is avoided.

Optionally, in one possible implementation manner, the first image frame includes only the region of interest indicated in the region of interest, and the second image frame includes other regions in addition to the region of interest indicated in the region of interest information. That is, only the picture content of the position of the region of interest is included in the picture in the first image frame, and the picture in the second image frame includes other picture content in addition to the picture content of the position of the region of interest.

The second image frame may be an image frame acquired by the first terminal at a preset zoom magnification, and the first image frame may be an image frame acquired by the first terminal at a zoom magnification adjusted according to the region of interest information. In the case that the first terminal does not move, the second image frame may be considered as a global image acquired by the first terminal, that is, the second image frame includes all areas in the field of view of the camera of the first terminal; the first image frame may be considered as a partial image acquired by the first terminal, that is, the first image frame includes only a partial area within the field of view of the camera of the first terminal.

Optionally, in a possible implementation manner, the acquiring, by the first terminal, the plurality of image frames includes: the first terminal sequentially collects a third image frame and the second image frame, and the zoom magnification of the third image frame is the same as that of the second image frame. That is, the first terminal continuously acquires a plurality of image frames including the third image frame and the second image frame described above at a specific zoom magnification. And then, the first terminal cuts the third image frame according to the region of interest information to obtain the first image frame. Specifically, after the first terminal obtains a third image frame, the first terminal cuts the third image frame according to the region of interest information in the first terminal, that is, cuts the region of interest in the third image frame based on the position indicated by the region of interest information, so as to obtain the first image frame. The first terminal acquires the first image frame in a digital zoom mode.

Optionally, in a possible implementation manner, the acquiring, by the first terminal, the plurality of image frames includes: the first terminal acquires one or more third image frames, and the zoom magnification of the one or more third image frames is the same as that of the second image frame; the first terminal cuts the one or more third image frames according to the region of interest information to obtain one or more first image frames; the first terminal determines one of the one or more third image frames as the second image frame. That is, the first terminal may continuously acquire first image frames (i.e., partial image frames) by means of digital zooming based on the region of interest information, and insert second image frames (i.e., global image frames) among the acquired plurality of first image frames.

That is, the first terminal inserts one global image frame with a low zoom magnification each time after acquiring a specific number of image frames based on the region of interest information. Thus, the number of image frames transmitted by the first terminal to the second terminal is greater than the number of image frames actually acquired by the first terminal. By inserting the global image frames into a plurality of continuous local image frames, the frequency of displaying the local image frames by the second terminal can be ensured, and the fluency of the video picture can be improved.

Optionally, in a possible implementation manner, the acquiring, by the first terminal, the plurality of image frames includes: the first terminal acquires the first image frame and the second image frame in an optical zooming mode. The optical zooming refers to zooming by means of an optical lens structure of the first terminal, that is, the first terminal enlarges and reduces a scene to be photographed through lens movement. Although both digital zooming and optical zooming are helpful for magnifying a remote object during telescopic shooting, only the optical zooming can support the imaging of an image main body, more pixels are added, so that the main body is not only enlarged, but also relatively clear. That is, when an object in an image is enlarged by means of optical zooming, the object in the image can be made relatively clearer.

Since the optical zoom is to zoom in or out an image frame captured by the lens by changing a focal length with a center of a field of view of the lens as a center point, a process of acquiring the first image frame by the first terminal by way of the optical zoom is to actually calculate a longest focal length of a field of view covering a region of interest, and then acquire the first image frame based on the longest focal length.

Optionally, in a possible implementation manner, a fourth image frame may be further included in the plurality of image frames acquired by the first terminal, where a zoom magnification of the fourth image frame is the same as a zoom magnification of the second image frame. The process of acquiring a plurality of image frames by the first terminal specifically includes: the first terminal sequentially collects the fourth image frame, one or more first image frames and the second image frames according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region information of interest, a target zoom magnification is adopted to collect one image frame, the preset number is the same as the number of one or more first image frames, and the target zoom magnification is the zoom magnification of the second image frame.

Therefore, after the first terminal collects a specific number of image frames based on the region of interest information, a global image frame is collected by adopting a lower zoom magnification, so that the frequency of the second terminal receiving the global image frame is ensured, and the phenomenon of overlarge picture fluctuation can not occur when the second terminal subsequently cuts the region of interest based on the global image frame.

Optionally, in a possible implementation manner, in the process of encoding the plurality of image frames by the first terminal, the first terminal may select a corresponding reference frame according to a frame type of the image frame.

Generally, in the video encoding process, a transmitting end selects a previous image frame of a current image frame as a reference frame, calculates difference information between the current image frame and the reference frame, and transmits the difference information between the current image frame and the reference frame to a receiving end instead of directly transmitting the current image frame to the receiving end, thereby playing a role in reducing transmitted data. However, in this scheme, since the first image frame and the second image frame are acquired based on different zoom magnifications. Therefore, the difference between the first image frame and the second image frame is relatively large compared to the adjacent two first image frames. When the second image frame is encoded, if the previous image frame (i.e., the first image frame) of the second image frame is taken as a reference frame, the data amount of the obtained difference information is larger, and the data transmission amount is increased. Based on this, the first terminal may select a corresponding reference frame according to the frame type of the image frame to ensure that the frame type between the image frame currently required to be encoded and the reference frame of the image frame is the same.

Specifically, the first terminal transmits the plurality of image frames to a second terminal, including: the first terminal acquires reference frames of the plurality of image frames; the first terminal encodes the plurality of image frames according to the reference frames of the plurality of image frames to obtain an encoding result; the first terminal sends a coding result to the second terminal; the reference frame of the first image frame is obtained according to the region of interest information, and the zoom magnification of the reference frame of the second image frame is the same as that of the second image frame.

Briefly, in the process of acquiring an image frame by the first terminal, the frame type of a local image frame (i.e., the first image frame) acquired by the first terminal based on the region of interest information is a first type, and the frame type of a global image frame (i.e., the second image frame) acquired based on a specific zoom magnification is a second type. For any one of the image frames of the first type, the first terminal may determine that the reference frame of the image frame is the image frame of the first type preceding the image frame, i.e., the image frame of the first type temporally closest to the image frame. Similarly, for any one of the image frames of the second type, the first terminal may determine that the reference frame of the image frame is the image frame of the second type preceding the image frame, i.e., the image frame of the second type temporally closest to the image frame. That is, for any one image frame, the reference frame of that image frame is the one of the same type of image frame that is temporally closest to that image frame.

Optionally, in one possible implementation manner, the first terminal encodes the plurality of image frames according to the reference frames of the plurality of image frames to obtain an encoding result, including: the first terminal encodes the image frame through a first encoder according to the reference frame of the image frame to obtain a first encoding result; the first terminal encodes the second image frame through a second encoder according to the reference frame of the second image frame to obtain a second encoding result; wherein the encoding results include the first encoding result and the second encoding result.

Wherein, when each first image frame is encoded by the first encoder, the previous image frame of each first image frame is used as a reference frame for encoding. When the second encoder encodes the second image frame, the reference frame of the second image frame is one image frame before the second image frame in the second encoder, and the one image frame before the second image frame is the same as the second image frame in type.

In the scheme, two paths of encoders are adopted to encode the image frames with different frame types, so that smooth encoding of the image frames can be ensured.

Optionally, in one possible implementation manner, in order to facilitate the second terminal to determine a first image frame and a second image frame of the plurality of image frames, the indication information further includes a frame type of the plurality of image frames, and the frame type of the first image frame is different from the frame type of the second image frame. For example, the indication information indicates that the frame type of the first image frame is a first type and indicates that the frame type of the second image frame is a second type.

Illustratively, when a certain bit for indicating the frame type in the indication information is set to 1, the frame type of the current image frame is indicated to be a second type, i.e., the second image frame described above; when a certain bit in the indication information for indicating the frame type is not set (i.e., the bit has a value of 0), the frame type of the current image frame is indicated as the first type, i.e., the first image frame described above.

A second aspect of the present application provides a remote video method applied to a first terminal as a video receiving end. The method comprises the following steps: the method comprises the steps that a first terminal receives a plurality of image frames and indication information sent by a second terminal, wherein the plurality of image frames comprise a first image frame and a second image frame, the zoom magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to region-of-interest information, the region-of-interest information is used for indicating the position of a region of interest, and the indication information comprises the region-of-interest information of the first image frame; if the region of interest information of the first image frame is different from the region of interest information in the first terminal, a third image frame is obtained by cutting in the second image frame according to the region of interest information in the first terminal, and the third image frame is displayed; and if the region of interest information of the first image frame is the same as the region of interest information in the first terminal, displaying the first image frame.

Specifically, for the second terminal as the receiving end, the second terminal can receive the interaction instruction from the user, and update the region of interest information according to the interaction instruction. However, the second terminal needs to feed back the updated region of interest information to the first terminal, and the first terminal acquires a new image frame based on the updated region of interest information and then sends the new image frame to the second terminal. Therefore, there is a certain hysteresis in the image frame matched with the updated region of interest information, that is, the region of interest information corresponding to the image frame received by the second terminal is different from the updated region of interest information within a period of time after the second terminal updates the local region of interest information.

Therefore, the second terminal determines whether the region of interest information of the first image frame is identical to the region of interest information stored in the second terminal before the first image frame is displayed. If the region of interest information of the first image frame is different from the region of interest information stored in the second terminal, the second terminal cuts out a third image frame in the second image frame according to the region of interest information in the second terminal, and displays the third image frame. In short, since the zoom magnification of the second image frame is larger than that of the first image frame, the second image frame is actually a global image frame, when the region of interest changes, the position of the new region of interest can be determined from the second image frame, and the position of the new region of interest is cut off, so as to obtain the third image frame. The content in the third image frame is the content corresponding to the position indicated by the region of interest information of the second terminal.

Optionally, in a possible implementation, the second image frame includes a region of interest indicated in the region of interest information.

Optionally, in a possible implementation manner, the first image frame is obtained by clipping one or more third image frames according to the region of interest information in the second terminal after the second terminal collects the one or more third image frames, and the zoom magnification of the one or more third image frames is the same as the zoom magnification of the second image frame.

Optionally, in a possible implementation, the second image frame is one image frame of the one or more third image frames.

Optionally, in a possible implementation manner, the plurality of image frames are acquired by the second terminal through an optical zooming manner.

Optionally, in a possible implementation manner, the plurality of image frames further includes a fourth image frame, and a zoom magnification of the fourth image frame is the same as a zoom magnification of the second image frame; the fourth image frame, one or more first image frames and the second image frame are sequentially acquired by the second terminal according to a preset rule; the preset rule is that after the second terminal collects a preset number of image frames based on the region information of interest, a target zoom magnification is adopted to collect one image frame, the preset number is the same as the number of one or more first image frames, and the target zoom magnification is the zoom magnification of the second image frame.

Optionally, in a possible implementation manner, the indication information further includes a frame type of the plurality of image frames, and the frame type of the first image frame is different from the frame type of the second image frame; the method further comprises the steps of: the first terminal sequentially sends the first image frames to a first buffer area and sends the second image frames to a second buffer area according to the frame types of the plurality of image frames. The first buffer area is used for storing first image frames with a first frame type, and the second buffer area is used for storing second image frames with a second frame type. Each time the second terminal receives a new image frame, the second terminal judges the frame type of the image frame, and if the frame type of the image frame is the first type, the image is sent to a first buffer area; if the frame type of the image frame is the second type, the image frame is sent to a second buffer. The old image frames will be overlaid by the new image frames on the first buffer and the second buffer.

If the region of interest information of the first image frame is different from the region of interest information in the first terminal, a third image frame is obtained by clipping in the second image frame according to the region of interest information in the first terminal, and the third image frame is displayed, including: if the region of interest information of the first image frame in the first buffer area is different from the region of interest information in the first terminal, a third image frame is obtained by cutting out the second image frame in the second buffer area according to the region of interest information in the first terminal, and the third image frame is displayed; and if the region of interest information of the first image frame is the same as the region of interest information in the first terminal, displaying the first image frame, including: and if the region of interest information of the first image frame in the first buffer area is the same as the region of interest information in the first terminal, displaying the first image frame in the first buffer area.

Optionally, in one possible implementation manner, the method further includes: when a user initiates an interaction instruction on a second terminal, the second terminal acquires the interaction instruction, and the interaction instruction is used for indicating to change the interested area. The interaction instruction may be, for example, a zoom-out operation instruction, an zoom-in operation instruction, or a pan operation instruction initiated by the user by touching the screen of the second terminal. The zoom-out operation instruction is used for instructing a screen displayed on the screen of the second terminal to be zoomed out by taking the target area as a starting point. The zoom-in operation instruction is for instructing to zoom in on a target area displayed on the screen of the second terminal. The panning operation instruction is used for instructing to pan the screen displayed on the screen of the second terminal to a specific direction. After the second terminal executes the interaction instruction, the obtained region to be displayed is a new region of interest, so that the second terminal can update the region of interest information according to the interaction instruction, and updated region of interest information is obtained. In this way, after updating the locally stored region of interest information, the second terminal sends the updated region of interest information to the first terminal.

A third aspect of the present application provides a terminal, comprising: the device comprises an acquisition unit, a processing unit and a receiving and transmitting unit; the acquisition unit is used for acquiring a plurality of image frames, wherein the plurality of image frames comprise a first image frame and a second image frame, the zoom magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to the region of interest information, and the region of interest information is used for indicating the position of the region of interest; the first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, the image frame to be displayed is used for generating a video, and the indication information comprises the region-of-interest information of the first image frame.

Optionally, in a possible implementation manner, the acquiring unit is further configured to acquire a third image frame and the second image frame, where a zoom magnification of the third image frame is the same as a zoom magnification of the second image frame; and the processing unit is further used for clipping the third image frame according to the region of interest information to obtain the first image frame.

Optionally, in a possible implementation manner, the acquiring unit is further configured to acquire one or more third image frames, where a zoom magnification of the one or more third image frames is the same as a zoom magnification of the second image frame; the processing unit is further configured to crop the one or more third image frames according to the region of interest information to obtain one or more first image frames; the processing unit is further configured to determine one of the one or more third image frames as the second image frame.

Optionally, in a possible implementation manner, the acquiring unit acquires the first image frame and the second image frame by means of optical zooming.

Optionally, in a possible implementation manner, the plurality of image frames further includes a fourth image frame, and a zoom magnification of the fourth image frame is the same as a zoom magnification of the second image frame; the acquisition unit is further used for sequentially acquiring the fourth image frame, one or more first image frames and the second image frame according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region information of interest, a target zoom magnification is adopted to collect one image frame, the preset number is the same as the number of one or more first image frames, and the target zoom magnification is the zoom magnification of the second image frame.

Optionally, in a possible implementation manner, the acquiring unit is further configured to acquire reference frames of the plurality of image frames; the processing unit is further used for encoding the plurality of image frames according to the reference frames of the plurality of image frames to obtain an encoding result; the receiving and transmitting unit is further used for transmitting a coding result to the second terminal; the reference frame of the first image frame is obtained according to the region of interest information, and the zoom magnification of the reference frame of the second image frame is the same as that of the second image frame.

Optionally, in a possible implementation manner, the processing unit is further configured to encode, by using a first encoder, the image frame according to a reference frame of the image frame, so as to obtain a first encoding result; the processing unit is further configured to encode the second image frame by using a second encoder according to the reference frame of the second image frame, so as to obtain a second encoding result; wherein the encoding results include the first encoding result and the second encoding result.

Optionally, in a possible implementation manner, the indication information further includes a frame type of the plurality of image frames, and the frame type of the first image frame is different from the frame type of the second image frame.

A fourth aspect of the present application provides a terminal, comprising: the device comprises an acquisition unit, a receiving and transmitting unit and a processing unit; the receiving and transmitting unit is used for receiving a plurality of image frames and indication information sent by a second terminal, the plurality of image frames comprise a first image frame and a second image frame, the zoom magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to the region of interest information, the region of interest information is used for indicating the position of a region of interest, and the indication information comprises the region of interest information of the first image frame; the processing unit is used for: if the region of interest information of the first image frame is different from the region of interest information in the first terminal, a third image frame is obtained by cutting in the second image frame according to the region of interest information in the first terminal, and the third image frame is displayed; and if the region of interest information of the first image frame is the same as the region of interest information in the first terminal, displaying the first image frame.

Optionally, in a possible implementation manner, the indication information further includes a frame type of the plurality of image frames, and the frame type of the first image frame is different from the frame type of the second image frame; the processing unit is further configured to sequentially send the first image frame to a first buffer area and the second image frame to a second buffer area according to frame types of the plurality of image frames; the processing unit is further configured to: if the region of interest information of the first image frame in the first buffer area is different from the region of interest information in the first terminal, a third image frame is obtained by cutting out the second image frame in the second buffer area according to the region of interest information in the first terminal, and the third image frame is displayed; and if the region of interest information of the first image frame in the first buffer area is the same as the region of interest information in the first terminal, displaying the first image frame in the first buffer area.

Optionally, in a possible implementation manner, the obtaining unit is configured to obtain an interaction instruction, where the interaction instruction is used to instruct to change the area of interest; the processing unit is also used for updating the information of the region of interest according to the interaction instruction to obtain updated information of the region of interest; the transceiver unit is further configured to send updated region of interest information to the second terminal.

A fifth aspect of the present application provides a terminal comprising: a processor, a non-volatile memory, and a volatile memory; wherein the nonvolatile memory or the volatile memory has stored therein computer readable instructions; the processor reads the computer readable instructions to cause the terminal to implement a method as in any one of the implementations of the first or second aspect.

A sixth aspect of the application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform a method as in any of the implementations of the first or second aspects.

A seventh aspect of the application provides a computer program product which, when run on a computer, causes the computer to perform a method as in any of the implementations of the first or second aspects.

An eighth aspect of the application provides a chip comprising one or more processors. Some or all of the processor is configured to read and execute a computer program stored in the memory to perform the method in any of the possible implementations of any of the aspects described above.

Optionally, the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire. Optionally, the chip further comprises a communication interface, and the processor is connected to the communication interface. The communication interface is used for receiving data and/or information to be processed, and the processor acquires the data and/or information from the communication interface, processes the data and/or information and outputs a processing result through the communication interface. The communication interface may be an input-output interface. The method provided by the application can be realized by one chip or a plurality of chips in a cooperative manner.

Drawings

Fig. 1 is a schematic diagram of a video frame according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a remote video process according to the first related art;

FIG. 3 is a schematic diagram of a remote video flow according to the second related art;

fig. 4 is a schematic structural diagram of a terminal 101 according to an embodiment of the present application;

Fig. 5 is a schematic flow chart of a remote video method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of image contrast with different zoom magnification according to an embodiment of the present application;

Fig. 7 is a schematic diagram of comparing information of regions of interest in different terminals at different moments according to an embodiment of the present application;

FIG. 8 is a schematic diagram of obtaining a plurality of image frames by digital zooming according to an embodiment of the present application;

FIG. 9 is a schematic diagram of another embodiment of the present application for obtaining a plurality of image frames;

FIG. 10 is a schematic diagram showing a comparison of image frames obtained based on different modes according to an embodiment of the present application;

FIG. 11 is a schematic diagram of determining a reference frame according to an embodiment of the present application;

FIG. 12 is a schematic diagram of encoding based on two encoders according to an embodiment of the present application;

FIG. 13 is a flowchart of a remote video method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a terminal 1400 according to an embodiment of the present application;

Fig. 15 is a schematic structural diagram of a computer program product 1500 according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will now be described with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the present application. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules that are expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps in the present application does not mean that the steps in the method flow must be executed according to the time/logic sequence indicated by the naming or numbering, and the execution sequence of the steps in the flow that are named or numbered may be changed according to the technical purpose to be achieved, so long as the same or similar technical effects can be achieved.

With the development of communication technology, remote video is widely applied in video call, remote medical treatment, remote education and other scenes. The picture resolution of remote video is typically no more than 1080P (i.e., 1920 x 1080 pixels) due to power consumption and time delay.

However, in many cases, if the acquisition resolution of the video is 1080P, the sharpness of the acquired video may not meet the actual requirements. For example, referring to fig. 1, fig. 1 is a schematic diagram of a video frame according to an embodiment of the present application. As shown in fig. 1, in a scene of remote job guidance, since the acquisition resolution of video is 1080P, even if the video picture is enlarged, the small-sized fonts in the video picture may not be recognized clearly.

Based on this, two solutions are proposed in the related art. In one solution of the related art, the resolution of the video acquired by the transmitting end is improved, for example, the resolution of the video acquired by the transmitting end is improved to 3840x2160 (4 k resolution for short), so that the video with higher resolution is transmitted to the receiving end, thereby solving the problem of unclear video.

For example, referring to fig. 2, fig. 2 is a schematic diagram of a remote video flow according to the first related art. As shown in fig. 2, the transmitting end collects video with 4k resolution through the camera, and after encoding the video with 4k resolution, the transmitting end transmits the encoded video to the receiving end. After receiving the coded video, the receiving terminal decodes the video to obtain a video with 4k resolution, and displays the video with 4k resolution.

In the related art one, by increasing the resolution of the transmitted video, the sharpness of the remote video can be effectively improved. However, compared with the video with 1080P resolution, the power consumption and the time delay for encoding the video with 4k resolution are increased by 4 times, and the real-time performance of the remote video is affected. In addition, due to large coding power consumption, remote video cannot be performed for a long time on some terminals, and normal remote video is affected. Furthermore, the coding rate for coding the video with the resolution of 4k is increased by 4 times, which brings about larger bandwidth cost, so that the scheme is difficult to apply in actual service scenes.

In the second related art, on the basis of improving the resolution of the video acquired by the transmitting end, for example, the resolution of the video acquired by the transmitting end is improved to 3840x2160, and the resolution of the acquired video is reduced to 1080P and then transmitted to the receiving end.

Thus, when the receiving end needs to amplify a certain area (the area may be generally referred to as an area of interest), the receiving end feeds back the area that needs to be amplified to the transmitting end. After the transmitting end acquires the region needing to be amplified, the transmitting end intercepts the region needing to be amplified from the acquired video picture, converts the intercepted video picture into 1080P and transmits the 1080P to the receiving end. Because the intercepted video picture is collected in high resolution, the definition of the video picture can be ensured to meet the requirement.

Referring to fig. 3, fig. 3 is a schematic diagram of a remote video flow according to a second related art. As shown in fig. 3, the transmitting end collects the video with the resolution of 4k through the camera, and intercepts the original video picture according to the information of the region of interest, so as to obtain the video after intercepting part of the picture. Then, the transmitting end encodes the video after the partial picture is intercepted with 1080P resolution, and transmits the encoded video to the receiving end. After receiving the coded video, the receiving terminal decodes the video to obtain a video with 1080P resolution, and displays the video with 1080P resolution.

For example, it is assumed that a transmitting end acquires video at 1920×1080 resolution, encodes the acquired video at 1920×1080 resolution, and transmits the encoded video to a receiving end. Then the receiving end can receive 1920 x 1080 resolution video. If the receiving end needs to enlarge and display a certain region of interest with the size of 960x540, the receiving end needs to intercept the video picture corresponding to the region of interest from the received original video, upsample the intercepted video picture into a picture with the size of 1920 x 1080, and finally display the upsampled video picture. Since the original video picture is actually acquired at 1920×1080 resolution, the video picture corresponding to the region of interest actually displayed by the receiving end is the video picture sampled on the receiving end, and therefore the definition of the video picture is not high.

After adopting the related technology II, the sending end is assumed to acquire the video with the resolution of 3840x2160, the acquired video is encoded with the resolution of 1920 x 1080 and then sent to the receiving end, namely, the original video is downsampled into 1920 x 1080, and then the downsampled video is encoded. Then the receiving end can receive 1920 x 1080 resolution video. If the receiving end needs to enlarge and display a region of interest with the size of 960x540, the receiving end needs to feed back the position of the region of interest to the transmitting end.

After receiving the position of the region of interest, the transmitting end can determine that the size corresponding to the region of interest in the acquired original video is 1920×1080. Therefore, the transmitting end can intercept the region of interest with the size of 1920×1080 in the acquired original video, encode the intercepted video with the resolution of 1920×1080 and transmit the encoded video to the receiving end. In this way, the receiving end can receive the video intercepted from the original video with the resolution of 4k, and the receiving end does not need to up-sample the received video picture, so that the definition of the video picture displayed by the receiving end is higher.

However, in the second related art, the receiving end needs to feed back the position of the region of interest to the transmitting end, and after the transmitting end performs corresponding processing, the video obtained after the processing is transmitted to the receiving end. In this way, the receiving end remotely controls the transmitting end, so that higher response time exists, and obvious clamping phenomenon of the video picture is easy to occur. For example, after performing a picture-enlarging operation, the receiving end needs to wait for a long time (typically, it needs to wait for more than 300 ms) before the receiving end can receive the data transmitted by the transmitting end and display the enlarged video picture.

The remote video method provided by the embodiment of the application can be applied to the terminal with the video acquisition function. The terminal is also called a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), etc., and is a device equipped with an image capturing apparatus capable of capturing video, and capable of performing remote communication with other devices to transmit the captured video to the other devices. Such as a handheld device with a photographing function, a monitoring camera, etc.

Currently, some examples of terminals are: mobile phone, tablet, notebook, palm, surveillance camera, mobile INTERNET DEVICE, MID, wearable device, virtual Reality (VR) device, augmented reality (augmented reality, AR) device, wireless terminal in industrial control (industrial control), wireless terminal in unmanned (SELF DRIVING), wireless terminal in tele-surgery (remote medical surgery), wireless terminal in smart grid (SMART GRID), wireless terminal in transportation security (transportation safety), wireless terminal in smart city (SMART CITY), wireless terminal in smart home (smart home), etc.

The image acquisition device in the terminal is used for converting the optical signal into an electric signal to generate an image signal. The image acquisition device may be, for example, an image sensor, which may be, for example, a charge coupled device (Charge Coupled Device, CCD) or a complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS).

Referring to fig. 4, fig. 4 is a schematic structural diagram of a terminal 101 according to an embodiment of the present application. As shown in fig. 4, terminal 101 includes a processor 103, with processor 103 coupled to a system bus 105. Processor 103 may be one or more processors, each of which may include one or more processor cores. A display adapter 107, which may drive a display 109, the display 109 being coupled to the system bus 105. The system bus 105 is coupled to an input output (I/O) bus via a bus bridge 111. I/O interface 115 is coupled to an I/O bus. The I/O interface 115 communicates with various I/O devices, such as an input device 117 (e.g., a touch screen, etc.), a multimedia disk (MEDIA TRAY) 121, (e.g., a CD-ROM (compact disc read-only memory), a multimedia interface, etc.). Transceiver 123 (which may transmit and/or receive radio communication signals), camera 155 (which may capture still and moving digital video images), and external USB port 125. Among them, an interface connected to the I/O interface 115 may be a USB interface.

The processor 103 may be any conventional processor including a reduced instruction set computing (reduced instruction set Computing, RISC) processor, a complex instruction set computing (complex instruction set computing, CISC) processor, or a combination thereof. In the alternative, the processor may be a dedicated device such as an ASIC.

Terminal 101 may communicate with software deploying server 149 through network interface 129. The network interface 129 is illustratively a hardware network interface, such as a network card. The network 127 may be an external network, such as the Internet, or an internal network, such as an Ethernet or virtual private network (virtual private network, VPN). Optionally, the network 127 may also be a wireless network, such as a WiFi network, a cellular network, or the like.

The hard disk drive interface 131 is coupled to the system bus 105. The hardware drive interface is connected to the hard disk drive 133. System memory 135 is coupled to system bus 105. The data running in system memory 135 may include an Operating System (OS) 137, applications 143, and a schedule for terminal 101.

The operating system includes Shell 139 and kernel 141.Shell 139 is an interface between the user and the kernel of the operating system. A shell is the outermost layer of the operating system. Shell manages interactions between users and the operating system: waiting for user input, interpreting the user input to the operating system, and processing output results of a variety of operating systems.

Kernel 141 is made up of those parts of the operating system that are used to manage memory, files, peripherals, and system resources. Kernel 141 interacts directly with the hardware, the operating system kernel typically runs processes and provides inter-process communication, CPU time slice management, interrupts, memory management, and IO management, among others.

By way of example, where terminal 101 is a smart phone, application 143 includes a remote video-related program. The terminal 101 can realize remote video with another terminal by executing the application 143. That is, the terminal 101 can collect video through the camera 155, and transmit the collected video to another terminal after processing and encoding the video through the processing 103. In one embodiment, terminal 101 may download application 143 from software deploying server 149 when execution of application 143 is desired.

The application scenario of the remote video method provided by the embodiment of the present application is described above, and the execution process of the remote video method will be described in detail below.

Referring to fig. 5, fig. 5 is a schematic flow chart of a remote video method according to an embodiment of the application. As shown in fig. 5, the remote video method includes the following steps.

In step 501, a first terminal acquires a plurality of image frames, where the plurality of image frames includes a first image frame and a second image frame, a zoom magnification of the first image frame is greater than a zoom magnification of the second image frame, and the first image frame is determined according to region of interest information, where the region of interest information is used to indicate a position of a region of interest.

In this embodiment, the first terminal is a transmitting terminal that collects video and transmits the video to the receiving terminal, and the second terminal is a receiving terminal that receives the video transmitted by the first terminal and displays the video.

During the video acquisition process, the first terminal may continuously acquire image frames at a fixed frame rate, thereby obtaining a plurality of continuous image frames. Among the plurality of image frames, a first image frame and a second image frame are included. The zoom magnification of the first image frame is larger than that of the second image frame.

The zoom magnification is the magnification of an image output by an image sensor of the camera. Under the condition that the camera does not move, the larger the zoom multiplying power of the camera is, the larger the shot object in the image output by the camera is, and the smaller the shot range in the image is; the smaller the zoom magnification of the camera, the smaller the subject in the image output by the camera, and the larger the range captured in the image.

For example, referring to fig. 6, fig. 6 is a schematic diagram illustrating image contrast with different zoom magnifications according to an embodiment of the present application. As shown in fig. 6, for two images with the same size output by the camera under the same scene, the camera obtains an image one by shooting with a smaller zoom magnification, and obtains an image two by shooting with a larger zoom magnification. For the first image with a small zoom magnification, the shooting range of the first image is the whole body of the skier. For the second image with larger zoom ratio, the shooting range of the second image is the head of the skier, namely, the shooting range of the second image is smaller than the shooting range of the first image. In addition, the head of the skier photographed in image two is larger than the head of the skier photographed in image one. Briefly, the second image may be considered to be obtained by enlarging the area of the first image where the head of the skier is located.

In the first terminal, the region of interest information may be stored, and the region of interest information may be fed back by the second terminal to the first terminal. The region of interest information is used to indicate the location of the region of interest. The region of interest refers to a region obtained by a user zooming in, zooming out or panning a video picture displayed on a screen of the terminal through interaction modes such as a screen of the terminal being touched by a finger.

In brief, the region of interest is the region to be displayed on the screen of the terminal during the remote video process. Taking fig. 6 as an example, an image one is displayed on the terminal screen, a user touches the terminal screen with a finger to zoom in on the head region of the skier, and the terminal can obtain a corresponding region of interest (namely, the head region of the skier shown in the image two) based on the interactive instruction of the user.

Based on the region of interest information in the first terminal, the first terminal may determine a location of the region of interest, thereby acquiring a first image frame related to the region of interest. The picture content in the first image frame is the content of the region of interest.

Optionally, the first image frame includes only the region of interest indicated in the region of interest, and the second image frame includes other regions in addition to the region of interest indicated in the region of interest information. That is, only the picture content of the position of the region of interest is included in the picture in the first image frame, and the picture in the second image frame includes other picture content in addition to the picture content of the position of the region of interest.

The second image frame may be an image frame acquired by the first terminal at a preset zoom magnification, and the first image frame may be an image frame acquired by the first terminal at a zoom magnification adjusted according to the region of interest information. In the case that the first terminal does not move, the second image frame may be considered as a global image acquired by the first terminal, that is, the second image frame includes all areas within the field of view of the camera of the first terminal; the first image frame may be considered as a partial image acquired by the first terminal, i.e. the first image frame includes only a partial area within the field of view of the camera of the first terminal.

Step 502, the first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, the image frame to be displayed is used for generating a video, and the indication information includes information of a region of interest of the first image frame.

In the process of carrying out remote video on the first terminal and the second terminal, the first terminal sequentially collects the plurality of image frames and sends the plurality of image frames to the second terminal one by one.

Optionally, the first terminal may collect the plurality of image frames with a higher resolution, then reduce the plurality of image frames to a specific resolution, then encode the reduced plurality of image frames, and send the encoded plurality of image frames. For example, the first terminal acquires the plurality of image frames at a resolution of 4k, reduces the plurality of image frames to a resolution of 1080P, and encodes and transmits the reduced plurality of image frames.

In addition, in the process that the first terminal transmits the image frames to the second terminal, the first terminal also transmits indication information to the second terminal, the indication information being used for indicating the region of interest information of each image frame.

In short, the first terminal may carry, in the process of sending each image frame to the second terminal, indication information corresponding to each image frame, where the indication information indicates the region of interest information of the image frame transmitted by the first terminal. Wherein the indication information may include coordinate information for indicating a location of the region of interest.

For example, in the case where the region of interest is a rectangular region, coordinate information of four vertices of the rectangular region may be included in the instruction information.

For another example, in the case where the region of interest is a rectangular region, the indication information may include coordinate information of one vertex of the rectangular region (e.g., an upper left corner vertex of the rectangular region) and a width and a height of the rectangular region. In this way, the coordinate information of the four vertices of the rectangular region can be calculated based on the coordinate information of one vertex of the rectangular region and the width and height of the rectangular region.

Optionally, when the first terminal performs video encoding according to the h.264/h.265 video compression standard, the first terminal may write the indication information into supplemental enhancement information (Supplemental Enhancement Information, SEI), so as to implement carrying the indication information corresponding to each image frame in the transmission data.

In a possible embodiment, for the second image frame, since the second image frame includes other regions in addition to the region of interest, the first terminal may not carry the indication information, i.e. the position of the region of interest in the second image frame, when transmitting the second image frame. In another possible embodiment, the first terminal still carries indication information to indicate the location of the region of interest in the second image frame when transmitting the second image frame.

Optionally, in the process that the first terminal sends the plurality of image frames to the second terminal, the first terminal may encode the image frames one by one to obtain encoded image information, and then send the encoded image information and indication information corresponding to the encoded image frames to the second terminal.

Optionally, in order to facilitate the second terminal to determine the first image frame and the second image frame of the plurality of image frames, the indication information may further include a frame type of the plurality of image frames. Wherein the frame type of the first image frame is different from the frame type of the second image frame.

Illustratively, the first terminal may indicate the frame type of the image frame by a certain bit in the indication information. For example, when a certain position for indicating the frame type in the indication information is set to 1, the frame type of the current image frame is indicated to be a second type, namely the second image frame; when a certain bit in the indication information for indicating the frame type is not set (i.e., the bit has a value of 0), the frame type of the current image frame is indicated as the first type, i.e., the first image frame described above. Instead of indicating the frame type in the above-described manner, the frame type of the image frame may be indicated in other manners, which is not particularly limited in this embodiment.

In step 503, the second terminal receives a plurality of image frames and indication information sent by the first terminal.

After the second terminal receives the data sent by the first terminal, the second terminal decodes the data sent by the first terminal to obtain a plurality of image frames sent by the first terminal. Illustratively, in the process that the first terminal continuously transmits the encoded data to the second terminal, the second terminal continuously decodes the received encoded data, so as to sequentially obtain the plurality of image frames and the indication information corresponding to each image frame.

It is noted that the first terminal encodes an image frame once every image frame is acquired, and transmits the encoded image frame and corresponding indication information to the second terminal. Thus, the second terminal receives the image frames transmitted from the first terminal one by one, rather than receiving a plurality of image frames at a time.

And step 504, if the region of interest information of the first image frame is different from the region of interest information in the second terminal, clipping a fifth image frame from the second image frame according to the region of interest information in the second terminal, and displaying the fifth image frame.

In this embodiment, after the second terminal receives the image frames sent by the first terminal, the second terminal may select the image frames to be displayed according to the indication information corresponding to each image frame.

Specifically, for the second terminal as the receiving end, the second terminal can receive the interaction instruction from the user, and update the region of interest information according to the interaction instruction. However, the second terminal needs to feed back the updated information of the region of interest to the first terminal, and the first terminal acquires a new image frame based on the updated information of the region of interest and then sends the new image frame to the second terminal. Therefore, there is a certain hysteresis in the image frame matched with the updated region of interest information, that is, the region of interest information corresponding to the image frame received by the second terminal is different from the updated region of interest information within a period of time after the second terminal updates the local region of interest information.

Therefore, the second terminal determines whether the region of interest information of the first image frame is identical to the region of interest information stored in the second terminal before the first image frame is displayed. If the region of interest information of the first image frame is different from the region of interest information stored in the second terminal, the second terminal cuts out a fifth image frame in the second image frame according to the region of interest information in the second terminal, and displays the fifth image frame.

In short, since the zoom magnification of the second image frame is smaller than that of the first image frame, the second image frame is actually a global image frame, when the region of interest changes, the position of the new region of interest can be determined from the second image frame, and the position of the new region of interest is cut off, so as to obtain a fifth image frame. The content in the fifth image frame is the content corresponding to the position indicated by the region of interest information of the second terminal.

And step 505, if the region of interest information of the first image frame is the same as the region of interest information in the first terminal, displaying the first image frame.

The second terminal may display the first image frame if the region of interest information of the first image frame is identical to the region of interest information stored in the second terminal.

For ease of understanding, the process by which the second terminal selects the displayed image frames will be described below in connection with specific examples.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating comparison of information of regions of interest in different terminals at different moments in time according to an embodiment of the present application. In fig. 7, a solid line box in the image represents a position indicated by the region of interest information in the current terminal.

At time t1, the region of interest information in the first terminal is the same as the region of interest information in the second terminal, i.e. the regions of interest indicated by the region of interest information in the first terminal and the second terminal are both located at the position of the skier's head. At this time, the region of interest information corresponding to the first image frame transmitted by the first terminal is the same as the region of interest information in the second terminal, and the second terminal selects and displays the first image frame, and the content of the first image frame is the head of the skier marked in the solid line frame.

At time t2, the second terminal receives an interaction instruction of the user, wherein the interaction instruction is specifically a translation operation instruction, and the second terminal updates the information of the region of interest according to the interaction instruction. In the second terminal, the region of interest indicated by the region of interest information before updating is located at the head of the skier, and the region of interest indicated by the region of interest information after updating is located at the skier. Because the region of interest information in the second terminal is changed, the region of interest information of the first image frame sent by the first terminal is different from the region of interest information in the second terminal, so that the second terminal selects the region of interest which is cut and updated in the second image frame, obtains a third image frame, and displays the third image frame. The content of the third image frame is the slide marked in the solid line frame.

At time t3, the first terminal receives updated region of interest information fed back by the second terminal, so that the first terminal acquires a new first image frame according to the updated region of interest information and sends the new first image frame to the second terminal. Because of the transmission delay between the first terminal and the second terminal, at time t3, the first image frame received by the second terminal is actually still acquired by the first terminal based on the region of interest information before updating. That is, the region of interest information of the first image frame received in the second terminal is not the same as the region of interest information in the second terminal, and the second terminal still selects to display the third image frame.

At time t4, the region of interest information in the first terminal is the same as the region of interest information in the second terminal, and the second terminal receives the first image frame acquired by the first terminal based on the updated region of interest information. At this time, the region of interest information corresponding to the first image frame sent by the first terminal is the same as the region of interest information in the second terminal, and the second terminal selects and displays the first image frame, where the content of the first image frame is the slide plate marked in the solid line frame.

In this embodiment, the first terminal as the transmitting end adds an image frame with a lower zoom magnification, that is, an image frame including more contents in the screen, in the process of transmitting an image frame related to the region of interest to the second terminal as the receiving end. Therefore, when the region of interest in the second terminal changes, the second terminal can timely intercept and display pictures corresponding to the changed region of interest from the image frames with lower zoom multiplying power, namely, the second terminal can display the adjusted video pictures without waiting for a long time, the response time when the region of interest of the video is adjusted is improved, and the phenomenon of blocking of the video pictures is avoided.

In one possible embodiment, to facilitate the selection of the displayed image frames by the second terminal, a different buffer may be created in the second terminal. And after receiving the image frames with different frame types, the second terminal respectively sends the image frames to the corresponding buffer areas. Then, the second terminal selects one of the buffers according to the region of interest information and displays the image frames in the buffer.

In an exemplary case where the indication information further includes frame types of the plurality of image frames, the second terminal sequentially transfers the first image frames to a first buffer and the second image frames to a second buffer according to the frame types of the plurality of image frames. The first buffer area is used for storing first image frames with a first frame type, and the second buffer area is used for storing second image frames with a second frame type.

Each time the second terminal receives a new image frame, the second terminal judges the frame type of the image frame, and if the frame type of the image frame is the first type, the image is sent to a first buffer area; if the frame type of the image frame is the second type, the image frame is sent to a second buffer. The old image frames will be overlaid by the new image frames on the first buffer and the second buffer.

After the second terminal sends the image frames to the corresponding buffers based on the frame types of the image frames, the second terminal determines which buffer to select for displaying the image frames according to the region of interest information of the image frames in the buffers.

Specifically, if the region of interest information of the first image frame in the first buffer area is different from the region of interest information in the second terminal, a fifth image frame is obtained by clipping in the second image frame in the second buffer area according to the region of interest information in the second terminal, and the fifth image frame is displayed. And if the region of interest information of the first image frame in the first buffer area is the same as the region of interest information in the first terminal, displaying the first image frame in the first buffer area.

In one possible embodiment, during the process of displaying the image frame by the second terminal, the second terminal may acquire the interaction instruction of the user in real time, and update the region of interest in real time according to the interaction instruction.

Illustratively, when a user initiates an interaction instruction on a second terminal, the second terminal obtains the interaction instruction, where the interaction instruction is used to instruct to alter the region of interest. The interaction instruction may be, for example, a zoom-out operation instruction, an zoom-in operation instruction, or a pan operation instruction initiated by the user by touching the screen of the second terminal. The zoom-out operation instruction is used for instructing a screen displayed on the screen of the second terminal to be zoomed out by taking the target area as a starting point. The zoom-in operation instruction is for instructing to zoom in on a target area displayed on the screen of the second terminal. The panning operation instruction is used for instructing to pan the screen displayed on the screen of the second terminal to a specific direction.

After the second terminal executes the interaction instruction, the obtained region to be displayed is a new region of interest, so that the second terminal can update the region of interest information according to the interaction instruction, and updated region of interest information is obtained. In this way, after updating the locally stored region of interest information, the second terminal sends the updated region of interest information to the first terminal.

The above has described the procedure in which the first terminal transmits image frames having different zoom magnifications to the second terminal, and the second terminal selects an image frame to be displayed according to the region-of-interest information. For ease of understanding, the process of the first terminal acquiring a plurality of image frames and transmitting the plurality of image frames to the second terminal will be described in detail below.

Alternatively, the first terminal may acquire the plurality of image frames in a plurality of ways.

In one aspect, the first terminal may acquire the plurality of image frames by digital zooming.

The digital zooming refers to that each pixel area of a partial area in the acquired image is increased through a processor in the first terminal, so that the purpose of amplifying the partial area is achieved. In practice, digital zoom does not change the focal length of the lens.

Illustratively, the first terminal sequentially acquires one or more third image frames and the second image frame, wherein the zoom magnification of the one or more third image frames is the same as the zoom magnification of the second image frame. That is, the first terminal continuously acquires a plurality of image frames including one or more of the third image frames and the second image frames described above at a specific zoom magnification. Then, the first terminal cuts out the one or more third image frames according to the region of interest information to obtain the one or more first image frames. Specifically, after the first terminal acquires a third image frame, the first terminal cuts the third image frame according to the region of interest information in the first terminal, that is, cuts out the region of interest in the third image frame based on the position indicated by the region of interest information, so as to obtain the first image frame. The first terminal acquires the first image frame in a digital zoom mode.

Referring to fig. 8, fig. 8 is a schematic diagram of obtaining a plurality of image frames by digital zooming according to an embodiment of the present application. As shown in fig. 8, the first terminal sequentially acquires an image 1, an image 2, an image 3, an image 4, an image 5, and an image 6 at a fixed zoom magnification, wherein the images 1 to 5 correspond to the third image frame described above, and the image 6 corresponds to the second image frame described above. In the process of acquiring the images 1-5 by the first terminal, after the first terminal acquires the image 1, cutting the image 1 based on the region of interest information to obtain an image A1; similarly, after the first terminal acquires the image 2, the image 2 is cropped based on the region of interest information, resulting in the image A2. Similarly, the first terminal obtains images A1-A5 corresponding to the images 1-5 by digital zooming, and the images A1-A5 correspond to the first image frame. For image 6, the first terminal no longer clips image 6 based on the region of interest information, i.e. image 6 in fig. 8 is identical to image B.

In one mode, the first terminal may set a fixed number of intervals for indicating the number of image frames spaced between two adjacent global image frames. Wherein the first image frame may be referred to as a local image frame and the second image frame may be referred to as a global image frame. That is, after each specific number of local image frames are acquired by the first terminal, one global image frame is acquired. For example, in case of an interval number of 4, the first terminal acquires one global image frame every 4 local image frames. Thus, when the first terminal performs image acquisition at a frame rate of 30, the first terminal acquires 30 image frames per second, and the 30 image frames include 24 local image frames and 6 global image frames. In practical applications, the number of intervals described above may be fixed, for example 4 or 5. The number of intervals may also be non-fixed, e.g. the first terminal acquires one global image frame after 4 local image frames apart, and then the first terminal acquires the next global image frame after 5 local image frames apart. The present embodiment does not limit the number of intervals set in the first terminal.

For example, a fourth image frame may be further included in the plurality of image frames acquired by the first terminal, and a zoom magnification of the fourth image frame is the same as a zoom magnification of the second image frame. The process of acquiring a plurality of image frames by the first terminal specifically includes: the first terminal sequentially collects the fourth image frame, one or more first image frames and the second image frames according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region information of interest, a target zoom magnification is adopted to collect one image frame, the preset number is the same as the number of one or more first image frames, and the target zoom magnification is the zoom magnification of the second image frame.

In the second mode, the first terminal acquires a plurality of local image frames based on the region of interest information and inserts the global image frames.

Illustratively, the first terminal acquires one or more third image frames, the zoom magnification of which is the same as the zoom magnification of the second image frame; the first terminal cuts the one or more third image frames according to the region of interest information to obtain one or more first image frames; the first terminal determines one of the one or more third image frames as the second image frame. That is, the first terminal continuously acquires first image frames (i.e., partial image frames) by means of digital zooming based on the region of interest information, and inserts second image frames (i.e., global image frames) in the acquired plurality of first image frames.

That is, the first terminal inserts one global image frame with a low zoom magnification each time after acquiring a specific number of image frames based on the region of interest information. Thus, the number of image frames transmitted by the first terminal to the second terminal is greater than the number of image frames actually acquired by the first terminal.

Referring to fig. 9, fig. 9 is a schematic diagram of obtaining a plurality of image frames according to another embodiment of the present application. As shown in fig. 9, the first terminal acquires image 1, image 2, image 3, image 4, and image 5 in this order at a fixed zoom magnification, wherein image 1-image 5 correspond to the third image frame described above. In the process of acquiring the images 1-5 by the first terminal, after acquiring the image 1 by the first terminal, cutting the image 1 based on the region of interest information to obtain an image A1; similarly, after the first terminal acquires the image 2, the image 2 is cropped based on the region of interest information, resulting in the image A2. Similarly, the first terminal obtains images A1-A5 corresponding to the images 1-5 by digital zooming, and the images A1-A5 correspond to the first image frame. After obtaining the images A1-A5, the first terminal determines the image 5 as a second image frame, i.e. inserts an image B after the image A5 based on the image 5, the image B being identical to the image 5.

It can be understood that, when the image frames are obtained in the first mode without the change of the information of the region of interest of the second terminal, the frequency of displaying the image frames on the second terminal is lower than the frequency of actually capturing the image frames by the first terminal because the second terminal does not select to display the global image frames captured by the first terminal.

For example, in the case where the first terminal acquires one global image frame after every 5 partial image frames, the first terminal acquires 30 image frames per second, and 25 partial image frames and 5 global image frames are included in the 30 image frames. For the second terminal, 25 partial image frames are displayed per second.

When the image frames are obtained in the second mode, in the case that the first terminal inserts one global image frame after every 5 local image frames, the first terminal acquires 30 local image frames per second and inserts 6 global image frames, that is, 36 image frames in total. For the second terminal, 30 partial image frames are displayed per second. Therefore, the second mode is adopted to obtain the image frames, so that the frequency of displaying the image frames by the second terminal can be ensured, and the fluency of the video picture can be improved.

For example, referring to fig. 10, fig. 10 is a schematic diagram illustrating a comparison of image frames obtained based on different modes according to an embodiment of the present application. In fig. 10, the first terminal acquires images 1 to 10 based on a specific zoom magnification. When the first terminal obtains an image frame to be sent to the second terminal in the first mode, the first terminal cuts the images 1-4 and 6-9 based on the region of interest information to obtain images A1-A8 respectively; the first terminal obtains an image B1 and an image B2 based on the image 5 and the image 10. That is, the first terminal transmits a total of 10 image frames to the second terminal.

When the first terminal obtains an image frame to be sent to the second terminal in a second mode, the first terminal cuts the images 1-10 based on the region of interest information to obtain images A1-A10 respectively; further, the first terminal obtains an image B1 and an image B2 based on the image 5 and the image 10, and inserts the image B1 after the image A5, and inserts the image B2 after the image a 10. That is, the first terminal transmits a total of 12 image frames to the second terminal. Obviously, when the first terminal obtains the image frames required to be sent to the second terminal in the second mode, the second terminal can receive more image frames and more image frames can be used for display, so that the frequency of displaying the image frames by the second terminal can be ensured, and the fluency of video pictures is improved.

In the third aspect, the first terminal may acquire the first image frame and the second image frame by means of optical zooming.

The optical zooming refers to zooming by means of an optical lens structure of the first terminal, that is, the first terminal enlarges and reduces a scene to be photographed through lens movement. Although both digital zooming and optical zooming are helpful for magnifying a remote object during telescopic shooting, only the optical zooming can support the imaging of an image main body, more pixels are added, so that the main body is not only enlarged, but also relatively clear. That is, when an object in an image is enlarged by means of optical zooming, the object in the image can be made relatively clearer.

Since the optical zoom is to zoom in or out an image frame captured by the lens by changing a focal length with a center of a field of view of the lens as a center point, the process of capturing the first image frame by the first terminal by way of the optical zoom is to actually calculate a longest focal length of the field of view covering the region of interest, and then acquire the image frame including the region of interest based on the longest focal length. If the image frame including the region of interest includes just the region of interest, the image frame including the region of interest may be determined as a first image frame; if the image frame including the region of interest includes a non-region of interest in addition to the region of interest, the region of interest may be truncated from the image frame including the region of interest to obtain a first image frame.

In this embodiment, the first terminal may adjust the zoom magnification in an optical zoom manner based on the region of interest information, and acquire the first image frame described above. Then, the first terminal adjusts the zoom magnification again to acquire a second image frame.

In one possible embodiment, during the process of encoding the plurality of image frames by the first terminal, the first terminal may select the corresponding reference frame according to the frame type of the image frame.

Generally, in the video encoding process, a transmitting end selects a previous image frame of a current image frame as a reference frame, calculates difference information between the current image frame and the reference frame, and transmits the difference information between the current image frame and the reference frame to a receiving end instead of directly transmitting the current image frame to the receiving end, thereby playing a role in reducing transmitted data.

In the present embodiment, since the first image frame and the second image frame are acquired based on different zoom magnifications. Therefore, the difference between the first image frame and the second image frame is relatively large compared to the adjacent two first image frames. When the second image frame is encoded, if the previous image frame (i.e., the first image frame) of the second image frame is taken as a reference frame, the data amount of the obtained difference information is larger, and the data transmission amount is increased. Based on this, the first terminal may select a corresponding reference frame according to the frame type of the image frame to ensure that the frame type between the image frame currently required to be encoded and the reference frame of the image frame is the same.

Illustratively, in the process of encoding the plurality of image frames, the first terminal acquires reference frames of the plurality of image frames, wherein the reference frames of the first image frame are obtained according to the region of interest information, and the zoom magnification of the reference frames of the second image frame is the same as the zoom magnification of the second image frame.

Briefly, in the process of acquiring an image frame by the first terminal, the frame type of a local image frame (i.e., the first image frame) acquired by the first terminal based on the region of interest information is a first type, and the frame type of a global image frame (i.e., the second image frame) acquired based on a specific zoom magnification is a second type. For any one of the image frames of the first type, the first terminal may determine that the reference frame of the image frame is the image frame of the first type preceding the image frame, i.e., the image frame of the first type temporally closest to the image frame. Similarly, for any one of the image frames of the second type, the first terminal may determine that the reference frame of the image frame is the image frame of the second type preceding the image frame, i.e., the image frame of the second type temporally closest to the image frame.

That is, for any one image frame, the reference frame of that image frame is the one of the same type of image frame that is temporally closest to that image frame.

After obtaining the reference frame, the first terminal encodes the plurality of image frames according to the reference frame of the plurality of image frames to obtain an encoding result; and the first terminal sends the coding result to the second terminal.

For example, referring to fig. 11, fig. 11 is a schematic diagram of determining a reference frame according to an embodiment of the present application. As shown in fig. 11, the images A1 to a10 are images acquired by the first terminal based on the region of interest information, and the frame types of the images A1 to a10 are the first type. The images B1 and B2 are images acquired by the first terminal based on a specific zoom magnification, and the frame types of the images B1 and B2 are the second type.

The reference frame of the image A2 is the image A1, the reference frame of the image A3 is the image A2 …, and the reference frame of the image A6 is the image A5. For any one of the images A1-a 10, the reference frame of that image is the previous image of the same first type. Similarly, the reference frame of image B2 is image B1, not image a10. For any one of the images B1-B2, the reference frame of that image is the previous image of the same second type.

In this embodiment, since the reference frame of the image frame to be encoded is not necessarily the image frame preceding the image frame to be encoded in the encoding process, for an encoder that uses the image frame preceding the image frame to be encoded as the reference frame, the encoder may not encode the image frame well.

In view of this, it is proposed in the present embodiment to use two-way encoders to encode image frames of different frame types to ensure smooth encoding of the image frames.

The first terminal encodes the plurality of image frames according to the reference frames of the plurality of image frames to obtain an encoding result, which may specifically include: the first terminal encodes the image frame through a first encoder according to the reference frame of the image frame to obtain a first encoding result; the first terminal encodes the second image frame through a second encoder according to the reference frame of the second image frame to obtain a second encoding result; wherein the encoding results include the first encoding result and the second encoding result.

In brief, in practical application, for image frames acquired based on the region of interest information, the first terminal inputs the image frames into the first encoder, and the first encoder encodes the image frames. For image frames acquired at a particular zoom magnification, the first terminal then inputs the image frames into a second encoder, which encodes the portion of the image frames.

For example, referring to fig. 12, fig. 12 is a schematic diagram of encoding based on two encoders according to an embodiment of the present application. As shown in fig. 12, the images A1 to a10 are images acquired by the first terminal based on the region of interest information, and the frame types of the images A1 to a10 are the first type. The images B1 and B2 are images acquired by the first terminal based on a specific zoom magnification, and the frame types of the images B1 and B2 are the second type.

The input of the first encoder is an image A1-an image a10, and in the process of encoding by the first encoder, the reference frame of the image A2 is an image A1, the reference frame of the image A3 is an image A2 …, and the reference frame of the image A6 is an image A5. For any one of the images A1-a 10 (except the first image input to the first encoder), the reference frame of that image is the previous image.

The inputs of the second encoder are image B1 and image B2. The reference frame of the image B2 is the image B1. I.e. for any one of the pictures input to the second encoder (except for the first picture input to the second encoder), the reference frame of that picture is the previous picture.

For easy understanding, the following describes in detail the flow of the remote video method provided in the embodiment of the present application with reference to specific examples. Referring to fig. 13, fig. 13 is a flowchart of a remote video method according to an embodiment of the application. As shown in fig. 13, the remote video method includes steps 1301-1315, where steps 1301-1306 are performed by a first terminal and steps 1307-1315 are performed by a second terminal.

In step 1301, during the process of capturing an image frame, the first terminal determines whether the image frame currently required to be captured is a multiple of N.

In the process of remote video, the first terminal continuously acquires image frames through the camera. In the process of acquiring image frames, the first terminal can acquire one global image frame if N-1 local image frames are acquired every interval, so that the first terminal can judge whether the global image frame or the local image frame is acquired currently by judging whether the image frame which needs to be acquired currently is a multiple of N.

For example, assuming N is 5, the first terminal acquires one global image frame every 4 local image frames. Thus, the 1 st to 4 th image frames acquired by the first terminal are local image frames, and the 5th image frame is a global image frame; the 6 th to 9 th image frames acquired by the first terminal are local image frames, and the 10 th image frame is a global image frame. Similarly, the global image frames acquired by the first terminal are all multiples of N.

In step 1302, if the current image frame to be acquired is not a multiple of N, the first terminal acquires a local image frame based on the region of interest information.

The local image frames are acquired by the first terminal in a digital zooming or optical zooming mode based on the information of the region of interest. The partial image frame may be, for example, the first image frame described above. The region of interest information may be region of interest information locally stored by the first terminal. In the process of remote video, the first terminal can receive the region of interest information sent by the second terminal, and store the received region of interest information in the local of the first terminal.

The first terminal may acquire the partial image frames based on a 4k resolution or a 2k resolution.

In step 1303, if the current image frame to be acquired is a multiple of N, the first terminal acquires a global image frame.

The global image frame is, for example, the second image frame, and the global image frame is obtained by the first terminal through a preset zoom ratio. The global image frame has a zoom magnification smaller than that of the local image frame, and the picture content in the global image frame includes the picture content of the region of interest and the picture content of the non-region of interest.

In another possible implementation, the image frame currently required to be acquired is a multiple of N, the first terminal may also be to continue with the local image frame, and insert a global image frame after acquiring the local image frame.

The first terminal may acquire the global image frames based on a 4k resolution or a 2k resolution.

In step 1304, the first terminal converts the acquired image frames to 1080P resolution.

After acquiring the local image frame or the global image frame, the first terminal converts the acquired local image frame or the global image frame into 1080P resolution, i.e., converts the acquired local image frame or the global image frame into an image frame composed of 1920×1080 pixels. For a partial image frame, if the size of the partial image frame itself is smaller than 1920×1080, the size of the partial image frame is converted into 1920×1080 by upsampling the partial image frame; if the size of the partial image frame itself is greater than 1920×1080, the size of the partial image frame is converted into 1920×1080 by downsampling the partial image frame. For global image frames, the global image frames are downsampled to convert the global image frames to 1920×1080.

Step 1305, the first terminal selects a reference frame of the image frame according to the frame type, and encodes the image frame based on the reference frame; or the first terminal encodes the image frames of different frame types through two paths of encoders.

After the image frames are acquired, the first terminal encodes the acquired image frames. In the encoding process, the first terminal may select a reference frame of the current image frame to be encoded according to a frame type of the current image frame to be encoded. Wherein the frame type of the local image frame may be defined as a first type and the frame type of the global image frame may be defined as a second type.

In one possible implementation, for any one of the image frames of the first type, the first terminal may determine that the reference frame of the image frame is the image frame of the first type preceding the image frame, i.e. the one temporally closest to the image frame. Similarly, for any one of the image frames of the second type, the first terminal may determine that the reference frame of the image frame is the image frame of the second type preceding the image frame, i.e., the image frame of the second type temporally closest to the image frame. That is, for any one image frame, the reference frame of that image frame is the one of the same type of image frame that is temporally closest to that image frame.

In another possible implementation, the first terminal may input image frames of different frame types into different encoders, thereby enabling encoding of the image frames by the two-way encoder. Specifically, the first terminal inputs an image frame of a frame type of a first type into a first encoder, and the first encoder encodes the image frame. In addition, the first terminal inputs an image frame of a frame type of a second type into the second encoder, and the first encoder encodes the image frame.

Wherein, when the first encoder encodes each first type of image frame, the first encoder encodes the previous image frame of each first type of image frame as a reference frame. When the second encoder encodes the second type of image frame, the reference frame of the second type of image frame is one image frame preceding the image frame in the second encoder.

In step 1306, the first terminal carries the indication information of the current image frame in the code stream, and sends the code stream to the second terminal.

In the remote video process, the first terminal continuously codes the acquired image frames to obtain a code stream. The first terminal may carry the indication information of the current image frame in the SEI of the code stream and send the code stream to the second terminal. The indication information is used for indicating the frame type of the current image frame and the region of interest information corresponding to the current image frame.

In step 1307, the second terminal receives the code stream sent by the first terminal, and decodes the code stream to obtain an image frame and corresponding indication information.

And after receiving the code stream sent by the first terminal, the second terminal decodes the code stream to obtain an image frame and indication information corresponding to the image frame.

Step 1308, the second terminal obtains the interaction instruction, and obtains the region of interest indicated by the interaction instruction by analyzing the interaction instruction.

Optionally, at any time of the remote video, the second terminal may acquire an interaction instruction triggered by the user, and obtain the region of interest indicated by the interaction instruction by analyzing the interaction instruction. The interaction instruction may be, for example, a zoom-out operation instruction, an zoom-in operation instruction, or a pan operation instruction initiated by the user by touching the screen of the second terminal.

In step 1309, the second terminal updates the local region of interest information according to the region of interest indicated by the interaction instruction.

After the region of interest indicated by the interaction instruction is obtained through analysis, the second terminal updates local region of interest information according to the region of interest indicated by the interaction instruction.

In step 1310, the second terminal determines whether the current image frame is a global image frame according to the indication information carried by the image frame.

Since the indication information is used for indicating the frame type of the image frame, the second terminal can determine whether the current image frame is a global image frame according to the indication information corresponding to the image frame in the code stream, that is, determine whether the frame type of the current image frame is the second type.

At step 1311, if the current image frame is not a global image frame, the current image frame is directed to a first buffer.

When the first buffer area stores other partial image frames, the second terminal overlays the new partial image frames on the other partial image frames so that only one image frame is always kept in the first buffer area.

At step 1312, if the current image frame is a global image frame, the current image frame is directed to a second buffer.

When the first buffer area stores other global image frames, the second terminal overlays the new global image frames on the other global image frames, so that only one image frame is always kept in the second buffer area.

In step 1313, the second terminal determines whether the region of interest information corresponding to the image frame in the first buffer area is consistent with the region of interest information local to the second terminal.

Based on the indication information corresponding to the image frames in the first buffer, the second terminal can determine whether the region of interest information corresponding to the image frames in the first buffer is consistent with the region of interest information local to the second terminal.

In step 1314, if the region of interest information corresponding to the image frame in the first buffer area is consistent with the region of interest information local to the second terminal, the second terminal sends the partial image frame of the first buffer area to the display screen for display.

In step 1315, if the region of interest information corresponding to the image frame in the first buffer area is inconsistent with the region of interest information local to the second terminal, the second terminal intercepts a partial region in the global image frame in the second buffer area, and sends the intercepted image frame to the display screen for display.

When the region of interest changes, the position of the new region of interest can be determined from the global image frame according to the region of interest information in the local area of the second terminal, and the position of the new region of interest is cut off to obtain a new image frame. The content in the new image frame is the content corresponding to the position indicated by the region of interest information of the second terminal.

In order to better implement the above-described scheme of the embodiment of the present application on the basis of the embodiments corresponding to fig. 1 to 13, a related apparatus for implementing the above-described scheme is further provided below.

Referring specifically to fig. 14, fig. 14 is a schematic structural diagram of a terminal 1400 according to an embodiment of the present application, where the terminal 1400 includes: an acquisition unit 1401, a processing unit 1402, and a transmitting-receiving unit 1403; the acquiring unit 1401 is configured to acquire a plurality of image frames, where the plurality of image frames includes a first image frame and a second image frame, and a zoom magnification of the first image frame is larger than a zoom magnification of the second image frame, and the first image frame is determined according to region-of-interest information, and the region-of-interest information is used to indicate a position of a region of interest; the first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, the image frame to be displayed is used for generating a video, and the indication information comprises the region-of-interest information of the first image frame.

Optionally, in a possible implementation manner, the acquiring unit 1401 is further configured to acquire a third image frame and the second image frame, where a zoom magnification of the third image frame is the same as a zoom magnification of the second image frame; the processing unit 1402 is further configured to crop the third image frame according to the region of interest information to obtain the first image frame.

Optionally, in a possible implementation manner, the acquiring unit 1401 is further configured to acquire one or more third image frames, where a zoom magnification of the one or more third image frames is the same as a zoom magnification of the second image frame; the processing unit 1402 is further configured to crop the one or more third image frames according to the region of interest information to obtain one or more first image frames; the processing unit 1402 is further configured to determine a third image frame of the one or more third image frames as the second image frame.

Alternatively, in one possible implementation, the acquiring unit 1401 acquires the first image frame and the second image frame by means of optical zooming.

Optionally, in a possible implementation manner, the plurality of image frames further includes a fourth image frame, and a zoom magnification of the fourth image frame is the same as a zoom magnification of the second image frame; the acquiring unit 1401 is further configured to sequentially acquire the fourth image frame, one or more of the first image frames, and the second image frame according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region information of interest, a target zoom magnification is adopted to collect one image frame, the preset number is the same as the number of one or more first image frames, and the target zoom magnification is the zoom magnification of the second image frame.

Optionally, in a possible implementation manner, the acquiring unit 1401 is further configured to acquire a reference frame of the plurality of image frames; the processing unit 1402 is further configured to encode the plurality of image frames according to reference frames of the plurality of image frames to obtain an encoding result; the transceiver 1403 is further configured to send a coding result to the second terminal; the reference frame of the first image frame is obtained according to the region of interest information, and the zoom magnification of the reference frame of the second image frame is the same as that of the second image frame.

Optionally, in a possible implementation manner, the processing unit 1402 is further configured to encode, by a first encoder, the image frame according to a reference frame of the image frame, to obtain a first encoding result; the processing unit 1402 is further configured to encode the second image frame by a second encoder according to a reference frame of the second image frame, to obtain a second encoding result; wherein the encoding results include the first encoding result and the second encoding result.

In another possible embodiment, the transceiver 1403 is configured to receive a plurality of image frames and indication information sent by the second terminal, where the plurality of image frames includes a first image frame and a second image frame, the zoom magnification of the first image frame is greater than that of the second image frame, the first image frame is determined according to region of interest information, the region of interest information is used to indicate a location of a region of interest, and the indication information includes region of interest information of the first image frame; the processing unit 1402 is configured to: if the region of interest information of the first image frame is different from the region of interest information in the first terminal, a third image frame is obtained by cutting in the second image frame according to the region of interest information in the first terminal, and the third image frame is displayed; and if the region of interest information of the first image frame is the same as the region of interest information in the first terminal, displaying the first image frame.

Optionally, in a possible implementation manner, the indication information further includes a frame type of the plurality of image frames, and the frame type of the first image frame is different from the frame type of the second image frame; the processing unit 1402 is further configured to sequentially send the first image frame to a first buffer and the second image frame to a second buffer according to frame types of the plurality of image frames; the processing unit 1402 is further configured to: if the region of interest information of the first image frame in the first buffer area is different from the region of interest information in the first terminal, a third image frame is obtained by cutting out the second image frame in the second buffer area according to the region of interest information in the first terminal, and the third image frame is displayed; and if the region of interest information of the first image frame in the first buffer area is the same as the region of interest information in the first terminal, displaying the first image frame in the first buffer area.

Optionally, in a possible implementation manner, the acquiring unit 1401 is configured to acquire an interaction instruction, where the interaction instruction is used to instruct to change the area of interest; the processing unit 1402 is further configured to update the region of interest information according to the interaction instruction, and obtain updated region of interest information; the transceiver unit 1403 is further configured to send updated region of interest information to the second terminal.

The remote video method provided by the embodiment of the application can be specifically executed by a chip in the terminal, wherein the chip comprises: the processing unit 1402 may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit 1402 may execute the computer-executable instructions stored in the storage unit to cause the chip in the server to perform the remote video method described in the above embodiment shown in fig. 1 to 13. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM), etc.

Referring to fig. 15, the present application also provides a computer program product, in some embodiments, the method disclosed in fig. 5 above may be embodied as computer program instructions encoded on a computer readable storage medium in a machine readable format or encoded on other non-transitory media or article of manufacture.

Fig. 15 schematically illustrates a conceptual partial view of an example computer program product comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein.

In one embodiment, computer program product 1500 is provided using signal bearing medium 1501. The signal bearing medium 1501 may include one or more program instructions 1502 that when executed by one or more processors may provide the functionality or portions of the functionality described above with respect to fig. 2. Thus, for example, referring to the embodiment shown in fig. 3, one or more features of steps 301-306 may be carried by one or more instructions associated with signal bearing medium 1501. Further, the program instructions 1502 in fig. 15 also describe example instructions.

In some examples, signal bearing medium 1501 may comprise a computer readable medium 1503 such as, but not limited to, a hard disk drive, compact Disk (CD), digital Video Disk (DVD), digital tape, memory, ROM or RAM, and the like.

In some implementations, the signal bearing medium 1501 may comprise a computer recordable medium 1504 such as, but not limited to, memory, read/write (R/W) CD, R/W DVD, and the like. In some implementations, the signal bearing medium 1501 may include a communication medium 1505 such as, but not limited to, a digital and/or analog communication medium (e.g., fiber optic cable, a waveguide, a wired communications link, a wireless communications link, etc.). Thus, for example, the signal bearing medium 1501 may be conveyed by a communication medium 1505 in wireless form (e.g., a wireless communication medium conforming to the IEEE 802.15 standard or other transmission protocol).

The one or more program instructions 1502 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, a computing device of the computing device may be configured to provide various operations, functions, or actions in response to program instructions 1502 conveyed to the computing device through one or more of computer readable medium 1503, computer recordable medium 1504, and/or communication medium 1505.

It should be understood that the arrangement described herein is for illustrative purposes only. Thus, those skilled in the art will appreciate that other arrangements and other elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used instead, and some elements may be omitted altogether depending on the desired results. In addition, many of the elements described are functional entities that may be implemented as discrete or distributed components, or in any suitable combination and location in conjunction with other components.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Claims

1. A remote video method, comprising:

A first terminal acquires a plurality of image frames, wherein the plurality of image frames comprise a first image frame and a second image frame, the zoom magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to region-of-interest information, and the region-of-interest information is used for indicating the position of the region-of-interest;

The first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, the image frame to be displayed is used for generating a video, and the indication information comprises the region-of-interest information of the first image frame;

The first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, and the method comprises the following steps: if the region of interest information of the first image frame is different from the region of interest information in the second terminal, the second terminal cuts out a third image frame in the second image frame according to the region of interest information in the second terminal, and displays the third image frame; and if the region of interest information of the first image frame is the same as the region of interest information in the second terminal, the second terminal displays the first image frame.

2. The method of claim 1, wherein the second image frame includes a region of interest indicated in the region of interest information of the first image frame.

3. The method according to claim 1 or 2, wherein the first terminal acquiring the plurality of image frames comprises:

the first terminal acquires a third image frame and the second image frame, wherein the zoom magnification of the third image frame is the same as that of the second image frame;

And the first terminal cuts the third image frame according to the region of interest information to obtain the first image frame.

4. The method according to claim 1 or 2, wherein the first terminal acquiring the plurality of image frames comprises:

The first terminal acquires one or more third image frames, and the zoom magnification of the one or more third image frames is the same as that of the second image frame;

the first terminal cuts the one or more third image frames according to the region of interest information to obtain one or more first image frames;

The first terminal determines one of the one or more third image frames as the second image frame.

5. The method according to claim 1 or 2, wherein the first terminal acquiring the plurality of image frames comprises:

The first terminal acquires the first image frame and the second image frame in an optical zooming mode.

6. The method of claim 1 or 2, wherein the plurality of image frames further comprises a fourth image frame having a same zoom magnification as the second image frame;

The first terminal acquiring a plurality of image frames, including:

The first terminal sequentially collects the fourth image frame, one or more first image frames and the second image frames according to a preset rule;

The preset rule is that after the first terminal collects a preset number of image frames based on the region information of interest, a target zoom magnification is adopted to collect one image frame, the preset number is the same as the number of one or more first image frames, and the target zoom magnification is the zoom magnification of the second image frame.

7. The method according to claim 1 or 2, wherein the first terminal transmitting the plurality of image frames to a second terminal comprises:

the first terminal acquires reference frames of the plurality of image frames;

The first terminal encodes the plurality of image frames according to the reference frames of the plurality of image frames to obtain an encoding result;

The first terminal sends a coding result to the second terminal;

The reference frame of the first image frame is obtained according to the region of interest information, and the zoom magnification of the reference frame of the second image frame is the same as that of the second image frame.

8. The method of claim 7, wherein the first terminal encodes the plurality of image frames based on reference frames of the plurality of image frames to obtain an encoding result, comprising:

The first terminal encodes the first image frame through a first encoder according to the reference frame of the first image frame to obtain a first encoding result;

The first terminal encodes the second image frame through a second encoder according to the reference frame of the second image frame to obtain a second encoding result;

Wherein the encoding results include the first encoding result and the second encoding result.

9. The method according to claim 1 or 2, wherein the indication information further comprises a frame type of the plurality of image frames, the frame type of the first image frame being different from the frame type of the second image frame.

10. A remote video method, comprising:

The method comprises the steps that a first terminal receives a plurality of image frames and indication information sent by a second terminal, wherein the plurality of image frames comprise a first image frame and a second image frame, the zoom magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to region-of-interest information, the region-of-interest information is used for indicating the position of a region of interest, and the indication information comprises the region-of-interest information of the first image frame;

If the region of interest information of the first image frame is different from the region of interest information in the first terminal, a third image frame is obtained by cutting in the second image frame according to the region of interest information in the first terminal, and the third image frame is displayed;

and if the region of interest information of the first image frame is the same as the region of interest information in the first terminal, displaying the first image frame.

11. The method of claim 10, wherein the second image frame includes a region of interest indicated in the region of interest information of the first image frame.

12. The method according to claim 10 or 11, wherein the first image frame is obtained by clipping one or more third image frames of the second terminal according to the region of interest information after the second terminal acquires the one or more third image frames, and a zoom magnification of the one or more third image frames is the same as a zoom magnification of the second image frame.

13. The method of claim 12, wherein the second image frame is one of the one or more third image frames.

14. The method according to claim 10 or 11, wherein the plurality of image frames are acquired by means of an optical zoom by the second terminal.

15. The method of claim 10 or 11, wherein the plurality of image frames further comprises a fourth image frame having a same zoom magnification as the second image frame;

The fourth image frame, one or more first image frames and the second image frame are sequentially acquired by the second terminal according to a preset rule;

The preset rule is that after the second terminal collects a preset number of image frames based on the region information of interest, a target zoom magnification is adopted to collect one image frame, the preset number is the same as the number of one or more first image frames, and the target zoom magnification is the zoom magnification of the second image frame.

16. The method according to claim 10 or 11, wherein the indication information further comprises a frame type of the plurality of image frames, the frame type of the first image frame being different from the frame type of the second image frame;

the method further comprises the steps of:

The first terminal sequentially sends the first image frames to a first buffer area and sends the second image frames to a second buffer area according to the frame types of the plurality of image frames;

If the region of interest information of the first image frame is different from the region of interest information in the first terminal, a third image frame is obtained by clipping in the second image frame according to the region of interest information in the first terminal, and the third image frame is displayed, including:

If the region of interest information of the first image frame in the first buffer area is different from the region of interest information in the first terminal, a third image frame is obtained by cutting out the second image frame in the second buffer area according to the region of interest information in the first terminal, and the third image frame is displayed;

and if the region of interest information of the first image frame is the same as the region of interest information in the first terminal, displaying the first image frame, including:

and if the region of interest information of the first image frame in the first buffer area is the same as the region of interest information in the first terminal, displaying the first image frame in the first buffer area.

17. The method according to claim 10 or 11, characterized in that the method further comprises:

The first terminal acquires an interaction instruction, wherein the interaction instruction is used for indicating to change the interested region;

The first terminal updates the region-of-interest information according to the interaction instruction to obtain updated region-of-interest information;

And the first terminal sends the updated region-of-interest information to the second terminal.

18. A terminal comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, when executed, the terminal performing the method of any of claims 1 to 17.

19. A computer readable storage medium comprising computer readable instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 17.

20. A computer program product comprising computer readable instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 17.