CN115134633A

CN115134633A - Remote video method and related device

Info

Publication number: CN115134633A
Application number: CN202110327092.4A
Authority: CN
Inventors: 刘尚; 胡翔宇; 徐卫国; 许旺灿; 杨小海
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2022-09-30
Anticipated expiration: 2041-03-26
Also published as: WO2022199594A1; CN115134633B

Abstract

The embodiment of the application discloses a remote video method which is applied to a terminal. The method comprises the following steps: the method comprises the steps that a first terminal obtains a first image frame and a second image frame, the zooming magnification of the first image frame is larger than that of the second image frame, and the first image frame is determined according to region-of-interest information; the first terminal transmits a plurality of image frames and indication information to the second terminal so that the second terminal selects an image frame to be displayed according to the indication information, wherein the indication information comprises region-of-interest information of the first image frame. By adding the image frame with lower zoom magnification in the process of sending the image frame related to the interesting area, the receiving end can timely intercept and display the image corresponding to the changed interesting area from the image frame with lower zoom magnification, the response time of adjusting the interesting area of the video is improved, and the phenomenon of blocking and pause of the video image is avoided.

Description

Remote video method and related device

Technical Field

The present application relates to the field of communications technologies, and in particular, to a remote video method and a related apparatus.

Background

With the development of communication technology, remote videos are widely applied in scenes such as video calls, telemedicine, remote education and the like. Limited by power consumption and latency, the picture resolution of remote video does not typically exceed 1080P. However, in many cases, if the capture resolution of the video is 1080P, the sharpness of the captured video may not meet the actual requirement. For example, in a scene of remote job tutoring, even if a video screen is enlarged, a font of a small size in the video screen may not be clearly recognized.

Based on this, in the related art, the resolution of the video acquired by the transmitting end is improved, and the acquired video is reduced to 1080P and then is transmitted to the receiving end. Thus, when the receiving end needs to enlarge a certain area (the area is usually called the region of interest), the receiving end feeds back the area needing to be enlarged to the transmitting end. After the sending end acquires the area needing to be amplified, the sending end intercepts the area needing to be amplified from the acquired video picture, converts the intercepted video picture into 1080P and sends the 1080P converted video picture to the receiving end. The video pictures obtained by interception are acquired with high resolution, so that the definition of the video pictures can meet the requirement.

However, the receiving end remotely controls the transmitting end, which has a high response time, and is easy to cause an obvious pause phenomenon in a video picture. For example, after the receiving side performs the screen enlargement operation, it takes a long time (generally, it takes more than 300 milliseconds to wait), and the receiving side can receive the data transmitted from the transmitting side and display the enlarged video screen.

Disclosure of Invention

The embodiment of the application provides a remote video method, wherein in the process of sending an image frame related to an interested area to a receiving end, a sending end adds an image frame with a lower zooming magnification, namely an image frame with more contents in a picture. Therefore, when the region of interest in the receiving end changes, the receiving end can intercept the image corresponding to the changed region of interest from the image frame with lower zoom magnification in time and display the image, namely, the receiving end can display the adjusted video image without waiting for a long time, so that the response time of adjusting the region of interest of the video is improved, and the phenomenon of blocking of the video image is avoided.

The first aspect of the present application provides a remote video method, which is applied to a first terminal serving as a video acquisition end and a video sending end in a remote video process. The method comprises the following steps: the first terminal acquires a plurality of image frames including a first image frame and a second image frame. The zoom ratio of the first image frame is greater than that of the second image frame, wherein the zoom ratio refers to the magnification ratio of an image output by an image sensor of the camera. When the camera does not move, the larger the zoom magnification of the camera is, the larger the shot object in the image output by the camera is, and the smaller the shot range in the image is; the smaller the zoom magnification of the camera is, the smaller the subject in the image output by the camera is, and the larger the range captured in the image is.

The first image frame is determined from region of interest information indicating a location of a region of interest. The first terminal may store region-of-interest information, where the region-of-interest information may be fed back to the first terminal by the second terminal. The region-of-interest information is used to indicate where the region of interest is located. The region of interest refers to a region obtained by a user zooming in, zooming out, or translating a video picture displayed on a screen of the terminal in an interactive manner such as touching the screen of the terminal with a finger. In short, the region of interest is a region to be displayed on the screen of the terminal in the remote video process. Based on the region-of-interest information in the first terminal, the first terminal may determine a location where the region-of-interest is located, thereby acquiring a first image frame associated with the region-of-interest. The picture content in the first image frame is the content of the region of interest.

The first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects image frames to be displayed according to the indication information, the image frames to be displayed are used for generating videos, and the indication information comprises region-of-interest information of the first image frames. Wherein the indication information may include coordinate information for indicating a position of the region of interest. For example, in the case where the region of interest is a rectangular region, the coordinate information of four vertices of the rectangular region may be included in the indication information. For another example, in the case that the region of interest is a rectangular region, the indication information may include coordinate information of a vertex of the rectangular region (e.g., a vertex at the upper left corner of the rectangular region) and a width and a height of the rectangular region. In this way, based on the coordinate information of one vertex of the rectangular region and the width and height of the rectangular region, the coordinate information of four vertices of the rectangular region can also be calculated.

In the scheme, in the process of sending the image frame related to the region of interest to the second terminal serving as the receiving terminal, the first terminal serving as the sending terminal adds the image frame with lower zoom magnification, namely, the image frame with more contents in the image. Therefore, when the region of interest in the second terminal changes, the second terminal can intercept the picture corresponding to the changed region of interest from the image frame with lower zoom magnification in time and display the picture, namely, the second terminal can display the adjusted video picture without waiting for a long time, so that the response time of adjusting the region of interest of the video is prolonged, and the phenomenon of blocking of the video picture is avoided.

Optionally, in a possible implementation manner, the first image frame only includes a region of interest indicated in the region of interest, and the second image frame includes other regions besides the region of interest indicated in the region of interest information. That is, only the picture content of the position of the region of interest is included in the picture in the first image frame, and the picture in the second image frame includes other picture contents in addition to the picture content of the position of the region of interest.

For example, the second image frame may be an image frame acquired by the first terminal at a preset zoom magnification, and the first image frame may be an image frame acquired by the first terminal at an adjusted zoom magnification according to the region of interest information. Under the condition that the first terminal does not move, the second image frame can be regarded as a global image acquired by the first terminal, namely the second image frame comprises all areas in the visual field range of the camera of the first terminal; the first image frame may be regarded as a local image acquired by the first terminal, that is, the first image frame only includes a partial region within the visual field of the camera of the first terminal.

Optionally, in a possible implementation manner, the acquiring, by the first terminal, the plurality of image frames includes: the first terminal collects a third image frame and the second image frame in sequence, and the zooming magnification of the third image frame is the same as that of the second image frame. That is, the first terminal continuously acquires a plurality of image frames including the third image frame and the second image frame at a specific zoom magnification. And then, the first terminal cuts the third image frame according to the region-of-interest information to obtain the first image frame. Specifically, after the first terminal acquires each third image frame, the first terminal cuts the third image frame according to the region-of-interest information in the first terminal, that is, cuts the region-of-interest in the third image frame based on the position indicated by the region-of-interest information, so as to obtain the first image frame. The mode of the first terminal acquiring the first image frame is digital zooming.

Optionally, in a possible implementation manner, the acquiring, by the first terminal, the plurality of image frames includes: the first terminal acquires one or more third image frames, and the zooming multiplying power of the one or more third image frames is the same as that of the second image frame; the first terminal cuts the one or more third image frames according to the region-of-interest information to obtain one or more first image frames; the first terminal determines one of the one or more third image frames as the second image frame. That is, the first terminal may continuously acquire first image frames (i.e., local image frames) by means of digital zooming based on the region-of-interest information, and insert second image frames (i.e., global image frames) into the acquired plurality of first image frames.

That is, the first terminal inserts a global image frame with a lower zoom magnification each time a specific number of image frames are acquired based on the region-of-interest information. In this way, the number of image frames sent by the first terminal to the second terminal is greater than the number of image frames actually captured by the first terminal. By inserting the global image frame into the plurality of continuous local image frames, the frequency of displaying the local image frames by the second terminal can be ensured, and the fluency of video pictures is improved.

Optionally, in a possible implementation manner, the acquiring, by the first terminal, the plurality of image frames includes: the first terminal collects the first image frame and the second image frame in an optical zooming mode. The optical zooming refers to that the first terminal zooms by means of an optical lens structure, namely the first terminal zooms in and zooms out a scene to be shot by moving a lens. Although both digital zooming and optical zooming are helpful for magnifying a distant object in telescopic shooting, only optical zooming can support imaging of an image main body, and more pixels are added, so that the main body is enlarged and relatively clearer. That is, when the object in the image is enlarged by means of optical zooming, the object in the image can be made relatively clearer.

Since the optical zoom is to zoom in or out the image frames captured by the lens by changing the focal length with the center of the field of view of the lens as a center point, the process of the first terminal acquiring the first image frame by means of the optical zoom is actually to calculate the longest focal length of the field of view covering the region of interest and then acquire the first image frame based on the longest focal length.

Optionally, in a possible implementation manner, a fourth image frame may be further included in the plurality of image frames acquired by the first terminal, and a zoom magnification of the fourth image frame is the same as a zoom magnification of the second image frame. The process of the first terminal acquiring the plurality of image frames specifically includes: the first terminal sequentially collects the fourth image frame, one or more first image frames and the second image frame according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

In this way, after the first terminal collects a specific number of image frames based on the region-of-interest information each time, a global image frame is collected by using a lower zoom ratio, so that the frequency of receiving the global image frame by the second terminal is ensured, and the phenomenon of overlarge image change is avoided when the region-of-interest is subsequently cut by the second terminal based on the global image frame.

Optionally, in a possible implementation manner, in the process that the first terminal encodes the plurality of image frames, the first terminal may select a corresponding reference frame according to a frame type of the image frame.

Generally, in the encoding process of a video, a sending end selects a previous image frame of a current image frame as a reference frame, then calculates difference information between the current image frame and the reference frame, and sends the difference information between the current image frame and the reference frame to a receiving end instead of directly sending the current image frame to the receiving end, thereby playing a role in reducing transmitted data. However, in the present solution, the first image frame and the second image frame are acquired based on different zoom magnifications. Therefore, the difference between the first image frame and the second image frame is relatively large compared to two adjacent first image frames. When the second image frame is encoded, if the previous image frame (i.e., the first image frame) of the second image frame is used as a reference frame, the data amount of the obtained difference information is large, and the data transmission amount is increased. Based on this, the first terminal may select a corresponding reference frame according to the frame type of the image frame, so as to ensure that the frame type between the image frame currently to be encoded and the reference frame of the image frame is the same.

Specifically, the first terminal transmitting the plurality of image frames to a second terminal includes: the first terminal acquires reference frames of the plurality of image frames; the first terminal encodes the plurality of image frames according to the reference frames of the plurality of image frames to obtain an encoding result; the first terminal sends a coding result to the second terminal; the reference frame of the first image frame is obtained according to the region-of-interest information, and the zoom magnification of the reference frame of the second image frame is the same as the zoom magnification of the second image frame.

Briefly, in the process of acquiring the image frames by the first terminal, the frame type of the local image frame (i.e., the first image frame) acquired by the first terminal based on the region-of-interest information is a first type, and the frame type of the global image frame (i.e., the second image frame) acquired based on a specific zoom magnification is a second type. For any image frame with the first type of frame type, the first terminal may determine that a reference frame of the image frame is a previous image frame of the first type of the image frame, that is, an image frame of the first type temporally closest to the image frame. Similarly, for any image frame of which the frame type is the second type, the first terminal may determine that the reference frame of the image frame is the previous image frame of the second type of the image frame, that is, the image frame of the second type temporally closest to the image frame. That is, for any one image frame, the reference frame of the image frame is an image frame of the same type temporally closest to the image frame.

Optionally, in a possible implementation manner, the encoding, by the first terminal, the multiple image frames according to the reference frames of the multiple image frames to obtain an encoding result includes: the first terminal encodes the image frame through a first encoder according to the reference frame of the image frame to obtain a first encoding result; the first terminal encodes the second image frame through a second encoder according to the reference frame of the second image frame to obtain a second encoding result; wherein the encoding result comprises the first encoding result and the second encoding result.

When the first encoder encodes each first image frame, the first encoder encodes the image frame which is one image frame before each first image frame and serves as a reference frame. When the second encoder encodes the second image frame, the reference frame of the second image frame is an image frame before the second image frame in the second encoder, and the image frame before the second image frame is the same as the type of the second image frame.

In the scheme, two paths of encoders are adopted to encode the image frames of different frame types, so that the smooth encoding of the image frames can be ensured.

Optionally, in a possible implementation manner, in order to facilitate the second terminal to determine a first image frame and a second image frame in the plurality of image frames, the indication information further includes a frame type of the plurality of image frames, and the frame type of the first image frame is different from the frame type of the second image frame. For example, the indication information indicates that the frame type of the first image frame is a first type and indicates that the frame type of the second image frame is a second type.

Illustratively, when a certain bit for indicating the frame type in the indication information is set to 1, the frame type of the current image frame is indicated to be a second type, i.e. the second image frame mentioned above; when a certain bit for indicating the frame type in the indication information is not set (i.e. the bit has a value of 0), it indicates that the frame type of the current image frame is the first type, i.e. the first image frame mentioned above.

A second aspect of the present application provides a remote video method, which is applied to a first terminal as a video receiving end. The method comprises the following steps: the method comprises the steps that a first terminal receives a plurality of image frames and indicating information, wherein the image frames are sent by a second terminal, the image frames comprise a first image frame and a second image frame, the zooming magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to region-of-interest information, the region-of-interest information is used for indicating the position of a region of interest, and the indicating information comprises the region-of-interest information of the first image frame; if the region-of-interest information of the first image frame is different from the region-of-interest information in the first terminal, cutting the second image frame according to the region-of-interest information in the first terminal to obtain a third image frame, and displaying the third image frame; and if the region-of-interest information of the first image frame is the same as the region-of-interest information in the first terminal, displaying the first image frame.

Specifically, for the second terminal as a receiving end, the second terminal can receive an interactive instruction from a user and update the region-of-interest information according to the interactive instruction. However, the second terminal needs to feed back the updated region-of-interest information to the first terminal, and the first terminal obtains a new image frame based on the updated region-of-interest information and then sends the new image frame to the second terminal. Therefore, there is a certain hysteresis in the image frame matched with the updated region of interest information, that is, within a period of time after the second terminal finishes updating the local region of interest information, the region of interest information corresponding to the image frame received by the second terminal is not the same as the updated region of interest information.

Therefore, before the second terminal displays the first image frame, the second terminal determines whether the region-of-interest information of the first image frame is the same as the region-of-interest information stored in the second terminal. And if the region-of-interest information of the first image frame is different from the region-of-interest information stored in the second terminal, the second terminal cuts the second image frame according to the region-of-interest information in the second terminal to obtain a third image frame and displays the third image frame. In short, because the zoom magnification of the second image frame is greater than that of the first image frame, and the second image frame is actually a global image frame, when the region of interest changes, the position of a new region of interest can be determined from the second image frame, and the position of the new region of interest is cut off to obtain a third image frame. And the content in the third image frame is the content corresponding to the position indicated by the region-of-interest information of the second terminal.

Optionally, in a possible implementation, the region of interest indicated in the region of interest information is included in the second image frame.

Optionally, in a possible implementation manner, the first image frame is obtained by cropping, by the second terminal, one or more third image frames according to the information of the region of interest in the second terminal after the second terminal acquires the one or more third image frames, and a zoom magnification of the one or more third image frames is the same as a zoom magnification of the second image frame.

Optionally, in a possible implementation, the second image frame is one image frame of the one or more third image frames.

Optionally, in a possible implementation manner, the plurality of image frames are acquired by the second terminal through an optical zoom.

Optionally, in a possible implementation manner, the plurality of image frames further include a fourth image frame, and a zoom magnification of the fourth image frame is the same as that of the second image frame; the fourth image frame, one or more first image frames and the second image frame are sequentially acquired by the second terminal according to a preset rule; the preset rule is that after the second terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

Optionally, in a possible implementation manner, the indication information further includes frame types of the plurality of image frames, and a frame type of the first image frame is different from a frame type of the second image frame; the method further comprises the following steps: and the first terminal sends the first image frame to a first buffer area and sends the second image frame to a second buffer area in sequence according to the frame types of the plurality of image frames. The first buffer area is used for storing a first image frame with a first frame type, and the second buffer area is used for storing a second image frame with a second frame type. When the second terminal receives a new image frame, the second terminal judges the frame type of the image frame, and if the frame type of the image frame is the first type, the image is sent to a first buffer area; if the frame type of the image frame is the second type, the image frame is sent to a second buffer. The old image frame may be overwritten by the new image frame on the first buffer and the second buffer.

If the region-of-interest information of the first image frame is different from the region-of-interest information of the first terminal, cutting the second image frame according to the region-of-interest information of the first terminal to obtain a third image frame, and displaying the third image frame, including: if the region-of-interest information of the first image frame in the first buffer area is different from the region-of-interest information of the first terminal, cutting a second image frame in the second buffer area according to the region-of-interest information of the first terminal to obtain a third image frame, and displaying the third image frame; if the region-of-interest information of the first image frame is the same as the region-of-interest information in the first terminal, displaying the first image frame, including: and if the region-of-interest information of the first image frame in the first buffer area is the same as the region-of-interest information in the first terminal, displaying the first image frame in the first buffer area.

Optionally, in a possible implementation manner, the method further includes: when a user initiates an interactive instruction on a second terminal, the second terminal acquires the interactive instruction, and the interactive instruction is used for indicating to change the region of interest. The interactive instruction may be, for example, a zoom-out operation instruction, a zoom-in operation instruction, or a pan operation instruction initiated by a user by touching a screen of the second terminal. The zooming-out operation instruction is used for indicating that the picture displayed on the screen of the second terminal is zoomed out by taking the target area as a starting point. The enlargement operation instruction is used for instructing enlargement of the target area displayed on the screen of the second terminal. The translation operation instruction is used for instructing to translate the picture displayed on the screen of the second terminal to a specific direction. After the second terminal executes the interactive instruction, the obtained region to be displayed is a new region of interest, so that the second terminal can update the region of interest information according to the interactive instruction to obtain the updated region of interest information. In this way, after the second terminal updates the local region of interest information, the second terminal sends the updated region of interest information to the first terminal.

A third aspect of the present application provides a terminal, comprising: the device comprises an acquisition unit, a processing unit and a transceiving unit; the acquiring unit is used for acquiring a plurality of image frames, the plurality of image frames comprise a first image frame and a second image frame, the zoom magnification of the first image frame is greater than that of the second image frame, the first image frame is determined according to region-of-interest information, and the region-of-interest information is used for indicating the position of a region of interest; the first terminal sends the image frames and indication information to a second terminal, so that the second terminal selects image frames to be displayed according to the indication information, the image frames to be displayed are used for generating videos, and the indication information comprises region-of-interest information of the first image frame.

Optionally, in a possible implementation manner, the acquiring unit is further configured to acquire a third image frame and the second image frame, where a zoom magnification of the third image frame is the same as a zoom magnification of the second image frame; the processing unit is further configured to crop the third image frame according to the region of interest information to obtain the first image frame.

Optionally, in a possible implementation manner, the acquiring unit is further configured to acquire one or more third image frames, where a zoom magnification of the one or more third image frames is the same as a zoom magnification of the second image frame; the processing unit is further configured to crop the one or more third image frames according to the region of interest information to obtain one or more first image frames; the processing unit is further configured to determine one of the one or more third image frames as the second image frame.

Optionally, in a possible implementation manner, the acquiring unit acquires the first image frame and the second image frame by means of optical zooming.

Optionally, in a possible implementation manner, the plurality of image frames further include a fourth image frame, and a zoom magnification of the fourth image frame is the same as that of the second image frame; the acquisition unit is further used for sequentially acquiring the fourth image frame, one or more first image frames and the second image frame according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

Optionally, in a possible implementation manner, the obtaining unit is further configured to obtain a reference frame of the plurality of image frames; the processing unit is further used for coding the plurality of image frames according to the reference frames of the plurality of image frames to obtain a coding result; the receiving and sending unit is further configured to send the encoding result to the second terminal; the reference frame of the first image frame is obtained according to the region-of-interest information, and the zoom magnification of the reference frame of the second image frame is the same as the zoom magnification of the second image frame.

Optionally, in a possible implementation manner, the processing unit is further configured to encode the image frame by using a first encoder according to a reference frame of the image frame to obtain a first encoding result; the processing unit is further configured to encode the second image frame through a second encoder according to a reference frame of the second image frame to obtain a second encoding result; wherein the encoding result comprises the first encoding result and the second encoding result.

Optionally, in a possible implementation manner, the indication information further includes frame types of the plurality of image frames, and a frame type of the first image frame is different from a frame type of the second image frame.

A fourth aspect of the present application provides a terminal, comprising: the device comprises an acquisition unit, a transmitting and receiving unit and a processing unit; the receiving and sending unit is configured to receive a plurality of image frames and indication information sent by a second terminal, where the plurality of image frames include a first image frame and a second image frame, a zoom magnification of the first image frame is greater than that of the second image frame, the first image frame is determined according to region-of-interest information, the region-of-interest information is used to indicate a position of a region of interest, and the indication information includes the region-of-interest information of the first image frame; the processing unit is configured to: if the region-of-interest information of the first image frame is different from the region-of-interest information in the first terminal, cutting the second image frame according to the region-of-interest information in the first terminal to obtain a third image frame, and displaying the third image frame; and if the region-of-interest information of the first image frame is the same as the region-of-interest information in the first terminal, displaying the first image frame.

Optionally, in a possible implementation manner, the plurality of image frames further include a fourth image frame, and a zoom magnification of the fourth image frame is the same as that of the second image frame; the fourth image frame, one or more first image frames and the second image frame are sequentially acquired by the second terminal according to a preset rule; the preset rule is that after the second terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

Optionally, in a possible implementation manner, the indication information further includes frame types of the plurality of image frames, and a frame type of the first image frame is different from a frame type of the second image frame; the processing unit is further used for sequentially sending the first image frame to a first buffer area and sending the second image frame to a second buffer area according to the frame types of the plurality of image frames; the processing unit is further to: if the information of the region of interest of the first image frame in the first buffer area is different from the information of the region of interest in the first terminal, cutting the second image frame in the second buffer area according to the information of the region of interest in the first terminal to obtain a third image frame, and displaying the third image frame; and if the region-of-interest information of the first image frame in the first buffer area is the same as the region-of-interest information in the first terminal, displaying the first image frame in the first buffer area.

Optionally, in a possible implementation manner, the obtaining unit is configured to obtain an interactive instruction, where the interactive instruction is used to instruct to change the region of interest; the processing unit is further used for updating the region-of-interest information according to the interactive instruction to obtain updated region-of-interest information; the transceiver unit is further configured to send the updated region of interest information to the second terminal.

A fifth aspect of the present application provides a terminal, including: a processor, a non-volatile memory, and a volatile memory; wherein the non-volatile memory or the volatile memory has stored therein computer readable instructions; the processor reads the computer readable instructions to cause the terminal to implement the method as implemented in any one of the first or second aspects.

A sixth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform a method as any one of the implementation manners of the first aspect or the second aspect.

A seventh aspect of the present application provides a computer program product which, when run on a computer, causes the computer to perform a method as implemented in any one of the first or second aspects.

An eighth aspect of the present application provides a chip comprising one or more processors. A part or all of the processor is used for reading and executing the computer program stored in the memory so as to execute the method in any possible implementation mode of any one aspect.

Optionally, the chip may include a memory, and the memory and the processor may be connected to the memory through a circuit or a wire. Optionally, the chip further comprises a communication interface, and the processor is connected to the communication interface. The communication interface is used for receiving data and/or information needing to be processed, the processor acquires the data and/or information from the communication interface, processes the data and/or information, and outputs a processing result through the communication interface. The communication interface may be an input output interface. The method provided by the application can be realized by one chip or by cooperation of a plurality of chips.

Drawings

Fig. 1 is a schematic view of a video frame according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a remote video flow of the first related art;

FIG. 3 is a schematic diagram illustrating a remote video flow of the second related art;

fig. 4 is a schematic structural diagram of a terminal 101 according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a remote video method according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating image contrast at different zoom magnifications according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram illustrating comparison of information of regions of interest in different terminals at different times according to an embodiment of the present application;

fig. 8 is a schematic diagram of obtaining a plurality of image frames by means of digital zooming according to an embodiment of the present application;

fig. 9 is a schematic diagram of another method for obtaining a plurality of image frames according to an embodiment of the present application;

fig. 10 is a schematic diagram illustrating a comparison between image frames obtained according to different manners according to an embodiment of the present application;

fig. 11 is a schematic diagram of determining a reference frame according to an embodiment of the present application;

fig. 12 is a schematic diagram of encoding based on two encoders according to an embodiment of the present application;

fig. 13 is a flowchart illustrating a remote video method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a terminal 1400 according to an embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a computer program product 1500 according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps appearing in the present application does not mean that the steps in the method flow have to be executed in the chronological/logical order indicated by the naming or numbering, and the named or numbered process steps may be executed in a modified order depending on the technical purpose to be achieved, as long as the same or similar technical effects are achieved.

With the development of communication technology, remote videos are widely applied in scenes such as video calls, telemedicine, remote education and the like. Subject to power consumption and latency, the picture resolution of remote video typically does not exceed 1080P (i.e., pixels are 1920 x 1080).

However, in many cases, if the capture resolution of the video is 1080P, the sharpness of the captured video may not meet the actual demand. For example, referring to fig. 1, fig. 1 is a schematic view of a video frame according to an embodiment of the present disclosure. As shown in fig. 1, in the remote job tutoring scene, since the capture resolution of the video is 1080P, for the font of the smaller size in the video picture, even if the video picture is enlarged, the font of the smaller size in the video picture may not be clearly recognized.

Based on this, two solutions are proposed in the related art. In a solution of the related art, a video with a higher resolution is transmitted to a receiving end by improving the resolution of a video acquired by a transmitting end, for example, by improving the resolution of the video acquired by the transmitting end to 3840 × 2160 (4 k resolution for short), so as to solve the problem of unclear video.

For example, referring to fig. 2, fig. 2 is a schematic diagram of a remote video flow in the first related art. As shown in fig. 2, a sending end acquires a video with a resolution of 4k through a camera, and after the video is encoded with the resolution of 4k, the encoded video is sent to a receiving end. And after receiving the coded video, the receiving end decodes the video to obtain a video with a resolution of 4k and displays the video with the resolution of 4 k.

In the first related art, by increasing the resolution of the transmitted video, the sharpness of the remote video can be effectively increased. However, compared with encoding a 1080P resolution video, the power consumption and the time delay of encoding a 4k resolution video are increased by 4 times, which affects the real-time performance of the remote video. In addition, because the coding power consumption is large, long-distance video cannot be performed on some terminals often, and normal long-distance video is affected. Furthermore, the coding rate for coding a 4k resolution video is also increased by 4 times, which brings a large bandwidth cost, and causes that the scheme is difficult to be applied in an actual service scene.

In the second related art, on the basis of improving the resolution of the video captured by the transmitting end, for example, the resolution of the video captured by the transmitting end is improved to 3840 × 2160, and the resolution of the captured video is reduced to 1080P and then is transmitted to the receiving end.

Thus, when the receiving end needs to enlarge a certain area (the area can be generally called as an area of interest), the receiving end feeds back the area needing to be enlarged to the transmitting end. After the sending end acquires the area needing to be amplified, the sending end intercepts the area needing to be amplified from the acquired video picture, converts the intercepted video picture into 1080P and sends the 1080P converted video picture to the receiving end. The video picture obtained by interception is collected with high resolution, so that the definition of the video picture can meet the requirement.

Referring to fig. 3, fig. 3 is a schematic diagram of a remote video flow of the second related art. As shown in fig. 3, the sending end acquires a video with a resolution of 4k through a camera, and captures an original video picture according to the information of the region of interest to obtain a video with a captured picture. Then, the transmitting end encodes the video with the cut-out part of the picture at 1080P resolution and transmits the encoded video to the receiving end. And after receiving the coded video, the receiving end decodes the video to obtain a 1080P resolution video and displays the 1080P resolution video.

For example, it is assumed that a transmitting end captures a video at a 1920 × 1080 resolution, and encodes the captured video at the 1920 × 1080 resolution and transmits the encoded video to a receiving end. Then the receiving end can receive 1920 × 1080 resolution video. If the receiving end needs to enlarge and display an interested area with the size of 960x540, the receiving end needs to intercept a video picture corresponding to the interested area from the received original video, upsample the intercepted video picture into a picture with the size of 1920 x 1080, and finally display the upsampled video picture. Because the original video picture is actually acquired at a resolution of 1920 × 1080, and the video picture corresponding to the region of interest actually displayed by the receiving end is the video picture up-sampled by the receiving end, the definition of the video picture is not high.

After the second correlation technique is adopted, it is assumed that a sending end acquires a video at a resolution of 3840 × 2160, encodes the acquired video at a resolution of 1920 × 1080, and sends the encoded video to a receiving end, that is, after an original video is downsampled to 1920 × 1080, the downsampled video is encoded. Then the receiving end can receive 1920 × 1080 resolution video. If the receiving end needs to enlarge and display a certain region of interest with the size of 960x540, the receiving end needs to feed back the position of the region of interest to the transmitting end.

After receiving the position of the region of interest, the transmitting end may determine that the size of the region of interest in the captured original video is 1920 × 1080. Therefore, the transmitting end can intercept an interested region with a size of 1920 × 1080 in the acquired original video, encode the intercepted video with a resolution of 1920 × 1080, and transmit the encoded video to the receiving end. Therefore, the receiving end can receive the video intercepted from the original video with the resolution of 4k, and the receiving end does not need to up-sample the received video picture, so that the definition of the video picture displayed by the receiving end is higher.

However, in the second related art, the receiving end needs to feed back the position of the region of interest to the transmitting end, and the transmitting end performs corresponding processing and then transmits the processed video to the receiving end. Therefore, the receiving end remotely controls the sending end, and has higher response time, which easily causes the obvious pause phenomenon of the video picture. For example, after the receiving end performs the screen enlargement operation, it needs to wait for a long time (generally, it needs to wait for more than 300 milliseconds), and the receiving end can receive the data transmitted by the transmitting end and display the enlarged video screen.

The remote video method provided by the embodiment of the application can be applied to a terminal with a video acquisition function. The terminal is also called User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), etc., and is a device equipped with an image capturing apparatus capable of capturing video, and capable of performing remote communication with other devices to transmit captured video to other devices. Such as a handheld device with a camera function, a surveillance camera, etc.

Currently, some examples of terminals are: a mobile phone (mobile phone), a tablet computer, a notebook computer, a palm top computer, a monitoring camera, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote surgery (remote medical supply), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), and the like.

The image acquisition device in the terminal is used for converting the optical signal into an electric signal to generate an image signal. The image capturing Device may be, for example, an image sensor, and the image sensor may be, for example, a Charge Coupled Device (CCD) or a Complementary Metal Oxide Semiconductor (CMOS).

Referring to fig. 4, fig. 4 is a schematic structural diagram of a terminal 101 according to an embodiment of the present disclosure. As shown in fig. 4, terminal 101 includes a processor 103, processor 103 coupled to a system bus 105. Processor 103 may be one or more processors, each of which may include one or more processor cores. A display adapter (video adapter)107, which may drive a display 109, the display 109 coupled with system bus 105. System bus 105 is coupled to an input/output (I/O) bus through a bus bridge 111. The I/O interface 115 is coupled to an I/O bus. The I/O interface 115 communicates with various I/O devices, such as an input device 117 (e.g., a touch screen, etc.), a multimedia disk (media track) 121 (e.g., a compact disk read-only memory (CD-ROM), a multimedia interface, etc.). A transceiver 123 (which can send and/or receive radio communication signals), a camera 155 (which can capture still and motion digital video images), and an external USB port 125. Wherein, optionally, the interface connected to the I/O interface 115 may be a USB interface.

The processor 103 may be any conventional processor, including a Reduced Instruction Set Computing (RISC) processor, a Complex Instruction Set Computing (CISC) processor, or a combination thereof. Alternatively, the processor may be a dedicated device such as an ASIC.

Terminal 101 may communicate with software deploying server 149 via network interface 129. Illustratively, the network interface 129 is a hardware network interface, such as a network card. The network 127 may be an external network, such as the internet, or an internal network, such as an ethernet or Virtual Private Network (VPN). Optionally, the network 127 may also be a wireless network, such as a WiFi network, a cellular network, and the like.

Hard drive interface 131 is coupled to system bus 105. The hardware drive interface is connected to the hard disk drive 133. System memory 135 is coupled to system bus 105. The data running in the system memory 135 may include an Operating System (OS)137, applications 143, and schedules of the terminal 101.

The operating system includes a Shell 139 and a kernel 141. Shell 139 is an interface between the user and the kernel of the operating system. The shell is the outermost layer of the operating system. The shell manages the interaction between the user and the operating system: await user input, interpret user input to the operating system, and process the output results of a wide variety of operating systems.

Kernel 141 is comprised of those portions of the operating system that are used to manage memory, files, peripherals, and system resources. The kernel 141 interacts directly with the hardware, and the operating system kernel typically runs processes and provides inter-process communication, CPU slot management, interrupts, memory management, and IO management, among other things.

Illustratively, in the case where the terminal 101 is a smartphone, the application program 143 includes a remote video-related program. The terminal 101 can implement remote video with another terminal by executing the application program 143. That is, the terminal 101 can capture a video through the camera 155, process and encode the captured video in the processing unit 103, and then transmit the processed video to another terminal. In one embodiment, the terminal 101 may download the application 143 from the software deploying server 149 when the application 143 needs to be executed.

The application scenarios of the remote video method provided in the embodiments of the present application are introduced above, and the execution process of the remote video method will be described in detail below.

Referring to fig. 5, fig. 5 is a schematic flowchart of a remote video method according to an embodiment of the present disclosure. As shown in fig. 5, the remote video method includes the following steps.

Step 501, a first terminal acquires a plurality of image frames, wherein the image frames include a first image frame and a second image frame, the zoom magnification of the first image frame is greater than that of the second image frame, the first image frame is determined according to region-of-interest information, and the region-of-interest information is used for indicating the position of a region of interest.

In this embodiment, the first terminal is a sending terminal that collects a video and sends the video to a receiving terminal, and the second terminal is a receiving terminal that receives the video sent by the first terminal and displays the video.

In the video acquisition process, the first terminal may continuously acquire image frames at a fixed frame rate, so as to obtain a plurality of continuous image frames. Among the plurality of image frames, a first image frame and a second image frame are included. The zooming magnifications of the first image frames are all larger than that of the second image frames.

The zoom factor refers to a magnification factor of an image output by an image sensor of the camera. When the camera does not move, the larger the zoom magnification of the camera is, the larger the shot object in the image output by the camera is, and the smaller the shot range in the image is; the smaller the zoom magnification of the camera is, the smaller the subject in the image output by the camera is, and the larger the range captured in the image is.

For example, referring to fig. 6, fig. 6 is a schematic diagram illustrating image contrast at different zoom magnifications according to an embodiment of the present application. As shown in fig. 6, for two images of the same size output by the camera in the same scene, the camera takes a first image by a smaller zoom magnification and a second image by a larger zoom magnification. For the image I with the smaller zoom magnification, the shooting range of the image I is the whole body of the skier. For the second image with the larger zoom magnification, the shooting range of the second image is the head of the skier, namely the shooting range of the second image is smaller than that of the first image. In addition, the head of the skier captured in image two is larger than the head of the skier captured in image one. In short, the second image can be regarded as an enlarged image of the area of the first image where the head of the skier is located.

In the first terminal, the region-of-interest information may be stored, and the region-of-interest information may be fed back to the first terminal by the second terminal. The region-of-interest information is used to indicate where the region of interest is located. The region of interest refers to a region obtained by a user zooming in, zooming out, or translating a video picture displayed on a screen of the terminal in an interactive manner such as touching the screen of the terminal with a finger.

In short, the region of interest is a region to be displayed on the screen of the terminal in the remote video process. Taking fig. 6 as an example, the first image is displayed on the terminal screen, and the user touches the terminal screen with a finger to perform an enlarging operation on the head area of the skier, so that the terminal can obtain the corresponding region of interest (i.e., the head area of the skier shown in the second image) based on the interaction instruction of the user.

Based on the region-of-interest information in the first terminal, the first terminal may determine a location where the region-of-interest is located, thereby acquiring a first image frame associated with the region-of-interest. The picture content in the first image frame is the content of the region of interest.

Optionally, only the region of interest indicated in the region of interest is included in the first image frame, and the second image frame includes other regions in addition to the region of interest indicated in the region of interest information. That is, only the picture content of the position of the region of interest is included in the picture in the first image frame, and the picture in the second image frame includes other picture contents in addition to the picture content of the position of the region of interest.

For example, the second image frame may be an image frame acquired by the first terminal at a preset zoom magnification, and the first image frame may be an image frame acquired by the first terminal at an adjusted zoom magnification according to the region of interest information. Under the condition that the first terminal does not move, the second image frame can be regarded as a global image acquired by the first terminal, namely the second image frame comprises all areas in the visual field range of the camera of the first terminal; the first image frame may be considered as a local image acquired by the first terminal, that is, the first image frame includes only a partial region within the field of view of the camera of the first terminal.

Step 502, the first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects an image frame to be displayed according to the indication information, the image frame to be displayed is used for generating a video, and the indication information includes region-of-interest information of the first image frame.

In the process of carrying out remote video between the first terminal and the second terminal, the first terminal sequentially collects the image frames and sends the image frames to the second terminal one by one.

Optionally, the first terminal may acquire the plurality of image frames at a higher resolution, then reduce the plurality of image frames to a specific resolution, then encode the plurality of reduced image frames, and send the plurality of encoded image frames. For example, the first terminal acquires the plurality of image frames at a resolution of 4k, reduces the plurality of image frames to a resolution of 1080P, and encodes and transmits the plurality of reduced image frames.

In addition, in the process of transmitting the image frames to the second terminal by the first terminal, the first terminal also transmits indication information to the second terminal, wherein the indication information is used for indicating the region-of-interest information of each image frame.

In short, the first terminal may carry indication information corresponding to each image frame in the process of sending each image frame to the second terminal, where the indication information indicates the region-of-interest information of the image frame transmitted by the first terminal. Wherein the indication information may comprise coordinate information for indicating the position of the region of interest.

For example, in the case where the region of interest is a rectangular region, the coordinate information of four vertices of the rectangular region may be included in the indication information.

For another example, in the case that the region of interest is a rectangular region, the indication information may include coordinate information of a vertex of the rectangular region (e.g., a vertex at the upper left corner of the rectangular region) and a width and a height of the rectangular region. In this way, based on the coordinate information of one vertex of the rectangular region and the width and height of the rectangular region, the coordinate information of four vertices of the rectangular region can also be calculated.

Optionally, when the first terminal performs video coding according to the h.264/h.265 video compression standard, the first terminal may write the indication Information into Supplemental Enhancement Information (SEI), so as to implement that the indication Information corresponding to each image frame is carried in the transmission data.

In a possible embodiment, for the second image frame, since the second image frame includes other regions besides the region of interest, the first terminal may not carry indication information when transmitting the second image frame, i.e. does not indicate the position of the region of interest in the second image frame. In another possible embodiment, the first terminal still carries indication information when transmitting the second image frame to indicate the position of the region of interest in the second image frame.

Optionally, in the process that the first terminal sends the plurality of image frames to the second terminal, the first terminal may encode the image frames one by one to obtain encoded image information, and then send the encoded image information and the indication information corresponding to the encoded image frames to the second terminal.

Optionally, in order to facilitate the second terminal to determine the first image frame and the second image frame in the plurality of image frames, the frame type of the plurality of image frames may be further included in the indication information. Wherein a frame type of the first image frame is different from a frame type of the second image frame.

For example, the first terminal may indicate the frame type of the image frame by indicating a certain bit in the information. For example, when a certain bit in the indication information for indicating the frame type is set to 1, the frame type of the current image frame is indicated to be a second type, that is, the above-mentioned second image frame; when a certain bit for indicating the frame type in the indication information is not set (i.e. the bit has a value of 0), it indicates that the frame type of the current image frame is the first type, i.e. the first image frame mentioned above. In addition to indicating the frame type in the above manner, the frame type of the image frame may also be indicated in other manners, which is not specifically limited in this embodiment.

In step 503, the second terminal receives the plurality of image frames and the indication information sent by the first terminal.

After the second terminal receives the data sent by the first terminal, the second terminal decodes the data sent by the first terminal to obtain a plurality of image frames sent by the first terminal. Illustratively, in the process that the first terminal continuously transmits the encoded data to the second terminal, the second terminal continuously decodes the received encoded data, so as to sequentially obtain the plurality of image frames and the indication information corresponding to each image frame.

It should be noted that, the first terminal encodes an image frame every time it acquires the image frame, and sends the encoded image frame and corresponding indication information to the second terminal. Therefore, the second terminal receives the image frames transmitted by the first terminal one by one, rather than receiving the image frames at a time.

Step 504, if the region of interest information of the first image frame is different from the region of interest information in the second terminal, a fifth image frame is obtained by cutting the second image frame according to the region of interest information in the second terminal, and the fifth image frame is displayed.

In this embodiment, after the second terminal receives the image frames sent by the first terminal, the second terminal may select the image frames to be displayed according to the indication information corresponding to each image frame.

Specifically, for the second terminal as a receiving end, the second terminal can receive an interactive instruction from a user and update the region-of-interest information according to the interactive instruction. However, the second terminal needs to feed back the updated region-of-interest information to the first terminal, and the first terminal acquires a new image frame based on the updated region-of-interest information and then sends the new image frame to the second terminal. Therefore, there is a certain hysteresis in the image frame matched with the updated region of interest information, that is, within a period of time after the second terminal finishes updating the local region of interest information, the region of interest information corresponding to the image frame received by the second terminal is not the same as the updated region of interest information.

Therefore, before the second terminal displays the first image frame, the second terminal determines whether the region-of-interest information of the first image frame is the same as the region-of-interest information stored in the second terminal. And if the region-of-interest information of the first image frame is different from the region-of-interest information stored in the second terminal, the second terminal cuts the second image frame according to the region-of-interest information in the second terminal to obtain a fifth image frame and displays the fifth image frame.

In short, because the zoom magnification of the second image frame is smaller than that of the first image frame, and the second image frame is actually a global image frame, when the region of interest changes, the position of a new region of interest can be determined from the second image frame, and the position of the new region of interest is cut off to obtain a fifth image frame. And the content in the fifth image frame is the content corresponding to the position indicated by the region-of-interest information of the second terminal.

And 505, if the region-of-interest information of the first image frame is the same as the region-of-interest information in the first terminal, displaying the first image frame.

The second terminal may display the first image frame if the region-of-interest information of the first image frame is the same as the region-of-interest information saved in the second terminal.

For ease of understanding, the process of the second terminal selecting the displayed image frame will be described below with reference to a specific example.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating comparison of region-of-interest information in different terminals at different times according to an embodiment of the present application. In fig. 7, a solid box in the image represents a position indicated by the region-of-interest information in the current terminal.

At time t1, the region-of-interest information in the first terminal is the same as the region-of-interest information in the second terminal, i.e., the region-of-interest indicated by the region-of-interest information in the first terminal and the second terminal are both located at the position of the head of the skier. At this time, the region-of-interest information corresponding to the first image frame sent by the first terminal is the same as the region-of-interest information in the second terminal, and the second terminal selectively displays the first image frame, wherein the content of the first image frame is the head of the skier marked in the solid frame.

At a time t2, the second terminal receives an interactive instruction of the user, where the interactive instruction is specifically a panning operation instruction, and the second terminal updates the region-of-interest information according to the interactive instruction. In the second terminal, the region of interest indicated by the region of interest information before updating is located on the head of the skier, and the region of interest indicated by the region of interest information after updating is located on the skier. Because the information of the region of interest in the second terminal is changed, the information of the region of interest of the first image frame sent by the first terminal is different from the information of the region of interest in the second terminal, so that the second terminal selects to cut the updated region of interest in the second image frame to obtain a third image frame, and displays the third image frame. The content of the third image frame is the slide plate marked in the solid frame.

At time t3, the first terminal receives the updated region of interest information fed back by the second terminal, so the first terminal acquires a new first image frame according to the updated region of interest information and sends the new first image frame to the second terminal. Because of the transmission delay between the first terminal and the second terminal, at time t3, the first image frame received by the second terminal is actually acquired by the first terminal based on the region-of-interest information before updating. That is, the region-of-interest information of the first image frame received in the second terminal is different from the region-of-interest information in the second terminal, and the second terminal still selects to display the third image frame.

At time t4, the region-of-interest information in the first terminal is the same as the region-of-interest information in the second terminal, and the second terminal receives the first image frame acquired by the first terminal based on the updated region-of-interest information. At this time, the region-of-interest information corresponding to the first image frame sent by the first terminal is the same as the region-of-interest information in the second terminal, and the second terminal selects and displays the first image frame, wherein the content of the first image frame is the slide plate marked in the solid frame.

In this embodiment, in the process of sending the image frame related to the region of interest to the second terminal as the receiving end, the first terminal as the sending end adds the image frame with lower zoom magnification, that is, the image frame including more contents in the image. Therefore, when the region of interest in the second terminal changes, the second terminal can intercept the picture corresponding to the changed region of interest from the image frame with lower zoom magnification in time and display the picture, namely, the second terminal can display the adjusted video picture without waiting for a long time, so that the response time of adjusting the region of interest of the video is prolonged, and the phenomenon of blocking of the video picture is avoided.

In one possible embodiment, to facilitate the selection of the displayed image frames by the second terminal, a different buffer (buffer) may be created in the second terminal. And after receiving the image frames of different frame types, the second terminal respectively sends the image frames to the corresponding buffer areas. Then, the second terminal selects one of the buffers according to the region-of-interest information and displays the image frame in the buffer.

Illustratively, in a case where the indication information further includes frame types of the plurality of image frames, the second terminal sequentially sends the first image frame to a first buffer and the second image frame to a second buffer according to the frame types of the plurality of image frames. The first buffer area is used for storing a first image frame with a first frame type, and the second buffer area is used for storing a second image frame with a second frame type.

When the second terminal receives a new image frame, the second terminal judges the frame type of the image frame, and if the frame type of the image frame is the first type, the image is sent to a first buffer area; if the frame type of the image frame is the second type, the image frame is sent to a second buffer. The old image frame may be overwritten by the new image frame on the first buffer and the second buffer.

After the second terminal sends the image frames to the corresponding buffer areas based on the frame types of the image frames, the second terminal determines which buffer area to select to display the image frames according to the region-of-interest information of the image frames in the buffer areas.

Specifically, if the region-of-interest information of the first image frame in the first buffer is different from the region-of-interest information in the second terminal, a fifth image frame is obtained by cutting the second image frame in the second buffer according to the region-of-interest information in the second terminal, and the fifth image frame is displayed. And if the region-of-interest information of the first image frame in the first buffer area is the same as the region-of-interest information in the first terminal, displaying the first image frame in the first buffer area.

In one possible embodiment, during the process of displaying the image frame by the second terminal, the second terminal may acquire an interactive instruction of the user in real time, and update the region of interest in real time according to the interactive instruction.

Illustratively, when a user initiates an interactive instruction on a second terminal, the second terminal acquires the interactive instruction, and the interactive instruction is used for indicating to change the region of interest. The interactive instruction may be, for example, a zoom-out operation instruction, a zoom-in operation instruction, or a pan operation instruction initiated by a user by touching a screen of the second terminal. The zoom-out operation instruction is used for indicating that the picture displayed on the screen of the second terminal is zoomed out by taking the target area as a starting point. The enlargement operation instruction is used for instructing enlargement of the target area displayed on the screen of the second terminal. The translation operation instruction is used for instructing to translate the picture displayed on the screen of the second terminal to a specific direction.

After the second terminal executes the interactive instruction, the obtained region to be displayed is a new region of interest, so that the second terminal can update the region of interest information according to the interactive instruction to obtain the updated region of interest information. In this way, after the second terminal updates the local region of interest information, the second terminal sends the updated region of interest information to the first terminal.

The above describes a process in which the first terminal sends image frames with different zoom magnifications to the second terminal, and the second terminal selects an image frame to be displayed according to the region-of-interest information. For ease of understanding, the process of the first terminal acquiring the plurality of image frames and transmitting the plurality of image frames to the second terminal will be described in detail below.

Alternatively, the first terminal may acquire the plurality of image frames in a plurality of ways.

In a first mode, the first terminal may acquire the plurality of image frames by means of digital zooming.

The digital zooming is to enlarge each pixel area of a partial area in the acquired image through a processor in the first terminal, so as to achieve the purpose of enlarging the partial area. In practice, digital zooming does not change the focal length of the lens.

Illustratively, the first terminal acquires one or more third image frames and the second image frame in sequence, and the zoom magnification of the one or more third image frames is the same as that of the second image frame. That is, the first terminal continuously acquires a plurality of image frames including the one or more third image frames and the second image frame at a specific zoom magnification. Then, the first terminal cuts the one or more third image frames according to the region-of-interest information to obtain the one or more first image frames. Specifically, after the first terminal collects each third image frame, the first terminal cuts the third image frame according to the region-of-interest information in the first terminal, that is, cuts the region-of-interest in the third image frame based on the position indicated by the region-of-interest information, so as to obtain the first image frame. The method for the first terminal to acquire the first image frame is digital zooming.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a plurality of image frames obtained by a digital zoom method according to an embodiment of the present application. As shown in fig. 8, the first terminal sequentially captures an image 1, an image 2, an image 3, an image 4, an image 5, and an image 6 at a fixed zoom magnification, where the images 1 to 5 correspond to the third image frame described above, and the image 6 corresponds to the second image frame described above. In the process of acquiring the images 1-5 by the first terminal, after the first terminal acquires the image 1, cutting the image 1 based on the region-of-interest information to obtain an image A1; similarly, after the first terminal acquires the image 2, the image 2 is cropped based on the region-of-interest information, resulting in an image a 2. By analogy, the first terminal obtains images a 1-a 5 corresponding to images 1-5 by means of digital zooming, images a 1-a 5 corresponding to the first image frame described above. For image 6, the first terminal no longer crops image 6 based on the region of interest information, i.e. image 6 in fig. 8 is the same as image B.

In the first mode, the first terminal may set a fixed interval number, where the interval number is used to indicate the number of image frames separated between two adjacent global image frames. The first image frame may be referred to as a local image frame, and the second image frame may be referred to as a global image frame. That is, the first terminal acquires one global image frame after every certain number of local image frames are acquired. For example, in the case that the number of intervals is 4, the first terminal acquires one global image frame every 4 local image frames. Thus, when the first terminal performs image acquisition at a frame rate of 30, the first terminal acquires 30 image frames per second, where the 30 image frames include 24 local image frames and 6 global image frames. In practical applications, the number of the above-mentioned intervals may be fixed, for example, the number of the intervals is 4 or 5. The number of the intervals may also be non-fixed, for example, the first terminal acquires one global image frame after 4 local image frames are separated, and then the first terminal acquires the next global image frame after 5 local image frames are separated. The present embodiment does not limit the number of intervals set in the first terminal.

Illustratively, a fourth image frame may be further included in the plurality of image frames captured by the first terminal, and a zoom magnification of the fourth image frame is the same as that of the second image frame. The process of the first terminal acquiring the plurality of image frames specifically includes: the first terminal sequentially collects the fourth image frame, one or more first image frames and the second image frame according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

In this way, after the first terminal collects a specific number of image frames based on the region-of-interest information each time, a global image frame is collected by using a lower zoom magnification, so that the frequency of receiving the global image frame by the second terminal is ensured, and the phenomenon of overlarge image change is avoided when the region-of-interest is subsequently cut by the second terminal based on the global image frame.

In the second mode, the first terminal inserts the global image frame after acquiring the plurality of local image frames based on the region-of-interest information.

Illustratively, the first terminal acquires one or more third image frames, the zoom magnification of which is the same as that of the second image frame; the first terminal cuts the one or more third image frames according to the region-of-interest information to obtain one or more first image frames; the first terminal determines one of the one or more third image frames as the second image frame. That is, the first terminal continuously acquires first image frames (i.e., local image frames) by means of digital zooming based on the region-of-interest information, and inserts second image frames (i.e., global image frames) into the acquired plurality of first image frames.

That is, each time the first terminal acquires a specific number of image frames based on the region-of-interest information, a global image frame with a lower zoom magnification is inserted. In this way, the number of image frames sent by the first terminal to the second terminal is greater than the number of image frames actually captured by the first terminal.

Referring to fig. 9, fig. 9 is a schematic diagram of obtaining a plurality of image frames according to an embodiment of the present application. As shown in fig. 9, the first terminal acquires image 1, image 2, image 3, image 4, and image 5 in order at a fixed zoom magnification, where images 1-5 correspond to the third image frame described above. In the process of acquiring the images 1-5 by the first terminal, after the first terminal acquires the image 1, cutting the image 1 based on the region-of-interest information to obtain an image A1; similarly, after the first terminal acquires the image 2, the image 2 is cropped based on the region-of-interest information, resulting in an image a 2. By analogy, the first terminal obtains images a 1-a 5 corresponding to images 1-5 by means of digital zooming, images a 1-a 5 corresponding to the first image frame described above. After obtaining image A1-image A5, the first terminal determines image 5 as the second image frame, i.e., inserts image B after image A5 based on image 5, which is the same as image 5.

It can be understood that, when the image frames are obtained in the first mode without change of the information of the region of interest of the second terminal, because the second terminal does not select to display the global image frames acquired by the first terminal, the frequency of displaying the image frames on the second terminal is lower than the frequency of actually acquiring the image frames by the first terminal.

For example, in the case where the first terminal acquires one global image frame every 5 local image frames, the first terminal acquires 30 image frames per second, and the 30 image frames include 25 local image frames and 5 global image frames. For the second terminal, the second terminal displays 25 local image frames per second.

When the image frames are obtained by the second method, under the condition that the first terminal inserts one global image frame every 5 local image frames, the first terminal acquires 30 local image frames every second and inserts 6 global image frames, namely, 36 image frames in total. For the second terminal, the second terminal displays 30 local image frames per second. Therefore, the image frames are obtained through the second mode, the frequency of the image frames displayed by the second terminal can be ensured, and the fluency of the video images is improved.

For example, referring to fig. 10, fig. 10 is a schematic diagram illustrating a comparison between image frames obtained in different manners according to an embodiment of the present application. In fig. 10, the first terminal acquires image 1-image 10 based on a particular zoom magnification. When the first terminal obtains an image frame which needs to be sent to the second terminal in the first mode, the first terminal cuts the images 1-4 and the images 6-9 based on the information of the region of interest to respectively obtain an image A1-an image A8; the first terminal obtains an image B1 and an image B2 based on the image 5 and the image 10. That is, the number of image frames transmitted from the first terminal to the second terminal is 10 in total.

When the first terminal obtains an image frame which needs to be sent to the second terminal in the second mode, the first terminal cuts the images 1-10 based on the information of the region of interest to respectively obtain an image A1-an image A10; further, the first terminal obtains an image B1 and an image B2 based on the image 5 and the image 10, and inserts the image B1 after the image a5, and inserts the image B2 after the image a 10. That is, the total number of image frames transmitted from the first terminal to the second terminal is 12. Obviously, when the first terminal obtains the image frames to be sent to the second terminal in the second mode, the second terminal can receive more image frames and can be used for displaying more image frames, so that the frequency of the second terminal for displaying the image frames can be ensured, and the fluency of video images can be improved.

In a third mode, the first terminal may acquire the first image frame and the second image frame by means of optical zooming.

The optical zooming refers to that the first terminal realizes zooming by means of an optical lens structure, namely the first terminal zooms in and zooms out a scene needing to be shot by moving a lens. Although both digital zooming and optical zooming are helpful for magnifying a distant object in telescopic shooting, only optical zooming can support imaging of an image main body, and more pixels are added, so that the main body is enlarged and relatively clearer. That is, when an object in an image is enlarged by means of optical zooming, the object in the image can be made relatively sharper.

Since the optical zoom is to zoom in or out an image frame captured by the lens by changing a focal length with the center of the field of view of the lens as a center point, the process of the first terminal acquiring the first image frame by means of the optical zoom is actually to calculate a longest focal length at which the field of view covers the region of interest and then acquire an image frame including the region of interest based on the longest focal length. If the image frame including the region of interest includes exactly only the region of interest, the image frame including the region of interest may be determined as a first image frame; if the image frame including the region of interest includes a region of non-interest in addition to the region of interest, the region of interest may be cut out from the image frame including the region of interest to obtain a first image frame.

In this embodiment, the first terminal may adjust a zoom magnification in an optical zoom manner based on the region of interest information, and acquire the first image frame. Then, the first terminal adjusts the zoom ratio again to acquire a second image frame.

In a possible embodiment, during the process of encoding the plurality of image frames by the first terminal, the first terminal may select a corresponding reference frame according to a frame type of the image frame.

Generally, in the encoding process of a video, a sending end selects a previous image frame of a current image frame as a reference frame, then calculates difference information between the current image frame and the reference frame, and sends the difference information between the current image frame and the reference frame to a receiving end instead of directly sending the current image frame to the receiving end, thereby playing a role in reducing transmitted data.

In the present embodiment, since the first image frame and the second image frame are acquired based on different zoom magnifications. Therefore, the difference between the first image frame and the second image frame is relatively large compared to two adjacent first image frames. When the second image frame is encoded, if the previous image frame (i.e., the first image frame) of the second image frame is used as a reference frame, the data amount of the obtained difference information is large, and the data transmission amount is increased. Based on this, the first terminal may select a corresponding reference frame according to the frame type of the image frame, so as to ensure that the frame type between the image frame currently to be encoded and the reference frame of the image frame is the same.

Illustratively, in the process of encoding the plurality of image frames, the first terminal acquires a reference frame of the plurality of image frames, wherein the reference frame of the first image frame is obtained according to the region-of-interest information, and the zoom magnification of the reference frame of the second image frame is the same as the zoom magnification of the second image frame.

Briefly, in the process of acquiring the image frames by the first terminal, the frame type of the local image frame (i.e., the first image frame) acquired by the first terminal based on the region-of-interest information is a first type, and the frame type of the global image frame (i.e., the second image frame) acquired based on a specific zoom magnification is a second type. For any image frame with the first type of frame type, the first terminal may determine that a reference frame of the image frame is a previous image frame of the first type of the image frame, that is, an image frame of the first type temporally closest to the image frame. Similarly, for any image frame of which the frame type is the second type, the first terminal may determine that the reference frame of the image frame is the previous image frame of the second type of the image frame, that is, the image frame of the second type temporally closest to the image frame.

That is, for any one image frame, the reference frame of the image frame is an image frame of the same type temporally closest to the image frame.

After obtaining the reference frame, the first terminal encodes the plurality of image frames according to the reference frame of the plurality of image frames to obtain an encoding result; and the first terminal sends the coding result to the second terminal.

For example, referring to fig. 11, fig. 11 is a schematic diagram illustrating a determination of a reference frame according to an embodiment of the present application. As shown in fig. 11, the image a 1-image a10 are images acquired by the first terminal based on the region-of-interest information, and the frame type of the image a 1-image a10 is the first type. The image B1 and the image B2 are images acquired by the first terminal based on a specific zoom magnification, and the frame type of the image B1 and the image B2 is the second type.

The reference frame of the image a2 is the image a1, and the reference frame of the image A3 is the image a2 …, the image a6 is the image a 5. For any of image A1-image A10, the reference frame for that image is the previous image that is of the first type. Similarly, the reference frame for picture B2 is picture B1, not picture a 10. For any of the pictures B1-B2, the reference frame of the picture is the previous picture that belongs to the second type.

In this embodiment, since the reference frame of the image frame to be encoded is not necessarily the previous image frame of the image frame to be encoded in the encoding process, for an encoder that uses the previous image frame of the image frame to be encoded as the reference frame, the encoder may not be able to encode the image frame well.

In view of this, the present embodiment proposes to use two encoders to encode image frames of different frame types, so as to ensure smooth encoding of the image frames.

For example, the encoding, by the first terminal, the multiple image frames according to the reference frame of the multiple image frames to obtain an encoding result may specifically include: the first terminal encodes the image frame through a first encoder according to a reference frame of the image frame to obtain a first encoding result; the first terminal encodes the second image frame through a second encoder according to the reference frame of the second image frame to obtain a second encoding result; wherein the encoding result comprises the first encoding result and the second encoding result.

Briefly, in practical applications, for image frames acquired based on the region-of-interest information, the first terminal inputs the image frames into the first encoder, and the first encoder encodes the image frames. For image frames acquired at a specific zoom magnification, the first terminal inputs the image frames into the second encoder, and the second encoder encodes the image frames.

For example, referring to fig. 12, fig. 12 is a schematic diagram of encoding based on two encoders according to an embodiment of the present application. As shown in fig. 12, the image a 1-image a10 are images acquired by the first terminal based on the region-of-interest information, and the frame types of the image a 1-image a10 are the first type. The image B1 and the image B2 are images acquired by the first terminal based on a specific zoom magnification, and the frame type of the image B1 and the image B2 is the second type.

The input of the first encoder is image a 1-image a10, and during the encoding process of the first encoder, the reference frame of image a2 is image a1, and the reference frame of image A3 is image a5, which is image a2 … and image a 6. For any one of picture a 1-picture a10 (except the first picture input to the first encoder), the reference frame for that picture is the previous picture.

The inputs to the second encoder are picture B1 and picture B2. The reference frame of picture B2 is picture B1. I.e. for any picture input to the second encoder (except the first picture input to the second encoder), the reference frame for that picture is the previous picture.

For ease of understanding, the flow of the remote video method provided in the embodiments of the present application will be described in detail below with reference to specific examples. Referring to fig. 13, fig. 13 is a flowchart illustrating a remote video method according to an embodiment of the present disclosure. As shown in fig. 13, the remote video method includes steps 1301-1315, wherein steps 1301-1306 are performed by a first terminal, and steps 1307-1315 are performed by a second terminal.

Step 1301, in the process of collecting the image frame, the first terminal judges whether the image frame needing to be collected currently is a multiple of N.

In the process of remote video, the first terminal continuously collects image frames through the camera. In the process of collecting the image frames, the first terminal can collect a global image frame every N-1 local image frames, so that the first terminal can judge whether the current image frame to be collected is the global image frame or the local image frame by judging whether the current image frame to be collected is a multiple of N.

For example, assuming that N is 5, the first terminal is 4 local image frames at intervals, and one global image frame is acquired. In this way, the 1 st to 4 th image frames acquired by the first terminal are local image frames, and the 5 th image frame is a global image frame; the 6 th to 9 th image frames acquired by the first terminal are local image frames, and the 10 th image frame is a global image frame. By analogy, the global image frames collected by the first terminal are all multiples of N.

In step 1302, if the current image frame to be acquired is not a multiple of N, the first terminal acquires a local image frame based on the region-of-interest information.

The local image frame is acquired by the first terminal through digital zooming or optical zooming based on the region-of-interest information. The local image frame may be, for example, the first image frame described above. The region-of-interest information may be locally stored region-of-interest information of the first terminal. In the process of remote video, the first terminal can receive the region-of-interest information sent by the second terminal and store the received region-of-interest information locally at the first terminal.

The first terminal may acquire the local image frames based on 4k resolution or 2k resolution.

And step 1303, if the current image frame needing to be acquired is a multiple of N, acquiring a global image frame by the first terminal.

The global image frame is, for example, the second image frame described above, and the global image frame is acquired by the first terminal through a preset zoom magnification. The zoom magnification of the global image frame is smaller than that of the local image frame, and the image content in the global image frame comprises the image content of an interested area and the image content of a non-interested area.

In another possible implementation manner, the image frames currently required to be acquired are multiples of N, the first terminal may also continue to acquire the local image frames, and after acquiring the local image frames, insert a global image frame.

The first terminal may acquire the global image frame based on 4k resolution or 2k resolution.

In step 1304, the first terminal converts the acquired image frames to 1080P resolution.

After acquiring the local image frame or the global image frame, the first terminal converts the acquired local image frame or global image frame into 1080P resolution, that is, converts the acquired local image frame or global image frame into an image frame composed of 1920 × 1080 pixels. For the local image frame, if the size of the local image frame itself is smaller than 1920 × 1080, the size of the local image frame is converted into 1920 × 1080 by up-sampling the local image frame; if the size of the local image frame itself is larger than 1920 × 1080, the size of the local image frame is converted into 1920 × 1080 by down-sampling the local image frame. For the global image frame, the global image frame is down-sampled to convert the size of the global image frame to 1920 × 1080.

Step 1305, the first terminal selects a reference frame of the image frame according to the frame type and encodes the image frame based on the reference frame; or, the first terminal encodes the image frames of different frame types through two paths of encoders.

After the image frame is acquired, the first terminal encodes the acquired image frame. In the encoding process, the first terminal may select a reference frame of the current image frame to be encoded according to a frame type of the current image frame to be encoded. Wherein, the frame type of the local image frame can be defined as a first type, and the frame type of the global image frame can be defined as a second type.

In a possible implementation manner, for any one image frame with the frame type of the first type, the first terminal may determine that a reference frame of the image frame is an image frame of the first type previous to the image frame, that is, an image frame of the first type temporally closest to the image frame. Similarly, for any image frame of which the frame type is the second type, the first terminal may determine that the reference frame of the image frame is the previous image frame of the second type of the image frame, that is, the image frame of the second type temporally closest to the image frame. That is, for any one image frame, the reference frame of the image frame is an image frame of the same type temporally closest to the image frame.

In another possible implementation manner, the first terminal may input image frames of different frame types into different encoders, so as to encode the image frames through two paths of encoders. Specifically, the first terminal inputs an image frame of which the frame type is the first type into a first encoder, and the first encoder encodes the part of the image frame. In addition, the first terminal inputs the image frame of which the frame type is the second type into the second encoder, and the first encoder encodes the portion of the image frame.

When the first encoder encodes each image frame of the first type, the previous image frame of each image frame of the first type is used as a reference frame for encoding. When the second encoder encodes the second type of image frame, the reference frame of the second type of image frame is an image frame before the image frame in the second encoder.

Step 1306, the first terminal carries the indication information of the current image frame in the code stream, and sends the code stream to the second terminal.

In the remote video process, the first terminal continuously encodes the acquired image frame to obtain a code stream. The first terminal may carry indication information of a current image frame in an SEI of the code stream, and send the code stream to the second terminal. The indication information is used for indicating the frame type of the current image frame and the region-of-interest information corresponding to the current image frame.

And step 1307, the second terminal receives the code stream sent by the first terminal and decodes the code stream to obtain the image frame and the corresponding indication information.

And after receiving the code stream sent by the first terminal, the second terminal decodes the code stream to obtain the image frame and the indication information corresponding to the image frame.

Step 1308, the second terminal obtains the interactive instruction, and obtains the region of interest indicated by the interactive instruction by analyzing the interactive instruction.

Optionally, at any time of the remote video, the second terminal may obtain an interactive instruction triggered by the user, and obtain the region of interest indicated by the interactive instruction by analyzing the interactive instruction. The interactive instruction may be, for example, a zoom-out operation instruction, a zoom-in operation instruction, or a pan operation instruction initiated by a user by touching a screen of the second terminal.

And step 1309, the second terminal updates the local region-of-interest information according to the region-of-interest indicated by the interactive instruction.

After the interesting area indicated by the interactive instruction is obtained through analysis, the second terminal updates local interesting area information according to the interesting area indicated by the interactive instruction.

In step 1310, the second terminal determines whether the current image frame is a global image frame according to the indication information carried by the image frame.

Because the indication information is used for indicating the frame type of the image frame, the second terminal can judge whether the current image frame is the global image frame according to the indication information corresponding to the image frame in the code stream, that is, judge whether the frame type of the current image frame is the second type.

Step 1311, if the current image frame is not a global image frame, the current image frame is sent to a first buffer.

When the first buffer stores other local image frames, the second terminal overlays the new local image frame on the other local image frames, so that only one image frame is always kept in the first buffer.

If the current image frame is a global image frame, the current image frame is sent to a second buffer, step 1312.

When the first buffer stores other global image frames, the second terminal overlays the new global image frame on the other global image frames, so that only one image frame is always kept in the second buffer.

Step 1313, the second terminal determines whether the region of interest information corresponding to the image frame in the first buffer is consistent with the local region of interest information of the second terminal.

Based on the indication information corresponding to the image frame in the first buffer, the second terminal may determine whether the region-of-interest information corresponding to the image frame in the first buffer is consistent with the local region-of-interest information of the second terminal.

Step 1314, if the region-of-interest information corresponding to the image frame in the first buffer area is consistent with the local region-of-interest information of the second terminal, the second terminal sends the local image frame in the first buffer area to the display screen for displaying.

And 1315, if the region-of-interest information corresponding to the image frame in the first buffer is inconsistent with the local region-of-interest information of the second terminal, the second terminal intercepts a partial region in the global image frame in the second buffer, and sends the intercepted image frame to a display screen for display.

When the region of interest changes, the position of a new region of interest can be determined from the global image frame according to the local region of interest information of the second terminal, and the position of the new region of interest is cut off to obtain a new image frame. And the content in the new image frame is the content corresponding to the position indicated by the region-of-interest information of the second terminal.

On the basis of the embodiments corresponding to fig. 1 to fig. 13, in order to better implement the above-mentioned scheme of the embodiments of the present application, the following also provides related equipment for implementing the above-mentioned scheme.

Specifically, referring to fig. 14, fig. 14 is a schematic structural diagram of a terminal 1400 according to an embodiment of the present application, where the terminal 1400 includes: an acquisition unit 1401, a processing unit 1402, and a transceiver unit 1403; the acquiring unit 1401 is configured to acquire a plurality of image frames, where the plurality of image frames include a first image frame and a second image frame, a zoom magnification of the first image frame is greater than a zoom magnification of the second image frame, the first image frame is determined according to region-of-interest information, and the region-of-interest information is used to indicate a position of a region of interest; the first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects image frames to be displayed according to the indication information, the image frames to be displayed are used for generating videos, and the indication information comprises region-of-interest information of the first image frames.

Optionally, in a possible implementation manner, the obtaining unit 1401 is further configured to acquire a third image frame and the second image frame, where a zoom magnification of the third image frame is the same as a zoom magnification of the second image frame; the processing unit 1402 is further configured to crop the third image frame according to the region of interest information, so as to obtain the first image frame.

Optionally, in a possible implementation manner, the obtaining unit 1401 is further configured to acquire one or more third image frames, where a zoom magnification of the one or more third image frames is the same as a zoom magnification of the second image frame; the processing unit 1402 is further configured to crop the one or more third image frames according to the region of interest information, so as to obtain one or more first image frames; the processing unit 1402 is further configured to determine one of the one or more third image frames as the second image frame.

Optionally, in a possible implementation manner, the obtaining unit 1401 obtains the first image frame and the second image frame by means of optical zooming.

Optionally, in a possible implementation manner, the image frames further include a fourth image frame, and a zoom magnification of the fourth image frame is the same as that of the second image frame; the obtaining unit 1401 is further configured to sequentially obtain the fourth image frame, one or more of the first image frame and the second image frame according to a preset rule; the preset rule is that after the first terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

Optionally, in a possible implementation manner, the obtaining unit 1401 is further configured to obtain a reference frame of the multiple image frames; the processing unit 1402 is further configured to encode the multiple image frames according to reference frames of the multiple image frames to obtain an encoding result; the transceiving unit 1403 is further configured to send an encoding result to the second terminal; the reference frame of the first image frame is obtained according to the region-of-interest information, and the zoom magnification of the reference frame of the second image frame is the same as the zoom magnification of the second image frame.

Optionally, in a possible implementation manner, the processing unit 1402 is further configured to encode the image frame by a first encoder according to a reference frame of the image frame to obtain a first encoding result; the processing unit 1402 is further configured to encode the second image frame through a second encoder according to the reference frame of the second image frame to obtain a second encoding result; wherein the encoding result comprises the first encoding result and the second encoding result.

In another possible embodiment, the transceiver 1403 is configured to receive a plurality of image frames and indication information sent by a second terminal, where the plurality of image frames includes a first image frame and a second image frame, a zoom magnification of the first image frame is greater than that of the second image frame, the first image frame is determined according to region-of-interest information, the region-of-interest information is used to indicate a position of a region of interest, and the indication information includes the region-of-interest information of the first image frame; the processing unit 1402 is configured to: if the region-of-interest information of the first image frame is different from the region-of-interest information in the first terminal, cutting the second image frame according to the region-of-interest information in the first terminal to obtain a third image frame, and displaying the third image frame; and if the region-of-interest information of the first image frame is the same as the region-of-interest information in the first terminal, displaying the first image frame.

Optionally, in a possible implementation manner, the region of interest indicated in the region of interest information is included in the second image frame.

Optionally, in a possible implementation manner, the image frames further include a fourth image frame, and a zoom magnification of the fourth image frame is the same as that of the second image frame; the fourth image frame, one or more first image frames and the second image frame are sequentially acquired by the second terminal according to a preset rule; the preset rule is that after the second terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

Optionally, in a possible implementation manner, the indication information further includes frame types of the plurality of image frames, and a frame type of the first image frame is different from a frame type of the second image frame; the processing unit 1402 is further configured to sequentially send the first image frame to a first buffer and send the second image frame to a second buffer according to the frame types of the plurality of image frames; the processing unit 1402 is further configured to: if the information of the region of interest of the first image frame in the first buffer area is different from the information of the region of interest in the first terminal, cutting the second image frame in the second buffer area according to the information of the region of interest in the first terminal to obtain a third image frame, and displaying the third image frame; and if the region-of-interest information of the first image frame in the first buffer area is the same as the region-of-interest information in the first terminal, displaying the first image frame in the first buffer area.

Optionally, in a possible implementation manner, the obtaining unit 1401 is configured to obtain an interactive instruction, where the interactive instruction is used to instruct to change the region of interest; the processing unit 1402 is further configured to update the region-of-interest information according to the interaction instruction, so as to obtain updated region-of-interest information; the transceiving unit 1403 is further configured to send the updated region of interest information to the second terminal.

The remote video method provided by the embodiment of the application can be specifically executed by a chip in the terminal, wherein the chip comprises: a processing unit 1402, which processing unit 1402 may be, for example, a processor, and a communication unit, which communication unit may be, for example, an input/output interface, a pin or a circuit, etc. The processing unit 1402 may execute the computer executable instructions stored by the storage unit to cause the chip in the server to perform the remote video method described in the embodiments of fig. 1-13 above. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

With reference to fig. 15, the present application further provides a computer program product, which in some embodiments, the method disclosed in fig. 5 above may be embodied as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or on other non-transitory media or articles of manufacture.

Fig. 15 schematically illustrates a conceptual partial view of an example computer program product comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein.

In one embodiment, computer program product 1500 is provided using signal bearing medium 1501. The signal bearing medium 1501 may include one or more program instructions 1502, which when executed by one or more processors may provide the functions or portions of the functions described above with respect to fig. 2. Thus, for example, referring to the embodiment illustrated in FIG. 3, one or more of the features of steps 301-306 may be undertaken by one or more instructions associated with the signal bearing medium 1501. Further, program instructions 1502 in FIG. 15 also describe example instructions.

In some examples, the signal bearing medium 1501 may include a computer readable medium 1503 such as, but not limited to, a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), a digital tape, a memory, a ROM or RAM, and so forth.

In some implementations, the signal bearing medium 1501 may include a computer recordable medium 1504 such as, but not limited to, memory, read/write (R/W) CD, R/W DVD, and the like. In some implementations, signal bearing medium 1501 may include a communication medium 1505 such as, but not limited to, digital and/or analog communication media (e.g., fiber optic cables, waveguides, wired communications links, wireless communications links, etc.). Thus, for example, signal bearing medium 1501 may be conveyed by a wireless form of communication medium 1505 (e.g., a wireless communication medium conforming to the IEEE 802.15 standard or other transmission protocol).

The one or more program instructions 1502 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, a computing device of the computing device may be configured to provide various operations, functions, or actions in response to program instructions 1502 communicated to the computing device by one or more of computer-readable media 1503, computer-recordable media 1504, and/or communication media 1505.

It should be understood that the arrangements described herein are for illustrative purposes only. Thus, those skilled in the art will appreciate that other arrangements and other elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used instead, and that some elements may be omitted altogether depending upon the desired results. In addition, many of the described elements are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, in any suitable combination and location.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic disk or optical disk, etc. for storing program codes.

Claims

1. A remote video method, comprising:

the method comprises the steps that a first terminal acquires a plurality of image frames, wherein the image frames comprise a first image frame and a second image frame, the zooming magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to region-of-interest information, and the region-of-interest information is used for indicating the position of a region of interest;

the first terminal sends the plurality of image frames and indication information to a second terminal, so that the second terminal selects image frames to be displayed according to the indication information, the image frames to be displayed are used for generating videos, and the indication information comprises region-of-interest information of the first image frames.

2. The method of claim 1, wherein a region of interest indicated in the region of interest information is included in the second image frame.

3. The method of claim 1 or 2, wherein the first terminal acquiring the plurality of image frames comprises:

the first terminal acquires a third image frame and the second image frame, wherein the zooming magnification of the third image frame is the same as that of the second image frame;

and the first terminal cuts the third image frame according to the region-of-interest information to obtain the first image frame.

4. The method of claim 1 or 2, wherein the first terminal acquiring the plurality of image frames comprises:

the first terminal acquires one or more third image frames, and the zooming multiplying power of the one or more third image frames is the same as that of the second image frame;

the first terminal cuts the one or more third image frames according to the region-of-interest information to obtain one or more first image frames;

the first terminal determines one of the one or more third image frames as the second image frame.

5. The method of claim 1 or 2, wherein the first terminal acquiring the plurality of image frames comprises:

the first terminal collects the first image frame and the second image frame in an optical zooming mode.

6. The method of any of claims 1-5, wherein the plurality of image frames further comprises a fourth image frame, the fourth image frame having a zoom magnification that is the same as the zoom magnification of the second image frame;

the first terminal acquires a plurality of image frames, including:

the first terminal sequentially collects the fourth image frame, one or more first image frames and the second image frame according to a preset rule;

the preset rule is that after the first terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

7. The method of any of claims 1-6, wherein the first terminal transmitting the plurality of image frames to a second terminal comprises:

the first terminal acquires a reference frame of the plurality of image frames;

the first terminal encodes the plurality of image frames according to the reference frames of the plurality of image frames to obtain an encoding result;

the first terminal sends a coding result to the second terminal;

the reference frame of the first image frame is obtained according to the region-of-interest information, and the zoom magnification of the reference frame of the second image frame is the same as the zoom magnification of the second image frame.

8. The method according to claim 7, wherein the first terminal encodes the plurality of image frames according to a reference frame of the plurality of image frames to obtain an encoding result, and comprises:

the first terminal encodes the first image frame through a first encoder according to the reference frame of the first image frame to obtain a first encoding result;

the first terminal encodes the second image frame through a second encoder according to the reference frame of the second image frame to obtain a second encoding result;

wherein the encoding result comprises the first encoding result and the second encoding result.

9. The method of any of claims 1-8, wherein the indication information further comprises a frame type of the plurality of image frames, wherein the frame type of the first image frame is different from the frame type of the second image frame.

10. A remote video method, comprising:

the method comprises the steps that a first terminal receives a plurality of image frames and indicating information, wherein the image frames are sent by a second terminal, the image frames comprise a first image frame and a second image frame, the zooming magnification of the first image frame is larger than that of the second image frame, the first image frame is determined according to region-of-interest information, the region-of-interest information is used for indicating the position of a region of interest, and the indicating information comprises the region-of-interest information of the first image frame;

if the region-of-interest information of the first image frame is different from the region-of-interest information in the first terminal, cutting a third image frame in the second image frame according to the region-of-interest information in the first terminal, and displaying the third image frame;

and if the region-of-interest information of the first image frame is the same as the region-of-interest information in the first terminal, displaying the first image frame.

11. The method of claim 10, wherein the second image frame includes a region of interest indicated in the region of interest information.

12. The method according to claim 10 or 11, wherein the first image frame is obtained by the second terminal clipping one or more third image frames according to the region-of-interest information in the second terminal after acquiring the one or more third image frames, and the zoom magnification of the one or more third image frames is the same as that of the second image frame.

13. The method of claim 10 or 11, wherein the second image frame is one image frame of the one or more third image frames.

14. The method according to claim 10 or 11, wherein the plurality of image frames are acquired by the second terminal by means of optical zooming.

15. The method of any of claims 10-14, wherein the plurality of image frames further comprises a fourth image frame, the fourth image frame having a zoom magnification that is the same as the zoom magnification of the second image frame;

the fourth image frame, one or more first image frames and the second image frame are sequentially acquired by the second terminal according to a preset rule;

the preset rule is that after the second terminal collects a preset number of image frames based on the region-of-interest information, one image frame is collected by adopting a target zooming magnification, wherein the preset number is the same as the number of one or more first image frames, and the target zooming magnification is the zooming magnification of the second image frame.

16. The method according to any of claims 10-15, wherein the indication information further comprises a frame type of the plurality of image frames, the frame type of the first image frame being different from the frame type of the second image frame;

the method further comprises the following steps:

the first terminal sends the first image frame to a first buffer area and sends the second image frame to a second buffer area in sequence according to the frame types of the image frames;

if the region-of-interest information of the first image frame is different from the region-of-interest information of the first terminal, cutting the second image frame according to the region-of-interest information of the first terminal to obtain a third image frame, and displaying the third image frame, including:

if the information of the region of interest of the first image frame in the first buffer area is different from the information of the region of interest in the first terminal, cutting the second image frame in the second buffer area according to the information of the region of interest in the first terminal to obtain a third image frame, and displaying the third image frame;

if the region-of-interest information of the first image frame is the same as the region-of-interest information in the first terminal, displaying the first image frame, including:

and if the region-of-interest information of the first image frame in the first buffer area is the same as the region-of-interest information in the first terminal, displaying the first image frame in the first buffer area.

17. The method according to any one of claims 10-16, further comprising:

the first terminal acquires an interactive instruction, wherein the interactive instruction is used for indicating to change the region of interest;

the first terminal updates the region-of-interest information according to the interactive instruction to obtain updated region-of-interest information;

and the first terminal sends the updated region-of-interest information to the second terminal.

18. A terminal comprising a memory and a processor; the memory has stored thereon code, the processor being configured to execute the code, the terminal performing the method of any of claims 1 to 17 when the code is executed.

19. A computer readable storage medium comprising computer readable instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 17.

20. A computer program product comprising computer readable instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 17.