CN113726815A

CN113726815A - Method and device for dynamically adjusting video

Info

Publication number: CN113726815A
Application number: CN202111082259.1A
Authority: CN
Inventors: 郭本浩; 席迎军
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2021-11-30
Anticipated expiration: 2041-09-15
Also published as: CN113726815B

Abstract

The embodiment of the application provides a method and a device for dynamically adjusting a video, relates to the field of terminals, and can ensure the fluency of the video in a video session process. The method comprises the following steps: responding to a first operation of a user, and starting a video session; acquiring uplink video data and detecting the transmission rate of the uplink video data; if the transmission rate of the uplink video data is greater than a first preset threshold value, carrying out video coding on the uplink video data by using a first GOP length; if the transmission rate of the uplink video data is less than or equal to a first preset threshold value, carrying out video coding on the uplink video data by a second GOP length, wherein the second GOP length is greater than the first GOP length; receiving downlink video data and detecting the transmission rate of the downlink video data; if the transmission rate of the downlink video data is greater than a second preset threshold value, receiving all the coding frames in the downlink video data; and if the transmission rate of the downlink video data is less than or equal to a second preset threshold, discarding part of I frames in the downlink video data.

Description

Method and device for dynamically adjusting video

Technical Field

The present application relates to the field of terminals, and in particular, to a method and an apparatus for dynamically adjusting a video.

Background

The video session can realize face-to-face conversation of users at different places by adopting the terminal equipment and the network (the users can see video pictures of the users and the other party). In the video session, the user can hear the sound of other users (participants) through the terminal equipment, see the images, actions and expressions of other participants, and send electronic demonstration contents. Through the video session, the user can communicate with people in other places of the world without going out.

During a video session, video fluency is an important factor affecting the user experience. If the video is blocked, the communication quality of the user is greatly reduced, and the time is wasted.

Disclosure of Invention

The embodiment of the application provides a method and a device for dynamically adjusting a video, which can ensure the fluency of the video in a video session process and improve user experience.

In order to achieve the purpose, the technical scheme is as follows:

in a first aspect, a method for dynamically adjusting a video is provided, and applied to an electronic device, the method includes: responding to a first operation of a user, and starting a video session; acquiring uplink video data and detecting the transmission rate of the uplink video data; if the transmission rate of the uplink video data is greater than a first preset threshold value, carrying out video coding on the uplink video data by a first group of pictures (GOP) length; if the transmission rate of the uplink video data is less than or equal to a first preset threshold value, carrying out video coding on the uplink video data by a second GOP length, wherein the second GOP length is greater than the first GOP length; receiving downlink video data and detecting the transmission rate of the downlink video data; if the transmission rate of the downlink video data is greater than a second preset threshold value, receiving all the coding frames in the downlink video data; and if the transmission rate of the downlink video data is less than or equal to a second preset threshold, discarding part of I frames in the downlink video data.

Based on the method provided by the embodiment of the application, when the transmission rate (network uplink speed) of the uplink video data is less than or equal to the first preset threshold (namely, the uplink network environment is poor), the GOP of the encoder can be increased for the uplink video data, so as to reduce the data volume of the uplink video data, and thus, the rate of the uplink video data is ensured. For downlink video data, the electronic device may reduce the amount of the downlink video data by discarding the I frame in the downlink video data, thereby ensuring the rate of the downlink video data. Therefore, the fluency of the video in the video session process is ensured, and the user experience is improved.

In one possible design, the method further includes: detecting whether a user speaks or not; when the user does not speak, performing video coding on the uplink video data at a first code rate; and when the user speaks, performing video coding on the uplink video data at a second code rate, wherein the second code rate is higher than the first code rate. I.e., the video may be enhanced based on the user's speech (e.g., the user's face area is enhanced). When the user speaks, the face area is coded by the high code rate, so that the face of the user is clearer, and other users can clearly see the facial expression of the user. When the user does not speak, the human face and the region except the human face are coded by adopting a lower code rate, so that the uplink video data volume is reduced, and the rate of the uplink video data is ensured.

In one possible design, video encoding the uplink video data at the second code rate includes: performing video coding on a region of interest (ROI) of human eyes and regions except the ROI in the uplink video data at a second code rate; or carrying out video coding on a human eye region of interest (ROI) in the uplink video data at a second code rate, and carrying out video coding on regions except the ROI at a first code rate. Therefore, the face of the user can be clearer, other users can see the facial expression of the user clearly, and user experience is improved.

In one possible design, the ROI includes at least one of a face region, a body region, or an object region. The human body region may include a human face region and a body region or only includes a body region, which is not limited in the present application.

In one possible design, the video session includes at least one of a video call, a video conference, or a live connection. Based on the method provided by the embodiment of the application, the fluency of the video in the process of carrying out video call, video conference or live broadcasting and microphone connecting of the user can be ensured, and the user experience is improved.

In one possible design, dropping the partial I-frame in the downstream video data includes: i frames are dropped every X frames, the value of X being determined from the GOP of the downstream video data. Wherein the value of X may be determined from a GOP of the downstream video data. For example, the value of X may be an integer multiple of the GOP of the downstream video data. For example, when the GOP is 30, the value of X is 30, 60, 90, etc.

In one possible design, the electronic device includes a video session application, a camera driving module, and a camera, and acquiring the uplink video data includes: the video session application calls a camera to acquire video data through a camera driving module; the camera collects video data at a first frame rate. That is to say, the uplink video data can be collected through the camera driving module and the camera.

In one possible design, the electronic device further includes an encoding module, and the method further includes: the encoding module sends the uplink video data subjected to video encoding to a video session application; and the video session application sends the uplink video data subjected to video coding to the cloud server.

In one possible design, the electronic device includes a network module, and receiving downstream video data includes: video data of other participating members in the video session is received over the network. The electronic equipment can play the video pictures of the other participants based on the video data of the other participants.

In one possible design, the first preset threshold is determined according to the resolution and the frame rate of the uplink video data; the second preset threshold is determined according to the resolution and the frame rate of the downlink video data. When the network uplink speed (the transmission rate of the uplink video data) is less than or equal to the first preset threshold, the uplink video data can be correspondingly processed to avoid that the user perceives that the video session is blocked, so that the video watching experience of the user at different network uplink speeds in the video conference is ensured. When the network downlink speed (the transmission rate of the downlink video data) is less than or equal to the second preset threshold, the downlink video data can be correspondingly processed to avoid that the user perceives that the video session is blocked, so that the video watching experience of the user at different network downlink speeds in the video conference is ensured.

In one possible design, video encoding the upstream video data at the first group of pictures GOP length includes: performing video coding on uplink video data according to the length of a first group of pictures (GOP) and a first frame rate; video encoding the upstream video data with the second GOP length includes: and carrying out video coding on the uplink video data at the second GOP length and the first frame rate. That is, when the network uplink speed (the transmission rate of the uplink video data) is less than or equal to the first preset threshold, the GOP can be increased on the premise that the frame rate is not changed. Therefore, the data uploading amount in unit time can be reduced, and the aim of improving the fluency of the video session is fulfilled.

In a second aspect, an electronic device is provided, which has the function of implementing the method of the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a third aspect, an electronic device is provided, including: a processor and a memory; the memory is configured to store computer executable instructions, and when the electronic device is running, the processor executes the computer executable instructions stored in the memory to cause the electronic device to perform the method according to any one of the first aspect.

In a fourth aspect, an electronic device is provided, comprising: a processor; the processor is configured to be coupled to the memory and to perform the method according to any one of the above first aspects after reading the instructions in the memory.

In a fifth aspect, there is provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the method of any of the above first aspects.

A sixth aspect provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first aspects above.

In a seventh aspect, an apparatus (e.g., the apparatus may be a system-on-a-chip) is provided, which includes a processor for enabling an electronic device to implement the functions recited in the first aspect. In one possible design, the apparatus further includes a memory for storing program instructions and data necessary for the electronic device. When the device is a chip system, the device may be composed of a chip, or may include a chip and other discrete devices.

For technical effects brought by any one of the design manners in the second aspect to the seventh aspect, reference may be made to technical effects brought by different design manners in the first aspect, and details are not repeated here.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a method for dynamically adjusting video according to an embodiment of the present disclosure;

fig. 3 is a schematic view of a processing flow of uplink video data according to an embodiment of the present application;

fig. 4 is a schematic diagram of GOPs at different network uplink speeds according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a human voice-based video enhancement provided in an embodiment of the present application;

FIG. 6 is a schematic illustration of a display provided by an embodiment of the present application;

fig. 7 is a schematic view of a processing flow of downlink video data according to an embodiment of the present application;

fig. 8 is a schematic diagram of a GOP at different network downlink speeds according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a chip system according to an embodiment of the present disclosure.

Detailed Description

For clarity and conciseness of the following description of the various embodiments, a brief introduction to related concepts or technologies is first presented:

and (3) encoding the frame: in a video coding sequence, three types of coded frames are mainly included: i-frames, P-frames, and B-frames. The I frame, i.e., Intra-coded picture, is encoded using only information of the frame without referring to other image frames. A P frame, i.e., a Predictive coded picture (forward Predictive frame), is inter-frame Predictive coded using a motion prediction method using a previous I frame or P frame. The B frame, i.e. bidirectional predicted picture, provides the highest compression ratio, and requires both a previous image frame (I frame or P frame) and a subsequent image frame (P frame), and performs inter-frame bidirectional predictive coding by using motion prediction. The byte number occupied by one I frame is larger than that of one P frame, and the byte number occupied by one P frame is larger than that of one B frame. In short, an I-frame records a complete picture, while P-frames and B-frames record changes relative to the I-frame. Without an I-frame, P-frames and B-frames cannot be decoded.

Group of pictures/group of pictures (GOP): a GOP is a group of consecutive pictures (i.e., frames). In a video coding sequence, a GOP refers to the distance between two I frames. I.e. how many frames an I-frame occurs inside. For example, the GOP is 120, i.e., 120 frames occur once in an I-frame. Assume that the video is 720p60(720p for high definition, 60 for frame rate, i.e., the number of pictures refreshed in one second), i.e., 2s of I frames.

Code rate (bitrate): the number of data bits transmitted per unit time during data transmission is in kbps, i.e., kilobits per second. The higher the code rate in unit time is, the higher the precision is, and the closer the file obtained by coding is to the original file. For video files, the higher the bitrate, the clearer, whereas the picture is coarse and mosaic.

Under the premise of unchanged code rate, the larger the GOP value is, the more the number of P frames and B frames is, the more the picture details are, and the better the picture quality is.

Frame rate (fps): the frequency (rate) at which images in units of frames appear continuously on the display per unit time. I.e., the number of frames of a picture that appear within 1 second, can also be understood as the number of times per second the graphics processor refreshes. Frame rate affects picture fluency, which is proportional to picture fluency. The larger the frame rate is, the smoother the picture is; the smaller the frame rate, the more jerky the picture. Due to the special physiological structure of human eyes, when the frame rate of the picture viewed by human eyes is higher than 16, the picture is considered to be coherent by human eyes, and the phenomenon is called persistence of vision.

Image Signal Processor (ISP): and the front-end image sensor is used for carrying out post-processing on the signal output by the front-end image sensor. The main functions of the ISP are linear correction, noise removal, dead pixel removal, interpolation, white balance, automatic exposure control, etc. The site details can be better restored through the ISP under different optical conditions, and the imaging quality of the camera is improved. ISPs include both independent and integrated forms.

YUV: YUV is a color space (color space) in which "Y" represents brightness (Luma), "U" represents Chroma (Chroma), and "V" represents concentration (Chroma). YUV-based color coding is a common coding scheme for streaming media.

Surface flinger: is a stand-alone system Service (Service) that may be used to receive graphical display data from multiple sources, synthesize them, and send them to a display device. For example, an application (application for short) typically has three layers of display, a status bar at the top, a navigation bar at the bottom or side, and an interface of the application. Each layer is updated and rendered separately, and the interfaces of the layers are finally combined into one by the surfefinger and then displayed through hardware.

Region of interest (ROI) of human eye: human eyes place different degrees of importance on different parts of a video scene. The human eye is usually more sensitive to information such as moving objects, textures, colors, shapes, etc. in the video, and these regions to which the human eye is more sensitive are the ROI. For example, in a video call, a person often focuses on information such as a user's facial expression in the video. The face region in a video call is a type of ROI. The image quality of the ROI directly affects the user experience. The better the image quality of the ROI, the higher the user experience.

The embodiment of the application provides a method for dynamically adjusting a video, which can ensure the fluency of a video session. Under a weak field environment (when the network is poor), for the uplink video data, the electronic equipment can properly increase the GOP according to the network load so as to reduce the data volume of the uplink video data, thereby ensuring the rate of the uplink video data. Moreover, the video can be enhanced (the face area of the user is enhanced) based on the speech of the user, namely, the face area is coded by adopting a higher code rate when the user speaks, so that the face of the user is clearer, and other users can clearly see the facial expression of the user. When the user does not speak, the human face and the region except the human face are coded by adopting a lower code rate, so that the uplink video data volume is reduced, and the rate of the uplink video data is ensured. For downlink video data, the electronic device may reduce the amount of the downlink video data by discarding the I frame in the downlink video data, thereby ensuring the rate of the downlink video data. The video session can include video call, video conference, live broadcast and live broadcast.

In the embodiment of the present application, the electronic device may include a mobile phone, a Personal Computer (PC), a tablet computer, a desktop computer (desktop computer), a handheld computer, a notebook computer (laptop), an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), a router, a television, and other devices. Or, the electronic device may include a sound box, a camera, an air conditioner, a refrigerator, an intelligent curtain, a desk lamp, a ceiling lamp, an electric cooker, a security device (such as an intelligent electronic lock), a robot, a sweeper, an intelligent scale, and other devices that can access the home wireless lan. Or, the electronic device may include wearable devices such as an intelligent headset, intelligent glasses, an intelligent watch, an intelligent bracelet, an Augmented Reality (AR) \ Virtual Reality (VR) device, a wireless locator, a Tracker (Tracker), and an electronic collar, and the electronic device in this embodiment may also be a device such as a car audio and a car air conditioner. The embodiment of the present application does not particularly limit the specific form of the electronic device.

As shown in fig. 1, in the embodiment of the present application, an electronic device 200 (such as a mobile phone) is taken as an example, and a structure of the electronic device provided in the embodiment of the present application is illustrated. The electronic device 200 (e.g., a cell phone) may include: the mobile communication device includes a processor 210, an external memory interface 220, an internal memory 221, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, an earphone interface 270D, a sensor module 280, a button 290, a motor 291, an indicator 292, a camera 293, a display 294, and a Subscriber Identity Module (SIM) card interface 295.

The sensor module 280 may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

Processor 210 may include one or more processing units, such as: the processor 210 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be a neural center and a command center of the electronic device 200. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 210 for storing instructions and data. In some embodiments, the memory in the processor 210 is a cache memory. The memory may hold instructions or data that have just been used or recycled by processor 210. If the processor 210 needs to use the instruction or data again, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 210, thereby increasing the efficiency of the system.

In some embodiments, processor 210 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

It should be understood that the connection relationship between the modules illustrated in the present embodiment is only an exemplary illustration, and does not limit the structure of the electronic device 200. In other embodiments, the electronic device 200 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charge management module 240 is configured to receive a charging input from a charger. The charger may be a wireless charger or a wired charger. The charging management module 240 may also supply power to the electronic device through the power management module 241 while charging the battery 242.

The power management module 241 is used to connect the battery 242, the charging management module 240 and the processor 210. The power management module 241 receives input from the battery 242 and/or the charging management module 240, and provides power to the processor 210, the internal memory 221, the external memory, the display 294, the camera 293, and the wireless communication module 260. In some embodiments, the power management module 241 and the charging management module 240 may also be disposed in the same device.

The wireless communication function of the electronic device 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modem processor, the baseband processor, and the like. In some embodiments, antenna 1 of electronic device 200 is coupled to mobile communication module 250 and antenna 2 is coupled to wireless communication module 260, such that electronic device 200 may communicate with networks and other devices via wireless communication techniques.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 200 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 250 may provide a solution including 2G/3G/4G/5G wireless communication applied on the electronic device 200. The mobile communication module 250 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 250 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation.

The mobile communication module 250 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 250 may be disposed in the processor 210. In some embodiments, at least some of the functional modules of the mobile communication module 250 may be disposed in the same device as at least some of the modules of the processor 210.

The wireless communication module 260 may provide a solution for wireless communication applied to the electronic device 200, including WLAN (e.g., wireless fidelity, Wi-Fi) network, Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like.

The wireless communication module 260 may be one or more devices integrating at least one communication processing module. The wireless communication module 260 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 210. The wireless communication module 260 may also receive a signal to be transmitted from the processor 210, frequency-modulate and amplify the signal, and convert the signal into electromagnetic waves via the antenna 2 to radiate the electromagnetic waves.

The electronic device 200 implements display functions via the GPU, the display screen 294, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 294 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 294 is used to display images, video, and the like. The display screen 294 includes a display panel.

The electronic device 200 may implement a shooting function through the ISP, the camera 293, the video codec, the GPU, the display screen 294, and the application processor. The ISP is used to process the data fed back by the camera 293. The camera 293 is used to capture still images or video. In some embodiments, electronic device 200 may include 1 or N cameras 293, N being a positive integer greater than 1.

The external memory interface 220 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 200. The external memory card communicates with the processor 210 through the external memory interface 220 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

Internal memory 221 may be used to store computer-executable program code, including instructions. The processor 210 executes various functional applications of the electronic device 200 and data processing by executing instructions stored in the internal memory 221. For example, in the present embodiment, the processor 210 may execute instructions stored in the internal memory 221, and the internal memory 221 may include a program storage area and a data storage area.

The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (e.g., audio data, a phone book, etc.) created during use of the electronic device 200, and the like. In addition, the internal memory 221 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

Electronic device 200 may implement audio functions via audio module 270, speaker 270A, receiver 270B, microphone 270C, headset interface 270D, and an application processor, among other things. Such as music playing, recording, etc.

The keys 290 include a power-on key, a volume key, etc. The keys 290 may be mechanical keys. Or may be touch keys. The motor 291 may generate a vibration cue. The motor 291 can be used for both incoming call vibration prompting and touch vibration feedback. Indicator 292 may be an indicator light that may be used to indicate a state of charge, a change in charge, or may be used to indicate a message, missed call, notification, etc. The SIM card interface 295 is used to connect a SIM card. The SIM card can be attached to and detached from the electronic apparatus 200 by being inserted into the SIM card interface 295 or being pulled out from the SIM card interface 295. The electronic device 200 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 295 may support a Nano SIM card, a Micro SIM card, a SIM card, etc.

It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic apparatus 200. In other embodiments, electronic device 200 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Where in the description of the present application, "/" indicates a relationship where the objects associated before and after are an "or", unless otherwise stated, for example, a/B may indicate a or B; in the present application, "and/or" is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance. Also, in the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as examples, illustrations or illustrations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.

As shown in fig. 2, an embodiment of the present application provides a method for dynamically adjusting a video, which is applied to an electronic device, and includes:

001. a video session is started.

The video session can include video call, video conference, live broadcast and live broadcast.

A video session application of an electronic device (e.g., a PC) may receive an operation by a user to open a video session (e.g., clicking a button to start a video conference/video call/live microphone attachment), in response to which the electronic device may open the video session.

The video session application may include, among other things, a video conferencing application (e.g.,

etc.), an instant messaging type application (e.g.,

) And live/video applications (e.g.,

) And the like.

002. And detecting the uplink speed of the network.

In some embodiments, the network uplink speed (i.e., the transmission rate of the uplink video data) may be monitored in real time by the background during the video session, and when the network uplink speed is less than or equal to the first preset threshold, the uplink video data may be processed accordingly (see steps 004 and 005) to avoid the user from perceiving that the video session is stuck, so as to ensure the video viewing experience of the user at different network uplink speeds in the video conference.

For example, in the Windows or Android system, the network state of the current device is queried through the system API. Taking an Android system as an example, information such as a current network connection type and a network uplink speed can be acquired through a ConnectivityManager type. The electronic device may be connected to a 4G or 5G network, or may be connected to a WiFi network, or may be connected to both the 4G/5G network and the WiFi network (i.e., a dual connection mode is adopted), which is not limited in this application.

003. And judging whether the uplink speed of the network is less than or equal to a first preset threshold value.

In the process of a video session, when an electronic device acquires a video through a Camera (Camera), a minimum uplink network threshold (i.e., a first preset threshold) corresponding to the video can be calculated according to the resolution and the frame rate of the video. That is, the first preset threshold is determined according to the resolution and the frame rate of the upstream video data of the current video session.

004. And if the network uplink speed is less than or equal to the first preset threshold value, adjusting the GOP of the encoder.

As shown in fig. 3, a schematic flow chart of processing of upstream video data (which may also be referred to as upstream video stream) is shown. The video session application can call the camera to acquire video data through modules such as a camera driver. It can be understood that the video data is composed of multiple frames of images, and the camera acquires the video data, that is, the camera acquires images/pictures at a certain frame rate. For example, the camera may capture images/pictures of user a at a frequency of 60 frames per minute. After the camera collects the image, the image can be sent to the ISP module for processingAnd (6) processing. The ISP module may send the processed image to the encoding module for video encoding. The encoding module may send the encoded video data (upstream video data) to a video session application (e.g.,

etc.). Further, the video session application may send the upstream video data to the server. The server may send the video data to the other participant (e.g., user B) so that the other participant can see the user's (e.g., user a) view.

In the process of video coding by the coding module, the value of the GOP is not fixed, and the value of the GOP may be determined according to the network uplink speed. If the transmission rate (network uplink speed) of the uplink video data is greater than a first preset threshold, performing video coding on the uplink video data by using a first GOP length; if the transmission rate of the uplink video data is less than or equal to the first preset threshold, the uplink video data can be subjected to video coding by a second GOP length, and the second GOP length is greater than the first GOP length. Wherein the frame rate (e.g., the first frame rate) of the upstream video data is unchanged.

For example, as shown in fig. 4, when the network uplink speed is greater than the first preset threshold, the GOP may be 5, and when the network uplink speed is less than or equal to the first preset threshold, the GOP may be 8. Namely, when the network uplink speed is less than or equal to the first preset threshold, the GOP can be increased. Therefore, the data uploading amount in unit time can be reduced, and the aim of improving the fluency of the video session is fulfilled.

005. Video enhancement is performed based on human voice.

As shown in fig. 5, step 005 may include step 0051, step 0052 and step 0053. Wherein:

0051. it is possible to detect whether the user is speaking during the video session.

The voice signal (voice) of the user can be collected through the microphone during the video session.

0052. When the user does not speak, that is, the voice signal of the user is not acquired through the microphone, the video coding can be performed with a lower code rate (for example, the first code rate) to reduce the uplink video data amount (reduce the uploading load of the network), thereby ensuring the uplink video data rate and improving the picture fluency of the video session.

0053. If the voice is collected by the microphone, the uplink video data can be subjected to video coding at the second code rate in the coding process. The second code rate is higher than the first code rate. Therefore, other participants can be ensured to observe the user changes (facial expression changes, gesture changes and the like) more clearly, and the user experience is ensured.

Whether the microphone collects the human voice can be determined according to the human voice frequency range. The frequency spectrum range of the human voice is 500-1 KHz, and after the microphone collects the voice, the human voice is considered to be collected when the human voice frequency spectrum section is detected through the voice frequency spectrum.

In one possible design, a ROI (e.g., a face region) in the upstream video data may be video encoded at a second code rate, and regions outside the ROI may be video encoded at a first code rate. Thus, the image quality of the face region in the uplink video data can be improved. In addition, code rate improvement of all uplink video data is not needed, and the phenomenon that the uplink video data amount is greatly increased (the phenomenon that the rate of the uplink video data is reduced) is avoided, so that the picture smoothness of video conversation can be ensured.

In another possible design, the ROI and regions other than the ROI in the upstream video data may be video encoded at a second code rate. That is, the code rate of all the uplink video data can be increased, the ROI does not need to be detected, and time can be saved.

In this embodiment of the application, the ROI of the video stream may include a face region, a body region (including the face region and the body region or only including the body region), an object in the background, and the like, which is not limited in this application.

For example, assuming that the ROI of the video stream is a human body region, when video coding is performed at the first coding rate, the video picture may be as shown in (a) of fig. 6. When the ROI of the video stream is video-encoded at the second code rate, the video picture may be as shown in (b) of fig. 6.

It should be understood that other participants may be more concerned with information such as the user's facial expressions when the user speaks. Therefore, the code rate of the face area in the uplink video data is improved, and the user experience can be guaranteed.

In another possible design, when the body motions (e.g., gestures) of the user change significantly, the bitrate of the human body region in the video stream can be increased, so that other participants can observe the gesture changes of the user more clearly, and the user experience is guaranteed.

Specifically, the hand of the user can be identified through an algorithm, and then whether the hand is displaced or not is judged through comparing front and back frames, and if the displacement of the hand is larger than a threshold value, the gesture of the user can be considered to be changed.

In yet another possible design, when the facial expression of the user changes significantly (for example, the mouth is closed or the mouth corners are raised), the bitrate of the face region in the video stream can be increased, so that other participants can observe the facial expression change of the user more clearly, and the user experience is guaranteed.

Specifically, the key points of the face of the user (used for marking the key points of the five sense organs such as the nose, the glasses and the mouth) can be identified, whether the key points of the face are displaced or not is judged through comparing the front frame with the rear frame, and if the displacement of the key points of the face is larger than a threshold value, the facial expression of the user can be considered to be changed.

It should be noted that there is no necessary execution sequence between step 004 and step 005, and step 004 may be executed first, and then step 005 is executed; step 005 may be performed first, and then step 004 may be performed;

steps

004 and 005 may be performed simultaneously, which is not particularly limited in this embodiment.

006. And detecting the downlink speed of the network.

In some embodiments, the network downlink speed (i.e., the transmission rate of the downlink video data) may be monitored in real time by the background during the video session, and when the network downlink speed is less than or equal to the second preset threshold, the downlink video data may be correspondingly processed (see step 008) to avoid the user from perceiving that the video session is jammed, so as to ensure the video viewing experience of the user at different network downlink speeds in the video conference.

For example, in the Windows or Android system, the network state of the current device is queried through the system API. Taking an Android system as an example, information such as a current network connection type and a network downlink speed can be acquired through a ConnectivityManager type. The electronic device may be connected to a 4G or 5G network, or may be connected to a WiFi network, or may be connected to both the 4G/5G network and the WiFi network (i.e., a dual connection mode is adopted), which is not limited in this application.

007. And judging whether the network downlink speed is less than or equal to a second preset threshold value.

In the video session, when the electronic device acquires a video sent by the opposite end, the minimum downlink network threshold (i.e., the second preset threshold) corresponding to the video can be calculated according to the resolution and the frame rate of the video. That is, the second preset threshold is determined according to the resolution and the frame rate of the downstream video stream of the current video session.

If the transmission rate (network downlink speed) of the downlink video data is greater than a second preset threshold, receiving all the coding frames in the downlink video data; if the transmission rate of the downlink video data is less than or equal to the second preset threshold, discarding a part of I frames in the downlink video data, i.e., performing step 008.

008. And if the network downlink speed is less than or equal to a second preset threshold value, discarding part of I frames in the downlink video data.

Fig. 7 is a schematic diagram illustrating a processing flow of downstream video data (which may also be referred to as a downstream video stream). The electronic device may first obtain video data (i.e., downlink video data) of other participating members through the network packet receiving module. The received downlink video data is decoded by a decoder. And carrying out color coding on the decoded downlink video data through a YUV module to obtain YUV data. And drawing the YUV data to obtain a layer 1 (used for displaying a video picture of a user). And then, the layer 1, the layer 2, and the layer 3 are synthesized by a layer synthesis module (e.g., a surfefringer), and finally, the synthesized data is sent to a display screen for displaying. Wherein, the layer 2 and the layer 3 may be layers corresponding to the status bar and the navigation bar, respectively.

When the network downlink load is high or the network downlink speed is low (for example, lower than a second preset threshold), the electronic device may discard a part of the I frames when performing network packet reception, so as to preferentially ensure the fluency of the video. The electronic device receiving the network packet means that the electronic device receives a data packet sent by other devices, and the data packet includes a video data packet. The video data packet includes data corresponding to I frames, B frames, and P frames. Optionally, the data packets sent by other devices may also include an audio data packet, which is not limited in this application. It should be understood that I-frames are reference frames of video, and the amount of data for I-frames is greatest relative to B-frames and P-frames. And the data volume of the downlink video can be reduced by discarding part of I frames, so that the fluency of the video can be ensured.

In one possible design, an I-frame may be dropped every X-frames. Wherein the value of X may be determined from a GOP of the downstream video data. For example, the value of X may be an integer multiple of the GOP of the downstream video data. For example, when the GOP is 30, the value of X is 30, 60, 90, etc.

Illustratively, as shown in fig. 8, assuming that the GOP is 5, when the network downlink speed is greater than the second preset threshold, I frames, B frames, and P frames are normally received. When the network downlink speed is less than or equal to the second preset threshold, the I frame can be discarded once every 10 frames to reduce the downlink video data volume, so that the video fluency can be ensured. After dropping the I frame, the electronic device can decode the B frame and the P frame, although the electronic device cannot refer to the I frame. Then, if the network downlink speed is recovered, that is, the network downlink speed is greater than the second preset threshold again, the I frame may not be discarded, and the quality (definition) of the video is ensured.

In addition, when the network downlink load is high or the network downlink speed is low, when the electronic device receives a packet through the network, part of the B frames or the P frames can be discarded to preferentially ensure the fluency of the video. In one possible design, a B frame or a P frame may be dropped every M frames. M is an integer greater than or equal to 2. Alternatively, the value of M may be an integer multiple of the GOP of the downstream video data. For example, when the GOP is 20, the value of M is 40, 60, 80, etc.

It should be noted that there is no necessary execution sequence between step 002-step 005 and step 006-step 007, and step 002-step 005 may be executed first, and then step 006-step 007 may be executed; alternatively, step 006-step 007 may be performed first, and then step 002-step 005 may be performed; step 002-step 005 and step 006-step 007 may be performed simultaneously, but this embodiment is not particularly limited thereto.

Based on the method provided by the embodiment of the application, when the network uplink speed is less than or equal to the first preset threshold (namely, the uplink network environment is poor), for the uplink video data, the electronic device can improve the GOP of the encoder on the basis that the frame rate is kept unchanged so as to reduce the data volume of the uplink video data, thereby ensuring the rate of the uplink video data. Also, the video may be enhanced based on the user utterance (e.g., enhancing a face region of the user). When the user speaks, the face area is coded by the high code rate, so that the face of the user is clearer, and other users can clearly see the facial expression of the user. When the user does not speak, the human face and the region except the human face are coded by adopting a lower code rate, so that the uplink video data volume is reduced, and the rate of the uplink video data is ensured. For downlink video data, the electronic device may reduce the amount of the downlink video data by discarding the I frame in the downlink video data, thereby ensuring the rate of the downlink video data.

An embodiment of the present application further provides a chip system, as shown in fig. 9, where the chip system includes at least one processor 901 and at least one interface circuit 902. The processor 901 and the interface circuit 902 may be interconnected by wires. For example, the interface circuit 902 may be used to receive signals from other devices (e.g., a memory of an electronic device). Also for example, the interface circuit 902 may be used to send signals to other devices, such as the processor 901.

For example, the interface circuit 902 may read instructions stored in a memory in the electronic device and send the instructions to the processor 901. The instructions, when executed by the processor 901, may cause an electronic device (e.g., the electronic device 200 shown in fig. 1) to perform the various steps in the embodiments described above.

Of course, the chip system may further include other discrete devices, which is not specifically limited in this embodiment of the present application.

Embodiments of the present application also provide a computer-readable storage medium, which includes computer instructions, and when the computer instructions are executed on an electronic device (such as the electronic device 200 shown in fig. 1), the electronic device 200 executes various functions or steps performed by the electronic device in the above-described method embodiments.

Embodiments of the present application further provide a computer program product, which, when running on a computer, causes the computer to execute each function or step performed by the electronic device in the above method embodiments.

The embodiment of the present application further provides a processing apparatus, where the processing apparatus may be divided into different logic units or modules according to functions, and each unit or module executes different functions, so that the processing apparatus executes each function or step executed by the electronic device in the foregoing method embodiments.

From the above description of the embodiments, it is obvious for those skilled in the art to realize that the above function distribution can be performed by different function modules according to the requirement, that is, the internal structure of the device is divided into different function modules to perform all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for dynamically adjusting video is applied to an electronic device, and is characterized by comprising the following steps:

responding to a first operation of a user, and starting a video session;

acquiring uplink video data and detecting the transmission rate of the uplink video data;

if the transmission rate of the uplink video data is greater than a first preset threshold value, carrying out video coding on the uplink video data by using a first group of pictures (GOP) length;

if the transmission rate of the uplink video data is less than or equal to the first preset threshold, carrying out video coding on the uplink video data by using a second GOP length, wherein the second GOP length is greater than the first GOP length;

receiving downlink video data and detecting the transmission rate of the downlink video data;

if the transmission rate of the downlink video data is greater than a second preset threshold value, receiving all coded frames in the downlink video data;

and if the transmission rate of the downlink video data is less than or equal to the second preset threshold, discarding part of I frames in the downlink video data.

2. The method of claim 1, further comprising:

detecting whether a user speaks or not;

when the user does not speak, carrying out video coding on the uplink video data at a first code rate;

and when the user speaks, performing video coding on the uplink video data at a second code rate, wherein the second code rate is higher than the first code rate.

3. The method of claim 2, wherein the video encoding the uplink video data at the second code rate comprises:

carrying out video coding on a human eye region of interest (ROI) and regions except the ROI in the uplink video data at the second code rate; or

And carrying out video coding on the ROI in the uplink video data at the second code rate, and carrying out video coding on the region except the ROI at the first code rate.

4. The method according to claim 2 or 3,

the ROI includes at least one of a face region, a body region, or an object region.

5. The method according to any one of claims 1 to 4,

the video session comprises at least one of a video call, a video conference, or a live connection.

6. The method of any of claims 1-5, wherein said dropping the partial I frame in the downstream video data comprises:

and discarding the I frame every X frame, wherein the value of X is determined according to the GOP of the downlink video data.

7. The method of any of claims 1-6, wherein the electronic device comprises a video session application, a camera driver module, and a camera, and wherein capturing the upstream video data comprises:

the video session application calls the camera to acquire video data through the camera driving module;

the camera acquires video data at a first frame rate.

8. The method of claim 7, wherein the electronic device further comprises an encoding module, the method further comprising:

the encoding module sends the uplink video data subjected to video encoding to the video session application;

and the video session application sends the uplink video data subjected to the video coding to a cloud server.

9. The method of any of claims 1-8, wherein the electronic device includes a network module, and wherein receiving downstream video data comprises:

and receiving video data of other participating members in the video session through a network.

10. The method according to any one of claims 1 to 9,

the first preset threshold is determined according to the resolution and the frame rate of the uplink video data;

the second preset threshold is determined according to the resolution and the frame rate of the downlink video data.

11. The method according to any one of claims 1 to 10,

the video encoding of the upstream video data with a first group of pictures, GOP, length comprises:

performing video coding on the uplink video data according to the length of a first group of pictures (GOP) and a first frame rate;

the video encoding the upstream video data with the second GOP length includes:

and carrying out video coding on the uplink video data by using a second GOP length and the first frame rate.

12. An electronic device comprising a processor coupled to a memory, the memory storing program instructions that, when executed by the processor, cause the electronic device to implement the method of any of claims 1-11.

13. A chip system, comprising one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected through a line;

the chip system is applied to an electronic device comprising a communication module and a memory; the interface circuit to receive signals from the memory and to send the signals to the processor, the signals including computer instructions stored in the memory; the electronic device performs the method of any of claims 1-11 when the processor executes the computer instructions.

14. A computer-readable storage medium comprising computer instructions;

the computer instructions, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-11.