CN113536866A

CN113536866A - Character tracking display method and electronic equipment

Info

Publication number: CN113536866A
Application number: CN202010323761.6A
Authority: CN
Inventors: 陈泽曦; 邹双一; 王凡
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2021-10-22

Abstract

A character tracking display method. In the method, the electronic equipment intelligently detects the objects of people in the video stream, cuts the video stream according to the number and the positions of the detected objects, and outputs and displays the video stream. By implementing the technical scheme provided by the application, the electronic equipment can automatically complete character tracking in a video scene, and the output video image displays more character details.

Description

Character tracking display method and electronic equipment

Technical Field

The application relates to the technical field of terminals and image processing, in particular to a person tracking display method and electronic equipment.

Background

With the development of network technology, more and more users have become accustomed to performing remote video interaction. For example, employees can carry out an online conference by real-time video shooting; in a special period, students can carry out remote class through real-time video shooting; even if the distance is thousands of miles, the family can conveniently carry out online video communication through real-time video shooting and share the similar conditions.

However, in the process of shooting a person video, the person may be at the edge of the shot picture because the camera is not facing the person or because the person is in a moving state. The photographed person is small and it is difficult to clearly show the details of the person.

Disclosure of Invention

The application provides a character tracking display method and electronic equipment, wherein character tracking is automatically completed in a video scene, and more character details are displayed on an output video picture.

In a first aspect, an embodiment of the present application provides a person tracking display method, including: the electronic equipment carries out target detection on the latest frame picture in the video stream to obtain the number of targets and the positions of the targets in the latest frame picture, wherein the targets are characters in the latest frame picture; the electronic equipment determines a cutting area according to the number of the targets and the positions of the targets in the latest frame picture, wherein the cutting area covers the positions of the targets in the latest frame picture and is smaller than the picture range of the latest frame picture; and the electronic equipment outputs the cut frame picture according to the cutting area.

In the embodiment of the application, the electronic device intelligently determines a cutting area which is smaller than the size of the latest frame of picture and covers the target according to the number of the detected targets and the position of the detected targets in the latest frame of picture of the video stream by detecting the targets in the video stream, and outputs the cut frame of picture according to the cutting area. Therefore, the automatic character tracking is realized, and the size of the cutting area is smaller than that of the latest frame of picture in the video stream, so that the characters in the output video picture are larger than those in the original video picture, and more character details can be displayed.

With reference to the first aspect, in some embodiments, the determining, by the electronic device, a clipping region according to the number of the objects and the positions of the objects in the latest frame of picture specifically includes: when the number of the targets is 1, the electronic equipment determines the cutting area according to the positions of 1 target in the latest frame picture; the cutting area covers the positions of the 1 target in the latest frame picture by taking the position of the 1 target in the latest frame picture as a center, and is smaller than the picture range of the latest frame picture; when the number of the targets is multiple, the electronic equipment determines the cutting area according to the positions of the multiple targets in the latest frame picture; the clipping area covers the positions of the targets in the latest frame and is smaller than the frame range of the latest frame.

In the above embodiment, the manner of determining the clipping region by the electronic device is different according to the different number of the objects obtained by the electronic device after the object detection, so that the appropriate clipping region can be intelligently determined according to the different number of the shot objects, and the frame picture obtained by the final clipping better meets the user's expectations.

With reference to some embodiments of the first aspect, in some embodiments, the determining, by the electronic device, the cropping area according to the positions of the multiple objects in the latest frame of picture specifically includes: the electronic device determining a principal among the plurality of targets; the electronic equipment determines a clipping area according to the positions of the targets in the latest frame picture and the positions of the hero in the latest frame picture; when there is no main role in the multiple targets, the clipping area covers the positions of the multiple targets in the latest frame picture and is smaller than the picture range of the latest frame picture; when there is a hero in the plurality of objects, the cropping area covers the positions of the plurality of objects in the latest frame picture and is smaller than the picture range of the latest frame picture by taking the position of the hero in the latest frame picture as the center.

In the above embodiment, when a plurality of objects are obtained by object detection, the electronic device may determine a hero in the plurality of objects, and then intelligently determine a suitable clipping area according to whether the hero exists and the position of the hero in the latest frame picture, so that the frame picture obtained by final clipping better meets the actual scene requirements.

With reference to some embodiments of the first aspect, in some embodiments, the determining, by the electronic device, a principal in the plurality of targets specifically includes: if the chief angle of the multiple targets is not determined in the historical frame picture, the electronic equipment performs attitude analysis on the multiple targets to determine the chief angle of the multiple targets, wherein the historical frame picture is a frame picture before the latest frame picture in the video stream; if the main roles in the multiple targets are determined in the historical frame picture, the electronic equipment tracks and determines the main roles in the multiple targets according to the positions and the characteristic information of the main roles in the historical frame picture.

In the above embodiment, the electronic device may determine the hero in the latest frame picture in different ways according to whether the hero has been determined in the history frame picture. Under the condition that the main character is determined in the historical frame picture, the main character does not need to be determined through attitude analysis any more, but only the main character in a plurality of targets needs to be determined through the position and characteristic information tracking of the main character in the historical frame picture, so that the attitude analysis of the plurality of targets in each frame picture does not need to be carried out continuously, the power consumption of the electronic equipment is reduced, and the main character determining efficiency is improved.

With reference to some embodiments of the first aspect, in some embodiments, the performing, by the electronic device, a posture analysis on the multiple targets to determine a principal angle in the multiple targets specifically includes: the electronic equipment determines that the target of the plurality of targets, which maintains the preset principal role action for the preset principal role duration, is the principal role in the plurality of targets.

In the above embodiments, the lead angle is determined by targeting the maintenance preset lead angle action for the preset lead angle duration. On one hand, misjudgment of target actions is avoided, and accuracy of determining the principal angle is improved. On the other hand, the user can independently select whether to maintain the preset main role action, and the interaction performance of the electronic equipment and the user is improved.

With reference to some embodiments of the first aspect, in some embodiments, the tracking, by the electronic device, the hero in the multiple targets according to the position and the feature information of the hero in the historical frame picture specifically includes: the electronic equipment determines a candidate target in the latest frame picture, wherein the candidate target is a target in the plurality of targets, and the position of the main corner in the last frame picture is within a preset distance threshold; when the candidate target is one, the electronic equipment determines that the candidate target is a principal in the multiple targets; when the candidate target is a plurality of targets, the electronic equipment determines a candidate target with characteristic information closest to characteristic information of a hero in the history frame pictures in the plurality of candidate targets as the hero in the plurality of targets.

In the above-described embodiment, the candidate target is determined by the distance from the position of the hero in the last frame picture, and the hero among the plurality of candidate targets is determined by the degree of proximity to the feature information of the hero in the history frame picture. The power consumption of the electronic equipment is reduced while ensuring the accuracy of the determination of the principal angle.

In combination with some embodiments of the first aspect, in some embodiments, the method further comprises: the electronic device records characteristic information of the principal in the plurality of objects.

In the above embodiment, after determining the pivot in the latest frame picture, the feature information of the pivot is recorded, which can facilitate the subsequent tracking and determination of the pivot in the newly generated frame picture without performing attitude analysis on each frame picture, thereby reducing the power consumption of the electronic device.

With reference to some embodiments of the first aspect, in some embodiments, the electronic device performs object detection on a latest frame of picture in a video stream to obtain the number of objects and positions of the objects in the latest frame of picture, and specifically includes: the electronic equipment performs downsampling on the original latest frame picture in the video stream to obtain the latest frame picture, wherein the resolution of the latest frame picture is less than that of the original latest frame picture; the electronic equipment carries out target detection on the latest frame picture to obtain the number of targets and the positions of the targets in the latest frame picture; the electronic device outputs the cut frame picture according to the cutting area, and the method specifically comprises the following steps: the electronic equipment cuts and up-samples the original latest frame picture according to the cutting area in the latest frame picture to obtain a cut frame picture, wherein the resolution of the cut frame picture is equal to that of the original latest frame picture; and the electronic equipment outputs the cut frame picture.

In the above embodiment, the frame screen is downsampled before the target detection, thereby reducing the computational load of the electronic device. And the up-sampling is carried out after the cutting, so that the resolution of the cut frame picture is improved, and the detail display of the character is clearer.

With reference to some embodiments of the first aspect, in some embodiments, before the step of outputting, by the electronic device, the cropped frame picture, the method further includes: and the electronic equipment performs distortion correction on the cut frame picture.

In the above embodiment, distortion correction is performed on the cut frame picture, and the deformation of the picture shot by the wide-angle camera is corrected, so that the video picture can reflect the shot object more truly.

With reference to some embodiments of the first aspect, in some embodiments, before the step of performing object detection on a latest frame of picture in the video stream by the electronic device, obtaining the number of objects and positions of the objects in the latest frame of picture, the method further includes: the electronic equipment combines the video streams shot by the cameras to obtain the video streams.

In the embodiment, the video streams shot by the multiple cameras are merged and then subjected to subsequent processing, so that the visual angle range of the video picture is enlarged.

In a second aspect, an embodiment of the present application provides an electronic device, including: a camera, one or more processors, and a memory; the camera is used for shooting to obtain a video stream; the memory coupled with the one or more processors, the memory to store computer program code, the computer program code including computer instructions, the one or more processors to invoke the computer instructions to cause the electronic device to perform: carrying out target detection on the latest frame picture in the video stream to obtain the number of targets and the positions of the targets in the latest frame picture, wherein the targets are characters in the latest frame picture; determining a cutting area according to the number of the targets and the positions of the targets in the latest frame picture, wherein the cutting area covers the positions of the targets in the latest frame picture and is smaller than the picture range of the latest frame picture; and outputting the cut frame picture according to the cutting area.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: when the number of the targets is 1, determining the cutting area according to the positions of 1 target in the latest frame picture; the cutting area covers the positions of the 1 target in the latest frame picture by taking the position of the 1 target in the latest frame picture as a center, and is smaller than the picture range of the latest frame picture; when the number of the targets is multiple, determining the cutting area according to the positions of the multiple targets in the latest frame picture; the clipping area covers the positions of the targets in the latest frame and is smaller than the frame range of the latest frame.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: determining a principal angle among the plurality of targets; determining a clipping area according to the positions of the targets in the latest frame picture and the positions of the hero in the latest frame picture; when there is no main role in the multiple targets, the clipping area covers the positions of the multiple targets in the latest frame picture and is smaller than the picture range of the latest frame picture; when there is a hero in the plurality of objects, the cropping area covers the positions of the plurality of objects in the latest frame picture and is smaller than the picture range of the latest frame picture by taking the position of the hero in the latest frame picture as the center.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: when the chief angle of the multiple targets is not determined in a historical frame picture, performing attitude analysis on the multiple targets to determine the chief angle of the multiple targets, wherein the historical frame picture is a frame picture before the latest frame picture in the video stream; when the main roles in the multiple targets are determined in the historical frame picture, the main roles in the multiple targets are tracked and determined according to the positions and the characteristic information of the main roles in the historical frame picture.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: and determining a target of the plurality of targets for maintaining the preset principal role action to reach the preset principal role duration as a principal role in the plurality of targets.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: determining a candidate target in the latest frame picture, wherein the candidate target is a target in the plurality of targets, and the position of the main corner in the last frame picture is within a preset distance threshold; when the candidate target is one, determining the candidate target as a principal in the plurality of targets; when the candidate target is a plurality of targets, the candidate target with the characteristic information closest to the characteristic information of the hero in the historical frame picture is determined to be the hero in the plurality of targets.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: feature information of the principal angles in the plurality of objects is recorded.

With reference to some embodiments of the second aspect, in some embodiments, the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: down-sampling the original latest frame picture in the video stream to obtain the latest frame picture, wherein the resolution of the latest frame picture is less than that of the original latest frame picture; carrying out target detection on the latest frame picture to obtain the number of targets and the positions of the targets in the latest frame picture; cutting and upsampling the original latest frame picture according to the cutting area in the latest frame picture to obtain a cut frame picture, wherein the resolution of the cut frame picture is equal to that of the original latest frame picture; and outputting the cut frame picture.

In some embodiments combined with some embodiments of the second aspect, in some embodiments, the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform: and carrying out distortion correction on the cut frame picture.

In some embodiments combined with some embodiments of the second aspect, in some embodiments, the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform: and combining the video streams shot by the plurality of cameras to obtain the video stream.

In a third aspect, an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device, and the chip system includes one or more processors, and the processor is configured to invoke a computer instruction to cause the electronic device to perform a method as described in the first aspect and any possible implementation manner of the first aspect.

It is understood that the chip system may include one processor 110 in the electronic device 100 shown in fig. 5, may also include a plurality of processors 110 in the electronic device 100 shown in fig. 5, and may also include other one or more chips, for example, may include an image signal processing chip in the camera 193 in the electronic device 100 shown in fig. 5, may also include an image display chip in the display screen 194, and the like, which is not limited herein.

In a fourth aspect, embodiments of the present application provide a computer program product including instructions, which, when run on an electronic device, cause the electronic device to perform the method described in the first aspect and any possible implementation manner of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions that, when executed on an electronic device, cause the electronic device to perform the method described in the first aspect and any possible implementation manner of the first aspect.

It is understood that the electronic device provided by the second aspect, the chip system provided by the third aspect, the computer program product provided by the fourth aspect, and the computer storage medium provided by the fifth aspect are all used to execute the method provided by the embodiments of the present application. Therefore, the beneficial effects achieved by the method can refer to the beneficial effects in the corresponding method, and are not described herein again.

Drawings

FIG. 1 is a schematic diagram of a relationship between a video stream and a frame in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an effect of the character tracking display method according to the embodiment of the present application;

FIG. 3 is a schematic diagram illustrating another effect of the character tracking display method according to the embodiment of the present application;

FIG. 4 is a schematic diagram illustrating another effect of the character tracking display method according to the embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 6 is a block diagram of a software architecture of an electronic device according to an embodiment of the present disclosure;

FIG. 7 is an exemplary diagram of a video capture interface in an embodiment of the present application;

fig. 8 is an exemplary schematic diagram of a shooting setting interface in the embodiment of the present application;

fig. 9 is an exemplary schematic diagram of target detection in the person tracking display method according to the embodiment of the present application;

fig. 10 is an exemplary diagram illustrating a determination of a clipping region in the person tracking display method according to the embodiment of the present application;

fig. 11 is an exemplary diagram of cropping and outputting in the person tracking display method according to the embodiment of the present application;

fig. 12 is another exemplary diagram illustrating object detection performed in the human tracking display method according to the embodiment of the present application;

fig. 13 is another exemplary diagram illustrating a determination of a clipping region in the person tracking display method according to the embodiment of the present application;

fig. 14 is another exemplary diagram of cropping and outputting in the person tracking display method according to the embodiment of the present application;

fig. 15 is an exemplary view illustrating a determination of a hero in the person tracking display method according to the embodiment of the present application;

fig. 16 is another exemplary diagram for determining a clipping region in the person tracking display method according to the embodiment of the present application;

fig. 17 is another exemplary diagram of clipping and output in the person tracking display method according to the embodiment of the present application;

fig. 18 is an exemplary diagram illustrating the determination of a candidate object in the person tracking display method according to the embodiment of the present application;

fig. 19 is another exemplary diagram illustrating the determination of a candidate object in the person tracking display method according to the embodiment of the present application.

Detailed Description

The terminology used in the following embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the listed items.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

Since the embodiments of the present application relate to the application of image processing technology, for the sake of understanding, the related terms and concepts related to the embodiments of the present application will be described below.

(1) Ssd (single shot multi box detector) algorithm:

the SSD algorithm is an artificial neural network algorithm for target detection. Compared with other target detection algorithms, the method has the advantages that the detection speed is increased while the precision is ensured, and the requirement of real-time detection can be met.

The SSD algorithm is used for target detection, and a labeled sample is required to be used as a training set to train the SSD algorithm to obtain the SSD algorithm meeting the requirements of the SSD algorithm.

For example, in the embodiment of the present application, a series of images tagged on a person may be used to train the SSD algorithm, so that the SSD algorithm can detect the person in the input image and output the position information of the person in the input image. Generally, in the SSD algorithm, position information of a person in an input image can be represented by a prediction box that positions the person in the input image. The specific size of the prediction frame may be preset, or may be automatically adjusted according to a preset rule according to the size of the detected target, which is not limited herein.

According to different requirements of the target detection use scenario, the persons in the embodiment of the present application may include only persons, or persons and household pets (e.g., cats, dogs, etc.), and may further include more persons-related objects, which is not limited herein. For different target demand conditions, only the sample marked with the label on the target to be identified is used as a training set to train the corresponding SSD algorithm.

(2) Downsampled (subsampled) and upsampled (upsampling) images:

the down sampling of the image is a process of reducing the image. The main purpose of downsampling is to reduce the resolution of a target detection image and reduce the Artificial Intelligence (AI) operation load.

For example, if an image has a resolution of M × N, s-fold down-sampling is performed to obtain an image with a resolution of (M/s) × (N/s), although s should be a common divisor of M and N.

Up-sampling of an image is the process of enlarging the image so that it can be displayed on a higher resolution display device. The image amplification almost adopts an interpolation method, namely, a proper interpolation algorithm is adopted to insert new elements among pixel points on the basis of the original image pixels.

In the embodiment of the present application, downsampling is performed on an image to reduce the resolution of the image, thereby reducing the computational load of image processing. For example, when the target detection is performed on an image by using the SSD algorithm, or when the posture of the image is detected, the image with a small resolution is processed with a smaller operation load and faster operation.

In the embodiment of the application, the upsampling is performed on the image because the resolution of a frame picture in the obtained video stream is reduced after the original video stream is cut according to the determined cutting area. Therefore, the frame pictures in the clipped video stream are up-sampled, so that the resolution of the frame pictures in the obtained video stream is restored to the resolution of the frame pictures in the original video stream.

(3) Video stream and frame picture:

the video stream is composed of frame pictures, for example, a video stream with a frame rate of 24 Frames Per Second (FPS), i.e., the video stream is represented to be composed of 24 frame pictures per second.

Fig. 1 is a schematic diagram of a relationship between a video stream and a frame picture. In the embodiment of the application, the video stream is generated by shooting in real time by the camera, so that the video stream obtained by shooting can generate new frame pictures continuously over time. Assuming that shooting is started at time T0, one frame picture is generated from time T1 to the latest time T13, wherein the frame pictures generated from time T1 to time T12 may be referred to as history frame pictures, and the frame picture generated from the latest time T13 may be referred to as the latest frame picture.

By using the character tracking display method in the embodiment of the application, each latest frame of picture is processed, and the processed frame of picture can form a new video stream.

In the prior art, when a video is shot, a person may be located at the edge of a shot picture because a camera is not directly facing the person or the person is in a moving state. The photographed person is small and the details of the person cannot be clearly shown.

By using the person tracking display method in the embodiment of the application, a user does not need to manually adjust the camera or the video picture. In a video scene, the electronic equipment can automatically complete character tracking, and an output video picture can display more character details, so that the user experience in the video scene is optimized.

The character tracking display method in the embodiment of the application has a good application effect in different application scenes.

For example, in a scene where a shooting target is a single person, as shown in fig. 2, a schematic diagram of an effect of the person tracking display method in the embodiment of the present application is shown. Fig. 2 (a) shows an original captured video, in which a single person is located at the edge of the video and the person is small and difficult to view details. After the processing of the character tracking display method in the embodiment of the application, the video shown in (b) in fig. 2 is output, the output video displays the character details, the background near the character is reserved, and the character is ensured to be in the video center.

For another example, in a scene where the shooting target is multiple people, as shown in fig. 3, another schematic usage effect diagram of the person tracking display method in the embodiment of the present application is shown. Fig. 3 (a) shows an original captured video, in which a plurality of captured persons are located at the edge of the video, and details of the persons cannot be observed. After the character tracking display method in the embodiment of the application, the video shown in fig. 3 (b) is obtained by outputting, the output video contains all characters of the original video, and the character details are kept as much as possible, so that the video experience of the user is improved.

For another example, in the case that there are multiple cameras for shooting simultaneously, as shown in fig. 4, another schematic usage effect diagram of the person tracking display method in the embodiment of the present application is shown. Fig. 4 (a) is an original video captured by two cameras respectively, in which a person is located at a junction of the two cameras, and the person's motion cannot be observed completely. After the processing of the person tracking display method in the embodiment of the application, the video shown in fig. 4 (b) is output, the output video contains all the persons in the videos shot by the original two cameras, the movement of each person can be clearly and completely observed, and the range of the visual angle of the video is enlarged.

An exemplary electronic device 100 provided by embodiments of the present application is first described below.

Fig. 5 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application.

The following describes an embodiment specifically by taking the electronic device 100 as an example. It should be understood that electronic device 100 may have more or fewer components than shown, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The electronic device 100 may include: the mobile terminal includes a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, the charger, the flash, the camera 193, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 180K via an I2C interface, such that the processor 110 and the touch sensor 180K communicate via an I2C bus interface to implement the touch functionality of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may communicate audio signals to the wireless communication module 160 via the I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The SIM interface may be used to communicate with the SIM card interface 195, implementing functions to transfer data to or read data from the SIM card.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative, and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application (such as a face recognition function, a fingerprint recognition function, a mobile payment function, and the like) required by at least one function, and the like. The storage data area may store data (such as face information template data, fingerprint information template, etc.) created during the use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The pressure sensor 180A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the intensity of the touch operation according to the pressure sensor 180A. The electronic apparatus 100 may also calculate the touched position from the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

In the embodiment of the present application, the camera 193 in the electronic device 100 may be triggered to record a video through the user operation acquired by the pressure sensor 180A and/or the touch sensor 180K. The recorded video may be displayed on the display screen 194 after the processor 110 calls the operation instructions stored in the internal memory 121.

Fig. 6 is a block diagram of a software configuration of the electronic device 100 according to the embodiment of the present application.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 10, the application package may include applications (also referred to as applications) such as a character tracking display module, a camera, a gallery, a calendar, a call, a map, a navigation, a WLAN, bluetooth, music, a video, a short message, etc.

In the embodiment of the application, after the camera application is started, the person tracking display module in the application package can be called, so that the person tracking display method in the embodiment of the application is executed, and a person is tracked and displayed in a video scene.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 4, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog interface. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

An android runtime (android runtime) includes a core library and a virtual machine. The android runtime is responsible for scheduling and management of the android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), two-dimensional graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide a fusion of two-dimensional (2-dimensional, 2D) and three-dimensional (3-dimensional, 3D) layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The three-dimensional graphic processing library is used for realizing 3D graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The kernel layer at least comprises a display driver, a camera driver, an audio driver, a sensor driver and a virtual card driver.

The following describes exemplary workflow of the software and hardware of the electronic device 100 in connection with capturing a photo scene.

When the touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into an original input event (including touch coordinates, a time stamp of the touch operation, and other information). The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and taking a control corresponding to the click operation as a control of a camera application icon as an example, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera drive by calling a kernel layer, and captures a still image or a video through the camera 193. In this embodiment, the camera application may call the video captured by the camera 193 to the person tracking display module in the application package for processing, and the camera application may output the processed video to the display screen 194.

The following describes a person tracking display method in the embodiment of the present application with reference to the software and hardware architecture of the electronic device 100. According to the different number of shooting targets and the different modes selected by the user, the person tracking display method has the following situations, which are respectively described as follows:

1. the photographic subject is a single subject.

Fig. 7 is an exemplary schematic diagram of a video shooting interface of an electronic device in an embodiment of the present application.

The video capture interface 700 may be a video capture interface that is displayed when a user opens a camera application and clicks a video recording control. The present invention may also be a video shooting interface displayed after the video chat function is started by other application programs in the electronic device 100, such as a video conference application, a chat application, and the like, which is not limited herein.

The video capture interface 700 can include a capture screen display area 701, a settings control 702, and a recording control 703.

The captured picture display area 701 is used for displaying a picture captured by the camera 193 in the electronic device 100.

The setting control 702 is used for triggering and displaying a shooting setting interface;

the recording controls 703 are used to control the start, pause, and stop of video recording.

When the electronic device 100 receives an operation of clicking the setting control 702 in fig. 7 by the user, the electronic device 100 may display a shooting setting interface. Fig. 8 is an exemplary schematic diagram of a shooting setting interface in the embodiment of the present application.

The capture settings interface 800 may include a person tracking control 801, a deformity correction control 802, a hero mode control 803, and a video stitching control 804. It is understood that in practical applications, the shooting setting interface 800 may further include many other functional controls, such as a portrait beautifying control, a time-delay shooting control, and the like, which is not limited herein.

The person tracking control 801 is used for triggering the activation of the person tracking function. After the person tracking function is activated, the electronic device 100 may execute the person tracking display method in the embodiment of the present application on the video stream captured by the camera 193.

The deformity correction control 802 is used to trigger the start of a screen deformity correction function. After the image distortion correction function is started, image distortion correction is performed before the images in the video stream are sent to the display screen 194 for display, so that the output images are more stable and clearer.

The hero mode control 803 is used to trigger the initiation of a hero mode. In the hero mode, when a plurality of persons are photographed, one of the persons is determined to be a hero and displayed in a central area of a screen.

The video splicing control 804 is used to trigger a video stream splicing function. After the video stream splicing function is started, the electronic device 100 may collectively refer to the video streams captured by the multiple cameras 193 as one video stream for processing and displaying.

If the current shooting target is a single target, and the user clicks the person tracking control 801 in the shooting setting interface 800 shown in fig. 8, the person tracking function is activated. Then, at the time of video shooting, the electronic device 100 performs object detection on the frame picture in the video stream obtained by current shooting, and determines the number of objects and the positions of the objects in the frame picture.

Fig. 9 is a schematic diagram illustrating an exemplary target detection performed in the human tracking display method according to the embodiment of the present application. The electronic device 100 may detect an object in the frame and determine coordinates (Xm, Ym) of a center point of the object in the frame.

It is understood that the activation of the person tracking function by clicking the person tracking control 801 is only an example, and in practical applications, there may be many ways to activate the person tracking function, for example, the electronic device may default to activate the person tracking function, or may control to activate the person tracking function by a preset gesture operation instruction, and the like, which is not limited herein.

It will be appreciated that fig. 9 is only one example of determining the position of an object through object detection, and in practical applications, there are many different ways to represent the position of an object in a frame picture. For example, if a coordinate system is established with respect to the frame screen along the XY axis, the position of the target in the frame screen can be represented by feeding back the maximum X value, the minimum X value, the maximum Y value, and the minimum Y value of the region where the target is located. The area of the target can be fixed by using a frame, and the position of the target in the frame can be represented by feeding back the coordinates of the frame. And is not limited herein.

An artificial intelligence neural network for target detection may have many options, such as fast R-CNN algorithm, R-FCN algorithm, SSD algorithm, and the like. Preferably, in the embodiment of the present application, a trained SSD algorithm for detecting humans may be used for target detection.

It will be appreciated that the object detection of the frame is primarily data processed by the processor 110 in the electronic device 100.

Optionally, in this embodiment of the application, before performing target detection on a frame picture in a video stream obtained by current shooting, the electronic device 100 may perform downsampling on the frame picture first, so that the resolution of the frame picture is reduced. And then, the target detection is carried out on the frame picture after the down sampling, thereby reducing the operation load of carrying out the target detection on the frame picture.

When the electronic device 100 performs object detection and determines that there is only one object in the frame, the electronic device may determine a clipping region according to a position of the object in the frame, where the clipping region is centered on the object and smaller than a frame range of the original frame. Fig. 10 is a schematic diagram illustrating an exemplary determination of a clipping region in the person tracking display method according to the embodiment of the present application. The electronic apparatus 100 may determine, as the trimming area, a screen range centered on the object and having the X/Y axis length covering the object as the original frame screen 1/2 (or other value). For example, in fig. 10, the X-axis length of the frame screen is X1, and the Y-axis length is Y1, the electronic device may determine that the region 1001 covering the X-axis length X1/2 and the Y-axis length Y1/2 of the target is a cropping region centered on the target.

It is understood that the electronic apparatus 100 may automatically determine the cropping area according to a preset composition rule according to the position of the object in the frame picture. The preset composition rule may be a preset rule for leaving a factory, a rule which is added by a user, or a trained artificial intelligence model, and is not limited herein.

The electronic apparatus 100 may crop and output the frame picture according to the cropping area. Fig. 11 is a schematic diagram illustrating cropping and outputting in the person tracking display method according to the embodiment of the present application. The frame picture cut out and output to be displayed on the display screen 194 is positioned at the center of the display screen 194 and is larger than in the original frame picture, compared to the frame picture in the original video stream, and therefore, it can display more details of the person.

When a frame picture is clipped, the frame picture in the captured video stream is clipped. If the frame screen is down-sampled before the object detection in order to reduce the calculation load, it is necessary to determine the coordinates of the cropping area in the frame screen of the captured video stream based on the coordinates of the cropping area in the down-sampled frame screen. And then clipping according to the clipping area on the frame picture in the video stream obtained by the shooting.

Since the resolution of the cropped frame picture is smaller than that of the original frame picture, upsampling is required to restore the cropped frame picture to the resolution of the original frame picture.

2. The shooting targets are multiple targets, and the user turns on the hero mode.

If the current shooting target is a plurality of targets, the user starts the person tracking function by clicking the person tracking control 801 in the shooting setting interface 800 shown in fig. 8, and starts the hero mode by clicking the hero mode control 803 in the shooting setting interface 800 shown in fig. 8. Then, at the time of video shooting, the electronic device 100 performs object detection on the frame picture in the video stream obtained by current shooting, and determines the number of objects and the positions of the objects in the frame picture.

It is understood that the initiation of the hero mode by clicking the hero mode control 803 is only an example, and in practical applications, there are many ways to initiate the hero, for example, the electronic device may default to the hero mode, or may control the initiation of the hero mode by a preset gesture operation instruction, and the like, and is not limited herein.

Fig. 12 is another exemplary schematic diagram illustrating target detection performed in the human tracking display method according to the embodiment of the present application. Through the object detection, the electronic apparatus 100 can determine the number of objects to be 3 and can obtain the coordinates of each object in the frame screen.

The master mode of the electronic device 100 is an on state. In the hero mode, when the shooting target is a plurality of targets, the electronic apparatus 100 displays heros of the plurality of targets in a central area of the frame screen.

When the shooting targets are a plurality of targets and the user turns on the hero mode, the electronic device 100 divides the processing of the frame screen into 3 cases, which are described below:

case 1: the electronic device has not determined a lead angle from the plurality of targets.

And determining a principal angle from a plurality of targets, wherein whether any target in the plurality of targets meets a preset principal angle condition needs to be determined through attitude analysis. Specifically, the electronic device may perform pose analysis on a plurality of targets in the frame. And when determining that a certain target maintains the preset principal angle action to reach the preset principal angle duration Tz, determining that the target is a principal angle in the multiple targets.

The preset main role action may be factory preset or set by a user, and is not limited herein. For example, the preset principal angle action may be an OK gesture action, or may be an action for recognizing that an elbow key point is higher than a shoulder key point, and an included angle between a wrist, the elbow and the shoulder key point is larger than a certain angle, or may have many other actions or postures as the preset principal angle action, which is not limited herein.

It can be understood that, since it takes at least the preset hero time period Tz to perform the pose analysis to determine the hero from the plurality of targets, and there is a possibility that the target performs the preset hero action after the shooting has been performed for a while, it is impossible to determine the hero from the plurality of targets for a while at the beginning of the video shooting. At this time, the electronic device 100 may determine the cropping area according to the number of the objects determined by the object detection and the positions of the objects in the frame screen. The cutting area is smaller than the picture range of the original frame picture and covers all detected targets. Fig. 13 is another exemplary diagram illustrating a determination of a clipping region in the person tracking display method according to the embodiment of the present application. The electronic apparatus 100 determines an area 1301 covering the plurality of objects as a trimming area.

The electronic apparatus 100 may crop and output the frame picture according to the cropping area. Fig. 14 is another exemplary diagram of cropping and outputting in the person tracking display method according to the embodiment of the present application. Compared with the frame picture in the original video stream, the frame picture cut and output and displayed on the display screen 194 is larger in each person than in the original frame picture, and all the objects detected in the original frame picture are completely displayed, so that more details of the persons can be displayed.

Case 2: the electronic device determines a hero from a plurality of objects in a latest frame picture.

If a certain target of the multiple targets in the latest frame picture maintains the preset hero action in the historical frame picture, and the duration of the preset hero action just reaches the preset hero duration Tz by the latest frame picture, the electronic device 100 may determine, in the latest frame picture, that the target that continuously maintains the preset hero action and reaches the preset hero duration Tz is the hero of the multiple targets. Fig. 15 is a schematic diagram illustrating an exemplary determination of a hero in the person tracking display method according to the embodiment of the present application. Assume that the preset principal angle action is a one-hand waist-forking action. The captured video stream continuously generates new frame pictures over time T. In the frame screens generated at times T1 and T2, the electronic apparatus 100 does not detect that there is a target holding the preset hero motion, and at time T3, the electronic apparatus 100 determines that the middle target among the 3 targets holds the preset hero motion through the posture analysis. The electronic device 100 continues to perform the gesture detection on each frame picture generated later, and determines that the intermediate target maintaining preset hero action reaches the preset hero duration in the latest frame picture generated until the time T13. Then, the electronic apparatus 100 determines that the intermediate target is a leading role among the plurality of targets. After determining that a certain object in the frame is a leading role in a plurality of objects, the electronic device 100 may record feature information of the object.

After determining the hero in the plurality of objects in the frame, the electronic device 100 may determine the cropping area according to the number of the plurality of objects, the positions of the plurality of objects in the frame, and the position of the hero in the frame. The cutting area is smaller than the picture range of the original frame picture, covers all detected targets and takes the principal angle as the center. Fig. 16 is another exemplary diagram illustrating a determination of a clipping region in the person tracking display method according to the embodiment of the present application. The electronic apparatus 100 determines an area 1601 covering the plurality of objects and centered on the hero as a trimming area.

The electronic apparatus 100 may crop and output the frame picture according to the cropping area. Fig. 17 is another exemplary diagram of cropping and outputting in the person tracking display method according to the embodiment of the present application. Compared with a frame picture in the original video stream, the frame picture which is cut and output and then displayed on the display screen 194 is displayed by taking the central main role as the center, and each person is larger than that in the original frame picture, so that not only are all targets detected in the original frame picture completely displayed and more details of the person displayed, but also the main person can be clearly determined, and the user experience in the video scene is improved.

Case 3: the electronic device has determined the lead from the historical frame pictures and tracked the determined lead in the most recent frame picture.

If the electronic device 100 has already determined the hero in the historical frame picture, in the newly generated frame picture, the electronic device 100 does not need to determine the hero in the latest frame picture according to the attitude analysis any more, and can trace and determine the hero in the latest frame picture only according to the feature information of the hero in the historical frame picture and the position information of the hero in the previous frame picture.

Specifically, the electronic device 100 may first determine, as the candidate object, an object whose position of the hero in the latest frame picture and the position of the hero in the last frame picture is within a preset distance threshold s. If the number of candidate objects is 1, the electronic device 100 may determine that the candidate object is a leading role in the latest frame. If the number of candidate targets is multiple, the electronic device 100 may determine that a candidate target of the multiple candidate targets whose feature information is closest to the feature information of the hero in the history frame picture is the hero in the latest frame picture.

Fig. 18 is a schematic diagram illustrating an exemplary determination of candidate targets in the person tracking display method according to the embodiment of the present application. In the history frame screens generated at the time T1 to T13, the electronic apparatus 100 has determined the hero among the plurality of targets. In the frame picture immediately preceding the latest frame picture generated at the time T14, that is, in the frame picture generated at the time T13, the coordinates of the center point of the principal corner are (Xz, Yz). It is understood that the position of the target in the frame picture generated at the time T14 and the frame picture generated at the time T13 may not be exactly the same due to camera deflection or target movement, etc. As shown in fig. 18, in the frame picture generated at time T14, there is only one candidate object within a preset distance threshold s from the position of the hero in the previous frame picture. When the electronic apparatus 100 determines that there is only one candidate object in the latest frame picture, the electronic apparatus 100 may determine that the candidate object is a hero in the latest frame picture. The electronic device 100 may record characteristic information of the hero.

Fig. 19 is another exemplary diagram illustrating candidate object determination in the person tracking display method according to the embodiment of the present application. At time T13, the coordinates of the center point of the leading corner in the frame are (Xz, Yz). In the frame generated at time T14, there are 2 candidate objects within a preset distance threshold s from the position of the leading corner in the previous frame. At this time, when the electronic device 100 determines the hero in the historical frame image, the recorded feature information of the hero is compared with the feature information of the 2 candidate objects, and the candidate object whose feature information is closest to the feature information of the hero in the historical frame image in the 2 candidate objects determines the hero in the frame image generated at the time T14.

Specifically, in the history frame picture, each time the hero is determined, the electronic device may record feature information of the hero. After determining a plurality of candidate objects in the latest frame image, the electronic device may perform cosine distance comparison using the recorded feature information of the principal angle and the feature information of the candidate object in the latest frame image, and lock the closest candidate object as the principal angle.

After determining the lead role in the latest frame, the electronic device 100 may also record the position information and feature information of the lead role, thereby facilitating tracking of the determined lead role in subsequent generated frame pictures.

After determining the pivot among the plurality of objects in the frame, the electronic device 100 may determine the cropping area, crop and output according to the number of the plurality of objects, the positions of the plurality of objects in the frame, and the position of the pivot in the frame. The specific manner can refer to fig. 16 and 17, which is similar to the determining, clipping and outputting of the clipping region in case 2, and is not described herein again.

3. The shooting targets are multiple targets, and the user does not start the principal angle mode.

If the current shooting target is a plurality of targets, the user clicks the character tracking control 801 in the shooting setting interface 800 shown in fig. 8 to start the character tracking function, but the user does not click the hero mode control 803 in the shooting setting interface 800 shown in fig. 8 to start the hero mode. Then, at the time of video shooting, the electronic device 100 performs object detection on the frame picture in the video stream obtained by current shooting, and determines the number of objects and the positions of the objects in the frame picture.

And then determining a cutting area according to the number of the targets determined by the target detection and the positions of the targets in the frame picture, cutting and outputting. The specific manner can refer to fig. 13 and 14, which is similar to the determining, clipping and outputting of the clipping region in case 1, and is not described herein again.

It is understood that, in the case where the imaging target is a plurality of targets, the electronic device may perform downsampling to reduce the frame resolution to reduce the processing load before performing target detection, as in the case where the imaging target is a single target. After the cropping is completed, upsampling may also be performed to restore the resolution of the cropped frame picture to the resolution of the original frame picture, which is not described herein again.

In the embodiment of the present application, in a video shooting scene, when the electronic device 100 cuts and outputs a processed frame according to the cutting area, the frame may be output to the display screen 194 of the electronic device 100 for display. In a video conference or a video chat scenario, the electronic device 100 may further output the processed frame picture to the electronic device of the opposite communication end through the mobile communication module 150 and/or the wireless communication module 160, so that the frame picture is displayed on the display screen of the electronic device of the opposite communication end. The specific output object may be determined according to an actual usage scenario, and is not limited herein.

In some embodiments of the present application, the person tracking display method may be used to process a recorded video, and according to the generation time of a frame in the video, a frame being processed is used as the latest frame, and a frame that is already processed is used as the history frame. The specific processing method is the same as the above-mentioned person tracking display method, and is not described herein again.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

As used in the above embodiments, the term "when …" may be interpreted to mean "if …" or "after …" or "in response to a determination of …" or "in response to a detection of …", depending on the context. Similarly, depending on the context, the phrase "at the time of determination …" or "if (a stated condition or event) is detected" may be interpreted to mean "if the determination …" or "in response to the determination …" or "upon detection (a stated condition or event)" or "in response to detection (a stated condition or event)".

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

Claims

1. A person tracking display method is characterized by comprising the following steps:

the electronic equipment carries out target detection on the latest frame picture in the video stream to obtain the number of targets and the positions of the targets in the latest frame picture, wherein the targets are characters in the latest frame picture;

the electronic equipment determines a cutting area according to the number of the targets and the positions of the targets in the latest frame picture, wherein the cutting area covers the positions of the targets in the latest frame picture and is smaller than the picture range of the latest frame picture;

and the electronic equipment outputs the cut frame picture according to the cutting area.

2. The method according to claim 1, wherein the electronic device determines the clipping region according to the number of the objects and the positions of the objects in the latest frame picture, and specifically comprises:

when the number of the targets is 1, the electronic equipment determines the cutting area according to the positions of 1 target in the latest frame picture; the cutting area covers the positions of the 1 target in the latest frame picture by taking the position of the 1 target in the latest frame picture as a center, and is smaller than the picture range of the latest frame picture;

when the number of the targets is multiple, the electronic equipment determines the cutting area according to the positions of the multiple targets in the latest frame picture; the cutting area covers the positions of the targets in the latest frame picture and is smaller than the picture range of the latest frame picture.

3. The method according to claim 2, wherein the electronic device determines the clipping region according to the positions of the plurality of objects in the latest frame picture, and specifically comprises:

the electronic device determining a principal among the plurality of targets;

the electronic equipment determines a cutting area according to the positions of the targets in the latest frame picture and the positions of the chief actors in the latest frame picture; when there is no main role in the plurality of targets, the clipping area covers the positions of the plurality of targets in the latest frame picture and is smaller than the picture range of the latest frame picture; when there is a hero in the plurality of objects, the cropping area covers the positions of the plurality of objects in the latest frame picture and is smaller than the picture range of the latest frame picture by taking the position of the hero in the latest frame picture as the center.

4. The method according to claim 3, wherein the electronic device determines a principal among the plurality of targets, in particular comprising:

if the chief angle of the multiple targets is not determined in a historical frame picture, the electronic equipment performs attitude analysis on the multiple targets to determine the chief angle of the multiple targets, wherein the historical frame picture is a frame picture before the latest frame picture in the video stream;

and if the main angles in the multiple targets are determined in the historical frame picture, the electronic equipment tracks and determines the main angles in the multiple targets according to the positions and the characteristic information of the main angles in the historical frame picture.

5. The method according to claim 4, wherein the electronic device performs a posture analysis on the plurality of targets to determine a principal angle among the plurality of targets, and specifically comprises:

the electronic equipment determines that a target of the multiple targets, which maintains the duration of the preset principal role action to reach the preset principal role duration, is a principal role in the multiple targets.

6. The method according to claim 4 or 5, wherein the electronic device tracks and determines the hero in the plurality of targets according to the position and feature information of the hero in the history frame picture, and specifically comprises:

the electronic equipment determines a candidate target in the latest frame picture, wherein the candidate target is a target in the plurality of targets, and the position of the main corner in the last frame picture is within a preset distance threshold;

when the candidate target is one, the electronic device determines that the candidate target is a principal among the plurality of targets;

when the candidate targets are multiple, the electronic equipment determines that a candidate target with characteristic information closest to characteristic information of a hero in a history frame picture is a hero in the multiple targets.

7. The method according to any one of claims 3 to 6, further comprising:

the electronic device records characteristic information of a principal in the plurality of objects.

8. The method according to any one of claims 1 to 7, wherein the electronic device performs object detection on a latest frame of picture in a video stream, and obtains the number of objects and positions of the objects in the latest frame of picture, and specifically includes:

the electronic equipment performs downsampling on the original latest frame picture in the video stream to obtain the latest frame picture, wherein the resolution of the latest frame picture is smaller than that of the original latest frame picture;

the electronic equipment performs target detection on the latest frame picture to obtain the number of targets and the positions of the targets in the latest frame picture;

the electronic equipment outputs the cut frame picture according to the cutting area, and the method specifically comprises the following steps:

the electronic equipment cuts and up-samples the original latest frame picture according to the cutting area in the latest frame picture to obtain a cut frame picture, wherein the resolution of the cut frame picture is equal to that of the original latest frame picture;

and the electronic equipment outputs the cut frame picture.

9. The method according to claim 8, wherein before the step of outputting the cropped frame picture by the electronic device, the method further comprises:

and the electronic equipment carries out distortion correction on the cut frame picture.

10. The method according to any one of claims 1 to 9, wherein the electronic device performs object detection on a latest frame picture in a video stream, before the step of obtaining the number of objects and the positions of the objects in the latest frame picture, the method further comprises:

the electronic equipment combines the video streams shot by the cameras to obtain the video streams.

11. An electronic device, characterized in that the electronic device comprises: a camera, one or more processors, and a memory;

the camera is used for shooting to obtain a video stream;

the memory coupled with the one or more processors, the memory to store computer program code, the computer program code including computer instructions, the one or more processors to invoke the computer instructions to cause the electronic device to perform:

carrying out target detection on the latest frame picture in the video stream to obtain the number of targets and the positions of the targets in the latest frame picture, wherein the targets are characters in the latest frame picture;

determining a cutting area according to the number of the targets and the positions of the targets in the latest frame picture, wherein the cutting area covers the positions of the targets in the latest frame picture and is smaller than the picture range of the latest frame picture;

and outputting the cut frame picture according to the cutting area.

12. The electronic device of claim 11, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

when the number of the targets is 1, determining the cutting area according to the positions of 1 target in the latest frame picture; the cutting area covers the positions of the 1 target in the latest frame picture by taking the position of the 1 target in the latest frame picture as a center, and is smaller than the picture range of the latest frame picture;

when the number of the targets is multiple, determining the cutting area according to the positions of the multiple targets in the latest frame picture; the cutting area covers the positions of the targets in the latest frame picture and is smaller than the picture range of the latest frame picture.

13. The electronic device of claim 12, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

determining a principal angle among the plurality of targets;

determining a clipping area according to the positions of the targets in the latest frame picture and the positions of the heroes in the latest frame picture; when there is no main role in the plurality of targets, the clipping area covers the positions of the plurality of targets in the latest frame picture and is smaller than the picture range of the latest frame picture; when there is a hero in the plurality of objects, the cropping area covers the positions of the plurality of objects in the latest frame picture and is smaller than the picture range of the latest frame picture by taking the position of the hero in the latest frame picture as the center.

14. The electronic device of claim 13, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

when the chief angle of the multiple targets is not determined in a historical frame picture, performing attitude analysis on the multiple targets to determine the chief angle of the multiple targets, wherein the historical frame picture is a frame picture before the latest frame picture in the video stream;

when the main angles in the multiple targets are determined in the historical frame picture, the main angles in the multiple targets are tracked and determined according to the positions and the characteristic information of the main angles in the historical frame picture.

15. The electronic device of claim 14, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

and determining a target of the plurality of targets, which maintains the duration of the preset principal role action to reach the preset principal role duration, as a principal role in the plurality of targets.

16. The electronic device according to claim 14 or 15, wherein the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform:

determining a candidate target in the latest frame picture, wherein the candidate target is a target in the plurality of targets, and the position of the main corner in the last frame picture is within a preset distance threshold;

when the candidate target is one, determining that the candidate target is a principal in the plurality of targets;

and when the candidate targets are multiple, determining that the candidate target with the characteristic information closest to the characteristic information of the hero in the historical frame pictures is the hero in the multiple targets.

17. The electronic device of claim 13 or 16, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

feature information of a principal in the plurality of objects is recorded.

18. The electronic device of any of claims 11 or 17, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

downsampling an original latest frame picture in the video stream to obtain the latest frame picture, wherein the resolution of the latest frame picture is smaller than that of the original latest frame picture;

carrying out target detection on the latest frame picture to obtain the number of targets and the position of the targets in the latest frame picture;

according to the cutting area in the latest frame picture, cutting and upsampling the original latest frame picture to obtain a cut frame picture, wherein the resolution of the cut frame picture is equal to that of the original latest frame picture;

and outputting the cut frame picture.

19. The electronic device of claim 18, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

and carrying out distortion correction on the cut frame picture.

20. The electronic device of any of claims 11-19, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

and merging the video streams shot by the plurality of cameras to obtain the video streams.

21. A chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform the method of any of claims 1-10.

22. A computer program product comprising instructions for causing an electronic device to perform the method according to any one of claims 1-10 when the computer program product is run on the electronic device.

23. A computer-readable storage medium comprising instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-10.