CN117319594A

CN117319594A - Conference personnel tracking display method, device, equipment and readable storage medium

Info

Publication number: CN117319594A
Application number: CN202310740618.0A
Authority: CN
Inventors: 任威; 徐金伟; 江文祥
Original assignee: Changsha Langyuan Electronic Technology Co Ltd
Current assignee: Changsha Langyuan Electronic Technology Co Ltd
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-12-29

Abstract

The application discloses a conference personnel tracking display method, device and equipment and a readable storage medium, and belongs to the technical field of data processing. The method comprises the steps of: in the video conference process, determining target positioning information of a speaking person; if the speaking person is one person, scaling and focusing a target portrait of the speaking person according to a first preset image frame selection rule based on the target positioning information so as to enable the target portrait to be centrally displayed in a corresponding output picture; and if the speaking person is a plurality of persons, scaling and focusing the target person image of the speaking person according to a second preset image frame selection rule based on the target positioning information so as to enable the target person image to be displayed left and right in a corresponding output picture. The method for tracking and displaying the conference personnel is applied to devices such as the video conference terminal and the like, and the effect and efficiency of the video conference can be improved.

Description

Conference personnel tracking display method, device, equipment and readable storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a computer readable storage medium for tracking and displaying conference personnel.

Background

When a video conference terminal is used for a remote video conference, it is often necessary to track and output a close-up lens of a speaker so as to improve communication efficiency of the video conference.

In the conventional method of tracking and displaying a speaker, the speaker is mainly characterized in that after the position of the speaker is recognized, a lens is aligned with the speaker, and focusing processing is performed to close up the speaker. However, because the modes of tracking display such as lens movement and focusing processing are too rough, the positions of the speakers in the output picture are not considered, for example, the positions of the speakers are often at the edge of the whole picture, the faces of the speakers in the output picture are not fully displayed, and the like, so that the look and feel and the conference effect of the teleconference are seriously affected.

Disclosure of Invention

The main objective of the present application is to provide a method, an apparatus, a device and a computer readable storage medium for tracking and displaying conference personnel, which aim to solve the technical problem that the look and feel and conference effect of a remote conference are poor due to too simple processing mode of tracking and displaying in a video conference.

In order to achieve the above object, the present application provides a method for displaying a meeting person tracking, the method for displaying a meeting person tracking comprising the steps of:

In the video conference process, determining target positioning information of a speaking person;

if the speaking person is one person, scaling and focusing a target portrait of the speaking person according to a first preset image frame selection rule based on the target positioning information so as to enable the target portrait to be centrally displayed in a corresponding output picture;

and if the speaking person is a plurality of persons, scaling and focusing the target person image of the speaking person according to a second preset image frame selection rule based on the target positioning information so as to enable the target person image to be displayed left and right in a corresponding output picture.

Optionally, the step of determining target positioning information of the speaking person during the video conference includes:

detecting whether human voice exists in the video conference process;

if the voice exists, judging whether the voice continuously reaches a preset duration threshold;

if the preset duration threshold is reached, determining that the person corresponding to the voice is speaking person and target positioning information of the speaking person.

Optionally, the step of determining target location information of the speaking person includes:

acquiring a panoramic image in a first visual range, and determining positioning information of each person in the panoramic image;

Acquiring voice angle information of a speaking person;

and determining target positioning information of the speaking person according to the voice angle information and the positioning information.

Optionally, the step of determining target positioning information of the speaking person according to the voice angle information and the positioning information includes:

determining the positioning information closest to the voice angle information in position in the positioning information;

judging whether the absolute value of the angle difference between the closest positioning information and the voice angle information is smaller than or equal to a preset angle threshold value;

and if the absolute value of the angle difference is smaller than or equal to a preset angle threshold value, determining that the closest positioning information is the target positioning information of the speaking person.

Optionally, the step of zooming and focusing the target portrait of the speaking person according to the first preset image frame selection rule so as to make the target portrait be centrally displayed in the corresponding output picture includes:

scaling and focusing a target portrait of the speaking person according to a first preset image frame selection rule so as to enable the target portrait to be centrally displayed in a corresponding output picture, and enabling the face center of the target portrait to be in a preset area in the output picture;

Wherein, the first preset image frame selection rule includes:

for a first preset image frame corresponding to the first preset image frame selection rule, determining the length of the first preset image frame based on the human body width of the speaking person, and determining the width of the first preset image frame based on the ratio between the first preset image frame and the length.

Optionally, the step of zooming and focusing the target portrait of the speaking person according to a second preset image frame selection rule so as to enable the target portrait to be displayed left and right in a corresponding output picture includes:

determining target speaking staff which speak latest among the speaking staff, wherein the target speaking staff are two people;

scaling and focusing the target portrait of the target speaking person according to a second preset image frame selection rule so as to enable the target portrait to be displayed left and right in a corresponding output picture;

wherein the second preset image frame selection rule includes:

and for a second preset image frame corresponding to the second preset image frame selection rule, determining the length of the second preset image frame based on the human body distance of the target speaking person, and determining the width of the second preset image frame based on the ratio between the second preset image frame and the length.

Optionally, after the step of centering the target portrait in the corresponding output screen, the method further includes:

if the voice-free angle information of the speaking person is detected, zooming and focusing the whole human images of the whole conference person according to a third preset image frame selection rule so as to enable the whole human images to be centrally displayed in the corresponding output picture;

wherein the third preset image frame selection rule includes:

and for a third preset image frame corresponding to the third preset image frame selection rule, the size of the third preset image frame is consistent with the size of the view finding frame of the panoramic view angle.

In addition, in order to achieve the above object, the present application further provides a conference person tracking display device, the conference person tracking display device including:

the personnel positioning module is used for determining target positioning information of speaking personnel in the video conference process;

the camera tracking module is used for zooming and focusing the target portrait of the speaking person according to a first preset image frame selection rule based on the target positioning information if the speaking person is one person, so that the target portrait is centrally displayed in a corresponding output picture; and if the speaking person is a plurality of persons, scaling and focusing the target person image of the speaking person according to a second preset image frame selection rule based on the target positioning information so as to enable the target person image to be displayed left and right in a corresponding output picture.

The application also provides a conference personnel tracking display device, which comprises a microphone array, a camera module, a processor, a memory and a conference personnel tracking display program stored on the memory and executable by the processor, wherein the conference personnel tracking display program realizes the steps of the conference personnel tracking display method when being executed by the processor.

The application also provides a computer readable storage medium, on which a conference person tracking display program is stored, wherein the conference person tracking display program, when executed by a processor, implements the steps of the conference person tracking display method as described above.

According to the conference personnel tracking display method, target positioning information of a speaking person is determined in the video conference process; if the speaking person is one person, scaling and focusing a target portrait of the speaking person according to a first preset image frame selection rule based on the target positioning information so as to enable the target portrait to be centrally displayed in a corresponding output picture; and if the speaking person is a plurality of persons, scaling and focusing the target person image of the speaking person according to a second preset image frame selection rule based on the target positioning information so as to enable the target person image to be displayed left and right in a corresponding output picture. Under the condition that one person speaks, the portrait of the person is kept to be centrally displayed in the output picture, and the close-up image of the speaker of the person is highlighted, so that the observation and conference effects are better, and the conference communication efficiency is better. Under the condition that multiple persons speak, the multiple persons speaking can be selected in the same output picture, the close-up image of each speaking person is ensured, the sightedness, the conference effect and the conference efficiency are improved, and the visual irrelevant interference of the person who does not speak at present is avoided as much as possible.

Drawings

Fig. 1 is a schematic structural diagram of a hardware running environment of a conference personnel tracking display device according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a first embodiment of a method for tracking and displaying conference personnel in the present application;

FIG. 3 is a detailed flowchart of step S10 of a preferred embodiment of the method for tracking and displaying conference personnel;

FIG. 4 is a detailed flowchart of step S13 of a preferred embodiment of the method for tracking and displaying conference personnel;

FIG. 5 is a detailed flowchart of step S10 of another preferred embodiment of the method for tracking and displaying conference personnel;

fig. 6 is a schematic diagram of a video conference display screen related to a conference personnel tracking display method in the present application;

FIG. 7 is a schematic diagram of an application example related to a method for tracking and displaying a conference person according to the present application;

fig. 8 is a schematic diagram of a frame structure of the conference personnel tracking display device.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The embodiment of the application provides a conference personnel tracking display device, which can be a video conference terminal.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware running environment of a conference person tracking display device according to an embodiment of the present application.

As shown in fig. 1, the conference person tracking display device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a camera module 1006, a microphone array 1007, and a communications bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a control panel, and the optional user interface 1003 may also include a standard wired interface, a wireless interface. Network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a WIFI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above. A conference person tracking display program may be included in the memory 1005 as a computer storage medium. The camera module comprises a main camera (hereinafter referred to as a main camera) and a secondary camera (hereinafter referred to as a secondary camera), wherein the main camera is used for tracking a speaking person and is arranged with the holder together, so that fine adjustment movement, optical and digital zooming can be performed. The auxiliary cameras are used for shooting the conference panorama, determining the position (positioning) information of each conference person in the visual range, and are not movable.

Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 does not constitute a limitation of the apparatus, and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components.

With continued reference to fig. 1, the memory 1005 in fig. 1, which is a computer-readable storage medium, may include an operating device, a user interface module, a network communication module, and a conference person tracking display program.

In fig. 1, the network communication module is mainly used for connecting with a server and performing data communication with the server; and the processor 1001 may call the conference person tracking display program stored in the memory 1005 and perform the following operations:

Further, the processor 1001 may call the conference person tracking display program stored in the memory 1005, and further perform the following operations:

detecting whether human voice exists in the video conference process;

acquiring voice angle information of a speaking person;

wherein, the first preset image frame selection rule includes:

Wherein the second preset image frame selection rule includes:

wherein the third preset image frame selection rule includes:

Based on the hardware structure of the conference personnel tracking display device, various embodiments of the conference personnel tracking display method are provided.

The embodiment of the application provides a conference personnel tracking display method.

Referring to fig. 2, fig. 2 is a flow chart of a first embodiment of a method for tracking and displaying conference personnel in the present application; in a first embodiment of the present application, the conference person tracking display method includes the following steps:

step S10, determining target positioning information of a speaking person in the process of a video conference;

during a video conference, participants can be divided into two categories: speaking staff and non-speaking staff.

When a video conference starts, namely a meeting person enters, a sound source positioning function is started, firstly, the leftmost face and the rightmost face in a visual angle range are positioned according to the face recognition result of a secondary camera, a main shot image frame is adjusted to an optimal panoramic visual angle (automatic framing function) based on the azimuth of face coordinates, all persons in the main shot visual angle range are required to be contained, the whole formed by all persons is displayed in the middle in an output picture as much as possible, the automatic framing function continuously judges and works after the sound source positioning is started, the judging interval is 2s, and picture change is slowest to be completed in 2 s.

During the automatic framing of a video conference:

if no face (no conference person) exists in the output picture of the display device of the video conference terminal, displaying the maximum display area, namely panoramic shooting (maximum visual angle) through the main camera/the auxiliary camera and transmitting the panoramic shooting (maximum visual angle) to the display device;

If only one person in the output picture, the magnification of the main shooting magnification is not more than 2 times of the magnification (2 times of the magnification on a 3 times optical basis);

when the main shooting maximum visual angle can not frame all faces, only the maximum visual angle is displayed on the picture, and the integrity of the faces is not considered any more;

when different speakers speak and switch, the picture needs to translate and smoothly shake the mirror.

In addition, when no person speaks and only 1 person exists in the output picture, the main shot tracking frequency is real-time tracking, and when the main shot tracking frequency exceeds 1/3 output picture (namely, the output picture can be divided into three parts according to left and right and/or up and down respectively), the centering can be repositioned immediately; when no person speaks and more than two persons in the output picture, the main shooting visual angle can be refreshed every 1 second, and the result output picture is made once every 2 seconds.

When no person speaks/someone speaks, if the displacement change of the person (face) collected by the main camera is too large and exceeds a certain displacement threshold (which can be set according to actual needs, such as 5 degrees), the current visual angle is maintained not to be refreshed. The aim of doing so is to avoid the tracking interference which is easily caused by the entering or leaving of people from the meeting, and to promote the actual tracking effect. And the displacement of the person (face) is continuously (continuously) changed in a short time (such as 5 s) acquired by the main camera, and the current visual angle is maintained not to be refreshed, namely not to be tracked. The aim of doing so is to prevent the tracking interference caused by excessive actions of partial personnel and improve the effect of the conference.

After the automatic framing in the initial stage, the conference person basically sits, and a speaker speaks, and the position (target positioning information) of the speaker can be determined through a microphone array (such as four-array microphones), so that the conference person can be subjected to main camera alignment, framing and focusing according to the position of the conference person, and the speaker is subjected to close-up.

Referring to fig. 3, in an embodiment, the determining target positioning information of the speaking person in step S10 includes:

step S11, obtaining a panoramic image in a first visual range, and determining positioning information of each person in the panoramic image;

in this embodiment, in order to facilitate distinguishing the maximum viewing angles of the primary and secondary cameras, it can be distinguished as: the secondary camera corresponds to a first visual (visual angle) range, and the primary camera corresponds to a second visual (visual angle) range.

The conference panoramic image in the maximum visual range is obtained through the auxiliary camera, the world coordinates of each person in the actual conference scene space, namely the positioning information of each person, can be determined based on the basic camera calibration and the coordinate conversion function according to the position coordinates of each participant in the image of the panoramic image, and a two-dimensional coordinate system can be specifically established by taking the optical center of the auxiliary camera as an origin, the optical axis as a y axis and the left and right sides of the auxiliary camera as an x axis. Thus, the positioning information here may be angular position information of each person.

Step S12, acquiring voice angle information of a speaking person;

the voice angle information of the speaker can be obtained and determined by a microphone array, and the technology of this part of sound localization is not explained too much. The human voice angle information is well understood to be determined in accordance with a similar two-dimensional coordinate system as described above (which may be the same coordinate system or may be a coordinate system set based on a microphone array), such as-10 deg., +5 deg..

And step S13, determining target positioning information of the speaking person according to the voice angle information and the positioning information.

And comparing and judging the voice angle information with each piece of positioning information, so as to find out target positioning information matched with the voice angle information from the positioning information, and further determine the position of the speaker currently speaking.

According to the embodiment of the application, the position of the speaker can be accurately determined based on the two layers of sound positioning and image positioning, so that the speaker can be conveniently closed, and the conference effect is ensured.

Referring to fig. 4, specifically, in an embodiment, the step S13 includes:

step S130, determining the positioning information closest to the voice angle information in position in the positioning information;

How to determine the target positioning information of the speaker from a plurality of positioning information, the positioning information closest to the voice angle information in position in the positioning information may be determined first.

For example, the positioning information of-10 degrees, 5 degrees and 8 degrees exists, the voice angle information is 6 degrees, and obviously, the closest positioning information is 5 degrees.

Step S131, judging whether the absolute value of the angle difference between the closest positioning information and the voice angle information is smaller than or equal to a preset angle threshold;

the closest positioning information and the voice angle information are subtracted to obtain an angle difference and an absolute value is taken, the absolute value of the calculated absolute difference is compared with a preset angle threshold, and the preset angle threshold can be set according to actual needs, such as 3 degrees.

Step S132, if the absolute value of the angle difference is smaller than or equal to a preset angle threshold, determining that the closest positioning information is the target positioning information of the speaking person.

If the absolute value of the angle difference is less than or equal to the preset angle threshold, the closest positioning information is considered to be the target positioning information of the speaker.

If the absolute value of the angle difference is larger than the preset angle threshold value, the closest positioning information cannot be considered as the target positioning information of the speaker due to larger deviation, and the response to the closest positioning information cannot be performed, namely the participant is not considered to speak, so that the interference of other external sounds can be avoided, the camera is prevented from moving and focusing to other irrelevant positions, and the smooth proceeding of the conference is ensured.

After the target positioning information of the speaker is confirmed by the main camera, the main camera needs to be aimed at the speaker under the drive of the cradle head.

Step S20, if the speaking person is a person, zooming and focusing a target portrait of the speaking person according to a first preset image frame selection rule based on the target positioning information so as to enable the target portrait to be centrally displayed in a corresponding output picture;

the speaker is one person, i.e. only one person in the conference is speaking, and after determining its target positioning information, the principal is aimed at the speaker in a movable range.

The preset image frame corresponding to the first preset image frame rule may be considered as a cropping frame or a composition frame, which is also an image marginal frame of the output frame, and determines a marginal of the displayable image on the display, which functions as: for a specific object (speaker), a panoramic image within a main photographing visual range (maximum viewing angle) is scaled and focused, and the speaker and some scenes and people adjacent thereto are framed, so that images of the speaker and some scenes and people adjacent thereto are displayed in an output screen.

After frame selection operations such as zooming and focusing are performed by using the first preset image frame through the first preset image frame selection rule, the target image of the speaker is displayed in the middle in the output picture, so that the close-up of the speaker is ensured to have the most attractive effect, and people with on-site participants and remote participants can pay attention to the speaker in high efficiency and in time, so that the effect and efficiency of the teleconference are improved.

And step S30, if the speaking person is a plurality of persons, zooming and focusing the target portrait of the speaking person according to a second preset image frame selection rule based on the target positioning information so as to enable the target portrait to be displayed left and right in a corresponding output picture.

The speaking person is a plurality of persons, namely, a plurality of persons speaking in a conference, and two situations are needed to be divided here: two people are speaking at the same time and more than two people are speaking at the same time. The speech may be considered to be performed within a certain time range, and the time range may be set according to actual needs, for example, three persons are speaking within 5s, 2 persons are speaking within 1s, and the like.

For the case of two persons or being treated as two persons speaking simultaneously, the second preset image frame selection rule may be: the leftmost and rightmost image frame selection (the frame selection process is zooming and focusing) and picture output are carried out at the positions of two persons. It is well understood that two people speaking are taken as the left margin and the right margin of the image frame, the figures of the two people speaking can be completely contained and the corresponding output pictures can also display other images between the two people. Further, the term "two persons speak" as used herein refers to two persons speaking more than two persons, but two persons speaking most recently or most optimally may be determined based on the specific order of speaking or the clarity of the voice.

For the case that more than two people are speaking, the second preset image frame selection rule may be:

the tracking operation is not performed, the image frame is adjusted to a preset optimal panoramic viewing angle (auto-framing function), all persons in the main-shooting visual range are required to be included, and the whole formed by all persons is displayed in the output picture as centrally as possible, and the person outside the visual range is not considered.

Referring to fig. 5, in an embodiment, the step S10 includes:

step S100, detecting whether a voice exists in the video conference process;

whether a person in the conference speaks or not can be identified through detection of sound processing technologies such as sound noise reduction, voice recognition, voiceprint recognition and the like, other irrelevant noise is eliminated, and positioning processing is not performed on the noise.

Step S110, if voice exists, judging whether the voice continuously reaches a preset duration threshold;

if there is a voice, in order to eliminate the occasional and irrelevant voice interference, such as common voices with coughs, answering voices, simple answering voices and the like which are irrelevant to the speaking content, it can be further judged whether the voice appearing in the conference continuously reaches a preset duration threshold, and the preset duration threshold can be set according to actual needs, for example, 2s, that is, only if the voice continuously reaches 2s, the speaker is considered to be speaking.

And step S120, if the preset duration threshold is reached, determining that the person corresponding to the voice is the speaking person and the target positioning information of the speaking person.

If the voice of the person reaches the preset time threshold, determining that the person corresponding to the voice of the person is the speaker, and simultaneously determining the target positioning information of the speaker based on the combination of the voice positioning technology and the image positioning technology.

The speaking person is not limited to the number of people, but all require speaking to reach a preset duration threshold.

Furthermore, although the conference may detect that the voice has reached the preset duration threshold, the microphone array recognizes that the angle of the voice has exceeded the image-recognized angular position of each participant, i.e., the voice is outside the range of the sub-shot panoramic image. In addition, in order to facilitate distinguishing the maximum viewing angles corresponding to the panoramic images of the main shot and the sub shot, the two maximum viewing angles may be identical or slightly different.

Through this embodiment of the application, the interference of irrelevant sound in the meeting process is eliminated, and the positions of the current speaker and the speaker can be ensured to be more accurately determined, so that the speaker is quickly subjected to close-up response, and the efficiency of the whole meeting and the close-up effect of tracking the speaker are improved.

Based on the foregoing embodiments, in one embodiment, the step S20 of scaling and focusing the target portrait of the speaking person according to the first preset image frame selection rule so as to make the target portrait be centrally displayed in the corresponding output frame includes:

a, scaling and focusing a target portrait of the speaking person according to a first preset image frame selection rule so as to enable the target portrait to be centrally displayed in a corresponding output picture, and enabling the face center of the target portrait to be in a preset area in the output picture;

Wherein, the first preset image frame selection rule includes:

First, a first preset image frame is described in association, where the length of the first preset image frame may be determined based on the human body width of the speaking person, specifically may be determined based on the torso pixel width of the target portrait of the speaking person in the panoramic image, or may be determined based on the head (face) pixel width of the target portrait in the panoramic image, where the length may be 2 times the torso pixel width if the length is determined based on the torso pixel width, and may be 4 times the head (face) pixel width if the length is determined based on the head (face) pixel width. And can be based on 16:9 or other ratio to determine the width of the first predetermined image frame, thus determining the size of the first predetermined image frame. It should be noted that 16:9, because the corresponding output picture can be conveniently paved on the whole display screen.

Further, after or simultaneously with the determination of the first preset image frame, the position of the first preset image frame needs to be determined. The first predetermined image frame selection rule includes the size and location of the first predetermined image frame. The position of the first preset image frame is determined based on the target person, and in particular may be determined based on the center of the face (corresponding to eyes or nose) of the target person. If the speaker is a person, the image frame is required such that the target person is centrally displayed in the output screen, the target person framed by the image frame is also in the middle of the image frame, and it is also ensured that the center of the face (which may be the eye portion) of the target person (speaker) is displayed on a preset area in the output screen, which may be the upper third area of the output screen. It is well understood that the output frame/first preset image frame may be divided into three equal areas, upper, middle and lower, keeping the eyes or nose of the speaker displayed in the upper equal area.

Further, based on the size of the first preset image frame and the adjacent positions of the participants, the frame selection speaker is easy to exist, and the adjacent persons of the speaker are selected partially or wholly.

In order to further enhance the effect of the video conference tracking display, the position of the first preset image frame is determined based on the adjacent person image of the adjacent person adjacent to the speaker in the first preset image frame, specifically may be determined based on the complete face of the adjacent person image, that is, the face integrity of the adjacent person image in the output screen is ensured while the speaker is ensured to be as centered as possible (but not fully locked for residence), which can be achieved by zooming in and out during shooting by the main camera. In short, that is, the center of the face is kept in the upper 1/3 area of the main shooting view angle but is not locked in the center, and the positions of the left and right boundaries of the image frame are finely adjusted according to the positions of the surrounding persons so as to ensure the integrity of the faces of the surrounding adjacent persons.

In addition, if the face of the speaker cannot be placed at the 1/3 position for the frame/maximum viewing angle reason, it is necessary to keep the target person image of the speaker as centered as possible, and overlap the image frame edge with the visual limit edge, which is a hardware limitation.

Based on the foregoing embodiments, in one embodiment, the step S30 of scaling and focusing the target portrait of the speaking person according to the second preset image frame selection rule so as to display the target portrait left and right in the corresponding output frame includes:

Step b, determining the target speaking person which is the latest speaking person among the speaking persons, wherein the target speaking person is two persons;

c, zooming and focusing the target portrait of the target speaking person according to a second preset image frame selection rule so as to enable the target portrait to be displayed left and right in a corresponding output picture;

wherein the second preset image frame selection rule includes:

First, a second preset image frame is described in association, and the length of the second preset image frame may be determined based on the human body distances of the left and right target speakers, where the human body distances may refer to human body image distances. Preferably the maximum distance of the human body, so that both the left and right speakers are completely framed. And can be based on 16:9 or other ratio to determine the width of the second predetermined image frame, thus determining the size of the second predetermined image frame. It should be noted that 16:9, because the corresponding output picture can be conveniently paved on the whole display screen.

Further, after or simultaneously with the determination of the second preset image frame, the position of the second preset image frame needs to be determined. The second predetermined image frame selection rule includes the size and location of the second predetermined image frame. The position of the second preset image frame may be determined based on the target person, in particular, based on the center of the face (corresponding eyes or nose) of the target person, and may also be determined based on the maximum distance of the person. In the case that the speaker is a plurality of persons, to ensure that the target person image is completely displayed on the left and right sides in the corresponding output screen, it may be further ensured that the face center (may be an eye portion) of the target person image (speaker) is displayed on a preset area in the output screen, where the preset area may be an upper third area of the output screen. It is well understood that the output frame/second preset image frame may be divided into three equal areas, upper, middle and lower, keeping the eyes or nose of each speaker displayed in the upper equal area.

Through this embodiment of this application, can show two talkers in the output picture about completely to and show the meeting personnel between two talkers in the output picture completely, can promote the tracking display effect when many people speak better, promote remote video conference's display quality and communication efficiency.

Based on the above embodiments, in an embodiment, after the step S20, the method further includes:

step d, if the voice-free angle information of the speaking person is detected, zooming and focusing all the human images of all the conference persons according to a third preset image frame selection rule so as to enable the all the human images to be centrally displayed in the corresponding output picture;

wherein the third preset image frame selection rule includes:

The method can be set in a preset time limit (for example, 5 s) according to actual needs, if no voice angle information of a speaker is detected, namely, the original speaker does not speak any more, and under the condition that other people speak, the picture is gradually (5 s) reduced from focusing to a state that all face frames enter, namely, the preset optimal panoramic view angle is panoramic and all the images are centered.

The size of the third preset image frame being consistent with the size of the viewfinder of the panoramic view angle refers to the primary shooting panorama. The position of the third preset image frame is determined based on the whole person image, in particular based on the intermediate position of the whole person image in the output picture/the third preset image frame. And zooming and focusing the whole human images of the whole conference personnel through a third preset image frame selection rule, so that the whole human images are centrally displayed in the corresponding output images.

In the process of zooming and focusing the whole human images of the whole conference personnel, new human voice exists, and the moving tracking can be carried out towards the sound source direction after the 2s requirement is met, and in addition, the focusing or tracking switching process is carried out.

Specifically, the focusing switching process can be divided into two stages:

1. firstly, enlarging an image alignment target of a new speaker;

2. then zoom focusing fine tuning the cradle head.

If the sound disappears in the first stage process, the best panoramic view angle is returned after the first stage action is finished, and if the sound disappears in the second stage, the alignment is continuously finished, and then the best panoramic view angle is returned.

Likewise, after the step S30, the method further includes:

wherein the third preset image frame selection rule includes:

Please refer to the above embodiments, and the description thereof is omitted herein.

In an embodiment, after the step S20 and/or the step S30, the method further includes:

step e, detecting the position change information of the target portrait of the speaking person in the output picture;

and f, if the position change information accords with a re-tracking rule, re-tracking the speaking person so as to enable the target portrait to be displayed in the middle or left and right in a corresponding output picture.

The re-tracking rules here are: the output picture is divided into three equal parts according to left and right, or the display screen pixel range of the display is divided into three equal parts, which can be sequentially called a first equal part area, a second equal part area and a third equal part area from left to right, when the center point of the face of the speaking person exceeds the target equal part area before the position change, for example, exceeds the first equal part area, the speaking person can be re-tracked, wherein the re-tracking can comprise the processes of main shooting movement, main shooting rescaling, focusing and the like under the driving of a cloud platform, and the tracking aims at ensuring that the target image of the speaking person is always displayed in the middle position of the output picture because the speaking person is excessively displaced, or respectively in left and right positions relative to one speaking person and two speaking persons.

In addition, if the position change information does not accord with the re-tracking rule, that is, the center point of the face of the speaker does not exceed the target equal part area before the position change, the processing can be omitted, so that display interference and uncomfortable feeling caused by continuous tracking of the speaker due to small-range movement of the speaker are avoided, and the stability of the video conference picture is ensured.

In an embodiment, in conjunction with the step S20 and the step S30, the method further includes:

and outputting the secondary panoramic image acquired by the secondary camera in real time at the middle and lower part of the output picture, and taking the secondary panoramic image as a picture-in-picture.

The picture-in-picture may be configured to reside or only appear if the sound source changes and disappear after the movement tracking.

Referring to fig. 6, the resolution of the output frame corresponding to the main shot may be 3840×2160, and the resolution of the picture-in-picture corresponding to the sub-shot may be 1280×520, with a rounded corner of 20px and a distance of 20px from the lower edge of the output frame (main frame). The output frame resolution may be 2560×1440, 1920×1080, 1280×720, 640×360.

By the embodiment, not only can the main picture be close-up to the speaker, but also the video pictures of other participants can be synchronously displayed, so that the effect of the remote video conference is further improved.

Additionally, the foregoing embodiments of the present application need to be supplemented with details of rules in the tracking process:

1. the echo transmitted by a loudspeaker from the far end of the conference in the conference process needs to be eliminated;

2. the situation that the face is cut in half is generally not allowed to occur during sound source positioning, and the integrity of the face needs to be guaranteed by shrinking or sacrificing the centering;

3. if the sound source appears at the edge, the face can not be accepted to appear at the edge in the centering way, and at the moment, the clipping proportion frame (image frame) also needs to be selected into another participant closest to the speaker by the frame, so that the appearance is more coordinated and comfortable, and if the face of the sound source is at the most edge, the edge can be directly used as the clipping edge;

4. when the participants are far away from the main camera, the main camera magnification exceeds a preset magnification (such as 6 times), the main camera magnification needs to be executed according to the preset magnification, an image frame is limited, and the face is placed at the 1/3 position as much as possible.

In order to further understand some processes of the technical solution of the present application, please refer to fig. 7, fig. 7 is a schematic flowchart of an application example related to a method for tracking and displaying a conference person of the present application.

In a practical scheme application process:

After the video conference terminal is started, judging whether user layout setting is stored;

if yes, entering a mode selected last time by the user: the sound source localization mode enters a panoramic or manual mode;

if not, starting a sound source positioning mode to enter panorama;

the sound source positioning mode entering panorama can also be realized by remotely controlling the entering sound source positioning;

for the flow on the left side of the figure:

after the sound source localization mode enters the panorama, face recognition (mainly recognizing the position of the face) is performed once every 2 seconds;

identifying the face by using the thumbnail;

the face coordinate information is grouped into a set and a refreshing set, and the positions of the participants are recorded in the set;

judging leftmost and rightmost faces in the 'set';

cutting the leftmost face and the rightmost face;

for the flow on the right side of the figure:

after the sound source localization mode enters the panorama, audio works in real time to output an angle stream (sound localization);

judging whether an angle flow or an angle flow change exists (namely judging whether a speaker or a speaker change exists or not);

matching faces on the angle stream (i.e. combining sound localization with image localization);

the face coordinate information is grouped into a set which contains 2 people at maximum, namely, no matter how many people speak, the face coordinate information is processed into two people speaking at most;

The coordinate information of 10s which is not speaking in the 'set' is removed, namely, the best panoramic view angle is returned after 10 s.

Further, referring to fig. 8, fig. 8 is a schematic diagram of a frame structure of the conference person tracking display device of the present application. The application also provides a meeting personnel tracking display device, meeting personnel tracking display device includes:

the personnel positioning module A10 is used for determining target positioning information of speaking personnel in the process of the video conference;

the camera tracking module a20 is configured to zoom and focus a target portrait of the speaking person according to a first preset image frame selection rule based on the target positioning information if the speaking person is a person, so that the target portrait is centrally displayed in a corresponding output picture; and if the speaking person is a plurality of persons, scaling and focusing the target person image of the speaking person according to a second preset image frame selection rule based on the target positioning information so as to enable the target person image to be displayed left and right in a corresponding output picture.

Optionally, the personnel positioning module a10 is further configured to:

detecting whether human voice exists in the video conference process;

Optionally, the personnel positioning module a10 is further configured to:

acquiring voice angle information of a speaking person;

Optionally, the personnel positioning module a10 is further configured to:

Optionally, the camera tracking module a20 is further configured to:

Wherein, the first preset image frame selection rule includes:

Optionally, the camera tracking module a20 is further configured to:

wherein the second preset image frame selection rule includes:

Optionally, the camera tracking module a20 is further configured to:

Wherein the third preset image frame selection rule includes:

The specific implementation of the conference personnel tracking display device is basically the same as the embodiments of the conference personnel tracking display method, and is not repeated here.

Furthermore, the application also provides a computer readable storage medium. The computer readable storage medium stores a conference person tracking display program, wherein the conference person tracking display program realizes the steps of the conference person tracking display method when being executed by a processor.

The method implemented when the conference person tracking display program is executed may refer to various embodiments of the conference person tracking display method, which are not described herein.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the claims, and all equivalent structural changes made in the present application and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. The conference personnel tracking display method is characterized by comprising the following steps of:

2. The conference person tracking display method as claimed in claim 1, wherein said step of determining target positioning information of a speaker during a video conference comprises:

detecting whether human voice exists in the video conference process;

3. The conference person tracking display method as claimed in claim 1, wherein said step of determining target positioning information of a speaker comprises:

acquiring voice angle information of a speaking person;

4. The conference person tracking display method as claimed in claim 3, wherein said step of determining target positioning information of a speaker from said voice angle information and said positioning information comprises:

5. The conference person tracking display method as claimed in claim 1, wherein said step of scaling and focusing a target person image of said speaker according to a first preset image frame selection rule so as to center said target person image in a corresponding output screen comprises:

wherein, the first preset image frame selection rule includes:

6. The conference person tracking display method as claimed in claim 1, wherein said step of scaling and focusing a target person image of said speaker according to a second preset image frame selection rule so as to display said target person image left and right in a corresponding output screen comprises:

wherein the second preset image frame selection rule includes:

7. The conference person tracking display method of claim 1, wherein after the step of centering the target person image in the corresponding output screen, the method further comprises:

Wherein the third preset image frame selection rule includes:

8. A conference person tracking display device, the conference person tracking display device comprising:

9. A conference person tracking display device comprising a microphone array, a camera module, a processor, a memory, and a conference person tracking display program stored on the memory that is executable by the processor, wherein the conference person tracking display program, when executed by the processor, implements the steps of the conference person tracking display method of any of claims 1 to 7.

10. A computer-readable storage medium, wherein a conference person tracking display program is stored on the computer-readable storage medium, wherein the conference person tracking display program, when executed by a processor, implements the steps of the conference person tracking display method according to any one of claims 1 to 7.