CN116866509B

CN116866509B - Conference scene picture tracking method, device and storage medium

Info

Publication number: CN116866509B
Application number: CN202310839845.9A
Authority: CN
Inventors: 杜剑文; 李辉权
Original assignee: Shenzhen Chuangzai Network Technology Co ltd
Current assignee: Shenzhen Chuangzai Technology Co.,Ltd.
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2024-02-23
Anticipated expiration: 2043-07-10
Also published as: CN116866509A

Abstract

The invention discloses a conference site picture tracking method, a computer device and a storage medium, which comprise the steps of carrying out semantic recognition on sound signals acquired by each sound acquisition point, obtaining semantic recognition information, comparing reference semantic information with each semantic recognition information, determining a plurality of picture acquisition target points, carrying out picture tracking on the picture acquisition target points and the like. The invention can identify the position point of the participant with speaking tendency, namely the picture acquisition target point, and can realize the picture data of the participant with speaking tendency to be acquired in advance, thereby quickly calling the picture data which is formed in advance when the participant really speaks, shortening the time from the start of speaking of the new speaker to the display of the picture of the new speaker, leading the picture of the new speaker to tend to be synchronous, being beneficial to improving the quality and conference efficiency of the video conference and being suitable for larger-scale conferences. The invention is widely applied to the technical field of multimedia.

Description

Conference scene picture tracking method, device and storage medium

Technical Field

The invention relates to the technical field of multimedia, in particular to a conference site picture tracking method, a computer device and a storage medium.

Background

When a video conference is carried out, whether sound and picture synchronization can be achieved has great influence on the quality of the conference. For example, if sound-picture synchronization can be achieved, then a face-to-face meeting communication effect is easy to achieve, and if the sound-picture synchronization degree is poor, for example, the participants hear the sound of the speaker and cannot see the video picture of the speaker, then the participants cannot concentrate on listening easily, so that conference information is ignored, and negative results such as communication quality are reduced. Especially, in the video conference process of multiple persons, because the number of the speakers is relatively large, and the number of the display areas of the picture shooting device and the display is limited, the picture shooting device can only shoot the individual speakers to obtain pictures, and the pictures are displayed through the display, so that the picture tracking problem of the conference site is generated, namely, when the speakers in the conference site change, the pictures of the original speakers are switched to the pictures of the new speakers in time.

In the video conference process of participating in multiple persons, the identities of the speakers are generally not fixed, and in the links of free discussion and the like, multiple participants become the speakers at random, so that a great challenge is caused for picture tracking of conference sites. In the prior art, a manual tracking mode is generally used, that is, after a photographer or a conference host and other staff in a conference site pay attention to the change of a speaker, a picture of a new speaker is manually switched and displayed, and because the manual judgment and operation take a long time, the tracking of the picture often falls far behind the change of the speaker, so that the new speaker has been speaking for a long time, but the situation that the picture switching is not completed still, and the information exchange quality of the conference is reduced.

Some related technologies detect a new speaker by measuring the sound of a conference site, by sound localization (for example, determining the distance between a sound source and a microphone by detecting the intensity of a sound signal, and determining the position of the sound source in combination with the distances of a plurality of microphones), and the like, thereby realizing automatic picture switching. Although these related technologies increase the speed of picture tracking to a certain extent, and the premise of realizing sound positioning is that a new speaker starts speaking, and sound positioning needs to take a certain time, so these related technologies still have a large picture switching delay, and the sound positioning effect is limited by factors such as space size, number of participants and site noise, so these related technologies are generally limited to applications in occasions such as small conference site space and small number of participants.

Disclosure of Invention

Aiming at the technical problems of low picture tracking speed, large limitation of application scenes and the like in the conventional conference scene picture tracking technology, the invention aims to provide a conference scene picture tracking method, a computer device and a storage medium.

In one aspect, an embodiment of the present invention includes a method for tracking a conference scene, including:

Acquiring sound signals acquired by each sound acquisition point; each sound acquisition point is distributed at each position of the conference site;

carrying out semantic recognition on each sound signal to obtain semantic recognition information corresponding to each sound signal;

acquiring reference semantic information;

comparing the reference semantic information with each piece of semantic identification information;

determining a plurality of picture acquisition target points according to the comparison result of the reference semantic information and each semantic identification information;

and carrying out picture tracking on the picture acquisition target point.

Further, the acquiring the reference semantic information includes:

acquiring theme information of a conference conducted on the conference site as the reference semantic information;

or alternatively

Carrying out semantic prediction according to each piece of semantic identification information to obtain semantic prediction information;

and taking the semantic prediction information as the reference semantic information.

Further, determining a plurality of image acquisition target points according to the comparison result of the reference semantic information and each semantic identification information includes:

respectively determining a first correlation degree between the reference semantic information and each semantic identification information;

Determining a plurality of first sound collection target points; the first sound collection target point is the sound collection point with the highest corresponding first correlation degree in all the sound collection points;

detecting a plurality of first sound sources; the first sound source is a sound source corresponding to the sound signal acquired by the first sound acquisition target point;

and determining the picture acquisition target point according to the position of the first sound source.

Further, the acquiring the reference semantic information includes:

clustering the sound collection points according to the semantic identification information and/or the position corresponding to the sound collection points;

and determining the category to which each sound acquisition point belongs and the reference semantic information corresponding to each category according to the clustering result.

acquiring the latest generated semantic identification information in all the semantic identification information;

determining a second correlation degree between each piece of reference semantic information and the latest generated semantic identification information respectively;

Determining a plurality of second sound acquisition target points; the second sound collection target point is the sound collection point with the highest corresponding second correlation degree in all the sound collection points;

detecting a plurality of second sound sources; the second sound source is a sound source corresponding to the sound signal collected by the second sound collection target point;

and determining the picture acquisition target point according to the position of the second sound source.

Further, the performing image tracking on the image acquisition target point includes:

controlling the picture shooting direction according to the position of each picture acquisition target point;

shooting each picture acquisition target point to obtain a plurality of corresponding picture data;

and mapping each picture data to a candidate picture queue.

Further, the performing image tracking on the image acquisition target point further includes:

acquiring acoustic parameters of each sound signal;

taking all the acoustic parameters as a whole, taking a single acoustic parameter as an individual, and determining a first deviation value corresponding to each acoustic parameter;

taking all the semantic identification information as a whole, taking single semantic identification information as an individual, and determining second deviation values corresponding to the semantic identification information;

Determining a priority value corresponding to the sound signal according to the first deviation value and the second deviation value corresponding to the sound signal; wherein the priority value is positively correlated with the first offset value and negatively correlated with the second offset value;

sorting the picture data in the candidate picture queue according to the priority value; the priority value corresponding to any picture data is positively correlated to the degree to which the picture data is preferentially fetched in the candidate picture queue.

acquiring acoustic parameters of each sound signal;

tracking variations of each of the acoustic parameters;

when detecting that the fluctuation of the acoustic parameter reaches a threshold value, the picture data corresponding to the acoustic parameter is read from the candidate picture queue and displayed.

In another aspect, an embodiment of the present invention further includes a computer apparatus including a memory for storing at least one program and a processor for loading the at least one program to perform a conference scene picture tracking method in the embodiment.

In another aspect, embodiments of the present invention further include a storage medium having stored therein a processor-executable program which, when executed by a processor, is configured to perform a conference scene screen tracking method of the embodiments.

The beneficial effects of the invention are as follows: according to the conference scene picture tracking method in the embodiment, semantic information representing the microscopic angle of the conference, namely, the microscopic angle semantic information, can be obtained, reference semantic information can represent the macroscopic angle semantic information of the conference, by comparing difference information between the microscopic angle semantic information and the macroscopic angle semantic information of the conference, the position point of a participant with speaking tendency, namely, a picture acquisition target point, can be identified, picture tracking is carried out on the picture acquisition target point, picture data of the participant with speaking tendency can be obtained in advance, therefore, when the participant really speaking, picture data formed in advance can be quickly called, the time from a new speaker to the display of a picture of the new speaker is shortened, the picture and the sound of the new speaker tend to be synchronous, the improvement of the quality and conference efficiency of the video conference is facilitated, better anti-interference capability is easy to obtain, and the conference with larger scale can be adapted.

Drawings

FIG. 1 is a system diagram of an embodiment in which a conference scene picture tracking method may be applied;

FIG. 2 is a step diagram of a method for tracking conference scene according to an embodiment;

FIG. 3 is a schematic diagram of a first implementation of the step of determining a plurality of image acquisition target points in the embodiment;

fig. 4 is a schematic diagram of a second implementation of the step of determining a number of image acquisition target points in the embodiment.

Detailed Description

The conference site picture tracking method in this embodiment may be implemented by the system shown in fig. 1. Referring to fig. 1, a table and chair for participants are provided at a conference site, and a plurality of sound collection points are provided at the conference site, specifically, one sound collection point may be provided at each position of the table and chair, or a plurality of sound collection points may be uniformly arranged at the conference site in an array form as shown in fig. 1, wherein each of the dotted line crossing points is one sound collection point. In this embodiment, the sound collection point may refer to a position where a sound collection device such as a microphone is placed, or may refer to a sound collection device such as a microphone placed at this position, and when the sound collection point is mentioned, the position where the sound collection point is pointed and the sound collection device such as a microphone placed at this position may not be distinguished, unless otherwise specified.

In this embodiment, the microphones of the respective sound collection points shown in fig. 1 may be specially provided for performing the conference site picture tracking method, that is, the microphones of the respective sound collection points shown in fig. 1 are used only when the steps in the conference site picture tracking method are performed, and the participants of the conference at the conference site speak with another microphone, in which case the microphones of the respective sound collection points shown in fig. 1 may be omni-directional, that is, they pick up sounds in each direction with the same sensitivity. The microphones of the individual sound collection points shown in fig. 1 may also be used concurrently, i.e. on the one hand the microphone of the individual sound collection points shown in fig. 1 is used when performing the steps in the conference scene picture tracking method, and on the other hand the microphone of the individual sound collection points shown in fig. 1 is used when the participants of the conference in the conference scene speak, in which case the microphone of the individual sound collection points shown in fig. 1 may be unidirectional, i.e. pick up sound with normal sensitivity only for a specific direction, e.g. the direction in which the respective one of the participants is located, and with lower sensitivity for the other directions.

Referring to fig. 1, the conference site is further provided with a plurality of cameras, each having its corresponding field of view. Specifically, each camera can change the shooting visual field through means of rocker control, track movement or zooming, so that one camera can select different participants to shoot.

After the sound collection points and the cameras are arranged according to the form shown in fig. 1, an upper computer is arranged, and the upper computer is connected with each microphone and each camera through a data line or through wireless communication protocols such as WiFi. After the upper computer is connected with each microphone and each camera, the upper computer can issue control instructions to a specific microphone or camera, so that the microphone or the camera is called to execute corresponding operation, the microphone can upload collected sound signals to the upper computer for processing, and the camera can upload shot picture data to the upper computer for processing. Accordingly, each step in the conference scene picture tracking method can be performed by the host computer.

In this embodiment, referring to fig. 2, the conference site picture tracking method includes the steps of:

S1, acquiring sound signals acquired by each sound acquisition point;

s2, carrying out semantic recognition on each sound signal to obtain semantic recognition information corresponding to each sound signal;

s3, acquiring reference semantic information;

s4, comparing the reference semantic information with each semantic identification information;

s5, determining a plurality of picture acquisition target points according to the comparison result of the reference semantic information and each semantic identification information;

s6, performing picture tracking on the picture acquisition target point.

In step S1, the host computer may encode each microphone, and distinguish between different microphones by the encoding of the microphones. For example, the microphones may be individually encoded as mike ₁ 、mike ₂ ……mike _N Where N is the total number of microphones. Microphone mike ₁ Converting the collected sound signals into a data form readable by an upper computer to obtain sound signals audio ₁ The method comprises the steps of carrying out a first treatment on the surface of the Microphone mike ₂ Collect the audio signal ₂ … … microphone mike _N Collect the audio signal _N … …. By executing step S1, the upper computer can obtain the audio signal ₁ 、audio ₂ ……audio _N 。

Since the effective information such as human voice in the sound signal is utilized in the present embodiment, when the sound signal obtained in step S1 contains background sound, noise, or the like, the collected sound signal may be subjected to preprocessing such as noise reduction and filtering by the host computer or microphone, and in the subsequent step, the processed sound signal may be considered to contain only the effective information such as human voice, unless otherwise specified. Furthermore, voiceprint characteristics of specific personnel can be recorded, and the voice signals are filtered according to the voiceprint characteristics, so that the voice signals only contain voice information of the specific personnel, and interference of other irrelevant information on the processing process of each step is reduced.

In step S2, the upper computer runs a semantic recognition program spech_recognizer (), specifically, the semantic recognition program spech_recognizer () may be a wav2vec algorithm or the like. For sound signal audio ₁ Semantic recognition is carried out to obtain corresponding semantic recognition information semmantic ₁ ＝Speech_Recognizer(audio ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the For sound signal audio ₂ Semantic recognition is carried out to obtain corresponding semantic recognition information semmantic ₂ ＝Speech_Recognizer(audio ₂ ) … … pair of audio signals _N Semantic recognition is carried out to obtain corresponding semantic recognition information semmantic _N ＝Speech_Recognizer(audio _N ). By executing step S2, semantic identification information semmantic in the form of a vector or the like can be obtained ₁ 、semantic ₂ ……semantic _N 。

In step S3, the upper computer acquires reference semantic information semanteme _reference . In the present embodiment, reference is made to semantic information semmantic _reference Semantic identification information semmantic ₁ 、semantic ₂ ……semantic _N Etc. have the same data format etc. attributes and refer to semantic information semmantic _reference Semantic information capable of representing the macroscopic angle of the whole or a part thereof (e.g. a part in the spatial dimension, i.e. an area in the conference site, or a part in the temporal dimension, e.g. a certain agenda of the conference) of the conference site, so that step S4 can be performed to semantically identify the semantic information ₁ 、semantic ₂ ……semantic _N Respectively with reference semantic information semmantic _reference Comparison was performed. Semantic due to semantic identification information ₁ 、semantic ₂ ……semantic _N Etc. respectively by microphones mike ₁ 、mike ₂ ……mike _N Collected at specific locations and at specific moments in the conference site, so semantic identification information semmantic ₁ 、semantic ₂ ……semantic _N Etc. semantic information capable of representing microscopic angles of the conference, while referring to semantic information semmantic _reference Semantic information of a macroscopic angle of the conference can be represented, so that by comparison, a difference between the semantic information of the microscopic angle of the conference and the semantic information of the macroscopic angle can be displayed.

In step S5, based on the reference semantic information semmantic _reference With each semantic identification information semmantic ₁ 、semantic ₂ ……semantic _N The contrast results of the conference, namely the difference between the semantic information of the micro angle and the semantic information of the macro angle, determine a plurality of image acquisition target points.

In step S6, the upper computer may encode each camera as a camera ₁ 、camera ₂ ……camera _M Wherein M is the total number of cameras, and each camera is respectively sent with a picture tracking instruction, so that picture tracking is carried out on picture acquisition target points. Specifically, the upper computer may call the camera first ₁ Captured frame of picture data ₁ Camera camera ₂ Captured frame of picture data ₂ … … camera _M Captured frame of picture data _M Judging the frame of the picture data ₁ 、frame ₂ ……frame _M If the image data of all the image acquisition target points in the step S5 are shot, invoking the image data containing the image acquisition target points, if the image data of all the image acquisition target points are not shot, the upper computer sends control instructions to some cameras in the image data, and the cameras receiving the control instructions adjust the shooting field of view of the cameras, so that all the image acquisition target points in the step S5 are shot, and tracking of the image acquisition target points in the step S5 is realized.

In this embodiment, steps S1-S6 may be performed when a person formally speaks in the conference process, or may be performed on a participant who does not speak (e.g., performs a discussion before speaking), in which case, the participant may be notified of the existence of the sound collection point in advance, so that the participant avoids revealing private information to the sound collection point, or may obtain authorization that may involve processing of private information to the participant; or limiting the pick-up range of the sound collection points, for example, enabling the participants to be picked up by the sound collection points only when the participants discuss towards the conference table, and avoiding the participants from being picked up by the sound collection points by reducing speaking volume or speaking sideways; or the upper computer is set so that after the upper computer executes the steps S1-S2 to obtain the semantic identification information, the semantic identification information is filtered, so that the semantic identification information processed in the subsequent steps of the steps S3-S6 and the like only contains information related to the conference but does not contain irrelevant information such as privacy information and the like.

In this embodiment, the principle of performing steps S1 to S6 is that: steps S1-S6 may be performed at any time period during the conference, including without the current person speaking, so by performing steps S1-S2, semantic information representing the micro-angle of the conference, i.e. semantic information of the micro-angle, can be obtained, while reference semantic information can represent semantic information of the macro-angle of the conference; the semantic information of the macroscopic angle of the conference can reflect the overall advancing trend of the conference agenda (such as the speaking of all the participants who have spoken), the semantic information of the microscopic angle can represent the event (the speaking of the individual participants) of the conference agenda, by comparing, the difference information between the semantic information of the microscopic angle of the conference and the semantic information of the macroscopic angle can be displayed, and the difference information is actually generated by the participants who are about to speak but not speak (have speaking tendency), so that the position point of the participants who have speaking tendency, namely the picture acquisition target point, can be identified, and the picture data of the participants who have speaking tendency can be acquired in advance by carrying out picture tracking on the picture acquisition target point, so that when the participants really speak, the picture data which have been formed in advance, namely the processes of camera view conversion, camera parameter changing configuration, picture switching and the like, can be completed before the new speaker speaks, the time from the new speaker to the picture of the new speaker is displayed is shortened, the picture tendency of the new speaker is synchronous, and the picture of the video conference is favorable for improving the video and conference efficiency; compared with the related technologies such as sound positioning, the accuracy of the sound semantic recognition used in the steps S1-S6 is less interfered by the factors such as the size of the conference site space, the number of participants, site noise and the like, so that the steps S1-S6 are easy to obtain better anti-interference capability and can adapt to a larger-scale conference.

In this embodiment, when step S3, that is, the step of acquiring the reference semantic information, the following steps may be specifically performed:

S301A, acquiring theme information of a conference conducted on a conference site, and taking the theme information as reference semantic information.

Step S301A is a first implementation manner of step S3.

When step S301A is performed, topic information of the conference may be input to the host computer in advance by the conference organization party, where the topic information may include information of a topic, a host, a contractor, a sponsor, a holding place, a participant, an introduction, and the like of the conference. When the original subject information has smaller data volume, the original subject information can be directly used as reference semantic information semanteme _reference The original subject information can also be processed by using the algorithms such as TextRank and the like, and keywords are extracted from the original subject information to be used as reference semantic information semanteme _reference . The subject information obtained by performing step S301A may be converted into a form of a vector so as to match the semantic identification information semanteme ₁ 、semantic ₂ ……semantic _N Etc. are the same data format.

By executing step S301A, the obtained reference semantic information semmantic _reference Containing subject information of meetings, thus referencing semantic information semmantic _reference Semantic information that can represent the macroscopic angle of the meeting.

S301B, carrying out semantic prediction according to each piece of semantic identification information to obtain semantic prediction information;

S302B, taking semantic prediction information as reference semantic information.

Steps S301B-S302B are the second implementation manner of step S3.

In executing step S301B, the upper computer executes the semantic identification information semmantic obtained in step S2 ₁ 、semantic ₂ ……semantic _N Ordering according to the corresponding sound signal acquisition time (assuming that the ordered semantic identification information is sematic ₁ 、semantic ₂ ……semantic _N ) Then the ordered semantic identification information semmantic is processed ₁ 、semantic ₂ ……semantic _N As known data, the data is input to a prediction algorithm such as a long-short-term memory network to perform prediction processing. The trained predictive algorithm can identify semantically based on the information ₁ 、semantic ₂ …，..semantic _N In step S302B, the semantic information expressed when the participant is likely to speak in the future is obtained by prediction, and the prediction result output by the prediction algorithm is used as the reference semantic information semanic _reference 。

Semanteme identification information semmantic expressed by all the speaking participants ₁ 、semantic ₂ ……semantic _N Semantic information capable of representing macroscopic angles of the conference, thus performing reference semantic information semmantic predicted by steps S301B-S302B _reference Semantic information can be represented for macroscopic angles of meetings that will occur at a future time.

In this embodiment, on the basis of executing step S301A or steps S301B-S302B, when executing step S5, that is, determining a plurality of image acquisition target points according to the comparison result of the reference semantic information and each semantic identification information, the following steps may be specifically executed:

S501A, respectively determining a first correlation degree between the reference semantic information and each semantic identification information;

S502A, determining a plurality of first sound acquisition target points; the first sound collection target point is the sound collection point with the highest corresponding first correlation degree in all the sound collection points;

S503A, detecting a plurality of first sound sources; the first sound source is a sound source corresponding to the sound signal acquired by the first sound acquisition target point;

S504A, determining a picture acquisition target point according to the position of the first sound source.

In the present embodiment, by executing step S301A or steps S301B-S302B, the reference semantic information semanic is determined _reference In reference to semantic information semmantic _reference Semantic identification information semmantic ₁ 、semantic ₂ ……semantic _N In the case of vectors, the semantic identification information semmantic is calculated easily by a vector similarity algorithm ₁ With reference semantic information semmantic _reference First degree of correlation betweenSemantic identification information semmantic ₂ With reference semantic information semmantic _reference First degree of correlation between->… … semantic identification information semmantic _N With reference semantic information semmantic _reference First degree of correlation between->

In step S502A, the respective first correlations obtained in step S501A The largest number of first correlations is selected. For example, 3 maximum first correlations are required to be selected, and the 3 maximum first correlations detected are respectivelyAnd->Due to the first degree of correlation->Through microphone mike ₁ Collected sound signal audio ₁ Semantic identification information semmantic of (a) ₁ With reference semantic information semmantic _reference First degree of correlation between, and therefore first degree of correlation +.>The corresponding sound collection point is a microphone mike ₁ Thus, mike ₁ Is determined as one of the first sound collection target points. The same can determine other first sound collection target points mike ₃ And mike ₄ 。

In step S503A, in the case shown in fig. 1, by placing each sound collection point near the conference table and chair and appropriately limiting the pickup range of the sound collection points, each sound collection point can be controlled to collect only a corresponding one of the sound sources (i.e., the participants sitting on the corresponding conference table). For example, referring to FIG. 3, a microphone mike ₃ For example, a microphone mike may be set ₃ The only sound source of (2) is the participants in row 1 and column 2 of fig. 3 when determining microphone mike ₃ If the target point is the first sound collection point, the meeting participant in the 1 st row and the 2 nd column in fig. 3 can be determined as the first sound source, and when the step S504A is executed, the location of the meeting participant in the 1 st row and the 2 nd column in fig. 3 is determined as the image collection target point, and the nearby camera is controlled to capture the image data of the meeting participant before the meeting participant formally speaks.

In step S503A, when each sound collection point may collect sounds emitted by a plurality of different sound sources, the position of the first sound source may be determined as the screen collection target point in step S504A by a sound localization technique (e.g., by the sound intensities detected by the first sound collection target point and the sound collection points near the first sound collection target point, respectively, and determining the position of the sound source).

In this embodiment, the principle of performing steps S501A-S504A is that: on the basis of executing step S301A, the subject information of the conference is obtained as reference semantic information semanteme _reference On the basis of executing steps S301B-S302B, semantic prediction information is obtained as reference semantic information semanic _reference These two types of reference semantic information semmantic _reference Semantic information respectively representing the macroscopic angle of the conference at the current moment and semantic information of the macroscopic angle of the conference which will occur at the future moment; the first sound collection target point screened in the steps S501A-S504A is actually that the speaking content and the reference semantic information semmantic are collected _reference The sound collection point of the sound signal closest to the sound collection point can expect that the participant at the first sound source is the most probable participant who will speak, and the position of the most probable participant who will speak is taken as the picture collection target point, so that the picture data of the most probable participant who will speak can be collected in advance, and the hit rate of the picture data collection in advance is improved.

S301C, clustering each sound acquisition point according to semantic identification information and/or the position of each sound acquisition point;

S302C, determining the category of each sound acquisition point and the reference semantic information corresponding to each sound acquisition point according to the clustering result.

Steps S301C-S302C are a third implementation manner of step S3.

When executing step S301C, the upper computer may determine semantic identification information semmantic corresponding to each sound collection point according to the semantic identification information semmantic ₁ 、semantic ₂ ……semantic _N And clustering, namely gathering the sound collection points with similar semantic identification information into the same category, wherein the semantic identification information corresponding to the sound collection points of different categories is not similar.

When executing step S301C, the upper computer may perform clustering according to the respective positions of the sound collection points, for example, may perform region division according to the meeting organization arrangement, according to the functional regions corresponding to the tables and chairs where the sound collection points are located (may perform region division according to the industries, the roles, the grades, the personnel types (participating in the meeting or the conference) of the seated participants, for example, the senior expert region, the invited guest region, the crews region, etc.), so that each sound collection point is divided into one functional region of the meeting site, that is, the functional region position of the meeting site where the sound collection point is located is used as the category to which the sound collection point belongs.

Regardless of the form by which the clustering is performed, each sound collection point mike can be obtained by executing step S301C ₁ 、mike ₂ ……mike _N Respectively dividing into corresponding categories to obtain each category cluster ₁ 、cluster ₂ ……cluster _k Where k is the total number of clusters forming categories (or functional areas into which the conference site is divided).

In the case where the clustering algorithm is performed according to the semantic recognition information in step S301C, in step S302C, the clustering center obtained at the same time when the clustering algorithm is performed may be used as the reference semantic information of the corresponding category. For example, class cluster output in a clustering algorithm ₁ As a cluster center of class cluster ₁ Corresponding reference semantic informationClass cluster output by clustering algorithm ₂ As a cluster center of class cluster ₂ Corresponding reference semantic information->… … clustering center of class clusterin output by clustering algorithm is used as reference semantic information corresponding to class clusterin>

In the case where the clustering algorithm is performed according to the respective locations of the sound collection points (actually, the division of the functional areas of the conference site is performed) in step S301C, in step S302C, the reference semantic information of the corresponding category (the functional areas of the conference site) may be determined according to the conference organization arrangement of the functional areas of the conference site. For example, if category (functional area) cluster ₁ For "senior expert zone", it is determined that the sound collection point belonging to the category (functional zone) is about to collect sound signals of senior experts, and the materials such as speaking manuscripts and speaking sequence of senior experts attending the conference can be obtained and keywords can be extracted as category cluster ₁ Corresponding reference semantic informationIf category (functional area) cluster ₂ In the "special invited guest region", the sound signal of the special invited guest to be collected at the sound collection point belonging to the category (functional region) can be determined, the information such as the speaking manuscript and the speaking order of the special invited guest which attends the conference can be obtained, and the keywords can be extracted as the category cluster ₂ Corresponding reference semantic information->… … class (functional area) cluster _k In the "attendant area", the sound signal of the attendant belonging to the category (functional area) to be acquired can be determined, and the data such as the speaking manuscript and the speaking order of the attendant attending the conference can be acquired and the keywords can be extracted as the category cluster _k Corresponding reference semantic information->

In this embodiment, on the basis of executing steps S301C-S302C, when executing step S5, that is, determining a plurality of image acquisition target points according to the comparison result of the reference semantic information and each semantic identification information, the following steps may be specifically executed:

S501B, acquiring the latest generated semantic identification information in all the semantic identification information;

S502B, respectively determining second relativity between each piece of reference semantic information and the latest generated semantic identification information;

S503B, determining a plurality of second sound acquisition target points; the second sound collection target point is the sound collection point with the highest corresponding second correlation degree in all the sound collection points;

S504B, detecting a plurality of second sound sources; the second sound source is a sound source corresponding to the sound signal collected by the second sound collection target point;

S5055B, determining a picture acquisition target point according to the position of the second sound source.

In step S501B, it is assumed that the semantic recognition information sequenced according to the corresponding sound signal acquisition time is semanteme ₁ 、semantic ₂ ……semantic _N Then the most recently generated semantic identification information is sematic _N Which represents the audio signal from the last acquisition (last speaker speaking) until the time of executing steps S1-S6 _N And identifying the obtained semantic identification information.

In the present embodiment, the reference semantic information is determined by executing steps S301C-S302C In step S502B, in reference to semantic information +.> Newly generated semantic identification information semmantic _N In the case of vectors, the latest generated semantic identification information semanic is calculated easily through a vector similarity algorithm _N Is->Second degree of relativity between>Newly generated semantic identification information semmantic _N Is->Second degree of relativity between>… … the most recently generated semantic identification information semmantic _N Is->Second degree of correlation between

In step S503B, the respective second correlations obtained in step S502B The largest number of second correlations is selected. For example, 3 largest second correlations are required, and the 3 largest second phases Guan Du detected are respectivelyAnd->Due to the second degree of relatedness->Is a category (functional area) cluster ₁ Reference semantic information->With newly generated semantic identification information semmantic _N A second degree of relatedness between and will therefore belong to a category (functional area) cluster ₁ Is determined as a second sound collection target point therein. The same thing can determine that the group will belong to the category (functional area) ₃ And cluster ₄ Is determined as a second sound collection target point therein.

The principle of step S504B is the same as that of step S503A, and for any one second sound collection target point, the second sound source corresponding to this second sound collection target point may be determined by means of its unique sound source or a sound localization technique, so that in step S505B, the position of each second sound source is determined as a screen collection target point.

In this embodiment, the principle of performing steps S501B-S505B is that: on the basis of executing steps S301C-S302C, respective reference semantic information of each category (functional area) is obtained The reference semantic information represents semantic information of each macroscopic angle of each part under the condition that the conference site at the current moment is divided into small-scale parts according to a certain standard; the second sound collection target point screened by executing the steps S501B-S505B is actually that the speaking content and the last recognized semantic recognition information semmantic are collected _N The sound collection point of the closest sound signal, therefore, the participant at the second sound source can be expected to be the most likely participant to speak, the position of the most likely participant to speak is taken as the picture collection target point, and the picture data of the most likely participant to speak can be expected to be collected in advance; further, referring to fig. 4, since clustering is performed in steps S501B to S505B, a plurality of semantic recognition messages can be found at a timeAnd the second sound sources with relevance on the positions or positions are used as picture acquisition target points for picture tracking, so that more pre-acquired picture data are used as candidates for picture switching, and the hit rate of picture data pre-acquisition is improved.

In this embodiment, when performing step S6, that is, performing the step of performing the image tracking on the image acquisition target point, the following steps may be specifically performed:

s601, controlling the picture shooting direction according to the position of each picture acquisition target point;

s602, shooting each picture acquisition target point to obtain a plurality of corresponding picture data;

s603, mapping each picture data to a candidate picture queue.

In step S601, after determining the position of each screen capture target point, the upper computer may search for the camera closest to these positions and possibly adjusting the shooting field to these positions, and send control instructions to these cameras, so that the shooting field of the camera is adjusted to the position of each screen capture target point.

In step S602, the camera can shoot the participants sitting at the positions of the image acquisition target points, where each image acquisition target point can shoot a corresponding set of image data in the form of video streams, or each participant can shoot a corresponding set of image data in the form of video streams.

Since the above embodiment proposes that the first sound collection target point is mike ₁ 、mike ₃ And mike ₄ By way of example, assume that step S602 is performed, and the obtained picture data includes the picture data for mike ₁ Frame obtained by shooting of meeting personnel at corresponding positions ₁ For mike ₃ Frame obtained by shooting of meeting personnel at corresponding positions ₃ And to mike ₄ Frame obtained by shooting of meeting personnel at corresponding positions ₄ 。

In step S603, the frame of the picture data obtained in step S602 is set ₁ 、frame ₃ And frame ₄ Mapping to a candidate pictureA queue. The host computer or other device may invoke a frame of picture data from the candidate picture queue ₁ 、frame ₃ And frame ₄ Any one or more of the screen data is displayed. Frame of picture data in candidate picture queue ₁ 、frame ₃ And frame ₄ The order of each picture data in the candidate picture queue indicates the order in which it is preferentially called out for display.

On the basis of performing steps S601-S603, the following steps may also be performed:

s604, acquiring acoustic parameters of each sound signal;

s605, taking all acoustic parameters as a whole, taking a single acoustic parameter as an individual, and determining respective corresponding first deviation values of the acoustic parameters;

s606, determining second deviation values corresponding to all the semantic identification information by taking all the semantic identification information as a whole and single semantic identification information as an individual;

s607, determining a priority value corresponding to the sound signal according to the first deviation value and the second deviation value corresponding to the sound signal; wherein the priority value is positively correlated with the first bias value and negatively correlated with the second bias value;

S608, sorting the picture data in the candidate picture queue according to the priority value; the priority value of any picture data in the candidate picture queue is positively correlated with the priority value corresponding to the picture data.

In the execution of step S604, the sound signals collected by the N microphones include audio ₁ 、audio ₂ ……audio _N . Specifically, taking the sound intensity as an example of the acoustic parameter, i.e., in step S604, the sound signal audio is acquired ₁ Intensity of sound (audio) ₁ ) Audio signal audio ₂ Intensity of sound (audio) ₂ ) … … Sound signal audio _N Intensity of sound (audio) _N )。

In step S605, the first deviation value corresponding to any one of the acoustic parameters can be expressed as the acoustic parameter and the total valueAbsolute value of the difference between the average values of the partial acoustic parameters. For example, the average value of all acoustic parameters is Acoustic parameter intensity (audio) _i ) The corresponding first deviation value is +.>Where i=1, 2,..n.

In step S606, the second deviation value corresponding to any one of the semantic identification information may be expressed as an absolute value of the difference between the semantic identification information and the average value of all the semantic identification information. For example, semantic identification information semmantic ₁ 、semantic ₂ ……semantic _N Are all in the form of vectors, the average value of which isSemantic identification information semmantic _i The corresponding second deviation value isWhere i=1, 2,..n.

In step S607, for an audio signal _i The corresponding first deviation value is The corresponding second deviation value is +.>Then the sound signal audio can be calculated by a function _i Priority of corresponding priority value _i This function is offset from the first offset valuePositive correlation with a second deviation value +.>And (5) negative correlation. For example, can be represented by the formula-> Namely, calculating the ratio of the first deviation value to the second deviation value, thereby calculating the sound signal audio _i Priority of priority value of (2) _i Where i=1, 2,..n.

In executing step S608, since it is proposed in the above embodiment that "the obtained picture data includes the pair mike ₁ Frame obtained by shooting of meeting personnel at corresponding positions ₁ For mike ₃ Frame obtained by shooting of meeting personnel at corresponding positions ₃ And to mike ₄ Frame obtained by shooting of meeting personnel at corresponding positions ₄ As an example of "the audio signal audio is calculated in step S608, respectively ₁ Priority of priority value of (2) ₁ Audio signal audio ₃ Priority of priority value of (2) ₃ Audio signal audio ₄ Priority of priority value of (2) ₄ 。

In step S608, the priority is assumed ₃ ＞priority ₁ ＞priority ₄ Due to priority of priority value ₁ 、priority ₃ And priority ₄ Respectively corresponding to frame of picture data in candidate picture queues ₁ 、frame ₃ And frame ₄ Thus, in the candidate picture queue, the picture data frame ₃ The most preferred is called out and the frame of the picture data ₁ The extracted priority is inferior to the frame of the picture data ₄ The priority of the call is the lowest. For example, in the case of performing a means switching picture, the picture data frame may be ₃ Placing in the most preferred recommended position; in the case of automatically switching pictures, whenDetecting a frame switch trigger condition to switch the frame to a frame data preferentially ₃ 。

In this embodiment, the principle of performing steps S601 to S608 is that: with acoustic signals audio _i For example, the first deviation value thereof indicates the microphone mike thereof _i The deviation degree of the collected speech (or discussion of non-speech property) of the participants relative to the whole atmosphere of the conference site can reflect subjective factors such as emotion of the participants, so as to actively reflect the speaking tendency of the participants, thus giving priority to the priority _i Is set to be positively correlated with the first offset value; and the audio signal audio _i Is indicative of its microphone mike _i The degree of deviation of the essence content of the collected speech (or discussion of non-speech properties) of the participants from the conference content is intuitively understood to be that the larger the second deviation value is, the more audio signal is represented _i The more nonsensical the substantial content of the conference is, i.e. the more ineffective the speaking trend of the participants is with respect to the conference, the priority value is _i Is set to be inversely related to the second bias value; finally according to the calculated priority value _i To adjust the frame of the picture data _i The priority degree called from the candidate picture queue is beneficial to improving the hit rate of accurately calling the picture data of the participants on the basis of executing the steps S1-S6 to obtain the picture data of a plurality of different participants, and is beneficial to leading the picture voice of the new speaker to tend to be synchronous under the condition of automatically switching pictures.

In this embodiment, on the basis of executing steps S601 to S603 or S601 to S608, the following steps may be further executed:

s609, acquiring acoustic parameters of each sound signal;

s610, tracking the variation of each acoustic parameter;

s611, when detecting that the fluctuation of the acoustic parameter reaches a threshold value, the frame data corresponding to the acoustic parameter is read from the candidate frame queue and displayed.

In step S609, the sound intensity is used as the sound intensity in the same manner as in step S604Examples of acoustic parameters may be the acquisition of sound signals audio ₁ Intensity of sound (audio) ₁ ) Audio signal audio ₂ Intensity of sound (audio) ₂ ) … … Sound signal audio _N Intensity of sound (audio) _N ). Specifically, since only the frame of picture data is included in the candidate picture queue in the present embodiment ₁ 、frame ₃ And frame ₄ Therefore, when step S609 is executed, only the sound intensity (audio) corresponding to these screen data can be acquired ₁ )、intensity(audio ₃ ) And intensity (audio) ₄ )。

The intensity of sound (audio) obtained in step S609 ₁ )、intensity(audio ₃ ) And intensity (audio) ₄ ) Are random variables that change over time, so in step S610, the host computer can continuously monitor the data changes.

In step S611, when the intensity of sound (audio) ₁ )、intensity(audio ₃ ) And intensity (audio) ₄ ) Is present, for example, the detection of the intensity of sound (audio) ₃ ) If the variation of (1) reaches a threshold, frame, which is frame data corresponding to the acoustic parameter, is read from the candidate frame queue ₃ And displaying.

In this embodiment, the principle of performing steps S609 to S611 is that: the acoustic parameters such as the sound intensity and the like can represent the conference working state of the participants, for example, when the same participant performs non-speaking discussion, weaker sound intensity is often generated, and when formal speaking is performed, stronger sound intensity is generated, so that the change of the conference working state of the participant (often the non-speaking discussion or silence is converted into formal speaking) can be identified by monitoring the change of the acoustic parameters generated by the same participant, namely, a new speaker is identified, and at the moment, the picture data corresponding to the acoustic parameters are read from the candidate picture queue to display, so that the picture of the new speaker is conveniently switched and displayed in time, the time from the start of speaking of the new speaker to the display of the picture of the new speaker is shortened, the picture and the sound of the new speaker tend to be synchronous, and the quality and the conference efficiency of the video conference are facilitated to be improved.

The conference site picture tracking method in the present embodiment may be executed by writing a computer program for executing the conference site picture tracking method in the present embodiment, writing the computer program into a computer device or a storage medium, and when the computer program is read out to run, thereby achieving the same technical effects as the conference site picture tracking method in the embodiment.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly or indirectly fixed or connected to the other feature. Further, the descriptions of the upper, lower, left, right, etc. used in this disclosure are merely with respect to the mutual positional relationship of the various components of this disclosure in the drawings. As used in this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used in this embodiment includes any combination of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could also be termed a second element, and, similarly, a second element could also be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

It should be appreciated that embodiments of the invention may be implemented or realized by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer readable storage medium configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, in accordance with the methods and drawings described in the specific embodiments. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Furthermore, the operations of the processes described in the present embodiments may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes (or variations and/or combinations thereof) described in this embodiment may be performed under control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications), by hardware, or combinations thereof, that collectively execute on one or more processors. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable computing platform, including, but not limited to, a personal computer, mini-computer, mainframe, workstation, network or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and so forth. Aspects of the invention may be implemented in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optical read and/or write storage medium, RAM, ROM, etc., such that it is readable by a programmable computer, which when read by a computer, is operable to configure and operate the computer to perform the processes described herein. Further, the machine readable code, or portions thereof, may be transmitted over a wired or wireless network. When such media includes instructions or programs that, in conjunction with a microprocessor or other data processor, implement the above steps, the invention of this embodiment includes these and other different types of non-transitory computer-readable storage media. The invention also includes the computer itself when programmed according to the methods and techniques of the invention.

The computer program can be applied to the input data to perform the functions of the present embodiment, thereby converting the input data to generate output data that is stored to the non-volatile memory. The output information may also be applied to one or more output devices such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including specific visual depictions of physical and tangible objects produced on a display.

The present invention is not limited to the above embodiments, but can be modified, equivalent, improved, etc. by the same means to achieve the technical effects of the present invention without departing from the spirit and principle of the present invention. Various modifications and variations are possible in the technical solution and/or in the embodiments within the scope of the invention.

Claims

1. A conference site picture tracking method, characterized in that the conference site picture tracking method comprises:

Acquiring reference semantic information;

performing picture tracking on the picture acquisition target point;

the performing image tracking on the image acquisition target point includes:

mapping each picture data to a candidate picture queue;

the image tracking of the image acquisition target point further comprises:

acquiring acoustic parameters of each sound signal;

2. The conference scene tracking method according to claim 1, wherein said acquiring the reference semantic information comprises:

or alternatively

3. The conference site picture tracking method as claimed in claim 2, wherein said determining a plurality of picture acquisition target points based on a comparison result of said reference semantic information and each of said semantic identification information comprises:

4. The conference scene tracking method according to claim 1, wherein said acquiring the reference semantic information comprises:

5. The conference site picture tracking method of claim 4, wherein said determining a number of picture acquisition target points based on a comparison result of said reference semantic information and each of said semantic identification information, comprises:

6. The conference site picture tracking method according to claim 1, wherein said picture tracking of said picture acquisition target point further comprises:

acquiring acoustic parameters of each sound signal;

tracking variations of each of the acoustic parameters;

7. A computer apparatus comprising a memory for storing at least one program and a processor for loading the at least one program to perform the conference site picture tracking method of any of claims 1-6.

8. A computer-readable storage medium in which a processor-executable program is stored, characterized in that the processor-executable program, when executed by a processor, is for performing the conference scene picture tracking method of any one of claims 1-6.