CN113810653A

CN113810653A - Audio and video based method and system for talkback tracking of multi-party network conference

Info

Publication number: CN113810653A
Application number: CN202111094320.4A
Authority: CN
Inventors: 张力; 张瑞峰; 朱庆祥
Original assignee: Guangzhou Ketianshichang Information Technology Co ltd
Current assignee: Guangzhou Ketianshichang Information Technology Co ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-17

Abstract

The invention discloses a method and a system for a talkback tracking multi-party network conference based on audio and video, which comprises the following steps of establishing communication connection between a conference center of a conference initiator and terminals of a plurality of participants: recording the conference center as a main speaker; acquiring audio and video data of a main speaker and transmitting the audio and video data to each terminal; identifying audio and video data, judging whether the talkback switching indication information exists, and if so, executing talkback switching processing; the talkback switching process includes: identifying other audio and video data after the talkback switching indication information, and comparing the audio and video data with the participant identity information base to obtain the identity data of the next speaker; and searching and determining a terminal of the new speaker according to the participant identity data pre-associated with the participant terminal, and acquiring audio and video data of the new speaker. The method and the device have the effect of improving the fluency of the conference process.

Description

Audio and video based method and system for talkback tracking of multi-party network conference

Technical Field

The application relates to the technical field of online conferences, in particular to a method and a system for a talkback tracking multi-party network conference based on audio and video.

Background

The development of the digital technology leads the audio and video technology to be integrated with communication and further live and work. The technology is particularly popularized in the aspect of enterprise collaborative office, and the user can conveniently work at home, remotely communicate and the like.

Patent publication No. CN112689115A discloses a control method of a multiparty conferencing system. The method comprises the following steps: the conference center and the terminal are coupled to a communication network; the conference center authenticates the terminal; the conference center designates a certain terminal as a video access terminal Cv, processes first video data and a plurality of audio data sent by the video access terminal Cv to generate first mixed audio data, and sends the first mixed audio data to the terminal. And converting the mixed audio data into a conference summary text, storing and outputting the conference summary text.

The above provides a multiparty conference system, which can improve the rate of terminal environment configuration in the conference process, improve the rate of audio/video code conversion, and improve the fluency of audio/video of multiparty conference, but has the following defects:

for a large conference, besides part of people on the line, in a host place, a lot of participants exist under the line, so the conference is usually displayed in the form of an electronic large screen, however, with the switching of the speaker, the content released on the large screen depends on manual work, and is relatively rigid, and the switching fluency is not good, so a new technical scheme is provided in the application.

Disclosure of Invention

In order to improve the fluency of the conference process, the application provides a method and a system for tracking the multi-party network conference by the talkback based on the audio and video.

In a first aspect, the present application provides a method for tracking a multiparty netmeeting by a talkback based on audio and video, which adopts the following technical scheme:

a method for tracking a multi-party network conference by a talkback based on audio and video comprises the following steps of establishing communication connection between a conference center of a conference initiator and terminals of a plurality of participants, and further comprises the following steps:

recording the conference center as a main speaker;

acquiring audio and video data of a main speaker and transmitting the audio and video data to each terminal; and the number of the first and second groups,

identifying audio and video data, judging whether talkback switching indication information exists or not, and executing talkback switching processing if the talkback switching indication information exists;

the talkback switching process includes:

identifying other audio and video data after the talkback switching indication information, and comparing the audio and video data with the participant identity information base to obtain the identity data of the next speaker; and the number of the first and second groups,

and searching and determining a terminal of the new talkback according to the participant identity data pre-associated with the participant terminal, and acquiring the audio and video data of the new talkback.

Optionally, the recognizing the audio/video data includes:

separating audio and video data to obtain audio data and video data;

performing audio translation of the text on the audio data to identify text information; and/or the presence of a gas in the gas,

carrying out image recognition on the video data, and recognizing human behavior;

the talkback switching indication information comprises pre-selected character information and/or human body behavior information.

Optionally, the method further includes:

acquiring manuscript data of a main speaker and switching setting data of chapters, sections or pages of the manuscript; wherein the chapter, section or page switching setting data includes time-consuming data of each chapter, section or page; and the number of the first and second groups,

according to the data set by the chapter, section or page switching, calculating the total time consumption of the manuscript and generating a dynamic progress bar based on the total time consumption of the manuscript;

and the progress bar is sent to the conference center and/or the terminals of the participants.

Optionally, the generation manner of the chapter, section, or page switching setting data includes:

setting time consumption for each chapter, section or page respectively; or the like, or, alternatively,

setting time consumption for a plurality of chapters, sections or pages; or the like, or, alternatively,

the time consumption of setting each chapter, section or page is the same, and the setting is carried out once.

Optionally, the method further includes:

acquiring chapters, sections or pages generated by the pre-arranged interaction activities as interaction nodes; and

determining a section of the progress bar, which is pre-generated on the progress bar, as an interactive section according to the interactive node;

and the identification audio and video data is executed in the interactive section.

Optionally, the method further includes:

performing silence detection on the audio data; or the like, or, alternatively,

the acquired audio and video data is data after silence detection;

the special encoded frames resulting from silence detection are removed prior to executing the audio translated text.

Optionally, the method further includes:

acquiring feedback information of a participant terminal;

judging whether a hand-lifting behavior exists, if so, sending a hand-lifting prompt to a conference center;

and judging whether the feedback information of the conference center agrees to behavior, if so, transmitting the audio and video data in the feedback information of the participant terminal to the conference center.

In a second aspect, the present application provides an audio/video-based lecture tracking multi-party netmeeting system, which adopts the following technical scheme:

an audio-video based walkie-talkie tracking multiparty network conferencing system comprising a memory and a processor, the memory having stored thereon a computer program which can be loaded by the processor and which performs any of the methods described above.

In summary, the present application includes at least one of the following beneficial technical effects:

1. through audio and video analysis and processing of a speaker, a speaker switching instruction is recognized, the identity of the speaker is further recognized according to the speaker switching instruction, the speaker and equipment are actively tracked and switched, and the audio and video of a new speaker are transmitted to a conference center and terminals of other participants, so that seamless switching of the speaker of the network conference is realized;

2. according to the time consumption of each chapter, section or page of the manuscript, a progress bar is generated based on the time consumption to prompt the progress and time of the speaker to control the PPT, the lecture and the like.

Drawings

FIG. 1 is a schematic diagram of a communication architecture of the present application;

FIG. 2 is a schematic overall flow diagram of the present application;

FIG. 3 is a schematic diagram of an interface of an existing conference, as referred to in the embodiments of the present application;

fig. 4 is a schematic diagram of a progress bar of the present application.

Detailed Description

The present application is described in further detail below with reference to figures 1-4.

The embodiment of the application discloses a method for tracking a multi-party network conference by a talkback based on audio and video.

Referring to fig. 1 and 2, an audio-video based method for a lecture tracking multiparty network conference includes:

s1, establishing communication connection between the conference center of the conference initiator and the terminals of a plurality of participants; and the number of the first and second groups,

and S2, interacting the multi-party audio and video data.

The conference center comprises a control computer, a multimedia large screen and a field camera which are distributed in a conference place; the computer is used for connecting and controlling the multimedia large screen and the field camera; the multimedia large screen is used for displaying manuscripts and the like of a main speech on site selected by a conference initiator, and the site camera is used for collecting a video image on site.

The terminals of the participants comprise computers, mobile phones or flat panels, and the terminals are connected with the computers through the Internet so that users can remotely communicate with each other through audio and video.

The interactive multiparty audio and video data comprises:

101. recording the conference center as a main speaker;

102. acquiring audio and video data of a main speaker and transmitting the audio and video data to each terminal; and the number of the first and second groups,

103. and identifying audio and video data, judging whether the talkback switching indication information exists, and if so, executing talkback switching processing.

For the audio and video data of the main speaker, the audio and video acquisition device of the main speaker (here, the conference center) acquires feedback.

Because the talkback switching information is set to: including pre-selected textual information and/or human behavior information, identifying audio-visual data includes: firstly, analyzing audio and video data to obtain audio data and video data; subsequently, performing audio translation text on the audio data to identify text information; and/or performing image recognition on the video data to recognize human behavior.

Specifically, the voice translation of the text information is obtained by using a commercially available voice translation platform (such as a flight), and the text information content is as "please speak XX now"; that is, after the speaker utters the content of the sentence, the content is translated by the audio translation platform, and then the text is recognized, and when the speaker switching instruction is recognized, the speaker switching process is executed.

Similarly, the human behavior information, such as the speaker standing on the front side of the multimedia large screen, makes an action of lifting one arm obliquely upwards at 135 degrees; at this time, the presence of the call-to-speak switching instruction is recognized through the motion recognition technology. In order to improve the accuracy, the two modes can be used in a matched mode, namely the difference between the character information and the action is within +/-3S, namely the talkback switching indication exists in the judgment.

In the above stage, except for the conference initiator, other participants have no speaking right temporarily, and the conference center does not show the audios and videos of other participants, so that the problem that the participants forget to turn off the phone call and an embarrassing event occurs in the existing part of the conference system is solved.

Regarding the talkback handover process, it includes:

201. identifying other audio and video data after the talkback switching indication information, and comparing the audio and video data with the participant identity information base to obtain the identity data of the next speaker; and the number of the first and second groups,

202. and searching and determining a terminal of the new talkback according to the participant identity data pre-associated with the participant terminal, and acquiring the audio and video data of the new talkback.

Before the meeting participant identity information base is taken into the meeting, the meeting participants actively enter or the meeting initiator actively enters; and in the identity input stage, automatically binding the equipment identification code and the network address to be used for obtaining the audio and video data of the main speaker in a matching manner.

According to the content, after the method is applied, the audio and video of the main speaker are monitored in the conference process, and after the switching indication is made, the follow-up audio and video data are actively identified, tracked and switched to the next main speaker selected by the main speaker, so that the audio and video data of the new main speaker are played through the conference center, the intelligent seamless switching of the large-screen content of the conference center is realized, meanwhile, the audio and video data transmitted to other participants are switched, and the conference fluency and experience are improved.

The method also comprises the following steps:

301. acquiring manuscript data of a main speaker and switching setting data of chapters, sections or pages of the manuscript; and the number of the first and second groups,

302. and calculating total time consumption of the manuscripts according to the switching setting data of the chapters, the sections or the pages, and generating a dynamic progress bar based on the total time consumption of the manuscripts.

Wherein the chapter, section or page switching setting data includes time-consuming data of each chapter, section or page; the progress bar is sent to the conference center and/or the terminals of the participants.

Reference is made to fig. 4, which is a schematic diagram of a progress bar, and the technology thereof can be referred to clock. The numbers in the figure represent the time consumed per PPT page, and the total time consumed for this PPT or lecture document; through the rightward pushing of the progress bar, the time consumption without PPT can be accurately mastered so as to help the speaker adjust and control the time and progress of the speech.

The method is provided with the following contents:

referring to fig. 3, in the existing network conference, many display a clock counter (such as a conference) on the right or left of the conference, accumulate the duration of the conference, and have no other role besides; however, if people who have training requirements, people who have specific requirements for speech time or some speaker groups in conferences can help them to control training or speech time, the quality of the related conferences can be effectively improved.

The generation manner of the switching setting data for a chapter, section, or page includes:

Specifically, for example: the PPT file (manuscript) has 25 pages, and the time spent by each page of PPT can be independently set, such as 1 minute on the first page and 2 minutes on the second page; or setting time for the page needing time consuming control; or making a uniform time-consuming setting, such as: PPT5 minutes per page.

For this application, above-mentioned progress bar and relevant setting thereof not only assists user to talkback progress etc. and controls, and it still is applied to the aforesaid talkback of cooperateing and switches, specifically includes:

401. acquiring chapters, sections or pages generated by the pre-arranged interaction activities as interaction nodes; and the number of the first and second groups,

402. and determining a section pre-generated on the progress bar according to the interactive nodes as an interactive section.

Wherein, the prearranged interaction activities are determined by related personnel at the manuscript uploading stage; or more conveniently, interactive marks are made in the manuscripts, and the identification marks are automatically determined; and subsequently, identifying audio and video data and executing the audio and video data in the interactive section.

The advantages of the above arrangement are: when the conference is unnecessary, such as the conference is not required to be picked and remembered, the audio and video data of the whole conference process are not identified, so that the talkback switching is more targeted, the resource waste is reduced, and the like.

In view of the above, the method further comprises:

performing silence detection on the audio data; or the acquired audio and video data is data after silence detection;

subsequently, the special encoded frames from the silence detection are removed before the audio translation is performed.

Specifically, silence detection may be integrated into an encoding module that processes the captured speech (audio); the silence detection algorithm is matched with the noise suppression algorithm to identify whether voice input exists at present, and if no voice input exists, a special coding frame (for example, the length is 0) is coded and output; subsequently, by eliminating the encoded frames, more accurate "audio" data can be obtained, and translation errors, translation costs, and the like can be reduced.

The above-mentioned single line interaction that is the main one, and in the course of meeting, etc., there is a demand that many parties interact at the same time, therefore this method also includes:

501. acquiring feedback information of a participant terminal;

502. judging whether a hand-lifting behavior exists, if so, sending a hand-lifting prompt to a conference center;

503. and judging whether the feedback information of the conference center agrees to behavior, if so, transmitting the audio and video data in the feedback information of the participant terminal to the conference center.

The hand-lifting behavior comprises hand-lifting trigger information generated by man-machine interaction between the terminal and the participants, and the hand-lifting trigger information is displayed on an interactive interface of a large screen and a computer after being received by the conference center so as to inform a main speaker and related personnel.

According to the content, the method is also suitable for simultaneous online interaction of multiple users, and can meet the requirements of more users.

The embodiment of the application also discloses a master-speaking tracking multi-party network conference system based on the audio and video.

An audio-video based lecture tracking multiparty netmeeting system comprises a memory having stored thereon a computer program which can be loaded by a processor and which performs any of the methods described above.

The above embodiments are preferred embodiments of the present application, and the protection scope of the present application is not limited by the above embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims

1. A method for tracking a multi-party network conference by a talkback based on audio and video comprises the steps of establishing communication connection between a conference center of a conference initiator and terminals of a plurality of participants, and is characterized by further comprising the following steps:

recording the conference center as a main speaker;

the talkback switching process includes:

2. The audio-video based intercom tracking multi-party netmeeting method of claim 1, wherein said identifying audio-video data comprises:

separating audio and video data to obtain audio data and video data;

3. The audio-video based intercom tracking multi-party netmeeting method of claim 1, further comprising:

4. The audio-video based lecture tracking multiparty netmeeting method of claim 3, wherein the chapter, section or page switch setting data is generated in a manner comprising:

5. The audio-video based intercom tracking multi-party netmeeting method of claim 4, further comprising:

6. The audio-video based intercom tracking multi-party netmeeting method of claim 5, further comprising:

performing silence detection on the audio data; or the like, or, alternatively,

the acquired audio and video data is data after silence detection;

7. The audio-video based intercom tracking multi-party netmeeting method of claim 1, further comprising:

acquiring feedback information of a participant terminal;

8. An audio-video based walkie-talkie tracking multiparty network conferencing system comprising a memory and a processor, the memory having stored thereon a computer program which can be loaded by the processor and which carries out the method of any one of claims 1 to 7.