CN111651632A

CN111651632A - Method and device for outputting voice and video of speaker in video conference

Info

Publication number: CN111651632A
Application number: CN202010325419.XA
Authority: CN
Inventors: 晏冬
Original assignee: Shenzhen Infinova Intelligent Technology Co Ltd
Current assignee: Shenzhen Infineon Information Co ltd
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2020-09-11

Abstract

The invention discloses a speaker audio and video output method and a speaker audio and video output device in a video conference, wherein the method comprises the following steps: when original video information of a local meeting place in a video conference is acquired, face detection is carried out on participants in the original video information to obtain a plurality of face angles and corresponding face position information; carrying out real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker; matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker; cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data; and outputting the single audio and video of the current speaker. The invention can output the independent audio and video of the speaker, thereby improving the user experience.

Description

Method and device for outputting voice and video of speaker in video conference

Technical Field

The invention relates to the technical field of video image processing, in particular to a method and a device for outputting voice and video of a speaker in a video conference and a readable storage medium.

Background

At present, a snapshot camera is generally adopted in a remote video conference to shoot images and upload the shot images to a remote conference place. Traditional video conferencing is comparatively single with snapshot camera function, and the video of gathering is all pictures of the scope that the camera can shoot, can't realize carrying out independent video picture to present speaker and show, has reduced user's experience.

In view of the above, it is necessary to provide further improvements to the current video display technology.

Disclosure of Invention

To solve at least one of the above technical problems, it is a primary object of the present invention to provide a method, an apparatus and a readable storage medium for outputting audio and video of a speaker in a video conference.

In order to achieve the above purpose, the first technical solution adopted by the present invention is: the audio and video output method for the speaker in the video conference is provided, and comprises the following steps:

when original video information of a local meeting place in a video conference is acquired, face detection is carried out on participants in the original video information to obtain a plurality of face angles and corresponding face position information;

carrying out real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker;

matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;

cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data;

and outputting the single audio and video of the current speaker.

The method comprises the following steps of carrying out face detection on participants in original video information to obtain a plurality of face angles and corresponding face positions, and specifically comprises the following steps:

carrying out face detection on all participants in the original video information to obtain a plurality of face information;

the method comprises the steps of numbering a plurality of face information, and obtaining corresponding face angles and face position information, wherein the face position information is the position of a face in a video picture.

The method for matching the human face acoustic angle of the current speaker with the plurality of human face angles in the local meeting place to obtain the human face position information of the current speaker specifically comprises the following steps:

matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place respectively;

if the face acoustic angle of the current speaker is successfully matched with the face angle in the local meeting place, target face position information corresponding to the target face angle is obtained;

and determining the target face position information as the face position of the current speaker.

The method includes the steps of cutting out a real-time face picture of a current speaker from original video information according to face position information of the current speaker, and generating an independent video of the current speaker according to the real-time face picture and audio data, and specifically includes the steps of:

cutting out a plurality of frames of real-time face pictures of the current speaker from the original video information according to the face position of the current speaker;

forming an independent video picture of a speaker according to a plurality of frames of real-time face pictures; and

and generating an independent audio/video of the real-time speaker according to the acquired audio data of the current speaker and the independent video picture.

The method includes the following steps that a plurality of frames of real-time face pictures of a current speaker are cut out from original video information according to the face position of the current speaker, and specifically includes:

acquiring the size ratio of original video information to a video cutting area;

cutting original video information according to the video cutting area to form a plurality of frames of real-time face pictures of a speaker in the speaking process;

and respectively carrying out same-time amplification processing on the multi-frame real-time face pictures according to the size ratio.

The size ratio includes a longitudinal ratio and a transverse ratio, and the same-time amplification processing is respectively performed on multiple frames of real-time face pictures according to the size ratio, and the method specifically includes:

carrying out same-time amplification processing on the longitudinal size and the transverse size of the real-time face picture according to the longitudinal ratio;

and if the transverse size of the amplified real-time human face picture is smaller than that of the original image, taking the blank area as the display edge of the amplified real-time human face picture.

The outputting of the individual audio and video of the current speaker specifically includes:

transmitting the single audio and video of the current speaker; and

and outputting the independent audio and video of the current speaker in the local conference room through the remote conference room.

In order to achieve the purpose, the second technical scheme adopted by the invention is as follows: provided is a speaker audio/video output device in a video conference, including:

the face detection module is used for carrying out face detection on the participants in the original video information to obtain a plurality of face angles and corresponding face position information when the original video information of the local meeting place in the video conference is obtained;

the acoustic positioning module is used for performing real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker;

the audio and video synchronization module is used for matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;

the cutting and coding module is used for cutting a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker and generating an independent video of the current speaker according to the real-time face picture and the audio data;

and the output module is used for outputting the independent audio and video of the current speaker.

In order to achieve the above object, the third technical solution adopted by the present invention is: provided is an electronic device including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method.

In order to achieve the above object, the fourth technical solution adopted by the present invention is: a readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

According to the technical scheme, when original video information of a local meeting place in a video conference is acquired, face detection is firstly carried out on participants in the original video information to obtain a plurality of face angles and corresponding face position information, and meanwhile, real-time acoustic positioning is carried out on a current speaker in the local meeting place to obtain a face acoustic angle and audio data of the current speaker; and then matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker, cutting out a real-time face picture of the current speaker from original video information according to the face position information of the current speaker, generating an independent audio/video of the current speaker according to the real-time face image and audio data, and finally outputting the independent audio/video of the current speaker.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a flowchart of a method for outputting audio and video of a speaker in a video conference according to a first embodiment of the present invention;

fig. 2 is a flowchart of a method for outputting audio and video of a speaker in a video conference according to a second embodiment of the present invention;

FIG. 3 is a block diagram of a speaker audio/video output device in a video conference according to a third embodiment of the present invention;

fig. 4 is a block diagram of an electronic device according to a fourth embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the description of the invention relating to "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying any relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

The invention provides a speaker audio and video output method in a video conference, which can output audio and video independently for speakers in the video conference, and specifically refers to the following embodiments.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for outputting audio and video of a speaker in a video conference according to a first embodiment of the present invention. In the embodiment of the invention, the method for outputting the voice and the video of the speaker in the video conference comprises the following steps:

s101, when the original video information of a local meeting place in the video conference is acquired, face detection is carried out on the participants in the original video information, and a plurality of face angles and corresponding face position information are obtained.

Specifically, video information of a local conference place is captured by a capturing camera and the captured video information is used as original video information. The original video information can be YUV video information, and then face detection is carried out on the YUV video information to obtain a plurality of face angles and corresponding face position information in the original video information.

Further, the face detection is performed on the participants in the original video information to obtain a plurality of face angles and corresponding face positions, and the method specifically includes:

Specifically, the original video information can be shot by at least one snapshot camera, the original video information includes all the participants, and the face information of all the participants can be obtained through face detection. In consideration of practical situations, the face detection can be performed on a set area of the original video information to acquire a plurality of pieces of face information, so that the image processing efficiency can be improved. After the plurality of face information are obtained, the face information is numbered, and corresponding face angle and face position information are obtained. Each face has a unique number, and the face angle and the position of the video picture where the face number is located are obtained according to the face number.

S102, performing real-time acoustic positioning on the current speaker in the local conference place to obtain the face acoustic angle and the audio data of the current speaker.

Specifically, when a speaker speaks in a local meeting place at a certain time ta _ y, the face acoustic angle and the audio data of the speaker at the time ta _ y can be obtained according to the acoustic positioning. The face acoustic angle can be obtained by detecting the speech of the speaker according to the acoustic positioning module.

S103, matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker.

Specifically, after the face acoustic angle of the current speaker is obtained, the face acoustic angle is matched with all face angles one by one to obtain a successfully matched face angle, and corresponding face position information is obtained according to the successfully matched face angle.

Further, the matching of the face acoustic angle of the current speaker and the plurality of face angles in the local conference room to obtain the face position information of the current speaker specifically includes:

Specifically, the face acoustic angle of the current speaker is matched with a plurality of face angles, the face angle of the current speaker can be determined according to the matching result, and the face position is determined according to the face angle of the current speaker. The face position is a rectangular area of the face of the inventor in the original video picture.

And S104, cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data.

Specifically, after the face position of the current speaker is obtained, a real-time face picture of the speaker is cut out from the original video information, and then the collected audio data of the speaker is combined to form an independent audio/video of the current speaker.

Further, the cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an individual video of the current speaker according to the real-time face picture and the audio data specifically includes:

Further, the cutting out a plurality of frames of real-time face pictures of the current speaker from the original video information according to the face position of the current speaker specifically includes:

acquiring the size ratio of original video information to a video cutting area;

Specifically, in the process of cutting the original video information, the size of the cut face picture is smaller than that of the original video picture, so that the cut face picture needs to be enlarged. In the embodiment, the requirement of video output size consistency is met by performing same-time amplification on multiple frames of real-time face pictures. In practical application, the cutting area is a rectangular area, and if the size of the cutting area is consistent with the size of the aspect ratio of the original video picture, the cutting area can be amplified in the same ratio according to the size ratio. If the size of the cutting area is not consistent with the size of the aspect ratio of the original video picture, the cutting area can be amplified according to the longitudinal ratio or the transverse ratio so as to prevent the amplified real-time human face picture from deforming.

Further, the size ratio includes a longitudinal ratio and a transverse ratio, and the processing of performing the same-time amplification on the multiple frames of real-time face pictures according to the size ratio specifically includes:

and if the transverse size of the amplified real-time human face picture is smaller than that of the original image, taking the blank area as the display edge of the amplified real-time human face picture. Therefore, the deformation of the real-time face picture can be prevented.

And S105, outputting the single audio and video of the current speaker.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for outputting audio and video of a speaker in a video conference according to a second embodiment of the present invention. In an embodiment of the present invention, the method for outputting audio and video of a speaker in an audio conference includes:

s201, when original video information of a local meeting place in a video conference is acquired, carrying out face detection on participants in the original video information to obtain a plurality of face angles and corresponding face position information;

s202, performing real-time acoustic positioning on a current speaker in a local conference place to obtain a face acoustic angle and audio data of the current speaker;

s203, matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;

s204, cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data;

s205, transmitting the single audio and video of the current speaker; and

and S206, outputting the independent audio and video of the current speaker in the local conference room through the remote conference room.

For the above steps, reference may be made to the above embodiments for specific description of S201 to S204, and for S205 to S206, specifically, the individual audio and video may be sent to the terminal of the local conference room in real time through the network or the USB interface, and transmitted through the ADSL network, and the individual audio and video of the current speaker in the local conference room is output through the remote conference room.

Referring to fig. 3, fig. 3 is a block diagram of a speaker audio/video output apparatus in a video conference according to a third embodiment of the present invention. In an embodiment of the present invention, the apparatus for outputting audio and video of a speaker in a video conference includes:

the face detection module 101 is configured to, when original video information of a local conference room in a video conference is acquired, perform face detection on participants in the original video information to obtain a plurality of face angles and corresponding face position information;

the acoustic positioning module 102 is configured to perform real-time acoustic positioning on a current speaker in a local conference room, so as to obtain a face acoustic angle and audio data of the current speaker;

the audio and video synchronization module 103 is used for matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;

the cutting and encoding module 104 is used for cutting a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker and generating an independent video of the current speaker according to the real-time face picture and the audio data;

and the audio and video output module 105 is used for outputting the single audio and video of the current speaker.

In this embodiment, through the face detection module 101, when original video information of a local conference room in a video conference is acquired, face detection can be performed on participants in the original video information to obtain a plurality of face angles and corresponding face position information; the acoustic positioning module 102 can perform real-time acoustic positioning on the current speaker in the local conference room to obtain the face acoustic angle and audio data of the current speaker; through the audio and video synchronization module 103, the face acoustic angle of the current speaker can be matched with a plurality of face angles in a local meeting place, so that the face position information of the current speaker is obtained; through the cutting and encoding module 104, a real-time face picture of the current speaker can be cut from the original video information according to the face position information of the current speaker, and an independent audio/video of the current speaker is generated according to the real-time face picture and the audio data; through the output module, the independent audio and video of the current speaker are output, so that the independent audio and video of the current speaker can be displayed and played, and the user experience is improved.

The face detection module 101 is specifically configured to:

The audio and video synchronization module 103 is specifically configured to:

The cropping and encoding module 104 is specifically configured to:

Wherein, the cropping and encoding module 104 is further configured to:

acquiring the size ratio of original video information to a video cutting area;

Wherein, the cropping and encoding module 104 is further configured to:

The audio/video output module 105 is specifically configured to:

transmitting the single audio and video of the current speaker; and

Referring to fig. 4, fig. 4 is a block diagram of an electronic device according to a fourth embodiment of the invention. The electronic device can be used for realizing the speaker audio and video output method in the video conference in the embodiment. As shown in fig. 4, the electronic device mainly includes: memory 301, processor 302, bus 303, and computer programs stored on memory 301 and executable on processor 302, memory 301 and processor 302 being connected via bus 303. The processor 302, when executing the computer program, implements the speaker audio/video output method in the video conference in the foregoing embodiment. Wherein the number of processors may be one or more.

The Memory 301 may be a Random Access Memory (RAM) Memory or a non-volatile Memory (non-volatile Memory), such as a magnetic disk Memory. The memory 301 is for storing executable program code, and the processor 302 is coupled to the memory 301.

Further, an embodiment of the present application also provides a readable storage medium, where the readable storage medium may be provided in the electronic device in the foregoing embodiments, and the readable storage medium may be the memory in the foregoing embodiment shown in fig. 4.

The readable storage medium has stored thereon a computer program which, when executed by a processor, implements the speaker audio-video output method in the video conference in the foregoing embodiments. Further, the computer-readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a readable storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned readable storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents made by the contents of the specification and drawings or directly/indirectly applied to other related technical fields within the spirit of the present invention are included in the scope of the present invention.

Claims

1. A speaker audio and video output method in a video conference is characterized by comprising the following steps:

and outputting the single audio and video of the current speaker.

2. The method according to claim 1, wherein the face detection is performed on the participants in the original video information to obtain a plurality of face angles and corresponding face positions, and specifically comprises:

3. The method for outputting the voice and the video of the speaker in the video conference as claimed in claim 2, wherein the step of matching the face acoustic angle of the current speaker with the plurality of face angles in the local conference room to obtain the face position information of the current speaker specifically comprises the steps of:

4. The method for audio/video output of a speaker in a video conference according to claim 3, wherein the cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an individual video of the current speaker according to the real-time face picture and the audio data specifically comprises:

5. The method for outputting the audios and videos of the speaker in the video conference as claimed in claim 4, wherein the cropping of the multi-frame real-time human face picture of the current speaker from the original video information according to the human face position of the current speaker specifically comprises:

acquiring the size ratio of original video information to a video cutting area;

6. The method as claimed in claim 5, wherein the size ratio includes a vertical ratio and a horizontal ratio, and the performing the same-time amplification processing on the real-time face pictures of the multiple frames according to the size ratio comprises:

7. The method for outputting the audio and video of the speaker in the video conference as claimed in claim 1, wherein the outputting of the individual audio and video of the current speaker specifically comprises:

transmitting the single audio and video of the current speaker; and

8. An audio/video output device for a speaker in a video conference, the audio/video output device comprising:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.