CN111651632A - Method and device for outputting voice and video of speaker in video conference - Google Patents

Method and device for outputting voice and video of speaker in video conference Download PDF

Info

Publication number
CN111651632A
CN111651632A CN202010325419.XA CN202010325419A CN111651632A CN 111651632 A CN111651632 A CN 111651632A CN 202010325419 A CN202010325419 A CN 202010325419A CN 111651632 A CN111651632 A CN 111651632A
Authority
CN
China
Prior art keywords
face
video
current speaker
speaker
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010325419.XA
Other languages
Chinese (zh)
Inventor
晏冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Infineon Information Co ltd
Original Assignee
Shenzhen Infinova Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Infinova Intelligent Technology Co Ltd filed Critical Shenzhen Infinova Intelligent Technology Co Ltd
Priority to CN202010325419.XA priority Critical patent/CN111651632A/en
Publication of CN111651632A publication Critical patent/CN111651632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/155Conference systems involving storage of or access to video conference sessions

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a speaker audio and video output method and a speaker audio and video output device in a video conference, wherein the method comprises the following steps: when original video information of a local meeting place in a video conference is acquired, face detection is carried out on participants in the original video information to obtain a plurality of face angles and corresponding face position information; carrying out real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker; matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker; cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data; and outputting the single audio and video of the current speaker. The invention can output the independent audio and video of the speaker, thereby improving the user experience.

Description

Method and device for outputting voice and video of speaker in video conference
Technical Field
The invention relates to the technical field of video image processing, in particular to a method and a device for outputting voice and video of a speaker in a video conference and a readable storage medium.
Background
At present, a snapshot camera is generally adopted in a remote video conference to shoot images and upload the shot images to a remote conference place. Traditional video conferencing is comparatively single with snapshot camera function, and the video of gathering is all pictures of the scope that the camera can shoot, can't realize carrying out independent video picture to present speaker and show, has reduced user's experience.
In view of the above, it is necessary to provide further improvements to the current video display technology.
Disclosure of Invention
To solve at least one of the above technical problems, it is a primary object of the present invention to provide a method, an apparatus and a readable storage medium for outputting audio and video of a speaker in a video conference.
In order to achieve the above purpose, the first technical solution adopted by the present invention is: the audio and video output method for the speaker in the video conference is provided, and comprises the following steps:
when original video information of a local meeting place in a video conference is acquired, face detection is carried out on participants in the original video information to obtain a plurality of face angles and corresponding face position information;
carrying out real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker;
matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;
cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data;
and outputting the single audio and video of the current speaker.
The method comprises the following steps of carrying out face detection on participants in original video information to obtain a plurality of face angles and corresponding face positions, and specifically comprises the following steps:
carrying out face detection on all participants in the original video information to obtain a plurality of face information;
the method comprises the steps of numbering a plurality of face information, and obtaining corresponding face angles and face position information, wherein the face position information is the position of a face in a video picture.
The method for matching the human face acoustic angle of the current speaker with the plurality of human face angles in the local meeting place to obtain the human face position information of the current speaker specifically comprises the following steps:
matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place respectively;
if the face acoustic angle of the current speaker is successfully matched with the face angle in the local meeting place, target face position information corresponding to the target face angle is obtained;
and determining the target face position information as the face position of the current speaker.
The method includes the steps of cutting out a real-time face picture of a current speaker from original video information according to face position information of the current speaker, and generating an independent video of the current speaker according to the real-time face picture and audio data, and specifically includes the steps of:
cutting out a plurality of frames of real-time face pictures of the current speaker from the original video information according to the face position of the current speaker;
forming an independent video picture of a speaker according to a plurality of frames of real-time face pictures; and
and generating an independent audio/video of the real-time speaker according to the acquired audio data of the current speaker and the independent video picture.
The method includes the following steps that a plurality of frames of real-time face pictures of a current speaker are cut out from original video information according to the face position of the current speaker, and specifically includes:
acquiring the size ratio of original video information to a video cutting area;
cutting original video information according to the video cutting area to form a plurality of frames of real-time face pictures of a speaker in the speaking process;
and respectively carrying out same-time amplification processing on the multi-frame real-time face pictures according to the size ratio.
The size ratio includes a longitudinal ratio and a transverse ratio, and the same-time amplification processing is respectively performed on multiple frames of real-time face pictures according to the size ratio, and the method specifically includes:
carrying out same-time amplification processing on the longitudinal size and the transverse size of the real-time face picture according to the longitudinal ratio;
and if the transverse size of the amplified real-time human face picture is smaller than that of the original image, taking the blank area as the display edge of the amplified real-time human face picture.
The outputting of the individual audio and video of the current speaker specifically includes:
transmitting the single audio and video of the current speaker; and
and outputting the independent audio and video of the current speaker in the local conference room through the remote conference room.
In order to achieve the purpose, the second technical scheme adopted by the invention is as follows: provided is a speaker audio/video output device in a video conference, including:
the face detection module is used for carrying out face detection on the participants in the original video information to obtain a plurality of face angles and corresponding face position information when the original video information of the local meeting place in the video conference is obtained;
the acoustic positioning module is used for performing real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker;
the audio and video synchronization module is used for matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;
the cutting and coding module is used for cutting a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker and generating an independent video of the current speaker according to the real-time face picture and the audio data;
and the output module is used for outputting the independent audio and video of the current speaker.
In order to achieve the above object, the third technical solution adopted by the present invention is: provided is an electronic device including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method.
In order to achieve the above object, the fourth technical solution adopted by the present invention is: a readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
According to the technical scheme, when original video information of a local meeting place in a video conference is acquired, face detection is firstly carried out on participants in the original video information to obtain a plurality of face angles and corresponding face position information, and meanwhile, real-time acoustic positioning is carried out on a current speaker in the local meeting place to obtain a face acoustic angle and audio data of the current speaker; and then matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker, cutting out a real-time face picture of the current speaker from original video information according to the face position information of the current speaker, generating an independent audio/video of the current speaker according to the real-time face image and audio data, and finally outputting the independent audio/video of the current speaker.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
Fig. 1 is a flowchart of a method for outputting audio and video of a speaker in a video conference according to a first embodiment of the present invention;
fig. 2 is a flowchart of a method for outputting audio and video of a speaker in a video conference according to a second embodiment of the present invention;
FIG. 3 is a block diagram of a speaker audio/video output device in a video conference according to a third embodiment of the present invention;
fig. 4 is a block diagram of an electronic device according to a fourth embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description of the invention relating to "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying any relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
The invention provides a speaker audio and video output method in a video conference, which can output audio and video independently for speakers in the video conference, and specifically refers to the following embodiments.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for outputting audio and video of a speaker in a video conference according to a first embodiment of the present invention. In the embodiment of the invention, the method for outputting the voice and the video of the speaker in the video conference comprises the following steps:
s101, when the original video information of a local meeting place in the video conference is acquired, face detection is carried out on the participants in the original video information, and a plurality of face angles and corresponding face position information are obtained.
Specifically, video information of a local conference place is captured by a capturing camera and the captured video information is used as original video information. The original video information can be YUV video information, and then face detection is carried out on the YUV video information to obtain a plurality of face angles and corresponding face position information in the original video information.
Further, the face detection is performed on the participants in the original video information to obtain a plurality of face angles and corresponding face positions, and the method specifically includes:
carrying out face detection on all participants in the original video information to obtain a plurality of face information;
the method comprises the steps of numbering a plurality of face information, and obtaining corresponding face angles and face position information, wherein the face position information is the position of a face in a video picture.
Specifically, the original video information can be shot by at least one snapshot camera, the original video information includes all the participants, and the face information of all the participants can be obtained through face detection. In consideration of practical situations, the face detection can be performed on a set area of the original video information to acquire a plurality of pieces of face information, so that the image processing efficiency can be improved. After the plurality of face information are obtained, the face information is numbered, and corresponding face angle and face position information are obtained. Each face has a unique number, and the face angle and the position of the video picture where the face number is located are obtained according to the face number.
S102, performing real-time acoustic positioning on the current speaker in the local conference place to obtain the face acoustic angle and the audio data of the current speaker.
Specifically, when a speaker speaks in a local meeting place at a certain time ta _ y, the face acoustic angle and the audio data of the speaker at the time ta _ y can be obtained according to the acoustic positioning. The face acoustic angle can be obtained by detecting the speech of the speaker according to the acoustic positioning module.
S103, matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker.
Specifically, after the face acoustic angle of the current speaker is obtained, the face acoustic angle is matched with all face angles one by one to obtain a successfully matched face angle, and corresponding face position information is obtained according to the successfully matched face angle.
Further, the matching of the face acoustic angle of the current speaker and the plurality of face angles in the local conference room to obtain the face position information of the current speaker specifically includes:
matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place respectively;
if the face acoustic angle of the current speaker is successfully matched with the face angle in the local meeting place, target face position information corresponding to the target face angle is obtained;
and determining the target face position information as the face position of the current speaker.
Specifically, the face acoustic angle of the current speaker is matched with a plurality of face angles, the face angle of the current speaker can be determined according to the matching result, and the face position is determined according to the face angle of the current speaker. The face position is a rectangular area of the face of the inventor in the original video picture.
And S104, cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data.
Specifically, after the face position of the current speaker is obtained, a real-time face picture of the speaker is cut out from the original video information, and then the collected audio data of the speaker is combined to form an independent audio/video of the current speaker.
Further, the cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an individual video of the current speaker according to the real-time face picture and the audio data specifically includes:
cutting out a plurality of frames of real-time face pictures of the current speaker from the original video information according to the face position of the current speaker;
forming an independent video picture of a speaker according to a plurality of frames of real-time face pictures; and
and generating an independent audio/video of the real-time speaker according to the acquired audio data of the current speaker and the independent video picture.
Further, the cutting out a plurality of frames of real-time face pictures of the current speaker from the original video information according to the face position of the current speaker specifically includes:
acquiring the size ratio of original video information to a video cutting area;
cutting original video information according to the video cutting area to form a plurality of frames of real-time face pictures of a speaker in the speaking process;
and respectively carrying out same-time amplification processing on the multi-frame real-time face pictures according to the size ratio.
Specifically, in the process of cutting the original video information, the size of the cut face picture is smaller than that of the original video picture, so that the cut face picture needs to be enlarged. In the embodiment, the requirement of video output size consistency is met by performing same-time amplification on multiple frames of real-time face pictures. In practical application, the cutting area is a rectangular area, and if the size of the cutting area is consistent with the size of the aspect ratio of the original video picture, the cutting area can be amplified in the same ratio according to the size ratio. If the size of the cutting area is not consistent with the size of the aspect ratio of the original video picture, the cutting area can be amplified according to the longitudinal ratio or the transverse ratio so as to prevent the amplified real-time human face picture from deforming.
Further, the size ratio includes a longitudinal ratio and a transverse ratio, and the processing of performing the same-time amplification on the multiple frames of real-time face pictures according to the size ratio specifically includes:
carrying out same-time amplification processing on the longitudinal size and the transverse size of the real-time face picture according to the longitudinal ratio;
and if the transverse size of the amplified real-time human face picture is smaller than that of the original image, taking the blank area as the display edge of the amplified real-time human face picture. Therefore, the deformation of the real-time face picture can be prevented.
And S105, outputting the single audio and video of the current speaker.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for outputting audio and video of a speaker in a video conference according to a second embodiment of the present invention. In an embodiment of the present invention, the method for outputting audio and video of a speaker in an audio conference includes:
s201, when original video information of a local meeting place in a video conference is acquired, carrying out face detection on participants in the original video information to obtain a plurality of face angles and corresponding face position information;
s202, performing real-time acoustic positioning on a current speaker in a local conference place to obtain a face acoustic angle and audio data of the current speaker;
s203, matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;
s204, cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data;
s205, transmitting the single audio and video of the current speaker; and
and S206, outputting the independent audio and video of the current speaker in the local conference room through the remote conference room.
For the above steps, reference may be made to the above embodiments for specific description of S201 to S204, and for S205 to S206, specifically, the individual audio and video may be sent to the terminal of the local conference room in real time through the network or the USB interface, and transmitted through the ADSL network, and the individual audio and video of the current speaker in the local conference room is output through the remote conference room.
Referring to fig. 3, fig. 3 is a block diagram of a speaker audio/video output apparatus in a video conference according to a third embodiment of the present invention. In an embodiment of the present invention, the apparatus for outputting audio and video of a speaker in a video conference includes:
the face detection module 101 is configured to, when original video information of a local conference room in a video conference is acquired, perform face detection on participants in the original video information to obtain a plurality of face angles and corresponding face position information;
the acoustic positioning module 102 is configured to perform real-time acoustic positioning on a current speaker in a local conference room, so as to obtain a face acoustic angle and audio data of the current speaker;
the audio and video synchronization module 103 is used for matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;
the cutting and encoding module 104 is used for cutting a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker and generating an independent video of the current speaker according to the real-time face picture and the audio data;
and the audio and video output module 105 is used for outputting the single audio and video of the current speaker.
In this embodiment, through the face detection module 101, when original video information of a local conference room in a video conference is acquired, face detection can be performed on participants in the original video information to obtain a plurality of face angles and corresponding face position information; the acoustic positioning module 102 can perform real-time acoustic positioning on the current speaker in the local conference room to obtain the face acoustic angle and audio data of the current speaker; through the audio and video synchronization module 103, the face acoustic angle of the current speaker can be matched with a plurality of face angles in a local meeting place, so that the face position information of the current speaker is obtained; through the cutting and encoding module 104, a real-time face picture of the current speaker can be cut from the original video information according to the face position information of the current speaker, and an independent audio/video of the current speaker is generated according to the real-time face picture and the audio data; through the output module, the independent audio and video of the current speaker are output, so that the independent audio and video of the current speaker can be displayed and played, and the user experience is improved.
The face detection module 101 is specifically configured to:
carrying out face detection on all participants in the original video information to obtain a plurality of face information;
the method comprises the steps of numbering a plurality of face information, and obtaining corresponding face angles and face position information, wherein the face position information is the position of a face in a video picture.
The audio and video synchronization module 103 is specifically configured to:
matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place respectively;
if the face acoustic angle of the current speaker is successfully matched with the face angle in the local meeting place, target face position information corresponding to the target face angle is obtained;
and determining the target face position information as the face position of the current speaker.
The cropping and encoding module 104 is specifically configured to:
cutting out a plurality of frames of real-time face pictures of the current speaker from the original video information according to the face position of the current speaker;
forming an independent video picture of a speaker according to a plurality of frames of real-time face pictures; and
and generating an independent audio/video of the real-time speaker according to the acquired audio data of the current speaker and the independent video picture.
Wherein, the cropping and encoding module 104 is further configured to:
acquiring the size ratio of original video information to a video cutting area;
cutting original video information according to the video cutting area to form a plurality of frames of real-time face pictures of a speaker in the speaking process;
and respectively carrying out same-time amplification processing on the multi-frame real-time face pictures according to the size ratio.
Wherein, the cropping and encoding module 104 is further configured to:
carrying out same-time amplification processing on the longitudinal size and the transverse size of the real-time face picture according to the longitudinal ratio;
and if the transverse size of the amplified real-time human face picture is smaller than that of the original image, taking the blank area as the display edge of the amplified real-time human face picture.
The audio/video output module 105 is specifically configured to:
transmitting the single audio and video of the current speaker; and
and outputting the independent audio and video of the current speaker in the local conference room through the remote conference room.
Referring to fig. 4, fig. 4 is a block diagram of an electronic device according to a fourth embodiment of the invention. The electronic device can be used for realizing the speaker audio and video output method in the video conference in the embodiment. As shown in fig. 4, the electronic device mainly includes: memory 301, processor 302, bus 303, and computer programs stored on memory 301 and executable on processor 302, memory 301 and processor 302 being connected via bus 303. The processor 302, when executing the computer program, implements the speaker audio/video output method in the video conference in the foregoing embodiment. Wherein the number of processors may be one or more.
The Memory 301 may be a Random Access Memory (RAM) Memory or a non-volatile Memory (non-volatile Memory), such as a magnetic disk Memory. The memory 301 is for storing executable program code, and the processor 302 is coupled to the memory 301.
Further, an embodiment of the present application also provides a readable storage medium, where the readable storage medium may be provided in the electronic device in the foregoing embodiments, and the readable storage medium may be the memory in the foregoing embodiment shown in fig. 4.
The readable storage medium has stored thereon a computer program which, when executed by a processor, implements the speaker audio-video output method in the video conference in the foregoing embodiments. Further, the computer-readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a readable storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned readable storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents made by the contents of the specification and drawings or directly/indirectly applied to other related technical fields within the spirit of the present invention are included in the scope of the present invention.

Claims (10)

1. A speaker audio and video output method in a video conference is characterized by comprising the following steps:
when original video information of a local meeting place in a video conference is acquired, face detection is carried out on participants in the original video information to obtain a plurality of face angles and corresponding face position information;
carrying out real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker;
matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;
cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an independent audio/video of the current speaker according to the real-time face picture and the audio data;
and outputting the single audio and video of the current speaker.
2. The method according to claim 1, wherein the face detection is performed on the participants in the original video information to obtain a plurality of face angles and corresponding face positions, and specifically comprises:
carrying out face detection on all participants in the original video information to obtain a plurality of face information;
the method comprises the steps of numbering a plurality of face information, and obtaining corresponding face angles and face position information, wherein the face position information is the position of a face in a video picture.
3. The method for outputting the voice and the video of the speaker in the video conference as claimed in claim 2, wherein the step of matching the face acoustic angle of the current speaker with the plurality of face angles in the local conference room to obtain the face position information of the current speaker specifically comprises the steps of:
matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place respectively;
if the face acoustic angle of the current speaker is successfully matched with the face angle in the local meeting place, target face position information corresponding to the target face angle is obtained;
and determining the target face position information as the face position of the current speaker.
4. The method for audio/video output of a speaker in a video conference according to claim 3, wherein the cutting out a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker, and generating an individual video of the current speaker according to the real-time face picture and the audio data specifically comprises:
cutting out a plurality of frames of real-time face pictures of the current speaker from the original video information according to the face position of the current speaker;
forming an independent video picture of a speaker according to a plurality of frames of real-time face pictures; and
and generating an independent audio/video of the real-time speaker according to the acquired audio data of the current speaker and the independent video picture.
5. The method for outputting the audios and videos of the speaker in the video conference as claimed in claim 4, wherein the cropping of the multi-frame real-time human face picture of the current speaker from the original video information according to the human face position of the current speaker specifically comprises:
acquiring the size ratio of original video information to a video cutting area;
cutting original video information according to the video cutting area to form a plurality of frames of real-time face pictures of a speaker in the speaking process;
and respectively carrying out same-time amplification processing on the multi-frame real-time face pictures according to the size ratio.
6. The method as claimed in claim 5, wherein the size ratio includes a vertical ratio and a horizontal ratio, and the performing the same-time amplification processing on the real-time face pictures of the multiple frames according to the size ratio comprises:
carrying out same-time amplification processing on the longitudinal size and the transverse size of the real-time face picture according to the longitudinal ratio;
and if the transverse size of the amplified real-time human face picture is smaller than that of the original image, taking the blank area as the display edge of the amplified real-time human face picture.
7. The method for outputting the audio and video of the speaker in the video conference as claimed in claim 1, wherein the outputting of the individual audio and video of the current speaker specifically comprises:
transmitting the single audio and video of the current speaker; and
and outputting the independent audio and video of the current speaker in the local conference room through the remote conference room.
8. An audio/video output device for a speaker in a video conference, the audio/video output device comprising:
the face detection module is used for carrying out face detection on the participants in the original video information to obtain a plurality of face angles and corresponding face position information when the original video information of the local meeting place in the video conference is obtained;
the acoustic positioning module is used for performing real-time acoustic positioning on a current speaker in a local conference room to obtain a face acoustic angle and audio data of the current speaker;
the audio and video synchronization module is used for matching the face acoustic angle of the current speaker with a plurality of face angles in a local meeting place to obtain face position information of the current speaker;
the cutting and coding module is used for cutting a real-time face picture of the current speaker from the original video information according to the face position information of the current speaker and generating an independent video of the current speaker according to the real-time face picture and the audio data;
and the output module is used for outputting the independent audio and video of the current speaker.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010325419.XA 2020-04-23 2020-04-23 Method and device for outputting voice and video of speaker in video conference Pending CN111651632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010325419.XA CN111651632A (en) 2020-04-23 2020-04-23 Method and device for outputting voice and video of speaker in video conference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010325419.XA CN111651632A (en) 2020-04-23 2020-04-23 Method and device for outputting voice and video of speaker in video conference

Publications (1)

Publication Number Publication Date
CN111651632A true CN111651632A (en) 2020-09-11

Family

ID=72352210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010325419.XA Pending CN111651632A (en) 2020-04-23 2020-04-23 Method and device for outputting voice and video of speaker in video conference

Country Status (1)

Country Link
CN (1) CN111651632A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347849A (en) * 2020-09-29 2021-02-09 咪咕视讯科技有限公司 Video conference processing method, electronic device and storage medium
CN112487978A (en) * 2020-11-30 2021-03-12 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN112541402A (en) * 2020-11-20 2021-03-23 北京搜狗科技发展有限公司 Data processing method and device and electronic equipment
CN114598819A (en) * 2022-03-16 2022-06-07 维沃移动通信有限公司 Video recording method and device and electronic equipment
CN114594892A (en) * 2022-01-29 2022-06-07 深圳壹秘科技有限公司 Remote interaction method, remote interaction device and computer storage medium
CN116781856A (en) * 2023-07-12 2023-09-19 深圳市艾姆诗电商股份有限公司 Audio-visual conversion control method, system and storage medium based on deep learning
CN116866509A (en) * 2023-07-10 2023-10-10 深圳市创载网络科技有限公司 Conference scene picture tracking method, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102368816A (en) * 2011-12-01 2012-03-07 中科芯集成电路股份有限公司 Intelligent front end system of video conference
CN105592268A (en) * 2016-03-03 2016-05-18 苏州科达科技股份有限公司 Video conferencing system, processing device and video conferencing method
WO2016095244A1 (en) * 2014-12-15 2016-06-23 深圳Tcl新技术有限公司 Method and device for adjusting video window in video conference
CN107333090A (en) * 2016-04-29 2017-11-07 中国电信股份有限公司 Videoconference data processing method and platform
CN109257559A (en) * 2018-09-28 2019-01-22 苏州科达科技股份有限公司 A kind of image display method, device and the video conferencing system of panoramic video meeting
CN110082723A (en) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 A kind of sound localization method, device, equipment and storage medium
CN110673811A (en) * 2019-09-27 2020-01-10 深圳看到科技有限公司 Panoramic picture display method and device based on sound information positioning and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102368816A (en) * 2011-12-01 2012-03-07 中科芯集成电路股份有限公司 Intelligent front end system of video conference
WO2016095244A1 (en) * 2014-12-15 2016-06-23 深圳Tcl新技术有限公司 Method and device for adjusting video window in video conference
CN105592268A (en) * 2016-03-03 2016-05-18 苏州科达科技股份有限公司 Video conferencing system, processing device and video conferencing method
CN107333090A (en) * 2016-04-29 2017-11-07 中国电信股份有限公司 Videoconference data processing method and platform
CN109257559A (en) * 2018-09-28 2019-01-22 苏州科达科技股份有限公司 A kind of image display method, device and the video conferencing system of panoramic video meeting
CN110082723A (en) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 A kind of sound localization method, device, equipment and storage medium
CN110673811A (en) * 2019-09-27 2020-01-10 深圳看到科技有限公司 Panoramic picture display method and device based on sound information positioning and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347849A (en) * 2020-09-29 2021-02-09 咪咕视讯科技有限公司 Video conference processing method, electronic device and storage medium
CN112347849B (en) * 2020-09-29 2024-03-26 咪咕视讯科技有限公司 Video conference processing method, electronic equipment and storage medium
CN112541402A (en) * 2020-11-20 2021-03-23 北京搜狗科技发展有限公司 Data processing method and device and electronic equipment
CN112487978A (en) * 2020-11-30 2021-03-12 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN112487978B (en) * 2020-11-30 2024-04-16 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN114594892A (en) * 2022-01-29 2022-06-07 深圳壹秘科技有限公司 Remote interaction method, remote interaction device and computer storage medium
CN114594892B (en) * 2022-01-29 2023-11-24 深圳壹秘科技有限公司 Remote interaction method, remote interaction device, and computer storage medium
CN114598819A (en) * 2022-03-16 2022-06-07 维沃移动通信有限公司 Video recording method and device and electronic equipment
CN116866509A (en) * 2023-07-10 2023-10-10 深圳市创载网络科技有限公司 Conference scene picture tracking method, device and storage medium
CN116866509B (en) * 2023-07-10 2024-02-23 深圳市创载网络科技有限公司 Conference scene picture tracking method, device and storage medium
CN116781856A (en) * 2023-07-12 2023-09-19 深圳市艾姆诗电商股份有限公司 Audio-visual conversion control method, system and storage medium based on deep learning

Similar Documents

Publication Publication Date Title
CN111651632A (en) Method and device for outputting voice and video of speaker in video conference
CN109754811B (en) Sound source tracking method, device, equipment and storage medium based on biological characteristics
CN102655585B (en) Video conference system and time delay testing method, device and system thereof
WO2019184650A1 (en) Subtitle generation method and terminal
US11076127B1 (en) System and method for automatically framing conversations in a meeting or a video conference
CN109819316B (en) Method and device for processing face sticker in video, storage medium and electronic equipment
CN112004046A (en) Image processing method and device based on video conference
CN110673811B (en) Panoramic picture display method and device based on sound information positioning and storage medium
US10468029B2 (en) Communication terminal, communication method, and computer program product
WO2015139562A1 (en) Method for implementing video conference, synthesis device, and system
CN107733874B (en) Information processing method, information processing device, computer equipment and storage medium
CN114531564A (en) Processing method and electronic equipment
CN111918127A (en) Video clipping method and device, computer readable storage medium and camera
CN108320331B (en) Method and equipment for generating augmented reality video information of user scene
CN112601099A (en) Live image processing method and device, storage medium and electronic equipment
CN104780341B (en) A kind of information processing method and information processing unit
CN113794814B (en) Method, device and storage medium for controlling video image output
CN111034187A (en) Dynamic image generation method and device, movable platform and storage medium
US10282633B2 (en) Cross-asset media analysis and processing
CN115514989A (en) Data transmission method, system and storage medium
US20200184973A1 (en) Transcription of communications
CN113784058A (en) Image generation method and device, storage medium and electronic equipment
CN114125365A (en) Video conference method, device and readable storage medium
CN112887653A (en) Information processing method and information processing device
CN111614928A (en) Positioning method, terminal device and conference system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230111

Address after: 518000 Yingfei Haocheng Science Park, Guansheng 5th Road, Luhu Community, Guanhu Street, Longhua District, Shenzhen, Guangdong 1515

Applicant after: Shenzhen Infineon Information Co.,Ltd.

Address before: 518110 Room 301, Infineon Technology Co., Ltd., No. 12, Guanbao Road, Luhu community, Guanhu street, Longhua District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN INFINOVA INTELLIGENT TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right