CN115150580A - Information processing method, electronic equipment, system and medium - Google Patents

Information processing method, electronic equipment, system and medium Download PDF

Info

Publication number
CN115150580A
CN115150580A CN202210924335.7A CN202210924335A CN115150580A CN 115150580 A CN115150580 A CN 115150580A CN 202210924335 A CN202210924335 A CN 202210924335A CN 115150580 A CN115150580 A CN 115150580A
Authority
CN
China
Prior art keywords
information
audio
video
user
video information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210924335.7A
Other languages
Chinese (zh)
Inventor
李勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Maile Information Technology Co ltd
Original Assignee
Guangzhou Maile Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Maile Information Technology Co ltd filed Critical Guangzhou Maile Information Technology Co ltd
Priority to CN202210924335.7A priority Critical patent/CN115150580A/en
Publication of CN115150580A publication Critical patent/CN115150580A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space

Abstract

The invention discloses an information processing method, electronic equipment, a system and a medium. The method comprises the following steps: acquiring audio and video information of at least one user; determining the state information of the corresponding user according to the audio and video information; determining information to be processed based on the target parameters, wherein the included audio and video information is associated audio information and video information; coding information to be processed to obtain coded information, wherein the audio and video information of each user in the information to be processed is coded independently; and transmitting the coded information. According to the method, the state information of the corresponding user can be determined according to the acquired audio and video information, so that the information to be processed is accurately selected from the audio and video information based on the target parameters containing the state information, a signal basis is provided for a receiving end, the position of a sound image can be flexibly determined based on the information to be processed, and the spatial effect of the audio is ensured.

Description

Information processing method, electronic equipment, system and medium
Technical Field
The present invention relates to the field of video conferencing technologies, and in particular, to an information processing method, an electronic device, a system, and a medium.
Background
With the acceleration of work rhythm, video conference application comes, and the user can carry out audio and video communication on line. The human auditory system has strong spatial perception and analysis capability, and can perceive the characteristics of sound such as tone and the direction of the sound.
In the current video conference system, all sounds (including the sounds of multiple persons in the same conference room or the sounds of multiple conference rooms) are generally mixed together, and are decoded and played back at the far end after signal processing, encoding compression and network transmission, so that the spatial effect of audio cannot be completely perceived by the far-end persons, and thus, great differences exist in the experiences of video conferences and face-line offline conferences.
Disclosure of Invention
The invention provides an information processing method, electronic equipment, a system and a medium, which are used for determining a sound image position based on a corresponding image position, ensuring the spatial effect of audio and providing immersive conference experience for a user.
According to an aspect of the present invention, there is provided an information processing method including:
acquiring audio and video information of at least one user, wherein the audio and video information comprises audio information and video information;
determining state information of the corresponding user according to the audio and video information, wherein the state information comprises indication information indicating whether the corresponding user is speaking;
determining information to be processed based on a target parameter, wherein the information to be processed comprises audio and video information selected from the audio and video information, the target parameter comprises the state information, and the included audio and video information is the associated audio information and video information;
coding the information to be processed to obtain coded information, wherein the audio and video information of each user in the information to be processed is coded independently;
and transmitting the coded information, wherein the coded information is used for determining the sound image position of the audio information.
According to another aspect of the present invention, there is provided an information processing method including:
acquiring and decoding encoded information to obtain information to be processed, wherein the information to be processed comprises selected and obtained audio and video information, the audio and video information comprises audio information and video information, and the encoded information is obtained based on any one of the methods in the first aspect;
determining the image position of an image corresponding to the video information;
determining the sound image position of a virtual sound image based on the image position, wherein the virtual sound image is the virtual sound image of the audio information corresponding to the video information;
rendering and displaying an image corresponding to the video information based on the image position;
generating a multi-channel signal corresponding to the audio information based on the sound image position and the audio information;
and playing the multi-channel signal.
According to another aspect of the present invention, there is provided an information processing apparatus comprising:
the first acquisition module is used for acquiring audio and video information of at least one user, wherein the audio and video information comprises audio information and video information;
the first determining module is used for determining the state information of the corresponding user according to the audio and video information, wherein the state information comprises indicating information indicating whether the corresponding user is speaking;
the second determining module is used for determining information to be processed based on a target parameter, wherein the information to be processed comprises audio and video information selected from the audio and video information, the target parameter comprises the state information, and the included audio and video information is the associated audio information and video information;
the encoding module is used for encoding the information to be processed to obtain encoded information, and the audio and video information of each user in the information to be processed is encoded independently;
and the transmission module is used for transmitting the coded information, and the coded information is used for determining the sound image position of the audio information.
According to another aspect of the present invention, there is provided an information processing apparatus comprising:
the second acquisition module is used for acquiring and decoding the coded information to obtain information to be processed, wherein the information to be processed comprises selected audio and video information, and the audio and video information comprises audio information and video information;
the third determining module is used for determining the image position of the image corresponding to the video information;
a fourth determining module, configured to determine, based on the image position, a sound image position of a virtual sound image, where the virtual sound image is a virtual sound image of audio information corresponding to the video information; generating a multi-channel signal corresponding to the audio information based on the sound image position and the audio information;
the display module is used for rendering and displaying the image corresponding to the video information based on the image position;
and the playing module is used for playing the multi-channel signal.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the information processing method according to any of the embodiments of the present invention.
According to another aspect of the present invention, a conference system is provided, which includes an electronic device, an acquisition device, and an output device, where the electronic device executing the information processing method provided by the embodiment of the present invention is an electronic device applied to a conference room scenario.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the information processing method according to any one of the embodiments of the present invention when the computer instructions are executed.
According to the technical scheme of the embodiment of the invention, audio and video information of at least one user is obtained, wherein the audio and video information comprises audio information and video information; according to the audio and video information, determining the state information of the corresponding user, wherein the state information comprises indication information indicating whether the corresponding user is speaking; determining information to be processed based on a target parameter, wherein the information to be processed comprises audio and video information selected from the audio and video information, the target parameter comprises the state information, and the included audio and video information is the associated audio information and video information; coding the information to be processed to obtain coded information, wherein the audio and video information of each user in the information to be processed is coded independently; and transmitting the coded information, wherein the coded information is used for determining the sound image position of the audio information. By utilizing the technical scheme, the state information of the corresponding user can be determined according to the acquired audio and video information, so that the information to be processed is accurately selected from the audio and video information based on the target parameter containing the state information, a signal basis is provided for a receiving end, the position of an audio image can be flexibly determined based on the information to be processed, and the spatial effect of the audio is ensured.
According to the information processing method provided by the embodiment of the invention, the sound image position of the virtual sound image is determined based on the image position, so that a multi-channel audio signal is generated, and the virtual sound image is generated by playing through the loudspeaker array, so that the image position is consistent with the sound image position, and the immersive conference experience is provided for users.
According to the conference system provided by the embodiment of the invention, the consistency of the image position and the sound-image position is realized through the electronic equipment provided by the embodiment of the invention, and the immersive conference experience is provided for the user.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1a is a flowchart of an information processing method according to an embodiment of the present invention;
fig. 1b is a schematic structural diagram of a transmitting apparatus according to an embodiment of the present invention;
fig. 1c is a schematic structural diagram of a receiving apparatus according to an embodiment of the present invention;
fig. 1d is a schematic view of a conference room according to an embodiment of the present invention;
fig. 1e is a schematic view of a topology structure of a conference system according to an embodiment of the present invention;
fig. 1f is a schematic view of a topology structure of another conference system according to an embodiment of the present invention;
FIG. 2 is a flowchart of an information processing method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an image segmentation module according to a second embodiment of the present invention;
fig. 4 is a schematic diagram of a multi-modal endpoint detection module according to a second embodiment of the present invention;
fig. 5 is a schematic diagram of a window arrangement manner according to a second embodiment of the present invention;
fig. 6 is a schematic diagram of a window arrangement manner according to a second embodiment of the present invention;
fig. 7 is a schematic diagram illustrating a window arrangement manner according to a second embodiment of the present invention;
fig. 8 is a schematic diagram of a speaker array according to a second embodiment of the present invention;
fig. 9 is a schematic structural diagram of an information processing apparatus according to a third embodiment of the present invention;
FIG. 10 is a block diagram of an information processing apparatus according to a third embodiment of the present invention;
fig. 11 is a schematic structural diagram of an electronic device that implements an information processing method of an embodiment of the present invention;
fig. 12 is a schematic structural diagram of a conference system according to a sixth embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1a is a flowchart of an information processing method according to an embodiment of the present invention, where this embodiment is applicable to processing an audio/video sent in a meeting room, and specifically, processing an audio/video of one or more users in the meeting room so as to ensure that the audio can have a spatial effect when an audio/video receiving end presents the audio/video. The method may be performed by an information processing apparatus, which may be implemented in the form of hardware and/or software, and which may be configured in an electronic device. The electronic device may be a conference terminal in a conference room scenario.
It is understood that the human auditory system has strong spatial perception and analysis capability, and can perceive the direction of sound in addition to the characteristics of sound such as loudness, timbre, tone, etc. Although a person has only two ears, the human auditory system can perceive the orientation of a sound source in three-dimensional space by the effects of reflection, scattering, occlusion, etc. of various reflective surfaces in the analysis space, including the person's environment and the person's own head and torso. Because of the person's spatial perceptibility, the person is able to concentrate on the voice of one person speaking in a noisy environment while ignoring the voice of other persons in the environment or the environmental noise, i.e. "cocktail party effect".
Among them, the spatial localization of sound by a person is mainly based on the following clues: time and phase difference from sound to ears due to distance; the difference of the intensity of sound to both ears caused by the shielding effect of the head; spectral differences caused by reflections and scattering of sound in different directions by the head, pinna, and torso; the dynamic factors from involuntary minor rotation of the head help to distinguish front and back mirror images and vertical orientation.
In the current video conference system, all sounds (including the sounds of multiple persons in the same conference room or the sounds of multiple conference rooms) are generally mixed together and are decoded and played back at a far end after signal processing, encoding compression and network transmission, but the far-end person cannot completely perceive the spatial effect of audio, so that great difference exists between the remote conference experience and the face-line offline conference experience.
It can be considered that the information processing method provided by the embodiment of the present invention can be used in a communication scene between a plurality of (two or more) conference rooms, where each conference room has 1 or more participants, each conference room includes a sending device and a receiving device, the sending device is configured to collect audio and video information in the conference room and send the encoded information to the other conference rooms; the receiving device is used for receiving and decoding the coded information in the rest conference rooms, displaying the image corresponding to the video information and playing the multichannel signal corresponding to the audio information.
Fig. 1b is a schematic structural diagram of a transmitting apparatus according to an embodiment of the present invention, and as shown in fig. 1b, the transmitting apparatus includes the following units:
and the audio acquisition unit is used for acquiring the sound of each individual participant. Fig. 3 may be one scheme, in which each participant is presented with a near-field directional pickup microphone (M1-M6) to pick up only the sound of the corresponding individual participant.
And the video acquisition unit is used for acquiring the panoramic picture of the conference room or the video information of the individual participants.
The transmitting processing unit is used for segmenting and extracting the image of each participant and carrying out one-to-one correlation correspondence on the extracted image and the sound; the method is also used for channel selection of audio and video streams, coding of audio and video and the like.
Fig. 1c is a schematic structural diagram of a receiving apparatus according to an embodiment of the present invention, and as shown in fig. 1c, the receiving apparatus includes the following units:
and the receiving and processing unit is used for carrying out audio and video decoding, video window layout control and image position determination according to the image position, and generating a multi-channel spatial audio signal through a spatial audio algorithm (such as a VBAP (visual basic application) based algorithm).
And the image rendering unit is used for displaying the images of each person at the far end on the screen according to the set arrangement and position.
And the sound rendering unit is used for playing the multi-channel audio corresponding to the virtual sound image through the multi-channel DAC and the loudspeaker array.
Fig. 1d is a scene schematic diagram of a conference room according to an embodiment of the present invention, as shown in fig. 1d, there are 6 participants in the conference room, that is, 6 users (e.g., P1-P6 in fig. 1 d), and each user is provided with a corresponding audio acquisition unit (e.g., M1-M6) in front of the user for acquiring audio information of the user; the display screen can be provided with a video acquisition unit for acquiring video information of a user, and the conference room is provided with a display screen and a loudspeaker array for respectively displaying images corresponding to the video information of other conference rooms and playing multi-channel signals corresponding to the audio information.
Fig. 1e is a schematic diagram of a topology structure of a conference system according to an embodiment of the present invention, where the conference system may be used in two or more conference rooms, and the setting of each conference room may be consistent; the audio and video stream between a plurality of conference rooms can be forwarded by selecting a Forwarding Unit (SFU), and the audio and video stream is selectively forwarded to other conference rooms according to the speaking state of the participants.
Fig. 1f is a schematic view of a topology structure of another conference system according to the first embodiment of the present invention, and as shown in fig. 1f, when a conference is performed between two conference rooms, the conference can be simplified to direct connection, that is, audio and video streams are directly transmitted from point to point.
It should be noted that the conference room in which the participant is located may be referred to as a "local conference room", and the participant in the local conference room is referred to as a "local participant"; while the other conference rooms are referred to as "far-end conference rooms" and the participants in the other conference rooms are referred to as "far-end participants".
In order to solve the above technical problem, an embodiment of the present invention provides an information processing method, by collecting, dividing, encoding and transmitting sound and image of each participant, i.e., user, in a conference room, so that when the image and sound are played back, an algorithm using a virtual sound image can make the sound from a position where the image is located, that is, the image position and the sound image position coincide. Since the displayed position of the image in the screen (i.e., the image position) is a position without a sound source, which can be considered as a virtual sound image, a multi-channel spatial audio signal can be generated by a spatial audio algorithm and played through a speaker array surrounding the screen to generate a virtual sound image. As shown in fig. 1a, the method comprises:
s110, audio and video information of at least one user is obtained, wherein the audio and video information comprises audio information and video information.
The audio and video information may refer to audio information and video information of a user in a video conference process. Audio information may be understood as information determined on the basis of audio, and video information may be understood as information determined on the basis of video. For example, the audio information may include the speech content of the user during the video conference, the video information may include the image information of the user, and the like. In this embodiment, the users may refer to participants participating in the video conference, and the number of the users is not limited, and may be one or more. The user may be located in a conference room.
The manner of acquiring the audio information and the video information is not limited, for example, the audio information and/or the video information of at least one user may be acquired simultaneously, the audio information and the video information of the user may also be acquired separately, and the audio information and the video information of each user may also be acquired separately.
For the audio information, the audio information of the corresponding user can be respectively acquired through the near-field directional pickup microphone in front of each user, or a plurality of users can share one microphone array, the audio information of all the users is firstly acquired based on the microphone array, and each beam tracks and acquires the audio information of the corresponding user based on a multi-beam forming algorithm, so that the audio information of each user is obtained. The number and the array of the microphones of the microphone array may be set based on actual meeting room conditions (such as the size of the meeting room), and the embodiment does not limit the specific implementation method of the multi-beam forming algorithm.
For video information, the video information of all users can be acquired through a video acquisition unit (such as a camera); the video information of a specific user can also be acquired through a plurality of video acquisition units, all the video acquisition units can cover all the users, for example, two video acquisition units are included, one video acquisition unit acquires the video information of a part of users, and the other video acquisition unit acquires the video information of the rest of users; the video information of all users can be acquired simultaneously through the plurality of video acquisition units, so long as all users can be covered and the definition of the video information of all users is ensured.
And S120, determining the state information of the corresponding user according to the audio and video information, wherein the state information comprises indication information indicating whether the corresponding user is speaking.
The state information may be used to characterize the state of the user, such as whether speaking, audio intensity, audio size, meeting order, and/or duration of speaking.
The status information may include indication information indicating whether the corresponding user is speaking, and the type of the indication information is not limited, for example, the indication information may be text information or symbol information. For example, when the corresponding user is speaking, the status information may include indication information 0 indicating that the corresponding user is speaking; when the corresponding user does not speak, the status information may include indication information 1 indicating that the corresponding user does not speak.
The audio intensity may be the intensity of the audio, the audio size may be the size of the audio, and the meeting order may characterize the order in which the corresponding users attend the meeting. Such as the order of participation among all local participants that the user includes in the local conference room; or the order of the participants in all of the conference rooms involved. The conference-in order may be determined based on the content captured by the audio capture unit and/or the video capture unit corresponding to the user.
Specifically, the state information of the corresponding user may be determined according to the audio-video information of at least one user, and the specific determination method is not limited, for example, the state information of the user may be determined based on only the audio information of the user.
S130, determining information to be processed based on a target parameter, wherein the information to be processed comprises audio and video information selected from the audio and video information, the target parameter comprises the state information, and the included audio and video information is the associated audio information and video information.
The target parameter may be considered as a parameter for determining information to be processed, for example, the target parameter may include status information, or may include a preset parameter, such as user identification information. The information to be processed may be information to be processed to be transmitted to the receiving end, and it is considered that not all the audio and video information of the user needs to be processed and transmitted, and the information to be processed needs to be selected from all the audio and video information to obtain the information to be processed, so the information to be processed includes the audio and video information selected from all the audio and video information, and the selected audio and video information is the associated audio information and video information. The associated audio information and video information belong to the same user.
In an embodiment, this step may determine the information to be processed based on the target parameter, and this embodiment does not limit the process of determining the information to be processed. Illustratively, according to the state information corresponding to the user, such as whether the user is speaking, the audio and video information which is speaking is selected as the information to be processed; corresponding audio and video information can be selected as information to be processed according to preset parameters, and the like, and the final information to be processed can be determined by combining the two methods.
In addition, the number of audio/video information contained in the information to be processed needs to be determined according to actual conditions, such as the maximum window number of the receiving end and/or the number of conference rooms. The maximum window number may be considered as the maximum number of windows displayed on the display screen of the receiving end, and the number of conference rooms may be considered as the number of all conference rooms participating in the current conference. The receiving end can be regarded as a client end for receiving the information to be processed after the local end codes. Such as the clients of the remaining conference rooms.
In one embodiment, the receiving end can determine the number of video information that the receiving end needs to display according to its performance (such as bandwidth), usage scenario, and the like.
Meanwhile, the receiving end can determine how many video information of the participants are available at the far end (i.e. the transmitting end), and the participants can be from the same conference room or different conference rooms. The receiving end can determine the video information to be displayed according to the set strategy, at the moment, the display request can be sent to the SFU server, and then the SFU server can combine the display requests of all the receiving ends and send the combined display request to the corresponding sending end, so that the sending end can transmit the corresponding audio and video information. The set strategy can completely determine the priority of selecting the audio and video information according to the speaking time; or fixing the audio and video information of the user with the front n-bit longer speaking time selected by each conference room, wherein n is a positive integer; or manually specifying which of several audio-visual information to select, etc.) by the administrator.
Correspondingly, when the sending end receives the display request, the corresponding audio and video information (namely the information to be processed) can be selected and sent. After receiving the information to be processed from the sending end, the SFU server may forward the information to the corresponding receiving end, thereby implementing the determination of the information to be processed.
And S140, encoding the information to be processed to obtain encoded information.
After determining the information to be processed, the content in the information to be processed may be encoded to obtain encoded information. The specific process of encoding is not limited here, for example, encoding may be performed separately according to users, and for each user, the specific process of encoding may also be differentiated according to different contents in the information to be processed.
And S150, transmitting the coded information.
In this step, the obtained encoded information may be transmitted to be used for subsequently determining the sound image position of the audio information, and the transmission method is not limited, and the encoded information may be directly transmitted to the receiving end, or may be forwarded to the receiving end through the media forwarding server, as long as the encoded information can be transmitted to the receiving end.
The information processing method provided by the embodiment acquires audio and video information of at least one user, wherein the audio and video information comprises audio information and video information; according to the audio and video information, determining the state information of the corresponding user, wherein the state information comprises indication information indicating whether the corresponding user is speaking; determining information to be processed based on a target parameter, wherein the information to be processed comprises audio and video information selected from the audio and video information, the target parameter comprises the state information, and the included audio and video information is the associated audio information and video information; coding the information to be processed to obtain coded information; and transmitting the coded information. According to the method, the state information of the corresponding user can be determined according to the acquired audio and video information, so that the information to be processed is accurately selected from the audio and video information based on the target parameters containing the state information, a signal basis is provided for a receiving end, the position of a sound image can be flexibly determined based on the information to be processed, and the spatial effect of the audio is ensured.
In one embodiment, the determining the state information of the corresponding user according to the audio/video information includes:
and determining corresponding state information according to the audio information and/or the video information.
In this embodiment, the status information of the user may be determined based on only the audio information of the user, may be determined based on only the video information of the user, or may be determined based on both the audio information and the video information. For example, when determining the status information of the user based on the audio information of the user, it may be determined whether the user is speaking based on the content of the audio information of the user, thereby obtaining the status information of the user.
In one embodiment, determining the state information of the corresponding user according to the audio/video information includes:
based on the video information, segmenting the image corresponding to the video information to obtain the image information of each user;
associating audio information and image information of the same user;
for the audio information and the image information which are associated with each user, determining an audio detection result based on the audio information, and determining an image detection result based on the image information;
and determining corresponding state information based on the audio detection result and the image detection result corresponding to each user.
The image information may refer to information of each image in the video information, and the image information includes a face of the user; the audio detection result may be regarded as a result of detecting the audio information, and the image detection result may be regarded as a result of detecting the image information.
In this step, the video information may be first segmented to obtain image information of each user, for example, the video information may be segmented according to a face of each user to obtain image information of the user, then the audio information of the same user is associated with the image information to obtain audio information and image information associated with each user, where the association process may be, for example, to find image information corresponding to each piece of audio information of the audio information of multiple users, and associate the audio information of the user with the found image information, and the means for finding image information may be, for example, to find an image including a mouth shape corresponding to speech content according to the speech content in the audio information. When the acquired video information contains video information of a plurality of users, the image of each user needs to be separately divided to obtain the image information of each user, and the specific steps of the division are not further expanded.
In one embodiment, the associating process may further determine, for each piece of audio information, an azimuth of a user to which the piece of audio information corresponds. And aiming at the image information of each user, determining the azimuth angle of the user corresponding to the image information, and establishing the image information and the audio information associated with the same user by matching the azimuth angle.
Then, for the audio information and the image information associated with each user, an audio detection result can be determined based on the audio information, and an image detection result can be determined based on the image information; and finally, determining corresponding state information based on the synthesis of the audio detection result and the image detection result corresponding to each user.
The method for determining the audio Detection result and the image Detection result is not limited, for example, whether the user is speaking in the audio information may be detected based on a Voice Activity Detection (VAD) algorithm, and whether the user is speaking in the image information may be detected based on a lip motion Detection (lip motion Detection) algorithm; thereby being capable of determining the state information by combining the results of the two.
The indication information included in the state information is determined to be information indicating that the user is speaking, such as when the audio detection result indicates that the user is speaking and the image detection result indicates that the user is speaking.
In one embodiment, the determining information to be processed based on the target parameter includes:
accumulating the state information aiming at the state information of each user, and determining the corresponding speaking duration, wherein the state information also comprises the speaking duration;
and selecting information to be processed from the audio and video information of the at least one user based on the determined speaking duration.
In one embodiment, the state information may include a speaking duration, and accordingly, for the state information of each user, speaking time of the user in a certain period may be accumulated to determine a speaking duration corresponding to the user; after the speaking durations are obtained, the speaking durations can be sorted according to the sizes, and therefore the audio and video information with the speaking duration larger than a set threshold or the longest speaking duration is selected from the sorting result to serve as the information to be processed. Or selecting a set amount of audio and video information as information to be processed according to the sequence of the speaking duration from large to small, wherein the set amount can depend on the maximum window number capable of being displayed by the image rendering of the receiving end and the number of conference rooms.
In one embodiment, the target parameter further includes user identification information, and accordingly, determining information to be processed based on the target parameter includes:
determining the audio and video information corresponding to the user identification information as information to be processed;
and selecting information to be processed from the audio and video information except the audio and video information corresponding to the user identification information.
The user identification information can uniquely identify the corresponding user, can be specified by the receiving end, and can also be determined based on conference setting in a conference room of the local end. If a user is set as a speaker in the local conference room, the user identification information of the user is contained in the target parameter, and the user identification information can be used for uniquely associating the audio and video information corresponding to the user. The specific manner of the receiver designation is not limited herein.
In one embodiment, the target parameter may include user identification information, and accordingly, first, the audio/video information corresponding to the user identification information included in the target parameter may be directly determined as the information to be processed, or the audio/video information corresponding to the user identification information included in the target parameter may be determined, and whether the audio/video information is available as the information to be processed is determined; and then, selecting the rest audio and video information from the audio and video information except the audio and video information corresponding to the user identification information as the information to be processed, wherein the selection condition is not limited, and if the appropriate audio and video information can be selected as the information to be processed based on the state information. The condition for judging whether the audio and video information corresponding to the user identification information is used as the information to be processed is lower than the condition for selecting the information to be processed from the audio and video information except the audio and video information corresponding to the user identification information.
In one embodiment, the information to be processed further includes state information corresponding to the selected audio/video information.
In this embodiment, the information to be processed may further include state information corresponding to the selected audio/video information, so that the receiving end determines the position of the subsequent image.
In one embodiment, the encoding the information to be processed to obtain encoded information includes:
for the audio and video information and the state information of each user in the information to be processed, encoding the audio and video information and the state information to obtain encoded information of the user; or the like, or a combination thereof,
and aiming at the audio and video information of each user in the information to be processed, encoding the audio and video information to obtain the encoded information of the user.
In this embodiment, the audio/video information and the status information of each user in the message to be processed may be encoded at the same time to obtain encoded information of the corresponding user; or, for each user in the message to be processed, the corresponding audio/video information can be directly encoded to obtain the encoded information of the corresponding user.
Example two
Fig. 2 is a flowchart of an information processing method according to a second embodiment of the present invention, where the method is applicable to processing audio and video received from one or more conference rooms, and the method may be executed by an information processing apparatus, where the information processing apparatus may be implemented in a form of hardware and/or software, and the information processing apparatus may be configured in an electronic device, and the electronic device may receive a conference terminal at a receiving end. As shown in fig. 2, the method includes:
s210, obtaining and decoding the coded information to obtain information to be processed, wherein the information to be processed comprises selected and obtained audio and video information, and the audio and video information comprises audio information and video information.
The encoded information may be obtained based on the method described in any of the embodiments, and the manner of obtaining the encoded information may be directly obtained from the sending end or obtained based on the media forwarding server.
The encoded information may be in user granularity, with each user corresponding to one encoded information. Or at conference room granularity, each conference room corresponds to one piece of encoded information. When one conference room comprises a plurality of participants, the information to be processed corresponding to each user can be obtained after the encoded information is decoded.
S220, determining the image position of the image corresponding to the video information.
The image position may be understood as the position where the image is displayed on the receiving-end display screen. After the receiving end decodes the information to be processed, the image positions of the images corresponding to the video information in the information to be processed can be determined, for example, when the audio and video information of a plurality of users is obtained, the user or users' audio and video information and the arrangement sequence of the images corresponding to the video information are determined; and for example, when the audio-video information of a plurality of users is acquired, determining the arrangement sequence of the images corresponding to the video information.
Therefore, after the coded information of all the users is decoded to obtain the audio and video information corresponding to each user, the images corresponding to the video information of all the users can be directly displayed; or selecting appropriate audio and video information from the audio and video information of all users, displaying images corresponding to the selected audio and video information, and setting the selected rule by related personnel without further limitation.
Then, it is necessary to specify the image position of the image corresponding to the video information, and it is considered that, when the number of the video information is one, a default display position may be specified as the image position of the image corresponding to the video information, and the default display position is not limited and may be specified based on the actual situation. The default display position may be dynamically adjusted or may be manually adjusted.
When the number of the video information is at least two, the image position of the corresponding video information needs to be determined according to the video arrangement sequence and the window arrangement information. In one embodiment, first, a video arrangement sequence of each piece of video information may be determined based on each piece of audio/video information in the to-be-processed information, and then each piece of video information in the video arrangement sequence is matched with position information of each window in the window arrangement information to obtain an image position of an image corresponding to the video information.
And S230, determining the sound image position of a virtual sound image based on the image position, wherein the virtual sound image is the virtual sound image of the audio information corresponding to the video information.
The virtual sound image can be regarded as the sound source position of the audio information corresponding to the video information, and the sound image position is used for representing the sound source position perceived by the user. After the image position is determined, the sound image position of the virtual sound image of the audio information corresponding to each piece of video information may be determined, for example, the image position may be determined as the sound image position of the corresponding virtual sound image, or any position of the window where the image is located may be used as the sound image position.
And S240, rendering and displaying the image corresponding to the video information based on the image position.
After the image position of the image corresponding to the video information is determined based on the steps, the image corresponding to the video information can be rendered based on the image position, and the rendered image is displayed. The process of rendering the image is not repeated, and the image can be rendered by related personnel according to the actual situation.
And S250, generating a multi-channel signal corresponding to the audio information based on the sound image position and the audio information.
The multi-channel signal may refer to a multi-channel signal obtained by processing audio information through an algorithm.
In one embodiment, the multi-channel signal corresponding to the audio information may be generated based on the audio information and the sound image position, for example, the multi-channel signal corresponding to the audio information may be generated according to the actual situation (such as the matrix of the speaker array and the number of speakers) and the sound image position, so that when the multi-channel signal is played, a virtual sound image is generated, and the virtual sound image is the sound image position.
It is believed that the perception of sound localization by the human auditory system is derived primarily from the following three types of information: time and phase difference from sound to ears due to distance; differences in intensity of sound to both ears due to occlusion effects of the head, and differences in frequency spectrum caused by reflection and scattering of sound in different directions by the head, pinna, and torso. Therefore, the present embodiment may generate a virtual sound image based on a single information or a combined information of the three information, for example, based on Vector Base Amplitude adjustment, positioning and moving (VBAP) of a sound image in a two-dimensional direction in front of a user is realized, and may also realize virtual sound image generation and positioning by an ambient stereo mixed sound (Ambisonics) method. The present embodiment is not limited to a specific algorithm for virtual sound image generation. If the arrangement of the image windows is two-dimensional, the algorithm of the virtual sound image has to support the position change in the two-dimensional direction. The dimensions supported by the algorithm for virtual sound image may be consistent with the arrangement bitmap of the image windows. The image window can be understood as a window where the receiving end displays an image, and video information is rendered into the image window.
The embodiment can adopt a spatial audio algorithm to generate a multi-channel audio signal, namely a multi-channel signal; the multi-channel signal, when played through a speaker array, creates a virtual sound image at the image location.
The execution sequence of the steps of the present invention is not limited, for example, S250 may be executed first, and then S240 may be executed. S240 and S260 may be performed synchronously.
And S260, playing the multi-channel signal.
After the multi-channel signal corresponding to the audio information is determined, the multi-channel signal can be played in the step, so that the image position and the sound-image position are consistent. The specific playing mode may be, for example: the m unit loudspeaker arrays are arranged around the display screen of the receiving end, when a user in a certain window speaks, after the m multi-channel signals corresponding to the audio information are obtained through the steps, the m unit loudspeaker arrays can simultaneously play the corresponding channel signals, so that participants at the receiving end can experience the experience that the image position and the sound image position are consistent, and m is a positive integer. According to the information processing method provided by the embodiment of the invention, the sound image position of the virtual sound image is determined based on the image position, the multi-channel audio signal is generated by adopting the spatial audio algorithm, and the virtual sound image is generated by playing through the loudspeaker array, so that the image position is consistent with the sound image position, and the immersive conference experience is provided for users.
In one embodiment, the determining the image position of the image corresponding to the video information includes:
and determining the image position of the image corresponding to the video information based on the information to be processed and preset window arrangement information.
The window arrangement information may be considered as preset window information, such as a window arrangement dimension, a window size, a window arrangement order, and/or window position information, and the like, where the window arrangement dimension may refer to a dimension for arranging windows, such as one dimension or two dimensions, and the window size may be understood as the size of each window, and the window position information is used to represent the position of each window. The window arrangement order may characterize the order of arrangement between included windows.
In this step, the image position of the image corresponding to the video information may be determined based on the information to be processed and the preset window arrangement information, for example, the image position of the image corresponding to the video information may be determined based on the state information corresponding to the selected audio/video information in the information to be processed and the preset window arrangement information; and matching the priority of each audio/video information in the preset information to be processed with the preset window arrangement information to obtain the image position of the image corresponding to the video information.
In an embodiment, the determining, based on the to-be-processed information and preset window arrangement information, an image position of an image corresponding to the video information further includes:
when the number of the audio and video information is at least two, determining the video arrangement sequence of the corresponding video information based on the state information;
and determining the image position of the corresponding video information based on preset window arrangement information and the video arrangement sequence, wherein the window arrangement information comprises the position information of the included window.
The video arrangement order can be regarded as the arrangement order of the video information in the information to be processed,
in this embodiment, the information to be processed may further include state information corresponding to the selected audio/video information, so when the number of the audio/video information is at least two, the video arrangement order of the corresponding video information may be determined based on the state information corresponding to each audio/video information, and the manner of determining the video arrangement order may be determined according to the content included in each state information; and then matching the video information corresponding to each user with each window based on the preset window arrangement information and the video arrangement sequence, and determining the corresponding image position based on the window where the video information is located.
In one embodiment, the determining, based on each of the status information, a video arrangement order of the corresponding video information includes:
sequencing each state information based on the indication information and the speaking duration included in each state information;
and determining the state arrangement sequence of the state information as the video arrangement sequence of the corresponding video information.
Specifically, the state information may be sorted first based on the indication information and the speaking duration included in the state information, and then the state arrangement order of the state information is used as the video arrangement order of the corresponding video information, where, for example, the method for sorting the state information may prioritize the state information that the user is speaking in the indication information, and then sort all other state information based on the order of the speaking duration from long to short; all the status information may be sorted based on the speaking duration from long to short, which is not limited in this embodiment.
In one embodiment, the determining, based on preset window arrangement information and the video arrangement order, an image position of corresponding video information includes:
according to the video arrangement sequence and the window arrangement information, sequentially associating the video information with the window corresponding to the window arrangement information;
and determining the window center position of the window corresponding to the window arrangement information as the image position of the associated video information.
After the video arrangement sequence of each video information is determined, each video information can be related to the window corresponding to the window arrangement information according to the video arrangement sequence and the window arrangement information in sequence according to the sequence of the video information in the video arrangement sequence; then, the window center position of the window associated with the video information can be determined as the image position of the associated video information, so as to complete the determination of the image position corresponding to each video information.
The following describes specific schemes of the above units:
1. an audio acquisition unit. The audio acquisition unit is aimed at acquiring the sound of a single participant (i.e. acquiring the audio information of at least one user), and the specific type and number of the audio acquisition units are not limited.
For example, in one embodiment, a near-field directional pickup microphone may be disposed in front of each participant to collect individual sound of each participant and output the sound of each participant after passing through a multi-channel collection device (ADC), that is, the sound of each participant is sent to the transmission processing unit individually, where the microphone may be a single microphone with physical directivity, or directional pickup may be achieved by a microphone array through a beam forming algorithm.
In one embodiment, a plurality of participants can share one microphone array, the microphone array collects the sound of all the participants, then a multi-beam algorithm is adopted, each beam tracks and picks up the sound of one participant, only the sound of the participant is picked up and the sound of other participants outside the beam is suppressed, and then the sound of each participant is output, namely the sound of each participant is independently sent to the sending processing unit. The number of beams may be set according to the number of participants in the conference room.
The present embodiments are not limited to a particular array pattern and beamforming algorithm. The number and the array of the microphone units of the microphone array can be designed according to the conditions of the size of a conference room, the number of participants, the arrangement of a conference table and the like, and the key indexes of the microphone array are that the width of a wave beam corresponds to the position interval of the participants and only one participant exists in the wave beam direction. For example, the array of the microphone array may be a linear array or a circular array. The beamforming algorithm may also be implemented in various ways, including a fixed beamforming algorithm, such as a White Noise Gain Constraint (WNGC) -based super-directional beamforming design algorithm, or an adaptive beamforming algorithm, such as a minimum variance distortionless response algorithm (MVDR) and a generalized sidelobe canceling algorithm (GSC).
2. And a video acquisition unit. The aim of the video acquisition unit is to cover all the participants in the conference room, providing a resolution high enough to allow sufficient sharpness of the segmented image. The video acquisition unit may include one or more cameras for acquiring video information of one or more users according to the size of a specific conference room and the arrangement of seats of participants.
3. And a sending processing unit. The sending processing unit may mainly include the following modules:
(1) An image segmentation module, and fig. 3 is a schematic structural diagram of an image segmentation module according to a second embodiment of the present invention, as shown in fig. 3, the image segmentation module may detect a face and a trunk in a single-path or multi-path camera stream, and separately segment an image (including a head and a trunk) of each participant (i.e., determine image information of each user based on video information).
(2) A multi-modal endpoint detection module, fig. 4 is a schematic diagram of the multi-modal endpoint detection module according to the second embodiment of the present invention, and as shown in fig. 4, firstly, multi-channel audio information (i.e., audio information) and a segmented image (i.e., image information) may be associated (i.e., audio information and image information of the same user are associated); the method comprises the steps that for audio and video information of each person, a Lip motion detection algorithm (Lip motion detection) and an audio VAD algorithm are combined, whether a corresponding participant is speaking or not is judged, namely, a speaking state (namely, state information is judged, and the state information comprises indication information indicating whether a corresponding user is speaking or not). And if an audio detection result is obtained by voice detection aiming at the audio information, an image detection result is obtained by lip movement detection aiming at the video information, and the audio detection result and the image detection result are subjected to multi-mode logic fusion to obtain corresponding state information.
(3) The path selection module may consider that the audio/video stream (i.e., the audio/video information) of all people in the local conference room is not encoded and transmitted to the remote conference room, but needs to be selected. The priority of the path selection depends on the speaking duration of the participant (i.e. the information to be processed is selected from the audio/video information of at least one user based on the determined speaking duration), and the number of the simultaneous encoding and transmission audio/video streams depends on the maximum window number and the number of conference rooms that can be displayed by the image rendering at the receiving end, for example: one possible configuration is that the receiving-end image rendering can display 16 windows at most, each conference room can simultaneously transmit the audio-video code streams of 4 participants at most, and 10 conference rooms at most can be connected.
For example, one possible configuration is that the receiving-end image rendering can display 16 windows at most, each conference room can simultaneously transmit audio and video code streams of 4 participants at most, and 10 conference rooms at most can be connected.
(4) And the audio and video coding module is used for coding the audio stream (namely, audio information), the video stream (namely, video information) and the speaking state (namely, state information) of the participant of the channel selection module and transmitting the coded audio stream, the video stream and the speaking state to other meeting rooms through a network, wherein the transmission mode can be direct transmission or indirect transmission through a media forwarding Server (SFU) and the like.
4. And a receiving processing unit. The receiving and processing unit mainly comprises the following modules:
(1) And the audio and video decoding module is used for decoding the received (directly transmitted or forwarded by the SFU) audio stream, video stream and speaking state.
The video window layout control module has the functions of 3 aspects: i.e. determining which images of participants in other meeting rooms are displayed; determining the arrangement order of the images of the remote participants (namely determining the video arrangement order of the corresponding video information based on the state information); the center position of each image is calculated (i.e. the image position of the corresponding video information is determined) according to the arrangement order (i.e. the video arrangement order) of the images and different window arrangement modes (i.e. the preset window arrangement information), and the center position of the image is used for generating a virtual sound image (i.e. the image position is determined as the sound image position of the corresponding virtual sound image).
The window layout control may have various strategies, such as an automatic control strategy, a manual control strategy, or a hybrid strategy. The automatic control strategy can specifically be that according to the speaking state obtained from the audio/video decoding module, the person who is speaking at present is ranked at the top, and other persons determine the ranking order according to the accumulated speaking duration; the manual strategy can be that related personnel determine fixed priority or dynamically and manually adjust the priority according to the preset priority of the participants; a hybrid strategy may be considered a combination of manual and automatic strategies.
The size and position of the window arrangement are not limited, and only the center position of the window needs to be transmitted to the virtual sound image generation module, so that the viewed image position and the heard sound image position are consistent from the perspective of a user. For example, the window arrangement may be one-dimensional or two-dimensional. Meanwhile, the implementation supports the situation that the document window and the portrait window are arranged in a mixed mode.
Fig. 5 is a schematic diagram of a window arrangement manner according to a second embodiment of the present invention, and as shown in fig. 5, an image corresponding to video information of a currently speaking user may be set in an upper left corner according to a real-time change condition of state information, and images corresponding to video information of other users may be placed in other windows, for example, each video information may be placed in other windows clockwise according to a length of an accumulated speaking duration.
Fig. 6 is a schematic diagram of a window arrangement manner according to a second embodiment of the present invention, and as shown in fig. 6, the window sizes of the images corresponding to the video information of all users may be the same. In this step, the status information may be sorted based on the indication information and the speaking duration included in the status information, and then the video information may be placed in the corresponding window from left to right and from top to bottom according to the status arrangement order of the status information. After the image positions of the images corresponding to all the video information are determined, the state of the window where the user who is speaking currently is located can be distinguished from other windows, for example, the window is in a flashing state or shadow areas are arranged around the window.
Fig. 7 is a schematic diagram of a window arrangement manner according to a second embodiment of the present invention, as shown in fig. 7, shared content (e.g., a document) in a conference process may be displayed on the left side of the diagram, and images corresponding to video information may be displayed on each window on the right side, where the images corresponding to the video information may be placed in a priority order from top to bottom.
The window can be displayed in a full screen mode or can be displayed in a display screen together with the shared content.
(2) And the virtual sound image generation module is used for generating multi-channel spatial signals by adopting a spatial audio algorithm according to the image position and the number and arrangement of the loudspeaker arrays (see 'audio rendering unit'). The generated multi-channel signal, after being played on a given loudspeaker array, produces a virtual sound image. Wherein the number of channels is the same as the number of loudspeakers.
5. An image rendering unit. The main function of image rendering is to render the decoded video stream in a small window. The arrangement of the windows is seen in a video window layout control module.
6. An audio rendering unit. The rendering of audio includes the following two parts:
(1) An array of loudspeakers. Typically, for a 16. Alternatively, the present embodiment employs a 7-unit speaker array.
Fig. 8 is a schematic diagram of a speaker array according to a second embodiment of the present invention, as shown in fig. 8, a speaker array with 7 units is disposed around a display screen, and when a user in a certain window speaks, 7 multi-channel signals corresponding to audio information can be obtained by rendering based on a sound image position and the audio information of the user, and are played by the speaker arrays with 7 units respectively at the same time, so that the played sound image position is consistent with an image position.
(2) A multi-channel digital-to-analog conversion Device (DAC), i.e., a sound card. The multi-channel signal generated by the virtual sound image generation module is played through the sound card and amplified by the power amplifier, and finally the loudspeaker array is pushed to make sound, so that a user in a local conference room feels the experience that the image position and the sound image position are consistent.
In summary, the present embodiment provides an end-to-end immersive conference system solution, which includes an audio solution and a video solution; the sound and the image of each person in the conference room are extracted and rendered in the far-end conference room, and a virtual sound image is generated by applying a spatial audio algorithm, so that the positions of the image and the sound image are consistent, and the immersive conference experience is provided for users. Therefore, in the embodiment, the control of the image and sound-image positions is not obtained from the sending end, but is controlled by a special conference control module at the receiving end, so that the use habit of the user is better met.
EXAMPLE III
Fig. 9 is a schematic structural diagram of an information processing apparatus according to a third embodiment of the present invention. As shown in fig. 9, the apparatus includes:
the first obtaining module 310 is configured to obtain audio and video information of at least one user, where the audio and video information includes audio information and video information;
the first determining module 320 is configured to determine, according to the audio and video information, status information of a corresponding user, where the status information includes indication information indicating whether the corresponding user is speaking;
a second determining module 330, configured to determine information to be processed based on a target parameter, where the information to be processed includes audio and video information selected from the audio and video information, the target parameter includes the state information, and the included audio and video information is associated audio information and video information;
the encoding module 340 is configured to encode the information to be processed to obtain encoded information, where audio and video information of each user in the information to be processed is encoded separately;
a transmission module 350, configured to transmit the encoded information, where the encoded information is used to determine a sound image position of the audio information.
In the information processing apparatus provided in the third embodiment of the present invention, the first obtaining module 310 obtains the audio and video information of at least one user, where the audio and video information includes audio information and video information; determining the state information of the corresponding user through a first determining module 320 according to the audio and video information, wherein the state information comprises indication information indicating whether the corresponding user is speaking; determining information to be processed by the second determining module 330 based on a target parameter, where the information to be processed includes audio and video information selected from the audio and video information, the target parameter includes the state information, and the included audio and video information is associated audio information and video information; the information to be processed is encoded through an encoding module 340 to obtain encoded information, and the audio and video information of each user in the information to be processed is encoded independently; the encoded information, which is used to determine the sound image position of the audio information, is transmitted through the transmission module 350. The device realizes accurate selection of information to be processed from audio and video information based on the target parameters containing state information, provides a signal basis for a receiving end, enables the position of an audio image to be flexibly determined based on the information to be processed, and ensures the spatial effect of audio.
Optionally, the first determining unit includes:
determining image information of each user based on the video information;
associating audio information and image information of the same user;
for the audio information and the image information which are associated with each user, determining an audio detection result based on the audio information, and determining an image detection result based on the image information;
and determining corresponding state information based on the audio detection result and the image detection result corresponding to each user.
Optionally, the first determining module 320 includes:
accumulating the state information aiming at the state information of each user, and determining the corresponding speaking duration, wherein the state information also comprises the speaking duration;
and selecting information to be processed from the audio and video information of the at least one user based on the determined speaking duration.
Optionally, the target parameter further includes user identification information, and accordingly, the first determining module 320 includes:
determining the audio and video information corresponding to the user identification information as information to be processed;
and selecting information to be processed from the audio and video information except the audio and video information corresponding to the user identification information.
Optionally, the encoding module 340 includes:
the audio and video information and the state information of each user in the information to be processed are coded to obtain coded information of the user; or the like, or, alternatively,
and aiming at the audio and video information of each user in the information to be processed, encoding the audio and video information to obtain the encoded information of the user.
The information processing device provided by the embodiment of the invention can execute the information processing method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 10 is a schematic structural diagram of an information processing apparatus according to a fourth embodiment of the present invention. As shown in fig. 10, the apparatus includes:
a second obtaining module 410, configured to obtain and decode the encoded information to obtain information to be processed, where the information to be processed includes selected audio and video information, and the audio and video information includes audio information and video information;
a third determining module 420, configured to determine an image position of an image corresponding to the video information;
a fourth determining module 430, configured to determine, based on the image position, a sound image position of a virtual sound image, where the virtual sound image is a virtual sound image of the audio information corresponding to the video information; generating a multi-channel signal corresponding to the audio information based on the sound image position and the audio information; generating a multi-channel audio signal by adopting a spatial audio algorithm, namely a multi-channel signal; the multi-channel signal, when played through a speaker array, creates a virtual sound image at the image location.
A display module 440, configured to render and display an image corresponding to the video information based on the image position;
the playing module 450 is configured to play the multi-channel signal.
In the information processing apparatus provided in the fourth embodiment of the present invention, the second obtaining module 410 obtains, decodes and encodes the encoded information to obtain information to be processed, where the information to be processed includes selected audio and video information, and the audio and video information includes audio information and video information; determining, by a third determining module 420, an image position of an image corresponding to the video information; determining, by the fourth determining module 430, a sound image position of a virtual sound image based on the image position, where the virtual sound image is a virtual sound image of the audio information corresponding to the video information; generating a multi-channel signal corresponding to the audio information based on the sound image position and the audio information; rendering and displaying an image corresponding to the video information based on the image position through a display module 440; the multi-channel signal is played back through the play module 450. The device determines the sound image position of the virtual sound image based on the image position, adopts a spatial audio algorithm to generate a multi-channel audio signal, and generates a virtual sound image through the loudspeaker array playing, so that the image position and the sound image position are consistent, and the immersive conference experience is provided for users.
Optionally, the third determining module 420 includes:
and the second determining unit is used for determining the image position of the image corresponding to the video information based on the information to be processed and preset window arrangement information.
Optionally, the information to be processed further includes state information corresponding to the selected audio/video information, and correspondingly, the second determining unit includes:
the first determining subunit is configured to determine, based on each piece of the state information, a video arrangement order of the corresponding video information when the number of the audio/video information is at least two;
and the second determining subunit is used for determining the image position of the corresponding video information based on preset window arrangement information and the video arrangement sequence, wherein the window arrangement information comprises position information of the included window.
Optionally, the first determining subunit includes:
sequencing each state information based on the indication information and the speaking duration included in each state information;
and determining the state arrangement sequence of the state information as the video arrangement sequence of the corresponding video information.
Optionally, the second determining subunit includes:
according to the video arrangement sequence and the window arrangement information, sequentially associating the video information with the window corresponding to the window arrangement information;
and determining the window center position of the window corresponding to the window arrangement information as the image position of the associated video information.
Optionally, the fourth determining module 430 includes:
and determining the image position as the sound image position of the corresponding virtual sound image.
The information processing device provided by the embodiment of the invention can execute the information processing method provided by the second embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
EXAMPLE five
Fig. 11 is a schematic structural diagram of an electronic device implementing an information processing method according to an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 11, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 may also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the respective methods and processes described above, such as an information processing method.
In some embodiments, the information processing method may be implemented as a computer program that is tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the information processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the information processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
EXAMPLE six
Fig. 12 is a schematic structural diagram of a conference system according to a sixth embodiment of the present invention, and as shown in fig. 12, the conference system includes: the electronic device 1, the acquisition device 2, and the output device 3 according to the fifth embodiment, wherein the electronic device 1 according to the fifth embodiment is implemented as an electronic device applied to a meeting room scene;
the acquisition equipment 2 is used for acquiring audio and video information of at least one user;
the output device 3 is used to output images and multi-channel signals.
The electronic device 1 may be configured to execute the information processing method of the first embodiment or the second embodiment; the collecting device 2 may be considered as a device for collecting audio and video, and is configured to collect audio and video information of at least one user required in the information processing method, the collecting device may be an integral device and simultaneously collect the audio and video information of at least one user, and the collecting device may also include an audio collecting unit and a video collecting unit, which respectively collect the audio information and the video information of at least one user.
The output device 3 may be used to output an image of the information processing method and the multi-channel signal, for example, the output device 3 may include a display unit for displaying the image output by the information processing method; the output device 3 may comprise a loudspeaker array for playing and amplifying the multi-channel signal.
In one embodiment, the capture device includes an audio capture unit for capturing audio information of each user and a video capture unit for capturing video information of one or more users.
In one embodiment, one user may correspond to one audio acquisition unit to realize separate acquisition of audio information of each user, and one audio acquisition unit may also acquire audio information of all users at the same time; in one embodiment, one user may correspond to one video capture unit to achieve independent capture of video information of each user, or all users may share one video capture unit to achieve overall capture of video information of all users. In the actual acquisition process, the above embodiments can be freely combined to realize the acquisition of the audio and video information of at least one user. The types and the numbers of the audio acquisition units and the video acquisition units are not limited, and can be set according to actual needs.
In one embodiment, the audio acquisition units are directional pickup microphones, the number of the audio acquisition units is one or more, and each audio acquisition unit is used for acquiring audio information of a user; alternatively, the first and second liquid crystal display panels may be,
the audio acquisition unit is one or more microphone arrays, and each microphone array is used for acquiring audio information of a user; alternatively, the first and second liquid crystal display panels may be,
the audio acquisition unit is one or more microphone arrays, and the microphone arrays are used for acquiring audio information of a plurality of users and outputting the acquired audio information of the plurality of users independently.
In this embodiment, the audio acquisition units may be directional pickup microphones, each audio acquisition unit is configured to acquire audio information of a corresponding user, and the number of the audio acquisition units may be one or more; in this embodiment, the audio acquisition units may also be microphone arrays, each microphone array is used to directionally acquire audio information of one user, and the number of the audio acquisition units is one or more; when the audio acquisition unit is a microphone array in this embodiment, the audio acquisition unit may also be used to acquire audio information of a plurality of users and individually output the acquired audio information of the plurality of users.

Claims (16)

1. An information processing method characterized by comprising:
acquiring audio and video information of at least one user, wherein the audio and video information comprises audio information and video information;
according to the audio and video information, determining the state information of the corresponding user, wherein the state information comprises indication information indicating whether the corresponding user is speaking;
determining information to be processed based on a target parameter, wherein the information to be processed comprises audio and video information selected from the audio and video information, the target parameter comprises the state information, and the included audio and video information is the associated audio information and video information;
coding the information to be processed to obtain coded information, wherein the audio and video information of each user in the information to be processed is coded independently;
and transmitting the coded information, wherein the coded information is used for determining the sound image position of the audio information.
2. The method according to claim 1, wherein determining the status information of the corresponding user according to the audio-video information comprises:
based on the video information, segmenting the image corresponding to the video information to obtain the image information of each user;
associating audio information and image information of the same user;
for the audio information and the image information which are associated with each user, determining an audio detection result based on the audio information, and determining an image detection result based on the image information;
and determining corresponding state information based on the audio detection result and the image detection result corresponding to each user.
3. The method of claim 1, wherein determining information to be processed based on the target parameter comprises:
accumulating the state information aiming at the state information of each user, and determining the corresponding speaking duration, wherein the state information also comprises the speaking duration;
and selecting information to be processed from the audio and video information of the at least one user based on the determined speaking duration.
4. The method of claim 1, wherein the target parameters further include user identification information, and accordingly, the information to be processed is determined based on the target parameters;
determining the audio and video information corresponding to the user identification information as information to be processed;
and selecting information to be processed from the audio and video information except the audio and video information corresponding to the user identification information.
5. The method of claim 1, wherein the encoding the information to be processed to obtain encoded information comprises:
for the audio and video information and the state information of each user in the information to be processed, encoding the audio and video information and the state information to obtain encoded information of the user; or the like, or, alternatively,
and aiming at the audio and video information of each user in the information to be processed, encoding the audio and video information to obtain the encoded information of the user.
6. An information processing method characterized by comprising:
acquiring and decoding encoded information to obtain information to be processed, wherein the information to be processed comprises selected and obtained audio and video information, the audio and video information comprises audio information and video information, and the encoded information is obtained based on the method of any one of claims 1-5;
determining the image position of an image corresponding to the video information;
determining the sound image position of a virtual sound image based on the image position, wherein the virtual sound image is the virtual sound image of the audio information corresponding to the video information;
rendering and displaying an image corresponding to the video information based on the image position;
generating a multi-channel signal corresponding to the audio information based on the sound image position and the audio information;
and playing the multi-channel signal.
7. The method according to claim 6, wherein the determining the image position of the image corresponding to the video information comprises:
and determining the image position of the image corresponding to the video information based on the information to be processed and preset window arrangement information.
8. The method according to claim 7, wherein the information to be processed further includes state information corresponding to the selected audio/video information, and correspondingly, determining the image position of the image corresponding to the video information based on the information to be processed and preset window arrangement information includes:
when the number of the audio and video information is at least two, determining the video arrangement sequence of the corresponding video information based on each state information;
and determining the image position of the corresponding video information based on preset window arrangement information and the video arrangement sequence, wherein the window arrangement information comprises the position information of the included window.
9. The method according to claim 8, wherein determining the video arrangement order of the corresponding video information based on each of the status information comprises:
sequencing each state information based on the indication information and the speaking duration included in each state information;
and determining the state arrangement sequence of the state information as the video arrangement sequence of the corresponding video information.
10. The method according to claim 8, wherein the determining the image position of the corresponding video information based on the preset window arrangement information and the video arrangement order comprises:
according to the video arrangement sequence and the window arrangement information, sequentially associating the video information with the window corresponding to the window arrangement information;
and determining the window center position of the window corresponding to the window arrangement information as the image position of the associated video information.
11. The method according to claim 6, wherein determining the image position of the virtual sound image based on the image position comprises:
and determining the image position as the sound image position of the corresponding virtual sound image.
12. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.
13. A conferencing system comprising an electronic device according to claim 12, an acquisition device and an output device, the electronic device according to claims 1-11 being implemented as an electronic device applied to a conference room scenario;
the acquisition equipment is used for acquiring the audio and video information of at least one user;
the output device is used for outputting images and multi-channel signals.
14. The conferencing system of claim 13, wherein the capture device comprises an audio capture unit for capturing audio information for each user and a video capture unit for capturing video information for one or more users.
15. The conferencing system of claim 14,
the audio acquisition units are directional pickup microphones, the number of the audio acquisition units is one or more, and each audio acquisition unit is used for acquiring the audio information of a user; alternatively, the first and second liquid crystal display panels may be,
the audio acquisition unit is one or more microphone arrays, and each microphone array is used for acquiring audio information of a user; alternatively, the first and second liquid crystal display panels may be,
the audio acquisition unit is one or more microphone arrays, and the microphone arrays are used for acquiring audio information of a plurality of users and outputting the acquired audio information of the plurality of users independently.
16. A computer-readable storage medium, characterized in that it stores computer instructions for causing a processor to implement, when executed, the information processing method of any one of claims 1-11.
CN202210924335.7A 2022-08-03 2022-08-03 Information processing method, electronic equipment, system and medium Pending CN115150580A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210924335.7A CN115150580A (en) 2022-08-03 2022-08-03 Information processing method, electronic equipment, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210924335.7A CN115150580A (en) 2022-08-03 2022-08-03 Information processing method, electronic equipment, system and medium

Publications (1)

Publication Number Publication Date
CN115150580A true CN115150580A (en) 2022-10-04

Family

ID=83413370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210924335.7A Pending CN115150580A (en) 2022-08-03 2022-08-03 Information processing method, electronic equipment, system and medium

Country Status (1)

Country Link
CN (1) CN115150580A (en)

Similar Documents

Publication Publication Date Title
US20230216965A1 (en) Audio Conferencing Using a Distributed Array of Smartphones
US11418758B2 (en) Multiple simultaneous framing alternatives using speaker tracking
US8717402B2 (en) Satellite microphone array for video conferencing
EP3319344B1 (en) Method and apparatus for generating audio signal information
US8441515B2 (en) Method and apparatus for minimizing acoustic echo in video conferencing
US9113034B2 (en) Method and apparatus for processing audio in video communication
EP3466113B1 (en) Method, apparatus and computer-readable media for virtual positioning of a remote participant in a sound space
US9554091B1 (en) Identifying conference participants and active talkers at a video conference endpoint using user devices
US8130978B2 (en) Dynamic switching of microphone inputs for identification of a direction of a source of speech sounds
US20190158733A1 (en) Optimal view selection method in a video conference
JP2017034502A (en) Communication equipment, communication method, program, and communication system
US8848021B2 (en) Remote participant placement on a unit in a conference room
EP2974253A1 (en) Normalization of soundfield orientations based on auditory scene analysis
CN113203988B (en) Sound source positioning method and device
EP2574050A1 (en) Method, apparatus and remote video conference system for playing audio of remote participator
US7177413B2 (en) Head position based telephone conference system and associated method
US20230283888A1 (en) Processing method and electronic device
JP2013115527A (en) Video conference system and video conference method
JP2000090288A (en) Face image control method for three-dimensional shared virtual space communication service, equipment for three-dimensional shared virtual space communication and program recording medium therefor
JP2009246528A (en) Voice communication system with image, voice communication method with image, and program
CN115150580A (en) Information processing method, electronic equipment, system and medium
CN115002401B (en) Information processing method, electronic equipment, conference system and medium
JP2009027246A (en) Television conference apparatus
KR20180105594A (en) Multi-point connection control apparatus and method for video conference service
WO2023120244A1 (en) Transmission device, transmission method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination