WO2023142266A1 - 远程交互方法、远程交互设备以及计算机存储介质 - Google Patents

远程交互方法、远程交互设备以及计算机存储介质 Download PDF

Info

Publication number
WO2023142266A1
WO2023142266A1 PCT/CN2022/084908 CN2022084908W WO2023142266A1 WO 2023142266 A1 WO2023142266 A1 WO 2023142266A1 CN 2022084908 W CN2022084908 W CN 2022084908W WO 2023142266 A1 WO2023142266 A1 WO 2023142266A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
image
range
sound source
preset
Prior art date
Application number
PCT/CN2022/084908
Other languages
English (en)
French (fr)
Inventor
张世明
张正道
倪世坤
李达钦
陈永金
Original Assignee
深圳壹秘科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹秘科技有限公司 filed Critical 深圳壹秘科技有限公司
Publication of WO2023142266A1 publication Critical patent/WO2023142266A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04842Selection of displayed objects or displayed text elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/44Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services

Definitions

  • the present application relates to the technical field of remote interaction, in particular to a remote interaction method, a remote interaction device and a computer storage medium.
  • remote interactive devices are more and more widely used in daily life and work.
  • remote interactive devices can be applied to scenarios such as live video broadcasting, video interaction, and audio and video conferences.
  • the remote interactive device generally obtains spatial panoramic video data through a shooting module, and displays the panoramic video data on an interactive interface to realize interaction with remote users.
  • the output of the panoramic video will easily lead to the interaction details such as the facial expressions and body movements of the users who are closer to the camera module, and the interaction details of users farther away from the camera module. Otherwise, it cannot be displayed on the interactive interface, and it is difficult for the remote user to distinguish the person who is currently speaking from the panoramic video, which leads to poor user interaction experience.
  • the main purpose of the present application is to provide a remote interaction method, a remote interaction device and a computer storage medium, aiming to realize that the interactive interface can highlight the video data of the sounding object, and improve the user's interactive experience during the remote interaction process.
  • the present application provides a remote interaction method, which includes the following steps:
  • the present application also proposes a remote interaction device, the remote interaction device includes:
  • control device includes: a memory, a processor, and a remote interactive program stored on the memory and operable on the processor
  • a program when the remote interaction program is executed by the processor, implements the steps of the remote interaction method described in any one of the above items.
  • the present application also proposes a computer storage medium, on which a remote interaction program is stored, and when the remote interaction program is executed by a processor, the remote interaction method as described in any one of the above items is implemented. A step of.
  • a remote interaction method proposed by the present application determines the target image range of the sound source object in the panoramic video of the target space according to the sound range of the sound source object in the target space, and the first target of the remote device Output the sub-video data within the target image range in the display window.
  • the output of the sub-video data can realize the vocal objects in the target space can be highlighted in the interactive interface of the remote device, which can better reflect the vocal objects in the target space than the panoramic video. interaction details, thereby effectively improving the user's interaction experience during the remote interaction process.
  • FIG. 1 is a schematic diagram of a scene of a remote interaction scene applied by a remote interaction device of the present application
  • FIG. 2 is a schematic diagram of the hardware structure involved in the operation of an embodiment of the remote interaction device of the present application
  • FIG. 3 is a schematic flowchart of the first embodiment of the remote interaction method of the present application.
  • FIG. 4 is a schematic flow diagram of the second embodiment of the remote interaction method of the present application.
  • FIG. 5 is a schematic flowchart of a third embodiment of the remote interaction method of the present application.
  • FIG. 6 is a schematic flowchart of a fourth embodiment of the remote interaction method of the present application.
  • FIG. 7 is a schematic flowchart of a fifth embodiment of the remote interaction method of the present application.
  • FIG. 8 is a schematic flowchart of the sixth embodiment of the remote interaction method of the present application.
  • FIG. 9 is a schematic diagram of the determination process of the target object and the sorting process of the sub-windows in the vocalization process of different sound source objects involved in the embodiment of the remote interaction method of the present application;
  • FIG. 10 is a schematic flowchart of the seventh embodiment of the remote interaction method of the present application.
  • FIG. 11 is a schematic diagram of panoramic videos collected before and after adjustment of the shooting range involved in the embodiment of the remote interaction method of the present application;
  • Fig. 12 is a schematic diagram of the interface when the first target display window and the second target display window are simultaneously displayed during the remote interaction process according to the embodiment of the remote interaction method of the present application;
  • FIG. 13 is a schematic diagram of the spatial range involved in the process of determining the vocal range and the target spatial range in the embodiment of the remote interaction method of the present application;
  • FIG. 14 is a schematic diagram of the image range involved in the determination and adjustment of the target image range in the remote interaction method embodiment of the present application.
  • FIG. 15 is a schematic diagram of the range adjustment of the target image triggered by the movement of the sound source object involved in the embodiment of the remote interaction method of the present application;
  • Fig. 16 is a schematic diagram of the space azimuth and the space elevation involved in the remote interaction method embodiment of the present application.
  • the main solution of the embodiment of the present application is: acquire the range of the sound source object in the target space, and obtain the panoramic video of the target space; determine the location of the sound source object in the panoramic video according to the range of sound Target image range; output sub-video data within the target image range in the first target display window, and send the first target display window to the remote device, so that the remote device displays the first target Show window.
  • the output of the panoramic video will easily lead to the interaction details such as the facial expressions and body movements of the users who are closer to the shooting module on the interactive interface, and those farther away from the shooting module.
  • the details of user interaction cannot be displayed on the interactive interface, and it is difficult for remote users to distinguish the person who is currently speaking from the panoramic video, which leads to poor user interaction experience.
  • the present application provides the above-mentioned solution, aiming to realize that the interactive interface can highlight the video data of the sounding object, and improve the interactive experience of the user during the remote interaction process.
  • the embodiment of the present application proposes a remote interaction device, which is applied to a remote interaction scene, where the remote interaction scene may be a long-distance interaction scene in the same space, or a long-distance interaction scene between different spaces or different regions.
  • remote interactive devices can be applied to scenarios such as live video broadcasting, video interaction, and remote conferences.
  • a table can be set in the space where the remote interaction device is located, and the table can be a square table, a round table, or any shape table.
  • the remote interaction device can be placed on the table, for example, the remote interaction device can be placed at the center of the table, the edge of the table or any other position on the table. People who need to interact remotely (such as multiple participants) surround the table.
  • devices such as monitors, audio playback devices, tablet computers, and mobile phones
  • devices that are set to output information required for interaction can also be set. on the side of the table or the edge of the table.
  • the remote interactive device includes a camera module 2 , a microphone array 3 and a control device 1 . Both the panoramic shooting module 2 and the microphone array 3 are connected to the control device 1 .
  • the remote interaction device may include a casing, and the camera module 2 and the microphone array 3 are fixed to the casing.
  • the shooting module 2 is configured to collect panoramic video data of the space where it is located.
  • the photographing module 2 can also be configured to collect scene pictures and the like of the space where it is located.
  • the camera module 2 is arranged on the top of the casing. In other embodiments, the camera module 2 can also be arranged around the casing.
  • the photographing module 2 is a fisheye camera.
  • the shooting module 2 can also include multiple cameras or movable cameras, so as to obtain the panoramic video here by splicing multiple video data collected by multiple cameras or multiple video data collected by movable cameras. data.
  • the viewing angle range of the photographing module 2 may include the maximum azimuth range that the captured images of the photographing module 2 can cover and/or the maximum elevation angle range that the captured images of the photographing module 2 can cover.
  • the definition of the azimuth angle of the image of the shooting module 2 is as follows: take the line of the preset imaging center of the shooting module 2 pointing to the first preset direction on the horizontal plane as the first baseline, and the line connecting the image position and the preset imaging center as the first target direction line, the horizontal angle formed by the first target direction line and the first baseline is the image azimuth.
  • the first preset direction here can be determined according to the installation position of the camera module 2 or the installation positions of other functional modules on the remote interactive device.
  • the first preset direction is a direction in which the preset imaging center faces the back of the remote interactive device.
  • the image azimuth angle formed by the first baseline along the clockwise direction and the first target direction line is defined as a positive value
  • the image azimuth angle formed by the first baseline along the counterclockwise direction and the first target direction line is defined as a negative value.
  • a positive image azimuth angle is used to define the maximum azimuth angle range
  • the maximum azimuth angle range of the photographing module 2 may be 0° to 360°.
  • the definition of the image pitch angle of the shooting module 2 is as follows: the line pointing to the second preset direction on the vertical plane from the preset imaging center of the shooting module 2 is the second baseline, and the line connecting the image position and the preset imaging center is the second target
  • the direction line, the angle formed by the second target direction line and the second baseline on the vertical plane is the image pitch angle.
  • the second preset direction is the direction in which the preset imaging center points to the image position corresponding to the edge of the table captured by the photographing module 2 when the remote interactive device is placed on the table.
  • the maximum pitch angle range of the shooting module 2 can be 0 degrees to 69 degrees, and the 69 degrees here can be determined according to the different table sizes and sound sources. Other values may be used for actual conditions such as differences in object heights and differences in installation positions of the camera module 2 .
  • the second preset direction may also be a horizontal direction.
  • the photographing module 2 can be preset with an image coordinate system for characterizing the image position of the image data it collects.
  • the image coordinate system can be a polar coordinate or a rectangular coordinate system, and the preset imaging center here is the center of the image coordinate system. Coordinate origin.
  • the photographing module 2 is a fisheye camera, and the maximum azimuth angle range of its framing is between 200 degrees and 230 degrees. In other embodiments, the range of the maximum azimuth angle may also be larger, such as 360 degrees, 270 degrees and so on.
  • the microphone array 3 is specifically configured to collect sound signals from different spatial directions in the space where it is located.
  • the control device 1 can locate the position of the sound source in the space according to the sound data collected by the microphone array 3 .
  • the microphone array 3 specifically includes a plurality of microphones arranged in an array. Specifically, in this embodiment, multiple microphones are arranged in a circular array. In other embodiments, multiple microphones may also be arranged in a triangular array or in an irregular shape.
  • the casing may be provided with a plurality of holes configured to install the microphone array 3, and the holes are set in one-to-one correspondence with the microphones in the microphone array 3, and the plurality of holes may be arranged on the top wall of the casing or in multiple holes.
  • the four holes can be arranged on the side wall of the housing and along the circumference of the housing.
  • the range of the azimuth angle for sound pickup by the microphone array 3 is 0° to 360°, and the angle range for the pitch angle of sound pickup by the microphone array 3 is 16° to 69°. It should be noted that the angle range of sound pickup by the microphone array 3 is not limited to the above numerical range, and a larger or smaller angle range can be set according to actual conditions.
  • the control device 1 includes: a processor 1001 (such as a CPU), a memory 1002 , a timer 1003 and the like.
  • the components in the control device 1 are connected via a communication bus.
  • the memory 1002 can be a high-speed RAM memory, or a stable memory (non-volatile memory), such as a disk memory.
  • the memory 1002 includes an embedded multimedia memory card (eMMC) and a double-rate synchronous dynamic random access memory (DDR).
  • eMMC embedded multimedia memory card
  • DDR double-rate synchronous dynamic random access memory
  • the memory 1002 may optionally also be a storage device independent of the foregoing processor 1001 .
  • FIG. 2 does not constitute a limitation to the device, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
  • the memory 1002 as a computer storage medium may include a remote interactive program.
  • the processor 1001 may be configured to call a remote interaction program stored in the memory 1002 and execute related steps of the remote interaction method in the following embodiments.
  • the remote interaction device may further include a speaker 4 connected to the control device 1 .
  • the loudspeaker 4 can be configured to play audio data, where the audio data can be the sound data collected by the remote device sent by the remote device, or it can be obtained by the remote interactive device based on a wired communication connection or a wireless communication connection in the space where it is located Sound data input by other terminals.
  • the loudspeaker 4 can be installed inside the housing, and the housing can be provided with a plurality of sound outlets communicating with the cavity where the loudspeaker 4 is located.
  • the sound emitted can spread evenly in different directions of 360 degrees through multiple sound holes.
  • the speaker 4 plays sound at the maximum volume, it is determined that the sound pressure level detected at a spatial position equal to a preset distance from the speaker 4 is greater than or equal to a preset decibel value.
  • the preset distance is 1 meter
  • the preset decibel value is 60 dB.
  • the preset distance may also be 1.3 meters, 1.5 meters, 2 meters, etc.
  • the preset decibel value may also be 70 dB, 75 dB, etc.
  • the remote interaction device further includes a button module 5 .
  • the button module 5 is connected with the control device 1 .
  • the button module 5 can be a mechanical button installed on the casing, or a touch module installed on the casing and configured to display virtual buttons, or other button modules 5 that can generate high and low level electrical signals.
  • the key module 5 is specifically configured for human-computer interaction between the user and the remote interactive device.
  • the specific key module 5 can generate corresponding key values in response to user operations, and the control device 1 can be configured to acquire the key values generated by the key module 5. value and run according to the instruction corresponding to the key value.
  • the remote interaction device may further include a communication module 6, which is specifically a wireless communication module 6, and may be configured to realize a wireless communication connection between the remote interaction device and an external device.
  • the wireless communication module 6 is a Bluetooth module.
  • the wireless communication module 6 may also be a WIFI module, a ZigBee module, a radio frequency communication module 6 and other wireless communication modules 6 of any type.
  • the control terminal of the remote interactive device (such as mobile phone, notebook computer, tablet computer, smart watch, etc.) can establish a wireless communication connection with the remote interactive device based on the wireless communication module 6, and the remote interactive device can receive the user input sent by the control terminal based on the wireless communication connection. control commands or acquired audio and video data.
  • the remote interaction device may further include a data interface 7 connected to the control device 1 .
  • the data interface 7 may be configured as a wired communication connection with a computer device external to the remote interaction device that accesses the Internet.
  • the data interface 7 is a USB interface.
  • the data interface 7 may also be other types of interfaces, such as IEEE interfaces and the like.
  • the control device 1 can send the audio and video data that needs to be output by the remote device to the computer device based on the data interface 7, and the computer device can send it to the remote device through the Internet, so that the remote device can output the audio and video data collected by the remote interactive device .
  • the control signals between the computer equipment and the remote interactive equipment can be bidirectionally transmitted based on the data interface 7 .
  • preset application programs (such as live broadcast software, conference software, etc.) can be installed in the computer equipment connected to the remote interactive device, and the preset application program can complete the audio and video data generated by the remote interactive device and the remote device on the Internet. Two-way transmission.
  • the embodiment of the present application also provides a remote interaction method, which is applied to the above-mentioned remote interaction device.
  • the remote interaction method includes:
  • Step S10 acquiring the sound range of the sound source object in the target space, and acquiring the panoramic video of the target space;
  • the target space is specifically a limited space where the remote interaction device is located.
  • the sound source object is specifically an object that emits sound in the target space, which may be a human body or a device that emits sound (such as a mobile phone, a speaker, a tablet computer, etc.).
  • the sounding range specifically refers to the maximum spatial range in which the sounding position (such as the mouth of a human body, etc.) moves during the sounding process of the sound source object.
  • the sound emitting range may be determined by detecting the sound signal of the sound source object, or may be determined by detecting the image signal of the sound source object.
  • the panoramic video is specifically multimedia data formed by multiple curved image frames (such as spherical images or cylindrical images) continuously collected by the shooting module, and the center of each curved image frame is the preset imaging center of the shooting module.
  • the panoramic video here can be obtained by obtaining the data collected by the shooting module in real time.
  • Step S20 determining the target image range where the sound source object is located in the panoramic video according to the sounding range
  • a conversion relationship between the spatial position of the target space and the image position in the panoramic video may be preset. Based on the conversion relationship, the spatial position feature parameters corresponding to the vocal range can be directly converted into image position feature parameters, and the image range corresponding to the converted image position feature parameters can be used as the target image range.
  • the sound range can also be amplified according to preset rules to obtain the corresponding spatial range of the target area of the sound source object (such as the head of the human body, the upper body of the human body, the entire playback device, etc.), and based on the conversion relationship, the obtained The spatial position feature parameters corresponding to the spatial range are converted into image position feature parameters, and the image range corresponding to the converted image position feature parameters is used as the target image range.
  • Step S30 output the sub-video data within the range of the target image in the first target display window, and send the first target display window to the remote device, so that the remote device displays the first target display window.
  • the remote device is specifically configured to perform two-way transmission of audio and video data with the remote interactive device, and output the audio and video data received from the remote interactive device, so as to realize the remote interaction between the user of the remote device and the user in the target space .
  • the first target display window is specifically a window set to display the video data of the sound source object among all objects allowed to make sounds in the target space, so that the user of the far-end device can visually realize the close distance with the sound source object in the target space comminicate.
  • each sound source object corresponds to a target image range
  • each sound source object corresponds to a sub-window in the first target display window
  • the target corresponding to each sound source object The sub-video data within the image range is output in the corresponding sub-window, and more than one sub-window is merged to form the first target display window.
  • the target display window is sent to any remote device that has installed and opened the preset application, and the remote device can display the first target display window and the sub-video data therein when the preset application is opened.
  • the sub-video data extracted from the panoramic video can also be directly sent to the remote device based on the Internet, and the remote device can adjust the sub-video data to display data adapted to the first target display window, and the first target display window Specifically, it is a window in a preset application set for remote interaction, and the remote device can display the adjusted display data in the first target display window of the preset application installed on it.
  • the target image range and the panoramic video can be sent to the remote device, and the remote device can extract the video data at the corresponding position in the panoramic video based on the received target image range to obtain sub-video data , output the extracted sub-video data in the first target display window of its preset application.
  • a remote interaction method proposed in the embodiment of the present application determines the target image range of the sound source object in the panoramic video of the target space according to the sound range of the sound source object in the target space.
  • a target display window outputs sub-video data within the range of the target image.
  • the output of the sub-video data can realize that the sounding object in the target space can be highlighted in the interactive interface of the remote device, which can better reflect the target space than the panoramic video.
  • the interaction details of the sounding object can effectively improve the user's interaction experience during the remote interaction process.
  • the step S10 includes:
  • Step S11 detecting a plurality of first spatial position information of the sound emitting position of the sound source object within a preset time period, and obtaining a plurality of sound source position information
  • the sound emitting position may refer to the mouth (as shown in 01 in Figure 13(b)); when the sound source object is a sound emitting device, the sound emitting position may refer to the speaker of the sound source object.
  • the spatial position information of the sounding position of the sound source object is detected at multiple consecutive moments within the preset duration (such as (X1, Y1) in Figure 13(a)), and the time between two consecutive moments
  • the time interval can be a preset value.
  • the preset duration may be 0.5 seconds
  • the first spatial position information of the sound emitting position of the sound source object may be continuously detected multiple times within 0.5 seconds to obtain multiple sound source position information.
  • a spatial coordinate system representing different spatial positions in the target space can be established in advance, and the spatial coordinate system can be a polar coordinate system or a rectangular coordinate system.
  • the spatial position information is specifically that the spatial position information may be represented by coordinates in a spatial coordinate system.
  • the sound signal detected by the microphone array is acquired, and the acquired sound signal is calculated according to the preset sound source localization algorithm, and the calculated spatial position information can be used as Sound source location information.
  • the preset sound source localization algorithm can be an algorithm for localizing the sound source based on the time difference of each microphone in the microphone array receiving the sound signal, such as the TDOA algorithm, and the TDOA algorithm can specifically include the GCC-PHAT algorithm or the SRP-PHAT algorithm, etc.; It is assumed that the sound source localization algorithm may also be a method for sound source localization using spatial spectrum estimation, such as the MUSIC algorithm.
  • the azimuth and elevation angles of the sounding position of the sound source object in the target space are detected multiple times within a preset time period, and the shooting module is used as the base point to obtain multiple first spatial azimuths and multiple a first spatial elevation angle; wherein, the plurality of sound source position information includes the plurality of first spatial azimuth angles and the plurality of first spatial elevation angles, and the shooting module is configured to capture the panoramic video .
  • the spatial azimuth ( ⁇ in Figure 16(a)) is defined as follows: take the spatial position of the camera module as the base point, the line pointing to the third preset direction on the horizontal plane as the third baseline, and the connection line between the spatial position and the base point is the third target direction line, and the horizontal angle formed by the third target direction line and the third baseline is the spatial azimuth.
  • the third preset direction here can be determined according to the installation position of the camera module or the installation position of other functional modules on the remote interaction device.
  • the third preset direction is a direction in which the preset imaging center faces the back of the remote interactive device.
  • the spatial orientation angle formed by the third baseline along the clockwise direction and the third target direction line is defined as a positive value
  • the spatial orientation angle formed by the third baseline along the counterclockwise direction and the third target direction line is defined as a negative value.
  • the spatial pitch angle ( ⁇ in Figure 16(b)) is defined as follows: take the spatial position of the shooting module as the base point, the line pointing to the fourth preset direction on the vertical plane as the fourth baseline, and the connection between the spatial position and the base point
  • the line is the fourth target direction line
  • the angle formed between the fourth target direction line and the fourth baseline on the vertical plane is the space pitch angle.
  • the fourth preset direction is a direction in which the preset imaging center points to the spatial position corresponding to the edge of the table captured by the photographing module when the remote interactive device is placed on the table.
  • the maximum pitch angle range of the shooting module can be 0° to 69°, and the 69° here can be determined according to the size of the table (as shown in Figure 16(b) ), the difference in the height of the sound source object (H3 in Figure 16(b)), and the difference in the installation position of the shooting module (H2 in Figure 16(b)) are set as other values.
  • the fourth preset direction may also be a horizontal direction.
  • the first spatial location information may also include one of a spatial azimuth and a spatial elevation angle; or, the first spatial location information may also include a direction and/or a distance of a sound emitting location relative to a base point.
  • the first spatial position information can also be obtained based on the image corresponding to the sound source object in the panoramic video, for example, identifying the image position information of the image area where the sound emitting position is located in the image of the sound source object in the panoramic video, based on the image
  • the location information is used to determine the first spatial location information here.
  • Step S12 determining the sound emitting range according to the plurality of sound source location information.
  • one or more than one characteristic position point in the sound emitting range may be determined according to a plurality of sound source position information, and the sound emitting range here is calculated according to the determined characteristic position points.
  • the sound emitting range is a square area, and in other embodiments, the sound emitting range may also be a circular area, a triangular area, or an area of other shapes.
  • the area shape of the sound emitting range may be specifically determined according to the window shape of the first target display window or the window shape for displaying the sub-window corresponding to the sound source object in the first target display window.
  • the multiple pieces of sound source position information include the above-mentioned multiple first spatial azimuths and multiple first spatial elevation angles, the minimum spatial azimuth and The maximum spatial orientation angle, determining the minimum spatial elevation angle and the maximum spatial elevation angle among the plurality of first spatial elevation angles; according to the minimum spatial orientation angle, the maximum spatial orientation angle, the minimum spatial elevation angle and the The maximum spatial elevation angle determines a plurality of first spatial corner positions corresponding to the sounding range; and a spatial range formed by enclosing the plurality of first spatial corner positions is determined as the sounding range.
  • the minimum space azimuth angle is X2
  • the maximum space azimuth angle is X3
  • the minimum space pitch angle is Y2
  • the maximum space pitch angle is Y3
  • the four first spaces of the sound range can be determined
  • the corner positions are (X2, Y2), (X2, Y3), (X3, Y2) and (X3, Y3), why the square space area formed by these four space corner positions can be determined as the sound emitting range.
  • the midpoint position of the sound emission range may also be determined according to multiple sound source position information, for example, determining the first average value of multiple first spatial azimuth angles and determining the second average value of multiple first spatial elevation angles,
  • the spatial position whose spatial azimuth angle is the first mean value and the spatial elevation angle is the second mean value may be determined as the midpoint position.
  • the spatial area with the midpoint as the center and the regional characteristic parameters as preset values is determined as the sound range, for example, the midpoint is the center, the preset The circular area whose value is the radius determines the sound range.
  • the target image range of the sound source object in the panoramic video is determined by the sound emission range determined by multiple sound source localization, which is conducive to improving the accuracy of the determined target image range, thereby ensuring that even if the sound source object makes a sound
  • the position of the sound source moves during the process (for example, the speaker turns his head during the sounding process, etc.), and the sub-video data corresponding to the sound source object in the panoramic video can also be accurately obtained to ensure that the interactive details of the sound source object are highlighted. User experience during further remote interactions.
  • the step S20 includes:
  • Step S21 Determine the target space range including the target area of the sound source object according to the sound range, the target area is the minimum area that the sound source object needs to display during interaction, and the target space range is greater than or equal to the sound range ;
  • the target area here may be a preset fixed area, or an area determined based on user setting parameters, or an area determined according to the type of sound source object (different types may correspond to different target areas).
  • the target area may be the head, upper body, or the area above the shoulders, etc.; when the sound source object is a device, the target area may be a display area on the device.
  • the target area is larger than the sound emitting range, and the target space range is larger than or equal to the target area.
  • the target space range is a square area.
  • the target spatial range may be a circular area, a triangular area or other irregularly shaped areas.
  • the sound emitting range may be directly used as the target spatial range; or the spatial range obtained by amplifying the sound range according to preset rules may be used as the target spatial range.
  • the target spatial range is an area range characterized based on the spatial coordinate system in the foregoing embodiments.
  • the area adjustment value corresponding to the sounding range may be obtained, and the target space range may be obtained after the sounding range is enlarged according to the area adjustment value.
  • the area adjustment value here may be a preset fixed parameter, or a parameter determined according to the actual scene conditions in the target space.
  • Step S22 determine that the image range corresponding to the target space range in the panoramic video is the target image range; wherein, the preset corresponding relationship is the preset corresponding relationship in the target space Correspondence between spatial positions and image positions corresponding to the panoramic video.
  • the preset corresponding relationship here is the coordinate transformation relationship between the image coordinate system and the space coordinate system mentioned in the above embodiments.
  • the spatial position feature parameters corresponding to the target spatial range are converted into image position feature parameters based on the preset corresponding relationship, and the target image range is determined based on the converted image position parameters.
  • multiple spatial corner positions of the target spatial range can be converted into multiple image corner positions based on the preset corresponding relationship, and the image area formed by surrounding multiple image corner positions in the panoramic video is used as the target image range;
  • the target spatial range is a circular area
  • the spatial midpoint position of the target spatial range is converted into the image midpoint position based on the preset corresponding relationship
  • the spatial radius corresponding to the target spatial range is converted into the image radius
  • the image in the panoramic video is The circular image area whose midpoint is the center and image radius is the radius is used as the target image range.
  • the image area corresponding to the determined target space range in the panoramic video is used as the target image range. It is beneficial to ensure that the sub-video data within the determined target image range can contain the image of the target area of the sound source object, so as to ensure that the extracted sub-video data can accurately contain all the details required for the interaction of the sound source object, so as to further improve the remote interaction process user experience in .
  • step S21 includes: obtaining the total number of objects allowed to make sounds in the target space, and obtaining the second spatial position information of the target space position within the sounding range; determining the target according to the total number A size characteristic value of a spatial range; determining the target spatial range according to the second spatial position information and the size characteristic value.
  • the objects that are allowed to emit sound include devices and human bodies that have the ability to emit sound.
  • the total number here is determined by obtaining the parameters input by the user, and can also be determined by performing target recognition on the panoramic video. For example, if there are 8 people, 1 mobile phone, and 1 display and playback device in the target space, it can be determined that there are 10 objects that are allowed to make sounds.
  • the target spatial position is specifically a position that characterizes the area position of the sound emitting range.
  • the target spatial position is the center position of the sound emitting range.
  • the target spatial position may also be an edge position, a corner position, a center of gravity position, or any other position of the sound emitting range.
  • the size characteristic value may be characteristic parameters representing the size of the region, such as the area, radius, length and width of the target space range. Among them, when the total number is greater than the set value, the size feature value is the preset size feature value, and when the total number is less than or equal to the set value, the size feature value can be calculated according to the total number. The interaction details of the source objects are accurately displayed.
  • part or all of the spatial position information corresponding to the target spatial range can be obtained, and the target spatial range can be determined based on the obtained spatial position information.
  • the target spatial position is the central position of the sounding range
  • the second spatial position information includes a second spatial azimuth of the target spatial position with the shooting module as the base point
  • the shooting module is configured to Collecting the panoramic video
  • the step of determining the target spatial range according to the second spatial position information and the size feature value includes: determining a space orientation angle adjustment value according to the size feature value; The angle adjustment value adjusts the second spatial azimuth to obtain the maximum critical value and minimum critical value of the azimuth angle range of the target spatial range based on the shooting module; according to the maximum critical value and the minimum critical value
  • the preset pitch angle range of the target space range based on the shooting module determines a plurality of second space corner positions of the target space range; a plurality of second space corner positions are enclosed to form A spatial range is determined as the target spatial range.
  • the size characteristic value is the width of the target spatial range, and the larger the width, the larger the spatial azimuth adjustment value; the smaller the width, the smaller the spatial azimuth adjustment value.
  • the size feature value is also the radius of the target spatial range.
  • reducing the second space azimuth angle according to the space azimuth adjustment value can obtain the minimum critical value of the space azimuth angle corresponding to the target space range, and enlarging the second space azimuth angle according to the space azimuth adjustment value can obtain the space corresponding to the target space range Maximum threshold for azimuth.
  • the range of the preset pitch angle can be determined in combination with information such as the installation position of the shooting module, the size of the table for placing the remote interactive device, and the maximum height allowed by the sound source object.
  • the minimum pitch angle value in the preset pitch angle range may be the angle between the line connecting the edge position of the table for placing the remote interaction device and the shooting module and the above-mentioned fourth baseline (for example, 0 degrees, etc.);
  • the maximum pitch angle value in the preset pitch angle range may be the angle between the line connecting the highest position of the sound source object and the camera module and the fourth baseline (for example, 69 degrees, etc.).
  • the preset pitch angle range may also be determined according to the preset image ratio and the above determined maximum critical value and minimum critical value.
  • the minimum value of the preset pitch angle range is the minimum spatial pitch angle of the target spatial range
  • the maximum value of the preset pitch angle range is the maximum spatial pitch angle of the target spatial range
  • the total number of objects allowed to make sounds in the target space is n, and the maximum azimuth angle range of the sound recognition of the microphone array is 0° to 360°, then the width of the target space range is 360°/n, since the target space position is the central position
  • the adjustment value of the space azimuth angle can be determined to be 360 degrees/2n;
  • the first position of the center position of the sound range can be determined.
  • the two space azimuth angles are (X2+X3)/2
  • the four spatial corner positions that can determine the target spatial range are (X4, Y4), (X4, Y5), (X5, Y4) and (X5, Y5 ), the quadrilateral space area formed by the four space corner positions is the target space range.
  • a fourth embodiment of the remote interaction method of the present application is proposed.
  • this embodiment referring to FIG. 6, after the step S20, it also includes:
  • Step S201 identifying the image area where the human body image is located within the range of the target image
  • a human body recognition algorithm may be used to identify the image data within the range of the target image to determine the image area. For example, face recognition is performed on the image data within the target image range to determine the face image, and human figure estimation is performed based on the face image to obtain the image area here.
  • the image area is a quadrilateral area. In other embodiments, the image area can also be a circular area or a human-shaped area.
  • Step S202 determining the ratio of the area of the image region to the area of the target image range
  • Step S203 judging whether the ratio is smaller than a preset value
  • step S30 In response to the situation that the ratio is smaller than the preset value, step S30 is executed after step S204 is executed; in response to the situation that the ratio is greater than or equal to the preset value, step S30 is executed.
  • the preset value is specifically the minimum value of the area ratio between the image area allowed by the comfortable distance and the target image range during face-to-face interaction between people.
  • a ratio smaller than the preset value indicates that the user of the remote device will feel that the distance between the sub-video data and the sound source object is too far, and the user cannot obtain the required interaction details based on the output of the sub-video data; the ratio is greater than or equal to
  • the preset value indicates that the user of the remote device can clearly see the interaction details of the sound source object when viewing the sub-video data.
  • Step S204 reducing the range of the target image so that the ratio is greater than or equal to the preset value.
  • the range of the target image can be reduced according to a preset fixed range adjustment parameter, or the range of the target image can be reduced according to a determined range adjustment parameter such as a size characteristic or a ratio of the image area.
  • step S201 may be returned to ensure that the above-mentioned ratio corresponding to the adjusted target image range is greater than or equal to a preset value.
  • the image area is enlarged according to the preset value to obtain a reduced target image range.
  • the image position adjustment value for enlarging the image area may be determined according to a preset value, and the image area is adjusted according to the image position adjustment value to obtain a reduced target image range.
  • the process of enlarging the image area according to the preset value to obtain the reduced target image range is as follows: determine the image position parameter of the target image position in the image area, Determine the image position adjustment value for enlarging the image area; adjust the image position parameter according to the image position adjustment value to obtain a target image position parameter; determine the reduced target image range according to the target image position parameter.
  • the target image position is the image center position of the image area.
  • the target image position may also be an image position corresponding to the sound emitting position of the sound source object in the image area, an edge position, a corner position, or any other arbitrary position of the image area.
  • the image position parameter may specifically be a characteristic parameter of the image position characterized by the image coordinate system mentioned in the above embodiment.
  • the image position parameters include the first image azimuth and/or the first image elevation angle of the target image position based on the preset imaging center.
  • the target image position may also include the distance and/or direction between the target image position and the preset imaging center.
  • the width of the image area specifically refers to the difference between the maximum azimuth angle and the minimum azimuth angle corresponding to the image area.
  • the width of the image area may also be the distance between edges on both sides of the image area along the horizontal direction.
  • the target width of the enlarged image area can be calculated according to the preset value and the width of the image area, and the image position adjustment value here can be determined according to the target width.
  • 1/2 of the target width can be used as the image position adjustment value; when the target image position is the image edge position of the upper edge of the image area along the horizontal direction, the target width can be directly used as the image position adjustment value.
  • the image position parameter may be adjusted according to the image position adjustment value as the target image position parameter.
  • the image position parameters include image azimuth and image elevation angle
  • the image position adjustment value includes image azimuth adjustment value and image elevation angle adjustment value
  • the target image azimuth is obtained after adjusting the image azimuth according to the image azimuth adjustment value, according to the image
  • the pitch angle adjustment value adjusts the pitch angle of the image to obtain the pitch angle of the target image
  • the target image position parameters include the azimuth angle of the target image and the pitch angle of the target image.
  • the first image position parameter can be obtained after adjusting the image position parameter according to the image position adjustment value
  • the target image position parameter can be obtained by calculating according to the first image position parameter and preset parameters.
  • the image position parameter includes the image azimuth, and the image position adjustment value includes the azimuth adjustment value.
  • the target image azimuth is obtained.
  • the target image ratio determines the pitch angle of the target image, and the target image position parameter includes the target image azimuth angle and the target image pitch angle; for another example, the image position parameter includes the image pitch angle, and the image position adjustment value includes the pitch angle adjustment value, which is adjusted according to the pitch angle.
  • the target image pitch angle is obtained after adjusting the image pitch angle, and the target image azimuth angle is determined according to the target image pitch angle and the target image ratio of the reduced target image range.
  • the target image position parameters include the target image azimuth angle and the target image pitch angle .
  • the target image range is reduced, so that the proportion of the human figure image in the target image range can be increased, ensuring that the proportion of the human figure image in the output sub-video data will not be too small , so as to ensure that the user of the remote device can visually realize face-to-face communication with the target space based on the output sub-video data, so as to ensure that the user of the remote device can clearly see the sound corresponding to the sub-video data during the remote interaction process.
  • Interaction details of the source object to further improve user experience during remote interaction.
  • the enlarged image area based on the preset value is used as the reduced target image range, which can ensure that the range of the human body presented by the human body image remains unchanged before and after the target image range is reduced, and ensure that the interaction details of the sound source object can be enlarged and presented.
  • the image position parameter includes the first image azimuth of the target image position based on the preset imaging center corresponding to the panoramic video
  • the image position adjustment value includes the image azimuth Adjustment value
  • the step of determining the target image position parameter according to the image position adjustment value and the image position parameter includes: adjusting the first image azimuth according to the image azimuth adjustment value to obtain an adjusted target image The maximum image azimuth angle and the minimum image azimuth angle with the preset imaging center as the base point; according to the maximum image azimuth angle, the minimum image azimuth angle, the vertical direction of the target image position in the image area Determine the maximum image pitch angle and the minimum image pitch angle of the target image range after the reduction based on the position feature parameters on the target image range and the image ratio of the target image range; determine the maximum image azimuth angle, the The minimum image azimuth angle, the maximum image elevation angle, and the minimum image elevation angle are the target image position parameters. Based on this, the step of determining the narrowed target image range according to the target image position parameters.
  • the target image position is a position located on the vertical bisector of the image area, which is at the same distance from the edges on both sides of the image area, for example, it may be the midpoint of the image area, or it may be the vertical bisector Any position other than the midpoint.
  • the minimum image azimuth is obtained after reducing the first image azimuth according to the image azimuth adjustment value
  • the maximum image azimuth is obtained after increasing the first image azimuth according to the image azimuth adjustment value.
  • the target angle amplitude defines the difference between the maximum pitch angle corresponding to the image area and the minimum pitch angle as the target angle amplitude, define the difference between the maximum pitch angle corresponding to the image area and the image pitch angle of the target image position as the first difference, define the target The difference between the image pitch angle of the image position and the minimum pitch angle corresponding to the image area is a second difference, and the position characteristic parameter of the target image position in the vertical direction of the image area is specifically the first difference The ratio to the target angle magnitude or the ratio of the second difference to the target angle magnitude.
  • the target angle amplitude is a fixed value, and in other embodiments, the target angle amplitude may also be a value determined according to actual scene parameters in the target space.
  • the target image position is an image position corresponding to the central position of the sound emitting range in the image area.
  • the image scale is the ratio of the length to the width of the image area.
  • the image ratio of the target image range is specifically the ratio of the width to the length before the target image range is reduced, and the third difference is defined as the maximum value of the image azimuth angle before the target image range is reduced and the minimum value of the image azimuth angle value, defining the fourth difference between the maximum value of the image pitch angle and the minimum value of the image azimuth angle before the target image range is reduced, and the image ratio of the target image range is the ratio of the third difference to the fourth difference.
  • the target width of the reduced target image range can be calculated according to the minimum image azimuth and the maximum image azimuth (that is, the difference between the maximum image azimuth and the minimum image azimuth ), based on the same image ratio before and after scaling the target image range, the target length of the reduced target image range (that is, the difference between the maximum image pitch angle and the minimum image pitch angle) can be calculated according to the target width and image ratio, according to the target
  • the maximum image pitch angle and the minimum image pitch angle can be calculated from the position characteristic parameters and the target length corresponding to the image position in the numerical direction.
  • the minimum image azimuth is defined as X8, the maximum image azimuth is defined as X9, and the default value is 0.9:1, that is, the area ratio of the image area to the target image area; as shown in Figure 14(a), the human body image is located
  • the multiple corner positions of the image area are (X6, Y6), (X6, Y7), (X7, Y6) and (X7, Y7), respectively, based on the horizontal center of the image area before and after zooming in, but:
  • Minimum image azimuth angle X8 (X7-(X7-X6)/2)-((X7-X6)/0.9)/2;
  • X7-(X7-X6)/2 is the azimuth of the first image
  • ((X7-X6)/0.9)/2 is the adjustment value of the azimuth of the image.
  • the minimum image pitch angle is defined as Y8
  • the maximum image pitch angle is defined as Y9
  • the multiple corner positions of the sound range are respectively (X2, Y2), (X2, Y3), (X3, Y2) in Figure 13 And (X3, Y3)
  • the target image position is the center position of the sound range
  • the target image position is Y3-(Y3-Y2)/2
  • the multiple corner positions of the target image range before being reduced are respectively in Figure 14 (X4', Y4')(X4', Y5')(X5', Y4')(X5', Y5')
  • the image ratio before and after zooming remains unchanged
  • the center position of the sound range is after zooming out
  • the position feature of the vertical direction of the target image range is consistent with the position feature parameter (such as 0.65) of the center position of the sound range in the vertical direction of the image area, then:
  • Minimum image pitch angle Y8 (Y3-(Y3-Y2)/2)-((X9-X8)*(Y5’-Y4’)/(X5’-X4’)*0.65);
  • (Y5'-Y4')/(X5'-X4') is the image ratio of the target image range.
  • the image corner positions of the image area where the enlarged human body image is located are (X8, Y8), (X8, Y9), (X9, Y8) and (X9, Y9), respectively.
  • the quadrilateral image area formed by enclosing the four corner points of the image is the reduced target image range.
  • the specifications of the human body image after the target image range is reduced can be roughly the same as that before the target image range is reduced, and the proportion of the portrait presented by the sub-video data after the target image range is reduced can be ensured. It has a better rendering effect to further improve the user experience of remote interaction.
  • the multiple images of the adjusted target image range are determined according to the maximum image azimuth angle, the minimum image azimuth angle, the maximum image elevation angle, and the minimum image elevation angle
  • the step of the corner position it also includes: determining the magnification of the area area of the image range surrounded by the multiple image corner positions relative to the area area of the image area; in response to the magnification being less than or equal to In the case of a preset magnification, perform the step of using the image range formed by enclosing the multiple image corner positions as the reduced target image range; in response to the fact that the magnification is greater than the preset magnification, set the The image range after the image area is enlarged by a preset multiple is used as the reduced target image range.
  • magnification of the image area is limited to avoid too large magnification, which will cause the portrait in the sub-video data to be too blurred after the target image range is reduced, so as to ensure that the interaction details of the sound source object can be clearly presented when the sub-video data is output. Improve the user experience of remote interactions.
  • the step of identifying the image area where the human body image is located within the target image range it further includes: in response to the fact that there is a human body image within the target image range, performing the determination step of the ratio of the area of the image area to the area of the target image range; in response to the fact that there is no human body image in the target image range, execute the step of outputting the target image range in the first target display window and sending the first target display window to the remote device, so that the remote device displays the first target display window.
  • a fifth embodiment of the remote interaction method of the present application is proposed.
  • this embodiment referring to FIG. 7, after the step S30, it also includes:
  • Step S40 acquiring the spatial position change parameters of the utterance range or the image position change parameters of the human body image area within the target image range;
  • the spatial position change parameter includes the spatial azimuth change value and/or the spatial pitch angle change value of the sound range based on the shooting module, and the shooting module is configured to collect the panoramic video;
  • the image position change parameter It includes the change value of the image azimuth angle and/or the change value of the image pitch angle of the image area based on the preset imaging center of the panoramic video.
  • the spatial position change parameter includes the spatial azimuth change value and/or the spatial elevation change value of the first target position (for example, the central position) in the vocal range
  • the image position change parameter includes the human body image within the target image range
  • Step S50 adjusting the range of the target image according to the spatial position change parameter or the image position change parameter
  • the first image position parameters of some or all corner points of the current target image range can be adjusted according to the spatial position change parameter or the image position change parameter to obtain the second image position parameters of each image corner point position of the adjusted target image range.
  • the target image range can be adjusted according to the image position change parameter; when the sound source object is a device with a sound function (such as a mobile phone, a speaker, etc.), the target image range can be adjusted according to the spatial position change parameter .
  • Step S60 output the adjusted sub-video data within the range of the target image in the first target display window, and send the adjusted first target display window to the remote device, so that the remote device displays Adjusted first target display window.
  • the image corner positions defining the current target image range are (X8, Y8), (X8, Y9), (X9, Y8) and (X9, Y9).
  • the image azimuth (X12-X11)/2 of the center position of the moved humanoid image area can be calculated based on the spatial position change parameter or the image position change parameter , define the minimum image azimuth angle of the adjusted target image range as X13, the maximum image azimuth angle of the adjusted target image range as X14, the minimum image elevation angle of the adjusted target image range as Y13, and the adjusted target image range
  • the maximum image pitch angle of is Y14, based on the size of the target image range before and after adjustment remains unchanged, then:
  • X13 (X12-X11)/2-(X9-X8)/2;
  • X14 (X12-X11)/2+(X9-X8)/2;
  • the multiple image corner positions corresponding to the adjusted target image range are (X13, Y13), (X13, Y14), (X14, Y13) and (X14, Y14).
  • the image area formed by the position enclosure is the adjusted target image range.
  • the image pitch angle range and image azimuth angle range of the adjusted target image range can be determined by analogy to the method here, and no tracking is performed here.
  • the image of the sound source object in the sub-video data output in the first target display window can also be completely displayed, so as to effectively improve the remote interaction process. user interaction experience.
  • the output of the sub-video data in the range of the target image in the first target display window includes:
  • Step S31 in response to the fact that the number of sound source objects is more than one, acquire the target number of sound source objects to be displayed in the first target display window;
  • the sound source object here may specifically include a sound source object that is currently sounding and a sound source object that was sounding before the current moment.
  • the number of targets here can be set by the user, or can be a fixed parameter set by default.
  • the number of sound source objects is greater than or equal to the number of destinations here.
  • Step S32 determining the target number of sound source objects among more than one sound source objects as target objects
  • the target number of sound source objects may be selected by the user, or may be selected from more than one sound source objects according to preset rules, or may be randomly selected.
  • Step S33 output sub-video data within the target image range corresponding to the target object in each sub-window corresponding to the target object, and merge the target number of sub-windows in the first target display window.
  • Different target objects correspond to different sub-windows in the first target display window, and different sub-windows respectively output sub-video data of different target objects.
  • the target object is set in one-to-one correspondence with the sub-window.
  • step S30 the panoramic video of the target space can be obtained and the sound range of each sound source object in the target space can be obtained, and the sound source object in the panoramic video can be determined according to the sound range corresponding to each sound source object.
  • the sub-video data corresponding to the target image range of the target object in the panoramic video is output in the sub-window corresponding to the target object.
  • the target number of sub-windows can be randomly arranged in the first target display window, or the target number of sub-windows can be arranged according to preset rules and then displayed in the first target display window.
  • the remote user can obtain the interaction details of more than one sounding object in the target space at the same time based on the video data displayed in the first target display window during the remote interaction process, further improving the remote interaction. user experience.
  • step S32 includes: acquiring the utterance state parameter corresponding to each of the sound source objects, the utterance state parameter representing the interval between the utterance time of the corresponding sound source object and the current time; Among the more than one sound source objects, the target number of sound source objects are determined as target objects according to the vocalization state parameters of each of the sound source objects.
  • the acquisition process of the sounding state parameters is specifically as follows: Acquire the current label value corresponding to each of the sound source objects, the label value of each sound source object is greater than or equal to the first preset value, and the label value Characterize the number of consecutive times that the corresponding sound source object has not made a sound before the current moment; update the current label value of each said sound source object according to preset rules, and obtain the updated label value of each said sound source object as each said sound source object
  • the vocal state parameters of the sound source object wherein, the preset rule includes: the label value of the sound source object currently in the vocal state is set to the first preset value, and the label value of the sound source object that is not currently in the vocal state The value increases by the second preset value.
  • the tag value is updated according to the preset rules here every time there is a sound source object making a sound. If none of the sound source objects makes a sound, the tag value corresponding to each sound source object can be initialized, and the initial value of the tag value corresponding to each sound source object can be the same or different.
  • the first preset value is 0, and the second preset value is 1.
  • the first preset value and the second preset value can also be set to other values according to actual needs, such as the first preset value is 1, the second preset value is 2, and so on.
  • the minimum value allowed by the tag value is a first preset value.
  • the target number of sound source objects are used as target objects according to the vocalization state parameters of each of the sound source objects
  • the steps include: arranging all the utterance state parameters in order from small to large to obtain an arrangement result; determining the sound source objects respectively corresponding to the target number of utterance state parameters ranked first in the arrangement result as target objects . Wherein, the higher the ranking of the utterance state parameter, the shorter the interval between the moment when the corresponding target object utters and the current moment.
  • the utterance state parameter may also be the interval between the utterance time of each sound source object and the current time. Based on the sequence of all interval durations from small to large, the sound source objects corresponding to the first target number of interval durations are determined as target objects, and the target number of target objects are obtained.
  • the number of sound source objects displayed in the first target display window is the number of targets that have sounded recently, so as to ensure the real-time and convenience of interaction in the remote interaction process, and further improve the remote control function. User experience during interaction.
  • the process of combining and displaying the target number of sub-windows in the first target display window is specifically as follows: determine the preset image position on the target image range of each target object with the panoramic video The azimuth angle of the second image with the preset imaging center as the base point; the arrangement order of the number of sub-windows of the target is determined according to the size relationship between the azimuth angles of the second image corresponding to each of the target objects; The target number of sub-windows are combined and displayed in the display window according to the arrangement order.
  • the preset image position is the center position of the target image range; in other embodiments, the preset image position may also be an edge position or other positions of the target image range.
  • the arrangement order here can be obtained by arranging the sub-windows of the target number according to the order of the azimuth angle of the second image from large to small; it can also be obtained by arranging the sub-windows of the target number according to the order of the azimuth angle of the second image from small to large. Sort order here.
  • the step of arranging the target number of sub-windows includes: sequentially arranging the target number of sub-windows in ascending order of the azimuth angle of the second image to obtain the arrangement order.
  • the target number of sub-windows are arranged and displayed based on the size relationship between the azimuth angles of the second image, so as to ensure that the sub-video data of each target object output by the first target display window is arranged in the same order as each target object.
  • the relative positions in the target space are the same, so as to ensure that the remote user can visually simulate the interaction scene when he or she is in the target space based on the output video data.
  • the target number of sub-windows are arranged sequentially, so as to simulate the scene when the remote user interacts face-to-face in the target space to the greatest extent, so as to further improve the user experience of the remote interaction process .
  • the communication window in Fig. 9 is the first target display window in this embodiment
  • W1, W2, W3 are the sub-windows arranged sequentially in the first target display window
  • W4 is the virtual sub-window corresponding to the newly added sound source object , are sub-windows that are not displayed in the first target display window
  • Fig. 9 and Fig. 12 P2, P3, P5, P7, etc. respectively represent different sound source objects.
  • W1, W2, W3, and W4 respectively correspond to a label value
  • the initial values of the label values corresponding to the target objects on W1, W2, and W3 are 1, 2, and 3 in sequence;
  • the label value of the sub-window of the sound source object that is currently sounding is 0, and the label value of the sub-window of the sound source object that is not currently sounding increases by 1; when the same sound source object makes sound continuously, the label value of its sub-window continues to be 0;
  • the maximum tag value corresponding to each sound source object is 4, and the minimum value is 0.
  • the order of the sub-windows will not be adjusted, and the state value corresponding to each sound source object will be updated according to the above rules; when the sound source object in the first target display window When the newly-added sound source object other than the source object is currently in the sounding state, it is determined that the sub-window corresponding to the sound source object with the largest state value in the first target display window is deleted, and the sub-window corresponding to the newly-added sound source object is the same as the first
  • the sub-windows other than the deleted sub-window in the target display window are sorted according to the image azimuth, and the state value corresponding to each sound source object is updated according to the above rules. According to the latest sorting of the sub-windows, the target number of sub-windows are sequentially displayed in the first target display window.
  • the sub-windows corresponding to P2, P3, and P5 are currently displayed in the first target display window, and the sounding sequence of P2, P3, P5, and P7 is P3, P5, P7, and P2 in turn, then the display in the first target display window Refer to FIG. 9 for the situation, the state value corresponding to each sound source object, and the sorting result of the sub-windows corresponding to each target object based on the image azimuth.
  • a seventh embodiment of the remote interaction method of the present application is proposed.
  • this embodiment referring to FIG. 10 , after the step of acquiring the panoramic image of the target space, it further includes:
  • Step S101 identify the human-shaped image area corresponding to the reference position of the panoramic video, the reference position takes the preset imaging center of the panoramic video as the base point and the image azimuth as the preset angle value, and the human-shaped image area includes a human body The complete image corresponding to the upper target area, which is the smallest area that needs to be displayed when the human body interacts;
  • the image range where the reference position is located may be a set of image positions where the difference between the corresponding image azimuth angle and the preset angle value is less than or equal to the set value.
  • the human body part recognition can be performed on the image range where the reference position is located to obtain the characteristic image of the human body part, and the human figure image area can be obtained based on the characteristic image.
  • the target area is the area of the human body at the shoulder and above
  • the entire shoulder and above the shoulder of the human body can be calculated based on the feature image
  • the complete image corresponding to the entire head of is used as the humanoid image area.
  • the reference position is an image edge position of the panoramic video
  • the preset angle value is 0 degree.
  • the preset angle value may also be an angle value that is an integer multiple of the image azimuth angle magnitude corresponding to the 0 degree included angle of a single human body image area.
  • the angle difference between the maximum image azimuth angle and the minimum image azimuth angle of a single human body image area is a, and an angle value that is an integer multiple of a between the angle with 0 degree can be used as the preset angle value.
  • Step S102 determining the minimum value of the image azimuth angle of the human figure image area based on the preset imaging center
  • Step S103 in response to the fact that the minimum value is smaller than the preset angle value, adjust the shooting range corresponding to the panoramic video according to the difference between the minimum value and the preset angle value, and return to execute the acquisition Describe the steps of the panoramic video of the target space;
  • Step S104 in response to the fact that the minimum value is greater than or equal to the preset angle value, perform the step of determining the target image range where the sound source object is located in the panoramic video according to the sound emitting range.
  • the preset angle value is specifically a critical image azimuth angle used to represent whether the human body image can be completely displayed in the panoramic video.
  • the preset angle value is specifically a critical image azimuth angle used to represent whether the human body image can be completely displayed in the panoramic video.
  • the difference between the minimum value and the preset angle value can be used as the target rotation angle value or the value after the difference between the minimum value and the preset angle value is increased by the set angle value can be used as the target rotation angle angle value.
  • the shooting module that controls the panoramic video rotates its shooting range along the horizontal direction by an angle value that is consistent with the target rotation angle value, so that the complete image corresponding to the target area on the human body can be displayed in the panoramic video, and there is no such thing in the image range corresponding to the reference position. Images corresponding to human body parts.
  • the panoramic video can be adjusted from FIG. 11( a ) to the state shown in FIG. 11( b ) based on the above method.
  • the above method can ensure that the human figure image can be completely displayed in the panoramic video, so as to further improve the user interaction experience of remote interaction.
  • step S30 while performing step S30, it also includes: outputting the panoramic video in the second target display window, and sending the second target display window to the remote device, so that all The remote device displays the first target display window and the second target display window in combination.
  • the first target display window and the second target display window can be combined and then sent to the remote device; the first target display window and the second target display window can also be sent to the remote device separately, and the remote device receives After the first target display window and the second target display window, the two windows are combined for display.
  • the first target display window A and the second target display window B are combined and displayed on the remote device.
  • the sub-video data within the range of the target image is output in the first target display window, and the first target display window is sent to the remote device, so that the remote Before the step of displaying the first target display window on the terminal device, it also includes: acquiring a sensitivity parameter corresponding to the first target display window; the sensitivity parameter characterizes the update frequency of the video data in the first target display window; according to The sensitivity parameter determines the target duration of the required interval for sound recognition; the sub-video data in the first target display window is output in the range of the target image, and the first target display window is sent to the remote device to After the step of causing the remote device to display the first target display window, it also includes: returning to the acquisition of the sound range of the sound source object in the target space at an interval of the target duration, and acquiring the panoramic video of the target space A step of.
  • the step of obtaining the sensitivity parameter corresponding to the first target display window includes: obtaining the scene feature parameter or the user setting parameter of the current remote interaction scene; determining the sensitivity according to the scene feature parameter or the user setting parameter parameter.
  • the scene feature parameters here may specifically include the user situation in the target space or the scene type of the remote interaction scene (such as video conference or live video broadcast, etc.).
  • the user setting parameter is a parameter about the update frequency of the video data in the first target display window input by the user to the remote interaction device based on the user's actual interaction requirement.
  • a plurality of preset sensitivity parameters can be preset, and different preset sensitivity parameters correspond to different preset durations, and the current first target display is determined from the multiple preset sensitivity parameters according to scene characteristic parameters or user setting parameters.
  • the preset duration corresponding to the sensitivity parameter corresponding to the current first target display window is used as the target duration.
  • the multiple preset sensitivity parameter ranges are 1st gear sensitivity, 2nd gear sensitivity and 3rd gear sensitivity, and the corresponding preset durations are 0.5 second, 1 second and 1.5 second in turn.
  • the vocal range recognition of the sound source object can be re-identified based on the target space, and the new sub-video data can be output in the first target display window , so as to ensure that the update frequency of the video in the first target display window accurately matches the actual interaction requirements of the user in the current remote interaction scenario, so as to further improve the user's interaction experience.
  • the remote interaction method in the embodiment of the present application further includes: detecting a mute instruction, and stopping the remote device from outputting the audio data collected in the target space.
  • the remote device After the remote device stops outputting the audio data collected in the target space, the remote device can also output mute prompt information, so that the remote user can know the mute state in the target space based on the mute prompt information.
  • the mute command can be input through a button, a mobile phone or a computer.
  • the remote interaction method in the embodiment of the present application further includes: detecting an instruction to turn off the video, and stopping the remote device from outputting the video data collected in the target space.
  • the remote device After the remote device stops outputting the video data collected in the target space, the remote device can also output the prompt information of video off, so that the remote user can know the video off status in the target space based on the prompt information of video off .
  • the command to turn off the video can be input through a button, a mobile phone or a computer.
  • the remote interaction method in the embodiment of the present application further includes: detecting a preset instruction, stopping the execution of the steps S10 to S30, and only displaying the second target display window on the remote device, Thereby protecting the privacy of people in the target space.
  • the preset command can be input through buttons, mobile phone or computer.
  • the remote interaction method in the embodiment of the present application further includes: the user in the scene where the remote device is located is in the state of making a sound, and stops executing S10 to step S30; when the user in the scene where the remote device is located is not In the sounding state, execute step S10 to step S30.
  • whether the remote device is in a sounding state in the scene can be determined by obtaining information sent by the remote device.
  • the embodiment of the present application also proposes a computer program, and when the computer program is executed by a processor, the relevant steps in any embodiment of the above remote interaction method are implemented.
  • the embodiment of the present application also proposes a computer storage medium, on which a remote interaction program is stored, and when the remote interaction program is executed by a processor, the relevant steps of any embodiment of the above remote interaction method are implemented.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disk), including several instructions to enable a terminal device (which may be a mobile phone, computer, server, remote interaction device, or network device, etc.) to execute the methods described in various embodiments of the present application.
  • a terminal device which may be a mobile phone, computer, server, remote interaction device, or network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种远程交互方法、远程交互设备以及计算机存储介质,其中,该方法包括:获取目标空间内声源对象的发声范围,获取所述目标空间的全景视频(S10);根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围(S20);在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口(S30)。

Description

远程交互方法、远程交互设备以及计算机存储介质
相关申请
本申请要求于2022年1月29号申请的、申请号为202210111658.4的中国专利申请的优先权,其全部内容通过引用结合于此。
技术领域
本申请涉及远程交互技术领域,尤其涉及远程交互方法、远程交互设备和计算机存储介质。
背景技术
随着经济技术的发展,远程交互设备在日常生活、工作中的应用越来越广泛。例如,远程交互设备可应用于视频直播、视频互动、音视频会议等场景。目前,远程交互设备一般通过拍摄模块获取空间的全景视频数据,将全景视频数据显示于交互界面上,以实现与远端用户的交互。
然而,在交互场景中涉及的人员较多时,全景视频的输出容易导致交互界面所展现的只有距离拍摄模块较近用户的面部表情、肢体动作等交互细节,距离拍摄模块较远的用户的交互细节则无法在交互界面上展现,并且远端用户难以从全景视频中分辨当前发言的人员,这导致用户交互体验不佳。
申请内容
本申请的主要目的在于提供一种远程交互方法、远程交互设备以及计算机存储介质,旨在实现交互界面可突出显示发声对象的视频数据,提高远程交互过程中用户的交互体验。
为实现上述目的,本申请提供一种远程交互方法,所述远程交互方法包括以下步骤:
获取目标空间内声源对象的发声范围,获取所述目标空间的全景视频;
根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围;
在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口。
此外,为了实现上述目的,本申请还提出一种远程交互设备,所述远程交互设备包括:
拍摄模块;
麦克风阵列;以及
控制装置,所述全景拍摄模块和所述麦克风阵列均与所述控制装置连接,所述控制装置包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的远程交互程序,所述远程交互程序被所述处理器执行时实现如上任一项所述的远程交互方法的步骤。
此外,为了实现上述目的,本申请还提出一种计算机存储介质,所述计算机存储介质上存储有远程交互程序,所述远程交互程序被处理器执行时实现如上任一项所述的远程交互方法的步骤。
本申请提出的一种远程交互方法,该方法根据目标空间内声源对象的发声范围确定声源对象在所述目标空间的全景视频中所处的目标图像范围,在远端设备的第一目标显示窗口内输出目标图像范围内的子视频数据,子视频数据的输出可实现目标空间内发声对象可在远端设备的交互界面中突出显示,相比于全景视频更能体现目标空间内发声对象的交互细节,从而有效提高远程交互过程中用户的交互体验。
附图说明
图1为本申请远程交互设备所应用的远程交互场景的场景示意图;
图2为本申请远程交互设备一实施例运行涉及的硬件结构示意图;
图3为本申请远程交互方法第一实施例的流程示意图;
图4为本申请远程交互方法第二实施例的流程示意图;
图5为本申请远程交互方法第三实施例的流程示意图;
图6为本申请远程交互方法第四实施例的流程示意图;
图7为本申请远程交互方法第五实施例的流程示意图;
图8为本申请远程交互方法第六实施例的流程示意图;
图9为本申请远程交互方法实施例涉及的不同声源对象发声过程中目标对象的确定过程及其子窗口的排序过程的示意图;
图10为本申请远程交互方法第七实施例的流程示意图;
图11为本申请远程交互方法实施例涉及的拍摄范围调整前后采集的全景视频的示意图;
图12为本申请远程交互方法实施例涉及远程交互过程中第一目标显示窗口与第二目标显示窗口同时显示时的界面示意图;
图13为本申请远程交互方法实施例中发声范围以及目标空间范围确定过程涉及的空间范围示意图;
图14为本申请远程交互方法实施例中目标图像范围的确定以及调整涉及的图像范围示意图;
图15为本申请远程交互方法实施例涉及的声源对象移动触发目标图像范围调整的示意图;
图16为本申请远程交互方法实施例涉及的空间方位角和空间俯仰角的示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例的主要解决方案是:获取目标空间内声源对象的发声范围,获取所述目标空间的全景视频;根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围;在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口。
由于现有技术中,在交互场景中涉及的人员较多时,全景视频的输出容易导致交互界面所展现的只有距离拍摄模块较近用户的面部表情、肢体动作等交互细节,距离拍摄模块较远的用户的交互细节则无法在交互界面上展现,并且远端用户难以从全景视频中分辨当前发言的人员,这导致用户交互体验不佳。
本申请提供上述的解决方案,旨在实现交互界面可突出显示发声对象的视频数据,提高远程交互过程中用户的交互体验。
本申请实施例提出一种远程交互设备,应用于远程交互场景,这里的远程交互场景可以是同一空间内远距离的交互场景,也可以是不同空间或不同地域之间远距离的交互场景。例如,远程交互设备可应用于视频直播、视频互动、远程会议等场景。
其中,结合图1,对远程交互设备一实施例中所应用的交互场景进行介绍:远程交互设备所在空间内可设置有桌子,桌子可以是方形的桌子、也可以圆桌、还可以是任意形状的桌子。远程交互设备可放置于桌子上,例如远程交互设备可放置于桌子的中心、桌子的边缘或桌子上其他任意位置。需要进行远程交互的人员(例如多个参会人员)围绕在桌子的周围,另外除了人员以外设置为输出交互所需信息的设备(例如显示器、音频播放设备、平板电脑、手机登)也可设于桌子的一侧或桌子的边缘。
在本实施例中,参照图2,远程交互设备包括拍摄模块2、麦克风阵列3以及控制装置1。全景拍摄模块2和麦克风阵列3均与控制装置1连接。具体的,远程交互设备可包括壳体,拍摄模块2和麦克风阵列3均固定于壳体。
拍摄模块2被配置为采集其所在空间的全景视频数据。拍摄模块2还可被配置为采集其所在空间的场景图片等。在本实施例中,拍摄模块2设于壳体的顶部。在其他实施例中,拍摄模块2也可环绕壳体周向设置。
在本实施例中,拍摄模块2为鱼眼摄像头。在另一实施例中,拍摄模块2还可包括多个摄像头或可移动的摄像头,以通过多个摄像头采集的多个视频数据或可移动摄像头采集的多个视频数据进行拼接得到这里的全景视频数据。
具体的,拍摄模块2的取景角度范围可包括拍摄模块2所允许采集的图像所能覆盖的最大方位角范围和/或拍摄模块2所允许采集的图像所能覆盖的最大俯仰角范围。拍摄模块2的图像方位角的定义如下:以拍摄模块2的预设成像中心指向水平面上第一预设方向的线为第一基线,图像位置与预设成像中心的连线为第一目标方向线,第一目标方向线与第一基线形成的水平夹角则为图像方位角。这里的第一预设方向可根据拍摄模块2的安装位置或远程交互设备上其他功能模块的安装位置进行确定。在本实施例中,第一预设方向为预设成像中心朝向远程交互设备背面的方向。其中,定义第一基线沿顺时针方向与第一目标方向线形成的图像方位角为正值,定义第一基线沿逆时针方向与第一目标方向线形成的图像方位角为负值。基于此,为了便于计算,采用正值的图像方位角定义最大方位角范围,则拍摄模块2的最大方位角范围可为0度至360度。
拍摄模块2的图像俯仰角的定义如下:以拍摄模块2的预设成像中心指向垂直面上第二预设方向的线为第二基线,图像位置与预设成像中心的连线为第二目标方向线,第二目标方向线与第二基线在垂直面上形成夹角则为图像俯仰角。其中,第二目标方向线在第二基线下方时图像俯仰角为负值;第二目标方向线在第二基线上方时图像俯仰角为正值。在本实施例中,第二预设方向为远程交互设备放置于桌子上时,预设成像中心指向拍摄模块2所拍摄到的桌子边缘对应的图像位置的方向。基于此,为了便于计算,采用正值的图像俯仰角定义最大俯仰角范围,则拍摄模块2的最大俯仰角范围可为0度至69度,这里的69度可根据桌子尺寸的不同、声源对象高度的不同以及拍摄模块2安装位置的不同等实际情况设置为其他数值。另外,在其他实施例中,第二预设方向也可以是水平方向。
需要说明的是,拍摄模块2可预先设置有对其采集的图像数据进行图像位置表征的图像坐标系,图像坐标系可为极坐标或直角坐标系,这里的预设成像中心为图像坐标系的坐标原点。
进一步的,在本实施例中,拍摄模块2为鱼眼摄像头,其取景的最大方位角范围在200度至230度之间。在其他实施例中,最大方位角范围也可以更大,例如360度、270度等。
麦克风阵列3具体被配置为采集其所在空间中来自于不同空间方向的声音信号。控制装置1可根据麦克风阵列3采集的声音数据对空间内声源所在位置进行定位。麦克风阵列3具体包括多个阵列排布的麦克风。具体的,在本实施例中,多个麦克风呈环形阵列排布。在其他实施例中,多个麦克风也可呈三角形阵列排布或不规则形状排布。
具体的,壳体上可设有被配置为安装麦克风阵列3的多个孔位,孔位与麦克风阵列3中的麦克风一一对应设置,多个孔位可设于壳体的顶壁或多个孔位可设于壳体的侧壁且沿壳体的周向设置。
在本实施例中,麦克风阵列3拾音的方位角的角度范围为0度至360度,麦克风阵列3拾音的俯仰角的角度范围为16度至69度。需要说明的是,麦克风阵列3拾音的角度范围不限制于上述的数值范围内,可根据实际情况设置更大或更小的角度范围。
其中,参照图2,控制装置1包括:处理器1001(例如CPU),存储器1002,计时器1003等。控制装置1中的各部件通过通信总线连接。存储器1002可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。具体的,在本实施例中,存储器1002包括嵌入式多媒体存储卡(eMMC)和双倍速率同步动态随机存储器(DDR)。在其他实施例中,存储器1002可选的还可以是独立于前述处理器1001的存储装置。
本领域技术人员可以理解,图2中示出的装置结构并不构成对装置的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图2所示,作为一种计算机存储介质的存储器1002中可以包括远程交互程序。在图2所示的装置中,处理器1001可以被配置为调用存储器1002中存储的远程交互程序,并执行以下实施例中远程交互方法的相关步骤操作。
进一步的,在另一实施例中,参照图2,远程交互设备还可包括扬声器4,扬声器4与控制装置1连接。扬声器4可被配置为播放声频数据,这里的声频数据可为远端设备发送的远端设备所采集的声音数据,也可以是远程交互设备基于有线通信连接或无线通信连接获取到其所在空间内其他终端输入的声音数据。
具体的,扬声器4可安装于壳体内部,壳体上可设有与扬声器4所在腔体连通的多个出声孔,多个出声孔呈环形排布设于壳体的侧壁,扬声器4发出的声音可通过多个出声孔均匀地朝360度不同的方向传播。
具体的,扬声器4以最大音量播放声音时,确定与扬声器4距离为等于预设距离的空间位置上检测的声压级大于或等于预设分贝值。在本实施例中,预设距离为1米,预设分贝值为60dB。在其他实施例中,预设距离也可为1.3米、1.5米、2米等,预设分贝值也可为70dB、75dB等。
进一步的,在另一实施例中,参照图2,远程交互设备还包括按键模块5。按键模块5与控制装置1连接。按键模块5可为安装于壳体上的机械按键,也可以为安装于壳体上可设置为显示虚拟按键的触控模块,还可以其他可生成高低平电信号的按键模块5。按键模块5具体被配置为用户与远程交互设备之间的人机交互,具体的按键模块5响应于用户操作可生成相应的键值,控制装置1可被配置为获取按键模块5所生成的键值并根据键值对应的指令运行。
进一步的,在另一实施例中,参照图2,远程交互设备还可包括通信模块6,通信模块6具体为无线通信模块6,可被配置为实现远程交互设备与外部设备的无线通信连接。在本实施例中,无线通信模块6为蓝牙模块。在其他实施例中,无线通信模块6也可为WIFI模块、ZigBee模块、射频通信模块6等其他任意类型无线通信模块6。远程交互设备的控制终端(如手机、笔记本电脑、平板电脑、智能手表等)可基于无线通信模块6与远程交互设备建立无线通信连接,远程交互设备可基于无线通信连接接收控制终端发送的用户输入的控制指令或所获取的音视频数据。
进一步的,在另一实施例中,参照图2,远程交互设备还可包括数据接口7,数据接口7与控制装置1连接。数据接口7可被配置为与远程交互设备外部接入互联网的计算机设备有线通信连接。在本实施例中,数据接口7为USB接口。在其他实施例中,数据接口7也可为其他类型的接口,例如IEEE接口等。控制装置1可将需要远端设备输出的音视频数据基于数据接口7发送至计算机设备,计算机设备可通过互联网发送至远端设备,以使远端设备可输出远程交互设备所采集的音视频数据。此外,计算机设备与远程交互设备之间的控制信号可基于数据接口7双向传输。
其中,与远程交互设备连接的计算机设备中可安装有预设应用程序(例如直播软件、会议软件等),预设应用程序可完成远程交互设备与远端设备各自产生的音视频数据在互联网的双向传输。
本申请实施例还提供一种远程交互方法,应用于上述远程交互设备。
参照图3,提出本申请远程交互方法第一实施例。在本实施例中,所述远程交互方法包括:
步骤S10,获取目标空间内声源对象的发声范围,获取所述目标空间的全景视频;
目标空间具体为远程交互设备所处的有限空间范围。
声源对象具体为目标空间内发出声音的对象,可为人体、也可为发出声音的装置(如手机、音箱、平板电脑等)。
发声范围具体为声源对象发声过程中其发声位置(如人体的嘴部等)活动的最大空间范围。发声范围可通过检测声源对象的声音信号进行确定,也可以通过检测声源对象的图像信号进行确定。
全景视频具体为拍摄模块连续采集的多个曲面图像帧(如球面图像或柱面图像)形成的多媒体数据,每个曲面图像帧的曲面中心为拍摄模块的预设成像中心。具体的,实时获取拍摄模块采集的数据可获得这里的全景视频。
需要说明的是,声音范围和全景视频在同一时间检测。
步骤S20,根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围;
具体的,可预先设置有目标空间的空间位置与全景视频中图像位置的之间的转换关系。基于该转换关系,可直接将发声范围对应的空间位置特征参数转换为图像位置特征参数,将转换得到的图像位置特征参数所对应的图像范围作为目标图像范围。另外,也可按照预设规则对发声范围进行放大后获得声源对象的目标区域(如人体的头部、人体的上身、整个播放设备等)对应的空间范围,基于该转换关系将所获得的空间范围对应的空间位置特征参数转换为图像位置特征参数,将转换得到的图像位置特征参数所对应的图像范围作为目标图像范围。
步骤S30,在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口。
远端设备具体为设置为与远程交互设备进行音视频数据双向传输、将接收到远程交互设备发送的音视频数据进行输出的设备,以实现远端设备的用户与目标空间内的用户进行远程交互。
第一目标显示窗口具体为设置为显示目标空间内所有允许发声的对象中声源对象的视频数据的窗口,以实现远端设备的用户在视觉上实现与目标空间内的声源对象进行近距离交流。
其中,在存在多于一个声源对象时,每个声源对象对应一个目标图像范围,每个声源对象在第一目标显示窗口内对应一个子窗口,则将每个声源对象对应的目标图像范围内的子视频数据在对应的子窗口输出,多于一个子窗口合并形成第一目标显示窗口。
具体的,可提取全景视频在目标图像范围内的子视频数据,将子视频数据添加至设置为远程交互的预设应用中的第一目标显示窗口中输出,将显示有子视频数据的第一目标显示窗口发送至任意安装并开启预设应用的远端设备,远端设备打开预设应用均可将第一目标显示窗口及其中的子视频数据进行显示。另外,从全景视频中提取得到的子视频数据也可直接基于互联网发送至远端设备,远端设备可将子视频数据调整成与第一目标显示窗口适配的显示数据,第一目标显示窗口具体为设置为远程交互的预设应用中的窗口,远端设备可在其安装的预设应用的第一目标显示窗口中对调整后的显示数据进行显示。又或者,在确定目标图像范围之后,可将目标图像范围和全景视频发送至远端设备,远端设备可基于所接收的目标图像范围对全景视频中相应位置的视频数据进行提取得到子视频数据,将提取得到的子视频数据在其预设应用的第一目标显示窗口内输出。
本申请实施例提出的一种远程交互方法,该方法根据目标空间内声源对象的发声范围确定声源对象在所述目标空间的全景视频中所处的目标图像范围,在远端设备的第一目标显示窗口内输出目标图像范围内的子视频数据,子视频数据的输出可实现目标空间内发声对象可在远端设备的交互界面中突出显示,相比于全景视频更能体现目标空间内发声对象的交互细节,从而有效提高远程交互过程中用户的交互体验。
进一步的,基于上述实施例,提出本申请远程交互方法第二实施例。在本实施例中,参照图4,所述步骤S10包括:
步骤S11,在预设时长内检测所述声源对象的发声位置的多个第一空间位置信息,获得多个声源位置信息;
声源对象为人体时发声位置可指的是嘴部(如图13(b)中的01);声源对象为发声的设备时,发声位置可指的是声源对象的喇叭。
具体的,在预设时长内连续多个时刻检测声源对象的发声位置的空间位置信息(如图13(a)中的(X1,Y1)),时间先后相邻的两个时刻之间的时间间隔可为预设值。例如,可预设时长可为0.5秒,可在0.5秒内连续多次检测声源对象的发声位置的第一空间位置信息,获得多个声源位置信息。
具体的,可预先建立有表征目标空间内不同空间位置的空间坐标系,空间坐标系可以是极坐标系 或直角坐标系。这里空间位置信息具体为空间位置信息可采用空间坐标系中的坐标进行表示。
在本实施例中,在每个声源位置信息检测的过程中,获取麦克风阵列检测的声音信号,根据预设声源定位算法对所获取的声音信号进行计算,计算得到的空间位置信息可作为声源位置信息。这里的预设声源定位算法可为基于麦克风阵列中各个麦克风接收声音信号的时间差对声源进行定位的算法,例如TDOA算法,TDOA算法可具体包括GCC-PHAT算法或SRP-PHAT算法等;预设声源定位算法也可为使用空间谱估计进行声源定位的方法,例如MUSIC算法等。
在本实施例中,在预设时长内多次检测所述声源对象的发声位置在所述目标空间中以拍摄模块为基点的方位角和俯仰角,获得多个第一空间方位角和多个第一空间俯仰角;其中,所述多个声源位置信息包括所述多个第一空间方位角和所述多个第一空间俯仰角,所述拍摄模块被配置为采集所述全景视频。
空间方位角(如图16(a)中的α)的定义如下:以拍摄模块所在空间位置为基点,基点指向水平面上第三预设方向的线为第三基线,空间位置与基点的连线为第三目标方向线,第三目标方向线与第三基线形成的水平夹角则为空间方位角。这里的第三预设方向可根据拍摄模块的安装位置或远程交互设备上其他功能模块的安装位置进行确定。在本实施例中,第三预设方向为预设成像中心朝向远程交互设备背面的方向。其中,定义第三基线沿顺时针方向与第三目标方向线形成的空间方位角为正值,定义第三基线沿逆时针方向与第三目标方向线形成的空间方位角为负值。
空间俯仰角(如图16(b)中的β)的定义如下:以拍摄模块所在空间位置为基点,基点指向垂直面上第四预设方向的线为第四基线,空间位置与基点的连线为第四目标方向线,第四目标方向线与第四基线在垂直面上形成夹角则为空间俯仰角。其中,第四目标方向线在第四基线下方时空间俯仰角为负值;第四目标方向线在第四基线上方时空间俯仰角为正值。在本实施例中,第四预设方向为远程交互设备放置于桌子上时,预设成像中心指向拍摄模块所拍摄到的桌子边缘对应的空间位置的方向。基于此,为了便于计算,采用正值的空间俯仰角定义最大俯仰角范围,则拍摄模块的最大俯仰角范围可为0度至69度,这里的69度可根据桌子尺寸(如图16(b)中的H1、W)的不同、声源对象高度(如图16(b)中的H3)的不同以及拍摄模块安装位置(如图16(b)中的H2)的不同等实际情况设置为其他数值。另外,在其他实施例中,第四预设方向也可以是水平方向。
在其他实施例中,第一空间位置信息也可包括空间方位角和空间俯仰角的其中一个;或者,第一空间位置信息也可包括发声位置相对于基点的方向和/或距离。
在其他实施例中,第一空间位置信息也可基于全景视频中声源对象对应的图像进行识别得到,例如识别全景视频中声源对象的图像中发声位置所在图像区域的图像位置信息,基于图像位置信息来确定这里的第一空间位置信息。
步骤S12,根据所述多个声源位置信息确定所述发声范围。
具体的,可根据多个声源位置信息确定发声范围中一个或多于一个特征位置点,根据所确定的特征位置点来计算这里的发声范围。
在本实施例中,发声范围为方形区域,在其他实施例中,发声范围也可为圆形区域、三角区域或其他形状的区域。发声范围的区域形状具体可根据第一目标显示窗口的窗口形状或第一目标显示窗口内用于显示与声源对象对应的子窗口的窗口形状所确定。
在本实施例中,多个声源位置信息包括上述的多个第一空间方位角和多个第一空间俯仰角时,可确定所述多个第一空间方位角中的最小空间方位角和最大空间方位角,确定所述多个第一空间俯仰角中的最小空间俯仰角和最大空间俯仰角;根据所述最小空间方位角、所述最大空间方位角、所述最小空间俯仰角以及所述最大空间俯仰角确定所述发声范围对应的多个第一空间角点位置;将所述多个第一空间角点位置围合形成的空间范围确定为所述发声范围。例如,如图13(b)所示,最小空间方位角为X2,最大空间方位角为X3,最小空间俯仰角为Y2,最大空间俯仰角为Y3,则可确定发声范围的四个第一空间角点位置分别为(X2,Y2)、(X2,Y3)、(X3,Y2)以及(X3,Y3),这四个空间角点位置为何形成的方形空间区域可确定为发声范围。
在其他实施例中,也可根据多个声源位置信息确定发声范围的中点位置,例如确定多个第一空间方位角的第一均值和确定多个第一空间俯仰角的第二均值,空间方位角为第一均值且空间俯仰角为第二均值的空间位置可确定为中点位置。以中点位置为中心、且区域特征参数为预设值(如预设区域形状和/或预设空间尺寸等)的空间区域确定为发声范围,例如,将以中点位置为圆心、预设值为半径的圆形区域确定为发声范围。
在本实施例中,通过多次声源定位所确定的发声范围来声源对象在全景视频中的目标图像范围,有利于提高所确定的目标图像范围的准确性,从而保证即使声源对象发声过程中声源位置移动(例如发声人在发声过程中扭头等),也可准确地获取到声源对象在全景视频中所对应的子视频数据,以保证声 源对象的交互细节的突出显示,进一步远程交互过程中的用户体验。
进一步的,基于上述任一实施例,提出本申请远程交互方法第三实施例。在本实施例中,参照图5,所述步骤S20包括:
步骤S21,根据所述发声范围确定包含所述声源对象的目标区域的目标空间范围,所述目标区域为所述声源对象在交互时需展示的最小区域,目标空间范围大于或等于发声范围;
这里的目标区域可为预先设置的固定区域,也可为基于用户设置参数所确定的区域,还可根据声源对象的类型所确定的区域(不同类型可对应不同的目标区域)。例如,声源对象为人体时,目标区域可为头部或上身或肩部以上区域等;声源对象为设备时,目标区域可为设备上的显示区域。其中,目标区域大于发声范围,目标空间范围大于或等于目标区域。
在本实施例中,目标空间范围为方形区域。在其他实施例中,目标空间范围可为圆形区域、三角形区域或其他不规则形状的区域。
具体的,可直接将发声范围作为目标空间范围;也可按照预设规则对声音范围进行放大后的空间范围作为目标空间范围。需要说明的是,目标空间范围为基于上述实施例中的空间坐标系进行表征的区域范围。
具体的,可获取发声范围对应的区域调整值,根据区域调整值放大发声范围后获得目标空间范围。这里的区域调整值可为预先设置的固定参数,也可根据目标空间内实际场景情况所确定的参数。
步骤S22,根据预设对应关系,确定所述目标空间范围在所述全景视频中对应的图像范围为所述目标图像范围;其中,所述预设对应关系为预先设置的所述目标空间内的空间位置与所述全景视频对应的图像位置之间的对应关系。
具体的,这里的预设对应关系为上述实施例提及的图像坐标系与空间坐标系之间的坐标转换关系。
基于预设对应关系将目标空间范围对应的空间位置特征参数转换成图像位置特征参数,基于转换得到的图像位置参数确定目标图像范围。例如,可基于预设对应关系将目标空间范围的多个空间角点位置转换成多个图像角点位置,将全景视频中多个图像角点位置围合形成的图像区域作为目标图像范围;又如,目标空间范围为圆形区域,基于预设对应关系将目标空间范围的空间中点位置转换成图像中点位置,将目标空间范围对应的空间半径转换成图像半径,将全景视频中以图像中点位置为圆心、图像半径为半径的圆形图像区域作为目标图像范围。
在本实施例中,通过上述方式,基于发声范围确定包含有声源对象需显示的最小区域的目标空间范围后,基于所确定的目标空间范围在全景视频中对应的图像区域作为目标图像范围,有利于保证所确定的目标图像范围内子视频数据可包含有声源对象的目标区域的图像,以确保所提取的子视频数据可准确地包含声源对象交互所需的全部细节,以进一步提高远程交互过程中的用户体验。
进一步的,在本实施例中,步骤S21包括:获取所述目标空间内允许发声的对象的总数,获取所述发声范围内目标空间位置的第二空间位置信息;根据所述总数确定所述目标空间范围的大小特征值;根据所述第二空间位置信息和所述大小特征值确定所述目标空间范围。
这里的允许发声的对象包括具有发声功能的设备和人体。这里的总数由获取用户输入的参数确定,也可通过对全景视频进行目标识别确定。例如,目标空间内有8个人、1个手机以及1个显示播放设备,则可确定允许发声的对象的总数有10个。
目标空间位置具体为对发声范围的区域位置进行表征的位置。在本实施例中,目标空间位置为发声范围的中心位置。在其他实施例中,目标空间位置也可为发声范围的边缘位置、角点位置、重心位置或其他任意的位置。
不同的总数对应的不同的大小特征值,大小特征值表征的目标空间范围的大小与总数呈负相关,也就是说,总数越多则目标空间范围的尺寸越小。大小特征值可为目标空间范围的面积、半径、长和宽等表征区域大小的特征参数。其中,总数大于设定值时,大小特征值为预设大小特征值,总数小于或等于设定值时可根据总数计算大小特征值,基于此,可有效避免目标空间范围太小,从而保证声源对象的交互细节可准确展示。
具体的,根据大小特征值对第二空间位置信息进行调整后可获得目标空间范围对应的部分或全部空间位置信息,基于所获得的空间位置信息可确定目标空间范围。
在本实施例中,所述目标空间位置为发声范围的中心位置,所述第二空间位置信息包括所述目标空间位置以拍摄模块为基点的第二空间方位角,所述拍摄模块被配置为采集所述全景视频,所述根据所述第二空间位置信息和所述大小特征值确定所述目标空间范围的步骤包括:根据所述大小特征值确定空间方位角调整值;根据所述空间方位角调整值调整所述第二空间方位角,获得所述目标空间范围以所述拍摄模块为基点的方位角范围的最大临界值和最小临界值;根据所述最大临界值、所述最小临界值以及 所述目标空间范围以所述拍摄模块为基点的预设俯仰角范围确定所述目标空间范围的多个第二空间角点位置;将多个所述第二空间角点位置围合形成的空间范围确定为所述目标空间范围。
其中,在本实施例中,大小特征值为目标空间范围的宽度,宽度越大则空间方位角调整值越大;宽度越小则空间方位角调整值越小。在其他实施例中,大小特征值也为目标空间范围的半径。
具体的,根据空间方位角调整值缩小第二空间方位角可获得目标空间范围对应的空间方位角的最小临界值,根据空间方位角调整值放大第二空间方位角可获得目标空间范围对应的空间方位角的最大临界值。
预设俯仰角范围可结合拍摄模块的安装位置、用于放置远程交互设备的桌子的尺寸以及声源对象允许出现的最大高度等信息确定。具体的,预设俯仰角范围中的最小俯仰角值可为用于放置远程交互设备的桌子的边缘位置与拍摄模块的连线与上述第四基线之间的夹角(例如0度等);预设俯仰角范围中的最大俯仰角值可为声源对象的最高位置与拍摄模块的连线与第四基线的夹角(例如69度等)。在其他实施例中,预设俯仰角范围也可根据预设图像比例与上述确定的最大临界值和最小临界值确定。
预设俯仰角范围的最小值为目标空间范围的最小空间俯仰角,预设俯仰角范围的最大值为目标空间范围的最大空间俯仰角。
例如,以下列例子说明本实施例方案:
1)目标空间内允许发声的对象的总数为n,麦克风阵列的声音识别的最大方位角范围为0度至360度,则目标空间范围的宽度为360度/n,由于目标空间位置为中心位置可确定空间方位角调整值为360度/2n;
2)基于上述确定的声音范围(X2,Y2)、(X2,Y3)、(X3,Y2)以及(X3,Y3)(如图13(b)所示)可确定声音范围的中心位置的第二空间方位角为(X2+X3)/2,目标空间范围的空间方位角的最小临界值为X4=(X2+X3)/2-360度/2n,目标空间范围的空间方位角的最大临界值为X5=(X2+X3)/2+360度/2n;
3)预设俯仰角范围为0度至P度(如69度),目标空间范围的空间俯仰角的最小临界值为Y4=0,目标空间范围的空间俯仰角的最大临界值Y5=P;
4)基于此,如图13(c)所示,可确定目标空间范围的四个空间角点位置分别为(X4,Y4),(X4,Y5),(X5,Y4)以及(X5,Y5),这四个空间角点位置围合形成的四边形空间区域则为目标空间范围。
进一步的,基于上述任一实施例,提出本申请远程交互方法第四实施例。在本实施例中,参照图6,所述步骤S20之后,还包括:
步骤S201,识别所述目标图像范围内人体图像所在的图像区域;
具体的,可采用人体识别算法对目标图像范围内图像数据进行识别确定图像区域。例如,对目标图像范围内的图像数据进行人脸识别确定人脸图像,基于人脸图像进行人形推算得到这里的图像区域。
在本实施例中,图像区域为四边形区域。在其他实施例中,图像区域也可为圆形区域或人形形状的区域。
步骤S202,确定所述图像区域的面积与所述目标图像范围的面积的比值;
步骤S203,判断所述比值是否小于预设值;
响应于所述比值小于预设值的情况,执行步骤S204后执行步骤S30;响应于所述比值大于或等于所述预设值的情况,执行步骤S30。
预设值具体为人与人面对面交互时舒适距离所允许的图像区域与目标图像范围之间面积比的最小值。比值小于预设值表明远端设备的用户在看到子视频数据时会觉得其与声源对象的距离过远,用户无法基于子视频数据的输出获取到所需的交互细节;比值大于或等于预设值表明远端设备的用户在看到子视频数据时可清楚的看到声源对象的交互细节。
步骤S204,缩小所述目标图像范围,以使所述比值大于或等于所述预设值。
具体的,可根据预先设置的固定范围调整参数缩小目标图像范围,也可根据图像区域的尺寸特征或比值等确定的范围调整参数来缩小目标图像范围。
在缩小目标图像范围之后,可返回执行步骤S201,以确保调整后的目标图像范围对应的上述比值可大于或等于预设值。
在本实施例中,根据所述预设值放大所述图像区域获得缩小后的目标图像范围。具体的,可根据预设值确定用于放大图像区域的图像位置调整值,根据图像位置调整值对图像区域进行调整后获得缩小后的目标图像范围。
在本实施例中,根据预设值放大图像区域获得缩小后的目标图像范围的过程具体如下:确定所述图像区域内目标图像位置的图像位置参数,根据所述预设值和所述图像区域的宽度确定用于放大所述图像区域的图像位置调整值;根据所述图像位置调整值调整所述图像位置参数获得目标图像位置参数;根 据所述目标图像位置参数确定缩小后的目标图像范围。
在本实施例中,目标图像位置为图像区域的图像中心位置。在其他实施例中,目标图像位置也可为声源对象的发声位置在所述图像区域内对应的图像位置、图像区域的边缘位置、角点位置或其他任意位置等。图像位置参数具体可为以上述实施例提及的图像坐标系进行表征的图像位置的特征参数。在本实施例中,图像位置参数包括目标图像位置以预设成像中心为基点的第一图像方位角和/或第一图像俯仰角。在其他实施例中,目标图像位置还可以包括目标图像位置与预设成像中心之间的距离和/或方向。
在本实施例中,图像区域的宽度具体指的是图像区域对应最大方位角与最小方位角的差值。在其他实施例中,图像区域的宽度也可为图像区域沿水平方向上两侧边缘之间的距离。具体的,可根据预设值和图像区域的宽度计算图像区域放大后的目标宽度,根据目标宽度确定这里的图像位置调整值。目标图像位置为图像中心位置时,可将目标宽度的1/2作为图像位置调整值;目标图像位置为图像区域沿水平方向上一侧边缘的图像边缘位置时,将目标宽度直接作为图像位置调整值。
具体的,可根据图像位置调整值对图像位置参数调整后作为目标图像位置参数。例如,图像位置参数包括图像方位角和图像俯仰角,图像位置调整值包括图像方位角调整值和图像俯仰角调整值,根据图像方位角调整值调整图像方位角后获得目标图像方位角,根据图像俯仰角调整值调整图像俯仰角后获得目标图像俯仰角,目标图像位置参数包括目标图像方位角和目标图像俯仰角。另外,还可根据图像位置调整值对图像位置参数调整后获得第一图像位置参数,根据第一图像位置参数和预设参数计算得到目标图像位置参数。例如,图像位置参数包括图像方位角,图像位置调整值包括方位角调整值,根据方位角调整值对图像方位角进行调整后获得目标图像方位角,根据目标图像方位角和缩小后的目标图像范围的目标图像比例确定目标图像俯仰角,目标图像位置参数包括目标图像方位角和目标图像俯仰角;又如,图像位置参数包括图像俯仰角,图像位置调整值包括俯仰角调整值,根据俯仰角调整值对图像俯仰角进行调整后获得目标图像俯仰角,根据目标图像俯仰角和缩小后的目标图像范围的目标图像比例确定目标图像方位角,目标图像位置参数包括目标图像方位角和目标图像俯仰角。
在本实施例中,通过上述方式,在人形图像比例较小时缩小目标图像范围,使目标图像范围中的人形图像比例可增大,保证所输出的子视频数据中人形图像的比例不会太小,以确保远端设备的用户基于输出的子视频数据可在视觉上实现与目标空间内的面对面交流,以保证远端设备的用户可在远程交互过程中清楚地看到子视频数据对应的声源对象的交互细节,以实现远程交互过程中用户体验的进一步提高。其中,基于预设值放大图像区域后作为缩小后的目标图像范围,可保证目标图像范围缩小前后人体图像所呈现的人体范围不变,确保声源对象的交互细节可放大呈现。
进一步的,在本实施例中,所述图像位置参数包括所述目标图像位置以所述全景视频对应的预设成像中心为基点的第一图像方位角,所述图像位置调整值包括图像方位角调整值,所述根据所述图像位置调整值和所述图像位置参数确定目标图像位置参数的步骤包括:根据所述图像方位角调整值调整所述第一图像方位角,获得调整后的目标图像范围以所述预设成像中心为基点的最大图像方位角和最小图像方位角;根据所述最大图像方位角、所述最小图像方位角、所述目标图像位置在所述图像区域的竖直方向上的位置特征参数以及所述目标图像范围的图像比例确定缩小后的目标图像范围以所述预设成像中心为基点的最大图像俯仰角和最小图像俯仰角;确定所述最大图像方位角、所述最小图像方位角、所述最大图像俯仰角以及所述最小图像俯仰角为所述目标图像位置参数。基于此,所述根据所述目标图像位置参数确定缩小后的目标图像范围的步骤包括:根据所述最大图像方位角、所述最小图像方位角、所述最大图像俯仰角以及所述最小图像俯仰角确定调整后的目标图像范围的多个图像角点位置;将所述多个图像角点位置围合形成的图像范围作为缩小后的目标图像范围。
在本实施例中,目标图像位置为位于所述图像区域的垂直平分线上的位置,其与图像区域两侧边缘的距离相等,例如可为图像区域的中点位置,也可为垂直平分线上除了中点位置以外的其他位置。具体的,根据图像方位角调整值缩小第一图像方位角后获得最小图像方位角,根据图像方位角调整值增大第一图像方位角后获得最大图像方位角。
定义图像区域对应的最大俯仰角与最小俯仰角之间的差值为目标角度幅值,定义图像区域对应的最大俯仰角与目标图像位置的图像俯仰角的差值为第一差值,定义目标图像位置的图像俯仰角与图像区域对应的最小俯仰角之间的差值为第二差值,所述目标图像位置在所述图像区域的竖直方向上的位置特征参数具体为第一差值与目标角度幅值的比值或第二差值与目标角度幅值的比值。其中,在本实施例中,目标角度幅值为固定值,在其他实施例中,目标角度幅值也可根据目标空间内的实际场景参数所确定的数值。
在本实施例中,所述目标图像位置为所述发声范围的中心位置在所述图像区域内对应的图像位置。图像比例为图像区域的长度与宽度的比值。
所述目标图像范围的图像比例具体为目标图像范围未缩小之前的宽度与长度的比例,定义目标图 像范围缩小前的图像方位角度的最大值与图像方位角度的最小值之间的为第三差值,定义目标图像范围缩小前的图像俯仰角度的最大值与图像方位角度的最小值之间的为第四差值,目标图像范围的图像比例为第三差值与第四差值的比值。
在获得最小图像方位角和最大图像方位角之后,可根据最小图像方位角和最大图像方位角计算缩小后的目标图像范围的目标宽度(即最大图像方位角与最小图像方位角之间的差值),基于目标图像范围缩放前后的图像比例相同,可根据目标宽度和图像比例计算缩小后的目标图像范围的目标长度(即最大图像俯仰角与最小图像俯仰角之间的差值),根据目标图像位置在数值方向上对应的位置特征参数和目标长度可计算得到最大图像俯仰角和最小图像俯仰角。
在获得最大图像俯仰角、最小图像俯仰角、最大图像方位角以及最小图像方位角之后,确定缩小后目标图像范围的四个角点位置,将四个角点位置所围合形成的四边形图像区域可作为缩小后的目标图像范围。
为了更好理解本实施例涉及的缩小后的目标图像范围的确定过程(即图像区域的放大过程),结合图13和图14,下面以具体应用进行说明:
1)、最小图像方位角定义为X8,最大图像方位角定义为X9,预设值为0.9:1,即图像区域与目标图像区域的面积比;如图14(a)所示,人体图像所在的图像区域的多个角点位置分别为(X6,Y6)、(X6,Y7)、(X7,Y6)以及(X7,Y7),基于图像区域放大前后的沿水平方向上的中心不变,则:
最小图像方位角X8=(X7-(X7-X6)/2)-((X7-X6)/0.9)/2;
最大图像方位角X9=(X7-(X7-X6)/2)+((X7-X6)/0.9)/2;
其中,X7-(X7-X6)/2为第一图像方位角,((X7-X6)/0.9)/2为图像方位角调整值。
2)、最小图像俯仰角定义为Y8,最大图像俯仰角定义为Y9,发声范围的多个角点位置分别为图13中的(X2,Y2)、(X2,Y3)、(X3,Y2)以及(X3,Y3),目标图像位置为发声范围的中心位置,则目标图像位置为Y3-(Y3-Y2)/2,未缩小之前的目标图像范围的多个角点位置分别为图14中的(X4’,Y4’)(X4’,Y5’)(X5’,Y4’)(X5’,Y5’),基于目标图像范围缩放前后的图像比例不变,发声范围的中心位置在缩小后的目标图像范围的垂直方向的位置特征与发声范围的中心位置在所述图像区域的竖直方向上的位置特征参数(如0.65)一致,则:
最小图像俯仰角Y8=(Y3-(Y3-Y2)/2)-((X9-X8)*(Y5’-Y4’)/(X5’-X4’)*0.65);
最大图像俯仰角Y9=Y8+(X9-X8)*(Y5’-Y4’)/(X5’-X4’);
其中,(Y5’-Y4’)/(X5’-X4’)为目标图像范围的图像比例。
3)如图14(b)所示,放大后人体图像所在图像区域的图像角点位置分别为(X8,Y8)、(X8,Y9)、(X9,Y8)以及(X9,Y9),这4个图像角点位置为围合形成的四边形图像区域则为缩小后的目标图像范围。
在本实施例中,通过上述方式,可保证目标图像范围缩小后人体图像的规格可与目标图像范围缩小前大致相同,保证目标图像范围缩小后子视频数据所呈现的人像比例较大同时人像可具有较佳的呈现效果,以进一步提高远程交互的用户体验。
进一步的,在本实施例中,所述根据所述最大图像方位角、所述最小图像方位角、所述最大图像俯仰角以及所述最小图像俯仰角确定调整后的目标图像范围的多个图像角点位置的步骤之后,还包括:确定所述多个图像角点位置围合形成的图像范围的区域面积相对于所述图像区域的区域面积的放大倍数;响应于所述放大倍数小于或等于预设倍数的情况,执行所述将所述多个图像角点位置围合形成的图像范围作为缩小后的目标图像范围的步骤;响应于所述放大倍数大于所述预设倍数的情况,将所述图像区域放大预设倍数后的图像范围作为缩小后的目标图像范围。这里,对图像区域的放大倍数进行限制,避免放大倍数过大导致目标图像范围缩小后子视频数据中的人像过于模糊,保证声源对象的交互细节可在子视频数据输出时清晰呈现,以进一步提高远程交互的用户体验。
进一步的,在本实施例中,所述识别所述目标图像范围内人体图像所在的图像区域的步骤之后,还包括:响应于所述目标图像范围内存在人体图像的情况,执行所述确定所述图像区域的面积与所述目标图像范围的面积的比值的步骤;响应于所述目标图像范围内不存在人体图像的情况,执行所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤。
进一步的,基于上述任一实施例,提出本申请远程交互方法第五实施例。在本实施例中,参照图7,所述步骤S30之后,还包括:
步骤S40,获取所述发声范围的空间位置变化参数或所述目标图像范围内人体图像区域的图像位置变化参数;
所述空间位置变化参数包括所述发声范围以拍摄模块为基点的空间方位角变化值和/或空间俯仰 角变化值,所述拍摄模块被配置为采集所述全景视频;所述图像位置变化参数包括所述图像区域以所述全景视频的预设成像中心为基点的图像方位角变化值和/或图像俯仰角变化值。
在本实施例中,空间位置变化参数包括发声范围中的第一目标位置(例如中心位置)的空间方位角变化值和/或空间俯仰角变化值,图像位置变化参数包括目标图像范围内人体图像区域的第二目标位置(例如中心位置)的图像方位角变化值和/或图像俯仰角变化值。
步骤S50,根据所述空间位置变化参数或所述图像位置变化参数调整所述目标图像范围;
具体的,可根据空间位置变化参数或图像位置变化参数调整当前目标图像范围的部分或全部角点的第一图像位置参数后获得调整后目标图像范围各个图像角点位置的第二图像位置参数。
其中,当声源对象为人体时可根据图像位置变化参数调整所述目标图像范围;当声源对象为具有发声功能的装置(例如手机、音箱等)时可根据空间位置变化参数调整目标图像范围。
步骤S60,在所述第一目标显示窗口内输出调整后的目标图像范围内的子视频数据,并发送调整后的第一目标显示窗口至所述远端设备,以使所述远端设备显示调整后的第一目标显示窗口。
例如,如图15所示,定义当前的目标图像范围的图像角点位置分别为(X8,Y8)、(X8,Y9)、(X9,Y8)以及(X9,Y9),当声源对象左右移动时会导致目标图像范围内人形图像区域的图像方位角发生变化时,可基于空间位置变化参数或图像位置变化参数计算移动后人形图像区域的中心位置的图像方位角(X12-X11)/2,定义调整后的目标图像范围的最小图像方位角为X13、调整后的目标图像范围的最大图像方位角为X14、调整后的目标图像范围的最小图像俯仰角为Y13、调整后的目标图像范围的最大图像俯仰角为Y14,基于调整前后目标图像范围的尺寸不变,则:
X13=(X12-X11)/2-(X9-X8)/2;
X14=(X12-X11)/2+(X9-X8)/2;
Y13=Y8;
Y14=Y9;
基于此,可确定调整后的目标图像范围对应的多个图像角点位置分别为(X13,Y13)、(X13,Y14)、(X14,Y13)以及(X14,Y14),多个图像角点位置围合形成的图像区域为调整后的目标图像范围。
另外,当声源对象上下移动时会导致目标图像范围内人形图像区域的图像俯仰角发生变化时或当声源对象斜向移动时会导致目标图像范围内人形图像区域的图像俯仰角和图像方位角同时发生变化时,可类比这里的方式确定调整后的目标图像范围的图像俯仰角范围和图像方位角范围,在此不作追踪。
在本实施例中,通过上述方式,可保证声源对象即使在交互过程中移动,第一目标显示窗口中输出的子视频数据中声源对象的图像也可完整显示,以有效提高远程交互过程中的用户交互体验。
进一步的,基于上述任一实施例,提出本申请远程交互方法第六实施例。在本实施例中,参照图8,所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据包括:
步骤S31,响应于所述声源对象的数量多于一个的情况,获取所述第一目标显示窗口中需显示的声源对象的目标数量;
需要说明的是,这里的声源对象具体可包括当前发声的声源对象和当前时刻之前发声的声源对象。
这里的目标数量可为用户自行设置,也可为默认设置的固定参数。声源对象的数量大于或等于这里的目标数量。
步骤S32,在多于一个所述声源对象中确定所述目标数量个声源对象作为目标对象;
这里目标数量的声源对象可由用户自行选择,也可根据预设规则从多于一个声源对象中选择,还可随机选取。
步骤S33,在每个所述目标对象对应的子窗口内输出所述目标对象对应的目标图像范围内的子视频数据,并在所述第一目标显示窗口内合并所述目标数量个子窗口。
不同的目标对象在第一目标显示窗口中对应不同的子窗口,不同子窗口分别输出不同目标对象的子视频数据。目标对象与子窗口一一对应设置。
具体的,在步骤S30之前,可获取所述目标空间的全景视频并获取目标空间内每个声源对象的发声范围,根据每个声源对象对应的发声范围确定声源对象在所述全景视频中对应的目标图像范围。基于此,目标对象在全景视频中对应目标图像范围内的子视频数据在目标对象所对应的子窗口中输出。
目标数量个子窗口可在第一目标显示窗口中随机排列,也可按照预设规则对目标数量个子窗口进行排列后显示于第一目标显示窗口内。
在本实施例中,通过上述方式,可保证远程交互过程中远端用户可基于第一目标显示窗口中显示的视频数据同时获取到目标空间内多于一个发声对象的交互细节,进一步提高远程交互的用户体验。
进一步的,在本实施例中,步骤S32包括:获取每个所述声源对象对应的发声状态参数,所述发 声状态参数表征对应的声源对象的发声时间与当前时间之间的间隔时长;在多于一个所述声源对象中,根据各所述声源对象的发声状态参数确定所述目标数量个声源对象作为目标对象。
在本实施例中,发声状态参数的获取过程具体如下:获取每个所述声源对象当前分别对应的标签值,每个声源对象的标签值均大于或等于第一预设值,标签值表征对应的声源对象在当前时刻之前未发声的连续次数;根据预设规则更新每个所述声源对象当前的标签值,获得每个所述声源对象更新后的标签值作为每个所述声源对象的发声状态参数;其中,所述预设规则包括:当前处于发声状态的声源对象的标签值设置为所述第一预设值,当前未处于发声状态的声源对象的标签值增加第二预设值。其中,标签值在每次存在声源对象发声的过程中按照这里的预设规则进行更新。若所有声源对象均未有发声时可对每个声源对象对应的标签值进行初始化,每个声源对象对应的标签值的初始值可相同或不同。在本实施例中,第一预设值为0,第二预设值为1。在其他实施例中,第一预设值和第二预设值也可根据实际需求设置为其他数值,如第一预设值为1,第二预设值为2等。标签值所允许存在最小值为第一预设值。
基于按照预设规则更新得到各个声源对象的标签值,则在多于一个所述声源对象中,根据各所述声源对象的发声状态参数将所述目标数量个声源对象作为目标对象的步骤包括:将所有发声状态参数按照从小到大的顺序依次排列,获得排列结果;将所述排列结果中排列位次在前的目标数量个发声状态参数分别对应的声源对象确定为目标对象。其中,发声状态参数排列位次越前则表明对应的目标对象发声的时刻与当前时刻的间隔时长越短。
在其他实施例中,发声状态参数也可为每个声源对象的发声时间与当前时间之间的间隔时长。基于所有间隔时长从小到大的顺序依次排列,将排列位次在前的目标数量个间隔时长分别对应的声源对象确定为目标对象,获得所述目标数量个目标对象。
在本实施例中,按照上面的方式,可保证第一目标显示窗口内显示的是最近发声的目标数量个声源对象,从而保证远程交互过程中交互的实时性和便利性,以进一步提高远程交互过程中的用户体验。
进一步的,在本实施例中,目标数量个子窗口在第一目标显示窗口中合并显示的过程具体如下:确定每个所述目标对象的目标图像范围上的预设图像位置以所述全景视频的预设成像中心为基点的第二图像方位角;所述根据各所述目标对象对应的第二图像方位角之间的大小关系确定所述目标数量个子窗口的排列顺序;在所述第一目标显示窗口内按照所述排列顺序合并显示所述目标数量个子窗口。
在本实施例中,预设图像位置为目标图像范围的中心位置;在其他实施例中预设图像位置也可为目标图像范围的边缘位置或其他位置。
具体的,可按照第二图像方位角从大到小的顺序对目标数量个子窗口进行排列得到这里的排列顺序;也可按照第二图像方位角从小到大的顺序对目标数量个子窗口进行排列得到这里的排列顺序。
定义所述预设成像中心指向预设水平方向的射线为基准线,定义每个目标对象对应的预设图像位置与所述预设成像中心的连线为目标线,每个目标对象对应的第二图像方位角为所述基准线沿顺时针方向到所述目标对象对应的目标线的水平夹角,所述根据各所述目标对象对应的第二图像方位角之间的大小关系确定所述目标数量个子窗口的排列顺序的步骤包括:根据第二图像方位角从小到大的顺序依次排列所述目标数量个子窗口,获得所述排列顺序。
在本实施例中,基于第二图像方位角之间的大小关系对目标数量个子窗口进行排列显示,从而保证第一目标显示窗口输出的各个目标对象的子视频数据的排列顺序与各个目标对象在目标空间内的相对位置相同,以保证远端用户可基于输出的视频数据从视觉上模拟其身临目标空间内时的交互场景。其中,根据第二图像方位角从小到大的顺序依次排列所述目标数量个子窗口,从而最大程度地模拟远端用户在目标空间现场进行面对面交互时的场景,以进一步提高远程交互过程的用户体验。
为了更好地理解本实施例涉及的目标数量个目标对象的确定过程,结合图9和图12对本实施例方案进行说明:
图9中的交流窗口为本实施例中的第一目标显示窗口,W1、W2、W3为第一目标显示窗口中依次排列的子窗口,W4为当前新增的声源对象对应的虚拟子窗口、为在第一目标显示窗口中不显示的子窗口;图9和图12P2、P3、P5、P7等分别表征的是不同的声源对象。
其中,W1、W2、W3、W4分别对应一个标签值,W1、W2、W3上的目标对象对应的标签值的初始值依次为1、2、3;在当前存在声源对象发声的过程中,当前发声的声源对象的子窗口的标签值为0,当前未发声的声源对象的子窗口的标签值均增加1;同一声源对象连续发声时,其子窗口的标签值持续为0;每个声源对象对应的标签值最大为4,最小为0。
基于此,第一目标显示窗口内的声源对象当前处于发声状态时,则子窗口的排序不调整,每个声源对象对应的状态值按照上述规则更新;当第一目标显示窗口内的声源对象以外新增的声源对象当前处于发声状态时,确定,第一目标显示窗口内状态值最大的声源对象对应的子窗口删除,将新增的声源对象对应的子窗口与第一目标显示窗口内被删除的子窗口以外的其他子窗口按照图像方位角进行排序,每 个声源对象对应的状态值按照上述规则更新。按照子窗口的最新排序将目标数量个子窗口在第一目标显示窗口内依次显示。
例如,第一目标显示窗口中当前显示P2、P3和P5对应的子窗口,之后P2、P3、P5以及P7的发声顺序依次为P3、P5、P7以及P2,则第一目标显示窗口内的显示情况、各声源对象对应的状态值以及基于图像方位角对各个目标对象对应的子窗口的排序结果可参照图9。
进一步的,基于上述任一实施例,提出本申请远程交互方法第七实施例。在本实施例中,参照图10,所述获取目标空间的全景图像的步骤之后,还包括:
步骤S101,识别所述全景视频的基准位置对应的人形图像区域,所述基准位置以所述全景视频的预设成像中心为基点的图像方位角为预设角度值,所述人形图像区域包括人体上目标区域对应的完整图像,所述目标区域为人体在交互时需展示的最小区域;
基准位置所在图像范围可为对应的图像方位角与预设角度值之间的差值小于或等于设定值的图像位置集合。具体的,可对基准位置所在图像范围内进行人体部位识别获得人体部位的特征图像,基于特征图像推算得到人形图像区域。
例如,目标区域为人体在肩部及其以上的区域时,若图像范围内存在人体的左肩和左边的半个头部对应的特征图像,可基于特征图像计算得到人体整个肩部和肩部以上的整个头部所对应的完整图像作为人形图像区域。
在本实施例中,所述基准位置为所述全景视频的图像边缘位置,预设角度值为0度。在其他实施例中,预设角度值也可为与0度夹角为单个人体图像区域对应的图像方位角幅度的整数倍的角度值。例如单个人体图像区域的最大图像方位角与最小图像方位角之间的角度差为a,则与0度夹角为a的整数倍的角度值可作为预设角度值。
步骤S102,确定所述人形图像区域以所述预设成像中心为基点的图像方位角的最小值;
步骤S103,响应于所述最小值小于所述预设角度值的情况,根据所述最小值与所述预设角度值的差值调整所述全景视频对应的拍摄范围,返回执行所述获取所述目标空间的全景视频的步骤;
步骤S104,响应于所述最小值大于或等于所述预设角度值的情况,执行所述根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围的步骤。
预设角度值具体为用于表征人体图像能否在全景视频中完整显示的临界图像方位角。最小值小于预设角度值时,表明人体图像不能在全景视频中完整显示;最小值大于或等于预设角度值时,表明人体图像可在全景视频中完整显示。
具体的,可将最小值与所述预设角度值之间差值作为目标旋转角度值或将最小值与所述预设角度值之间差值增大设定角度值后的数值作为目标旋转角度值。控制全景视频的拍摄模块将其拍摄范围沿水平方向旋转与目标旋转角度值一致的角度值,使人体上目标区域对应的完整图像可在全景视频中全部显示,基准位置对应的图像范围内不存在人体部位对应的图像。
例如,可基于上述方式将全景视频从图11(a)调整至图11(b)的状态。
在本实施例中,通过上述方式可保证人形图像可在全景视频中完整显示,以进一步提高远程交互的用户交互体验。
进一步的,基于上述任一实施例,步骤S30执行的同时还包括:在第二目标显示窗口内输出所述全景视频,并发送所述第二目标显示窗口至所述远端设备,以使所述远端设备合并显示所述第一目标显示窗口和所述第二目标显示窗口。
具体的,可将第一目标显示窗口和第二目标显示窗口合并后发送至远端设备;也可将第一目标显示窗口和第二目标显示窗口单独发送至远端设备,远端设备接收到第一目标显示窗口和第二目标显示窗口后对两个窗口进行合并显示。
例如,如图12所示,将第一目标显示窗口A与第二目标显示窗口B在远端设备上合并显示。
基于此,可保证远端用户可基于输出的视频数据同时知晓目标空间内整体场景情况和发声对象的交互细节,有利于进一步提高远程交互的用户体验。
进一步的,基于上述任一实施例,所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤之前,还包括:获取所述第一目标显示窗口对应的灵敏度参数;所述灵敏度参数表征所述第一目标显示窗口内视频数据的更新频率;根据所述灵敏度参数确定声音识别所需间隔的目标时长;所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤之后,还包括:间隔所述目标时长,返回执行所述获取目标空间内声源对象的发声范围,获取所述目标空间的全景视频的步骤。
其中,所述获取所述第一目标显示窗口对应的灵敏度参数的步骤包括:获取当前远程交互场景的 场景特征参数或用户设置参数;根据所述场景特征参数或所述用户设置参数确定所述灵敏度参数。这里的场景特征参数可具体包括目标空间内的用户情况或远程交互场景的场景类型(如视频会议或视频直播等)。用户设置参数为用户基于其实际交互需求向远程交互设备输入关于第一目标显示窗口中视频数据的更新频率的参数。
具体的,可预先设置有多个预设灵敏度参数,不同的预设灵敏度参数对应不同的预设时长,根据场景特征参数或用户设置参数从多个预设灵敏度参数中确定当前的第一目标显示窗口对应的灵敏度参数,将当前的第一目标显示窗口对应的灵敏度参数对应的预设时长作为目标时长。
例如,多个预设灵敏度参数范围为1档灵敏度、2档灵敏度以及3档灵敏度,依次对应的预设时长为0.5秒、1秒以及1.5秒。
基于此,在第一目标显示窗口对若干个子视频数据进行输出后,在间隔目标时长后可重新基于目标空间进行声源对象的发声范围识别确定新的子视频数据在第一目标显示窗口中输出,从而保证第一目标显示窗口中视频的更新频率和当前远程交互场景下用户的实际交互需求精准匹配,以进一步提高用户的交互体验。
进一步的,基于上述任一实施例,本申请实施例中远程交互方法还包括:检测到静音指令,停止在远端设备输出目标空间内采集的音频数据。
在停止在远端设备输出目标空间内采集的音频数据之后,还可在远端设备中输出静音提示信息,以使远端用户可基于静音提示信息知晓目标空间内的静音状态。
其中,静音指令可通过按键、手机或电脑输入。
进一步的,基于上述任一实施例,本申请实施例中远程交互方法还包括:检测到关闭视频指令,停止在远端设备输出目标空间内采集的视频数据。
在停止在远端设备输出目标空间内采集的视频数据之后,还可在远端设备中输出视频关闭的提示信息,以使远端用户可基于视频关闭的提示信息知晓目标空间内的视频关闭状态。
其中,关闭视频指令可通过按键、手机或电脑输入。
进一步的,基于上述任一实施例,本申请实施例中远程交互方法还包括:检测到预设指令,停止执行所述步骤S10至步骤S30,只在远端设备上显示第二目标显示窗口,从而保护目标空间内人员的隐私。
其中,预设指令可通过按键、手机或电脑输入。
进一步的,基于上述任一实施例,本申请实施例中远程交互方法还包括:远端设备所在场景内的用户处于发声状态,停止执行S10至步骤S30;当远端设备所在场景内用户处于未发声状态,执行步骤S10至步骤S30。
这里远端设备所在场景内是否处于发声状态可由获取远端设备发送的信息确定。
此外,本申请实施例还提出一种计算机程序,所述计算机程序被处理器执行时实现如上远程交互方法任一实施例的相关步骤。
此外,本申请实施例还提出一种计算机存储介质,所述计算机存储介质上存储有远程交互程序,所述远程交互程序被处理器执行时实现如上远程交互方法任一实施例的相关步骤。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,远程交互设备,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的可选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (32)

  1. 一种远程交互方法,其中,所述远程交互方法包括以下步骤:
    获取目标空间内声源对象的发声范围,获取所述目标空间的全景视频;
    根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围;
    在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口。
  2. 如权利要求1所述的远程交互方法,其中,所述获取目标空间内声源对象的发声范围的步骤包括:
    在预设时长内检测所述声源对象的发声位置的多个第一空间位置信息,获得多个声源位置信息;
    根据所述多个声源位置信息确定所述发声范围。
  3. 如权利要求2所述的远程交互方法,其中,所述在预设时长内检测所述声源对象的发声位置的多个第一空间位置信息,获得多个声源位置信息的步骤包括:
    在所述预设时长内多次检测所述声源对象的发声位置在所述目标空间中以拍摄模块为基点的方位角和俯仰角,获得多个第一空间方位角和多个第一空间俯仰角;
    其中,所述多个声源位置信息包括所述多个第一空间方位角和所述多个第一空间俯仰角,所述拍摄模块被配置为采集所述全景视频。
  4. 如权利要求3所述的远程交互方法,其中,所述根据所述多个声源位置信息确定所述发声范围的步骤包括:
    确定所述多个第一空间方位角中的最小空间方位角和最大空间方位角,确定所述多个第一空间俯仰角中的最小空间俯仰角和最大空间俯仰角;
    根据所述最小空间方位角、所述最大空间方位角、所述最小空间俯仰角以及所述最大空间俯仰角确定所述发声范围对应的多个第一空间角点位置;
    将所述多个第一空间角点位置围合形成的空间范围确定为所述发声范围。
  5. 如权利要求1所述的远程交互方法,其中,所述根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围的步骤包括:
    根据所述发声范围确定包含所述声源对象的目标区域的目标空间范围,所述目标区域为所述声源对象在交互时需展示的最小区域,所述目标空间范围大于或等于所述发声范围;
    根据预设对应关系,确定所述目标空间范围在所述全景视频中对应的图像范围为所述目标图像范围;
    其中,所述预设对应关系为预先设置的所述目标空间内的空间位置与所述全景视频对应的图像位置之间的对应关系。
  6. 如权利要求5所述的远程交互方法,其中,所述根据所述发声范围确定所述声源对象所在的目标空间范围的步骤包括:
    获取所述目标空间内允许发声的对象的总数,获取所述发声范围内目标空间位置的第二空间位置信息;
    根据所述总数确定所述目标空间范围的大小特征值;
    根据所述第二空间位置信息和所述大小特征值确定所述目标空间范围。
  7. 如权利要求6所述的远程交互方法,其中,所述目标空间位置为中心位置,所述第二空间位置信息包括所述目标空间位置以拍摄模块为基点的第二空间方位角,所述拍摄模块被配置为采集所述全景视频,所述根据所述第二空间位置信息和所述大小特征值确定所述目标空间范围的步骤包括:
    根据所述大小特征值确定空间方位角调整值;
    根据所述空间方位角调整值调整所述第二空间方位角,获得所述目标空间范围以所述拍摄模块为基点的方位角范围的最大临界值和最小临界值;
    根据所述最大临界值、所述最小临界值以及所述目标空间范围以所述拍摄模块为基点的预设俯仰角范围确定所述目标空间范围的多个第二空间角点位置;
    将多个所述第二空间角点位置围合形成的空间范围确定为所述目标空间范围。
  8. 如权利要求1所述的远程交互方法,其中,所述根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围的步骤之后,还包括:
    识别所述目标图像范围内人体图像所在的图像区域;
    确定所述图像区域的面积与所述目标图像范围的面积的比值;
    响应于所述比值小于预设值的情况,缩小所述目标图像范围,以使所述比值大于或等于所述预设值;
    执行所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤。
  9. 如权利要求8所述的远程交互方法,其中,所述缩小所述目标图像范围的步骤包括:
    根据所述预设值放大所述图像区域获得缩小后的目标图像范围。
  10. 如权利要求9所述的远程交互方法,其中,所述根据所述预设值放大所述图像区域获得缩小后的目标图像范围的步骤包括:
    确定所述图像区域内目标图像位置的图像位置参数,根据所述预设值和所述图像区域的宽度确定用于放大所述图像区域的图像位置调整值;
    根据所述图像位置调整值和所述图像位置参数确定目标图像位置参数;
    根据所述目标图像位置参数确定缩小后的目标图像范围。
  11. 如权利要求10所述的远程交互方法,其中,所述图像位置参数包括所述目标图像位置以所述全景视频对应的预设成像中心为基点的第一图像方位角,所述图像位置调整值包括图像方位角调整值,所述根据所述图像位置调整值和所述图像位置参数确定目标图像位置参数的步骤包括:
    根据所述图像方位角调整值调整所述第一图像方位角,获得调整后的目标图像范围以所述预设成像中心为基点的最大图像方位角和最小图像方位角;
    根据所述最大图像方位角、所述最小图像方位角、所述目标图像位置在所述图像区域的竖直方向上的位置特征参数以及所述目标图像范围的图像比例确定缩小后的目标图像范围以所述预设成像中心为基点的最大图像俯仰角和最小图像俯仰角;
    确定所述最大图像方位角、所述最小图像方位角、所述最大图像俯仰角以及所述最小图像俯仰角为所述目标图像位置参数。
  12. 如权利要求11所述的远程交互方法,其中,所述目标图像位置为所述发声范围的中心位置在所述图像区域内对应的图像位置。
  13. 如权利要求11所述的远程交互方法,其中,所述根据所述目标图像位置参数确定缩小后的目标图像范围的步骤包括:
    根据所述最大图像方位角、所述最小图像方位角、所述最大图像俯仰角以及所述最小图像俯仰角确定调整后的目标图像范围的多个图像角点位置;
    将所述多个图像角点位置围合形成的图像范围作为缩小后的目标图像范围。
  14. 如权利要求13所述的远程交互方法,其中,所述根据所述最大图像方位角、所述最小图像方位角、所述最大图像俯仰角以及所述最小图像俯仰角确定调整后的目标图像范围的多个图像角点位置的步骤之后,还包括:
    确定所述多个图像角点位置围合形成的图像范围的区域面积相对于所述图像区域的区域面积的放大倍数;
    响应于所述放大倍数小于或等于预设倍数的情况,执行所述将所述多个图像角点位置围合形成的图像范围作为缩小后的目标图像范围的步骤;
    响应于所述放大倍数大于所述预设倍数的情况,将所述图像区域放大预设倍数后的图像范围作为缩小后的目标图像范围。
  15. 如权利要求8所述的远程交互方法,其中,所述确定所述图像区域的面积与所述目标图像范围的面积的比值的步骤之后,还包括:
    响应于所述比值大于或等于所述预设值的情况,执行所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤。
  16. 如权利要求1所述的远程交互方法,其中,所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤之后,还包括:
    获取所述发声范围的空间位置变化参数或所述目标图像范围内人体图像区域的图像位置变化参数;
    根据所述空间位置变化参数或所述图像位置变化参数调整所述目标图像范围;
    在所述第一目标显示窗口内输出调整后的目标图像范围内的子视频数据,并发送调整后的第一目标显示窗口至所述远端设备,以使所述远端设备显示调整后的第一目标显示窗口。
  17. 如权利要求16所述的远程交互方法,其中,所述空间位置变化参数包括所述发声范围以拍 摄模块为基点的空间方位角变化值和/或空间俯仰角变化值,所述拍摄模块被配置为采集所述全景视频;
    所述图像位置变化参数包括所述图像区域以所述全景视频的预设成像中心为基点的图像方位角变化值和/或图像俯仰角变化值。
  18. 如权利要求1至17所述的远程交互方法,其中,所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据步骤包括:
    响应于所述声源对象的数量多于一个的情况,获取所述第一目标显示窗口中需显示的声源对象的目标数量;
    在多于一个所述声源对象中确定所述目标数量个声源对象作为目标对象;
    在每个所述目标对象对应的子窗口内输出所述目标对象对应的目标图像范围内的子视频数据,并在所述第一目标显示窗口内合并所述目标数量个子窗口。
  19. 如权利要求18所述的远程交互方法,其中,所述在多于一个所述声源对象中确定所述目标数量个声源对象作为目标对象的步骤包括:
    获取每个所述声源对象对应的发声状态参数,所述发声状态参数表征对应的声源对象的发声时间与当前时间之间的间隔时长;
    在多于一个所述声源对象中,根据各所述声源对象的发声状态参数确定所述目标数量个声源对象作为目标对象。
  20. 如权利要求19所述的远程交互方法,其中,所述获取每个所述声源对象对应的发声状态参数的步骤包括:
    获取每个所述声源对象当前分别对应的标签值,每个声源对象的标签值均大于或等于第一预设值,标签值表征对应的声源对象在当前时刻之前未发声的连续次数;
    根据预设规则更新每个所述声源对象当前的标签值,获得每个所述声源对象更新后的标签值作为每个所述声源对象的发声状态参数;
    其中,所述预设规则包括:当前处于发声状态的声源对象的标签值设置为所述第一预设值,当前未处于发声状态的声源对象的标签值增加第二预设值。
  21. 如权利要求20所述的远程交互方法,其中,所述在多于一个所述声源对象中,根据各所述声源对象的发声状态参数确定所述目标数量个声源对象作为目标对象的步骤包括:
    将所有发声状态参数按照从小到大的顺序依次排列,获得排列结果;
    将所述排列结果中排列位次在前的目标数量个发声状态参数分别对应的声源对象确定为目标对象。
  22. 如权利要求18所述的远程交互方法,其中,所述在所述第一目标显示窗口内合并所述目标数量个子窗口的步骤包括:
    确定每个所述目标对象的目标图像范围上的预设图像位置以所述全景视频的预设成像中心为基点的第二图像方位角;
    所述根据各所述目标对象对应的第二图像方位角之间的大小关系确定所述目标数量个子窗口的排列顺序;
    在所述第一目标显示窗口内按照所述排列顺序合并显示所述目标数量个子窗口。
  23. 如权利要求22所述的远程交互方法,其中,定义所述预设成像中心指向预设水平方向的射线为基准线,定义每个目标对象对应的预设图像位置与所述预设成像中心的连线为目标线,每个目标对象对应的第二图像方位角为所述基准线沿顺时针方向到所述目标对象对应的目标线的水平夹角,所述根据各所述目标对象对应的第二图像方位角之间的大小关系确定所述目标数量个子窗口的排列顺序的步骤包括:
    根据第二图像方位角从小到大的顺序依次排列所述目标数量个子窗口,获得所述排列顺序。
  24. 如权利要求1至17中任一项所述的远程交互方法,其中,所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备的步骤执行同时,还包括:
    在第二目标显示窗口内输出所述全景视频,并发送所述第二目标显示窗口至所述远端设备,以使所述远端设备合并显示所述第一目标显示窗口和所述第二目标显示窗口。
  25. 如权利要求1至17中任一项所述的远程交互方法,其中,所述获取所述目标空间的全景视频的步骤之后,还包括:
    识别所述全景视频的基准位置对应的人形图像区域,所述基准位置以所述全景视频的预设成像中心为基点的图像方位角为预设角度值,所述人形图像区域包括人体上目标区域对应的完整图像,所 述目标区域为人体在交互时需展示的最小区域;
    确定所述人形图像区域以所述预设成像中心为基点的图像方位角的最小值;
    响应于所述最小值小于所述预设角度值的情况,则根据所述最小值与所述预设角度值的差值调整所述全景视频对应的拍摄范围,返回执行所述获取所述目标空间的全景视频的步骤;
    响应于所述最小值大于或等于所述预设角度值的情况,则执行所述根据所述发声范围确定所述声源对象在所述全景视频中所处的目标图像范围的步骤。
  26. 如权利要求25所述的远程交互方法,其中,所述基准位置为所述全景视频的图像边缘位置。
  27. 如权利要求1至17中任一项所述的远程交互方法,其中,所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤之前,还包括:
    获取所述第一目标显示窗口对应的灵敏度参数;所述灵敏度参数表征所述第一目标显示窗口内视频数据的更新频率;
    根据所述灵敏度参数确定声音识别所需间隔的目标时长;
    所述在第一目标显示窗口内输出所述目标图像范围内的子视频数据,并发送所述第一目标显示窗口至远端设备,以使所述远端设备显示所述第一目标显示窗口的步骤之后,还包括:
    间隔所述目标时长,返回执行所述获取目标空间内声源对象的发声范围,获取所述目标空间的全景视频的步骤。
  28. 如权利要求27所述的远程交互方法,其中,所述获取所述第一目标显示窗口对应的灵敏度参数的步骤包括:
    获取当前远程交互场景的场景特征参数或用户设置参数;
    根据所述场景特征参数或所述用户设置参数确定所述灵敏度参数。
  29. 如权利要求1至17中任一项所述的远程交互方法,其中,所述远程交互方法还包括;
    检测到静音指令,停止在远端设备输出目标空间内采集的音频数据;
    且/或,所述远程交互方法还包括;
    检测到关闭视频指令,停止在远端设备输出目标空间内采集的视频数据;
    且/或,所述远程交互方法还包括;
    检测到预设指令,停止执行所述获取目标空间内声源对象的发声范围的步骤;
    且/或,所述远程交互方法还包括;
    响应于远端设备所在场景内的用户处于发声状态的情况,停止执行所述获取目标空间内声源对象的发声范围的步骤;远端设备所在场景内用户处于未发声状态,执行所述获取目标空间内声源对象的发声范围的步骤。
  30. 一种远程交互设备,其中,所述远程交互设备包括:
    拍摄模块;
    麦克风阵列;以及
    控制装置,所述全景拍摄模块和所述麦克风阵列均与所述控制装置连接,所述控制装置包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的远程交互程序,所述远程交互程序被所述处理器执行时实现如权利要求1至29中任一项所述的远程交互方法的步骤。
  31. 如权利要求30所述的远程交互设备,其中,所述远程交互设备还包括扬声器、按键模块、通信模块以及数据接口,所述扬声器、所述按键模块、所述通信模块以及所述数据接口均与所述控制装置连接。
  32. 一种计算机存储介质,其中,所述计算机存储介质上存储有远程交互程序,所述远程交互程序被处理器执行时实现如权利要求1至29中任一项所述的远程交互方法的步骤。
PCT/CN2022/084908 2022-01-29 2022-04-01 远程交互方法、远程交互设备以及计算机存储介质 WO2023142266A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210111658.4A CN114594892B (zh) 2022-01-29 2022-01-29 远程交互方法、远程交互设备以及计算机存储介质
CN202210111658.4 2022-01-29

Publications (1)

Publication Number Publication Date
WO2023142266A1 true WO2023142266A1 (zh) 2023-08-03

Family

ID=81805004

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/084908 WO2023142266A1 (zh) 2022-01-29 2022-04-01 远程交互方法、远程交互设备以及计算机存储介质

Country Status (2)

Country Link
CN (1) CN114594892B (zh)
WO (1) WO2023142266A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116866720A (zh) * 2023-09-04 2023-10-10 国网山东省电力公司东营供电公司 基于声源定位的摄像头角度自适应调控方法、系统及终端

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130300820A1 (en) * 2009-04-14 2013-11-14 Huawei Device Co., Ltd. Remote presenting system, device, and method
CN107948577A (zh) * 2017-12-26 2018-04-20 深圳市保千里电子有限公司 一种全景视讯会议的方法及其系统
CN110191303A (zh) * 2019-06-21 2019-08-30 Oppo广东移动通信有限公司 基于屏幕发声的视频通话方法及相关产品
CN112578338A (zh) * 2019-09-27 2021-03-30 阿里巴巴集团控股有限公司 声源定位方法、装置、设备及存储介质
CN112804455A (zh) * 2021-01-08 2021-05-14 重庆创通联智物联网有限公司 远程交互方法、装置、视频设备和计算机可读存储介质
CN113676622A (zh) * 2020-05-15 2021-11-19 杭州海康威视数字技术股份有限公司 视频处理方法、摄像装置、视频会议系统及存储介质

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8289363B2 (en) * 2006-12-28 2012-10-16 Mark Buckler Video conferencing
US8319819B2 (en) * 2008-03-26 2012-11-27 Cisco Technology, Inc. Virtual round-table videoconference
CN101442654B (zh) * 2008-12-26 2012-05-23 华为终端有限公司 视频通信中视频对象切换的方法、装置及系统
CN101866215B (zh) * 2010-04-20 2013-10-16 复旦大学 在视频监控中采用视线跟踪的人机交互装置和方法
TW201901527A (zh) * 2017-05-26 2019-01-01 和碩聯合科技股份有限公司 視訊會議裝置與視訊會議管理方法
JP2019012509A (ja) * 2018-02-23 2019-01-24 株式会社コロプラ ヘッドマウントデバイスによって仮想空間を提供するためのプログラム、方法、および当該プログラムを実行するための情報処理装置
CN110166920B (zh) * 2019-04-15 2021-11-09 广州视源电子科技股份有限公司 桌面会议扩音方法、系统、装置、设备以及存储介质
CN110460729A (zh) * 2019-08-26 2019-11-15 延锋伟世通电子科技(南京)有限公司 一种车辆会议模式全方位语音交互系统及方法
CN111432115B (zh) * 2020-03-12 2021-12-10 浙江大华技术股份有限公司 基于声音辅助定位的人脸追踪方法、终端及存储装置
CN111651632A (zh) * 2020-04-23 2020-09-11 深圳英飞拓智能技术有限公司 视频会议中发言人音视频输出方法及装置
CN111650558B (zh) * 2020-04-24 2023-10-10 平安科技(深圳)有限公司 定位声源用户的方法、装置和计算机设备
CN111818294A (zh) * 2020-08-03 2020-10-23 上海依图信息技术有限公司 结合音视频的多人会议实时展示的方法、介质和电子设备
CN112487246A (zh) * 2020-11-30 2021-03-12 深圳卡多希科技有限公司 一种多人视频中发言人的识别方法和装置
CN112614508B (zh) * 2020-12-11 2022-12-06 北京华捷艾米科技有限公司 音视频结合的定位方法、装置、电子设备以及存储介质
CN113225515A (zh) * 2020-12-28 2021-08-06 南京愔宜智能科技有限公司 一种升降式音视频会议系统
CN113093106A (zh) * 2021-04-09 2021-07-09 北京华捷艾米科技有限公司 一种声源定位方法及系统
CN113630556A (zh) * 2021-09-26 2021-11-09 北京市商汤科技开发有限公司 聚焦方法、装置、电子设备以及存储介质
CN113794814B (zh) * 2021-11-16 2022-02-08 珠海视熙科技有限公司 一种控制视频图像输出的方法、装置及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130300820A1 (en) * 2009-04-14 2013-11-14 Huawei Device Co., Ltd. Remote presenting system, device, and method
CN107948577A (zh) * 2017-12-26 2018-04-20 深圳市保千里电子有限公司 一种全景视讯会议的方法及其系统
CN110191303A (zh) * 2019-06-21 2019-08-30 Oppo广东移动通信有限公司 基于屏幕发声的视频通话方法及相关产品
CN112578338A (zh) * 2019-09-27 2021-03-30 阿里巴巴集团控股有限公司 声源定位方法、装置、设备及存储介质
CN113676622A (zh) * 2020-05-15 2021-11-19 杭州海康威视数字技术股份有限公司 视频处理方法、摄像装置、视频会议系统及存储介质
CN112804455A (zh) * 2021-01-08 2021-05-14 重庆创通联智物联网有限公司 远程交互方法、装置、视频设备和计算机可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116866720A (zh) * 2023-09-04 2023-10-10 国网山东省电力公司东营供电公司 基于声源定位的摄像头角度自适应调控方法、系统及终端
CN116866720B (zh) * 2023-09-04 2023-11-28 国网山东省电力公司东营供电公司 基于声源定位的摄像头角度自适应调控方法、系统及终端

Also Published As

Publication number Publication date
CN114594892A (zh) 2022-06-07
CN114594892B (zh) 2023-11-24

Similar Documents

Publication Publication Date Title
US20140376740A1 (en) Directivity control system and sound output control method
US20170047076A1 (en) Method and device for achieving object audio recording and electronic apparatus
US10798483B2 (en) Audio signal processing method and device, electronic equipment and storage medium
CN111107389B (zh) 确定观看直播时长的方法、装置和系统
CN108924375B (zh) 铃声音量的处理方法、装置、存储介质及终端
US20230090916A1 (en) Display apparatus and processing method for display apparatus with camera
JP2023519422A (ja) オーディオ処理方法、装置、可読媒体及び電子機器
US20240094970A1 (en) Electronic system for producing a coordinated output using wireless localization of multiple portable electronic devices
WO2023142266A1 (zh) 远程交互方法、远程交互设备以及计算机存储介质
CN111045945B (zh) 模拟直播的方法、装置、终端、存储介质及程序产品
CN112269559A (zh) 音量调整方法、装置、电子设备及存储介质
CN113556481A (zh) 视频特效的生成方法、装置、电子设备及存储介质
WO2023066373A1 (zh) 确定样本图像的方法、装置、设备及存储介质
CN109325219B (zh) 一种生成记录文档的方法、装置及系统
US11902754B2 (en) Audio processing method, apparatus, electronic device and storage medium
CN112927718B (zh) 感知周围环境的方法、装置、终端和存储介质
CN113301444B (zh) 视频处理方法、装置、电子设备及存储介质
CN111982293B (zh) 体温测量方法、装置、电子设备及存储介质
CN114550393A (zh) 门铃控制方法、电子设备和可读存储介质
CN114422743A (zh) 视频流显示方法、装置、计算机设备和存储介质
US20210397332A1 (en) Mobile device and control method for mobile device
CN114245148A (zh) 直播互动方法、装置、终端、服务器及存储介质
CN114093020A (zh) 动作捕捉方法、装置、电子设备及存储介质
WO2022161146A1 (zh) 视频录制方法及电子设备
CN113259654B (zh) 视频帧率的检测方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923084

Country of ref document: EP

Kind code of ref document: A1