WO2024087641A1 - 一种具有无线麦克风智能跟踪功能的音视频控制方法 - Google Patents

一种具有无线麦克风智能跟踪功能的音视频控制方法 Download PDF

Info

Publication number
WO2024087641A1
WO2024087641A1 PCT/CN2023/099068 CN2023099068W WO2024087641A1 WO 2024087641 A1 WO2024087641 A1 WO 2024087641A1 CN 2023099068 W CN2023099068 W CN 2023099068W WO 2024087641 A1 WO2024087641 A1 WO 2024087641A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
information
person
module
digital signal
Prior art date
Application number
PCT/CN2023/099068
Other languages
English (en)
French (fr)
Inventor
陈炳佐
黄文玲
Original Assignee
深圳奥尼电子股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳奥尼电子股份有限公司 filed Critical 深圳奥尼电子股份有限公司
Publication of WO2024087641A1 publication Critical patent/WO2024087641A1/zh

Links

Definitions

  • the invention relates to the technical field of audio and video control, and in particular to an audio and video control method with a wireless microphone intelligent tracking function.
  • the audio or video conference host system mainly uses microphones and speakers as carriers for sound signal transmission;
  • video conferencing refers to a meeting in which people in two or more locations have face-to-face conversations through communication equipment and networks.
  • video conferencing can be divided into point-to-point conferencing and multi-point conferencing.
  • Individuals in daily life have no requirements for the security of conversation content, conference quality, and conference scale, and can use video software for video chat.
  • business video conferencing of government agencies, enterprises and institutions requires stable and secure networks, reliable conference quality, and formal conference environments. Therefore, professional video conferencing equipment is needed to form a special video conferencing system. Since such video conferencing systems all use televisions for display, they are also called television conferences or video conferences.
  • the purpose of the present invention is to provide an audio and video control method with a wireless microphone intelligent tracking function to solve the problems raised in the above background technology.
  • an audio and video control method with a wireless microphone intelligent tracking function comprising the following specific processes:
  • Step S100 the audio and video control system obtains audio information and video information of the space where the wireless microphone is located, the audio information includes first audio information and second audio information, and the video information includes global character image information and local character image information;
  • Step S200 Based on the audio information in step S100, the first audio information is parsed to obtain a first audio attribute, and audios with different attributes are distinguished; according to the distinguished audio attributes and the global character image information, a combination analysis is performed to match specific character information in the global character image information corresponding to the audios with different attributes in the audio information;
  • Step S300 After the matching is completed, the positions of different persons are located according to the second audio information and the partial person image information, and the position data of each person is sent to the audio and video control system;
  • Step S400 After the audio and video control system receives the location data of all persons, it monitors whether the second audio information of the corresponding persons in all location data is updated. When the audio and video control system monitors the person who updates the second audio information, it sends the location data of the person and globally amplifies the local character image information corresponding to the person to obtain the audio and video monitoring information of the corresponding person.
  • the audio control system includes a debugging mode and a conference mode; the debugging mode is used to collect the first audio information and the global character image information, and the conference mode is used to collect the second audio information and the local character image information;
  • the debugging mode is used to place the wireless microphone.
  • the wireless microphone is connected to the power of the audio and video conference host.
  • the wireless microphone is equipped with a power button and shakes its head up, down, left, and right without a target when the power is turned on to obtain global and local image information of the person.
  • the audio control system analyzes a specific person, the camera on the wireless microphone will aim at the participant and obtain the location address of the person and send the location address to the audio and video control system. Repeat the positioning.
  • the wireless microphone records the address of each person into the audio and video control system to complete the initial positioning of the system.
  • the conference mode is used when a participant is speaking, based on the position information that has been confirmed in the audio and video system, to turn the camera to the participant corresponding to the position information and amplify the second audio information and partial character image information of the participant.
  • the division of the first audio information and the second audio information includes the following process:
  • the time interval reflects two situations that may be included in the sound collection stage, one is that all sounds are disorderly and irregular, and the other is that the sound is regularly emitted after the planning is completed; the two situations are distinguished in order to distinguish the sound change rules applicable to many sound monitoring scenarios; and the fluctuation frequency index represents the average change in the entire sound collection stage;
  • Step S120 Based on the sound fluctuation frequency index of step S110, the audio and video control system traverses the first digital signal in the sound collection stage, obtains the sound fluctuation frequency of the first digital signal and the adjacent digital signal, and subtracts the sound fluctuation frequency of the first digital signal from the overall sound fluctuation frequency index to obtain the frequency fluctuation difference;
  • the frequency fluctuation difference indicates the degree of deviation between the sound fluctuation frequency and the average fluctuation frequency index in the monitoring scene over time;
  • the change of sound will be divided by a critical point, and the size relationship of the frequency fluctuation difference before and after the critical point is not defined here, which can accommodate more laws of sound changes in scenes, such as: sometimes the frequency fluctuation is large in the stage before sound collection, and sometimes the frequency fluctuation is small in the stage before sound collection;
  • Step S130 sequentially obtain the sound fluctuation frequencies of adjacent digital signals in the sound collection stage and mark the turning digital signal, the turning digital signal is a digital signal corresponding to the negative value of the quotient of the frequency fluctuation difference of the previous adjacent digital signal and the frequency fluctuation difference of the subsequent adjacent digital signal, and the positive and negative values of the frequency fluctuation differences of all digital signals after the turning digital signal are the same as the positive and negative values of the frequency fluctuation difference corresponding to the turning digital signal;
  • Step S140 Based on the judgment rule of step S130, the audio and video control system divides the audio information before the turning digital signal into the first audio information, and divides the audio information after the turning digital signal into the second audio information.
  • step S200 includes the following process:
  • Frequency reflects the characteristics of sound. Everyone's voice frequency is different. First, divide the number of people monitored by frequency, and then analyze the decibel characteristics of each person separately, because the decibel size is affected by the distance between the receiving end and the generating end;
  • Step S220 record different first audio attributes and corresponding decibel features as a set A, and calculate the average decibel difference ratio of the decibel feature corresponding to the jth first audio attribute in the set A over time. ,in represents the difference between two adjacent decibel features in the first audio attribute in the jth order, n represents the number of decibel feature differences, and n is at least 1; calculate the overall deviation index of different first audio attributes in set A ;
  • Step S230 Classify the global images of different characters in the global character image information to obtain the jth global character image , record the character proportions of different types of global character images as a set B, and calculate the average character image proportion difference corresponding to the j-th type of global character image in set B over time ;in represents the difference in the proportion of the person image between two adjacent images in the jth global person image, m represents the number of the difference in the proportion of the person image, and m is at least 1; calculate the overall deviation index of different types of global person images in set B ;
  • the overall deviation index reflects the span of decibel size and the span of image reflection distance among all monitored persons. If these two spans are basically consistent, it can be said that there is a connection between the change of decibel and distance.
  • Step S240 Calculate the deviation index similarity based on the overall deviation index Q in step S220 and the overall deviation index Z in step S230 If the deviation index similarity is greater than the similarity threshold, it means that the change in the decibel of the person is related to the movement of the person's position in the global image, and the average decibel difference ratio in set A and the average person image ratio difference in set B are matched one by one with a similarity greater than 99%, and the audio attributes corresponding to different people in the global person image information are obtained.
  • the decibel change law of each type of audio attribute and the change law of the global image are further analyzed, and after matching, it is possible to match which person emits which frequency.
  • step S300 includes the following process:
  • the person who first speaks in the second audio information is taken as the starting person, and the person ratios of all person images in the local person image information are obtained, and the person ratios are sorted from large to small, and the position coordinates corresponding to the image with the smallest person ratio are set as the starting coordinates;
  • the sector adaptation is performed on the starting coordinates to obtain the position coordinates of the second person; the sector adaptation means that the figure ratio of the person making the sound and the figure ratio of the starting person are converted into mathematical data with the starting coordinates as the center of the sector, and the same direction is estimated with the data as the radius.
  • the sector adaptation means that the figure ratio of the person making the sound and the figure ratio of the starting person are converted into mathematical data with the starting coordinates as the center of the sector, and the same direction is estimated with the data as the radius.
  • the audio and video control method includes an audio and video control system, and the audio and video control system includes a spatial information acquisition module, a spatial information analysis module, a position data acquisition module and a monitoring data amplification module;
  • the spatial information acquisition module is used to obtain the data information of the space where the wireless microphone is located, and transmit the data information to the spatial information analysis module; the spatial information analysis module is used to analyze the data information from the spatial information acquisition module; the position data acquisition module is used to determine the position information of the people in the space based on the data information after the analysis; the monitoring data amplification module is used to globally amplify the characters in the new data information when the spatial information acquisition module adds new data information to obtain the audio and video monitoring information of the corresponding people.
  • the spatial information acquisition module includes an audio information acquisition module and a video information acquisition module; the audio information acquisition module acquires audio information, and the audio information includes first audio information and second audio information; the video information acquisition module acquires video information, and the video information acquisition module includes a global character image information acquisition module and a local character image information acquisition module;
  • the audio information acquisition module includes a digital signal conversion module, a voice fluctuation frequency index calculation module, a transition digital signal marking module and an audio information division module;
  • the digital signal conversion module is used to convert the audio information into a digital signal.
  • the turning digital signal marking module traverses the sound fluctuation frequencies of the first digital signal and the adjacent digital signals, and obtains the frequency fluctuation difference by subtracting the sound fluctuation frequency of the first digital signal from the overall sound fluctuation frequency index; sequentially obtains the sound fluctuation frequencies of adjacent digital signals in the sound collection stage and marks the turning digital signal, the turning digital signal is the digital signal corresponding to the negative value of the quotient of the frequency fluctuation difference of the previous adjacent digital signal and the frequency fluctuation difference of the subsequent adjacent digital signal, and the positive and negative of the frequency fluctuation differences of all digital signals after the turning digital signal are the same as the positive and negative of the frequency fluctuation difference corresponding to the turning digital signal;
  • the audio information division module is used to divide the audio information before the turning digital signal into the first audio information, and divide the audio information after the turning digital signal into the second audio information.
  • the spatial information analysis module includes an audio information analysis module, a video information analysis module and a character matching module;
  • the audio information analysis module includes an audio attribute classification module, an average decibel difference ratio calculation module, and an audio attribute deviation index calculation module;
  • the video information analysis module includes a global character image classification module, a character image ratio difference calculation module, and a global character image deviation index calculation module;
  • the character matching module includes a deviation index similarity calculation module and a character audio attribute corresponding module;
  • the audio attribute analysis module classifies different audio attributes.
  • the average decibel difference ratio calculation module is used to record the decibel features of each audio and record different first audio attributes and corresponding decibel features as a set A, and calculate the average decibel difference ratio of the decibel features corresponding to the jth first audio attribute in the set A over time;
  • the audio attribute deviation index calculation module is used to calculate the overall deviation index of different first audio attributes in the set A;
  • the global character image classification module is used to classify the character images in the global character image information acquisition module, and record the character image proportions corresponding to different character images as set B;
  • the character image proportion difference calculation module is used to calculate the average character image proportion difference of different types of global character images in set B over time, and the global character image deviation index calculation module is used to calculate the overall deviation index of different types of global character images;
  • the deviation index similarity calculation module is used to compare the numerical similarities in the global character image deviation index calculation module and the audio attribute deviation index calculation module. When the similarity is greater than the similarity threshold, it indicates that the change in the decibel of the person is related to the movement of the person's position in the global image; the character audio attribute corresponding module makes a one-to-one correspondence with the average decibel difference ratio in set A and the average character image ratio difference in set B with a similarity greater than 99%, and obtains the audio attributes corresponding to different people in the global character image information.
  • the position data acquisition module includes a character image ratio sorting module, a starting coordinate setting module and a sector adaptation module;
  • the character image ratio sorting module is used to take the person who first speaks in the second audio information as the starting person, obtain the character ratios of all character images in the local character image information, and sort the character ratios from large to small;
  • the starting coordinate setting module sets the corresponding position coordinates in the smallest character ratio image as the starting coordinates;
  • the sector adaptation module is used to perform sector adaptation on the starting coordinates to obtain the position coordinates of the second person when any person in the monitoring makes a sound, based on the corresponding person ratio of the person making the sound in the local person image and the person ratio size relationship of the starting person.
  • the present invention solves the problem that in scenes suitable for wireless microphone intelligent tracking, the source of sound and the specific correspondence between people cannot be simply and efficiently distinguished due to the complexity of people, so that the sound can be captured during monitoring but the location of the sound cannot be clearly determined.
  • the present invention is adaptable to all environments monitored by microphones for adjustable positioning, and uses a method of combining and analyzing audio information and video information to determine the correlation between spatial sound positions, and then obtain the correspondence between sound and people.
  • the present invention makes it possible to distinguish the real-time relationship between the sound emitted by a person and the person's position in different environments, thereby improving the efficiency of the system in identifying people; and the present invention determines the coordinates through image proportions, thereby increasing the inclusiveness of the scene in which the position coordinates are located.
  • FIG1 is a system structure diagram of an audio and video control method with a wireless microphone intelligent tracking function according to the present invention
  • FIG. 2 is a step diagram of an audio and video control method with a wireless microphone intelligent tracking function according to the present invention.
  • FIG. 3 is a block diagram of the audio and video control principle of an audio and video control method with a wireless microphone intelligent tracking function according to the present invention
  • FIG. 4 is a block diagram of a conference microphone principle of an audio and video control method with a wireless microphone intelligent tracking function according to the present invention
  • FIG. 5 is a schematic block diagram of a host camera device according to an audio and video control method with a wireless microphone intelligent tracking function according to the present invention.
  • an audio and video control method with a wireless microphone intelligent tracking function including the following specific processes:
  • Step S100 the audio and video control system obtains audio information and video information of the space where the wireless microphone is located, the audio information includes first audio information and second audio information, and the video information includes global character image information and local character image information;
  • Step S200 Based on the audio information in step S100, the first audio information is parsed to obtain a first audio attribute, and audios with different attributes are distinguished; according to the distinguished audio attributes and the global character image information, a combination analysis is performed to match specific character information in the global character image information corresponding to the audios with different attributes in the audio information;
  • Step S300 After the matching is completed, the positions of different persons are located according to the second audio information and the partial person image information, and the position data of each person is sent to the audio and video control system;
  • Step S400 After the audio and video control system receives the location data of all persons, it monitors whether the second audio information of the corresponding persons in all location data is updated. When the audio and video control system monitors the person who updates the second audio information, it sends the location data of the person and globally amplifies the local character image information corresponding to the person to obtain the audio and video monitoring information of the corresponding person.
  • the audio and video control system includes a camera device host and a plurality of conference microphones, each of which is connected to the camera device host by wireless communication;
  • the conference microphone includes a microphone control main chip, a microphone input circuit, a first 2.4G transceiver circuit, a key circuit, and a microphone power supply circuit.
  • the microphone input circuit, the first 2.4G transceiver circuit, and the key circuit are respectively connected to the microphone control main chip, and the microphone power supply circuit supplies power to the microphone control main chip, the microphone input circuit, the first 2.4G transceiver circuit, and the key circuit.
  • the camera device host has a camera and a speaker;
  • the circuit structure of the camera device host includes a camera device host chip, an audio output circuit, a host button, a second 2.4G transceiver circuit, a speaker output circuit, a USB interface circuit, a camera, a track control circuit, and a power supply circuit;
  • the camera device host chip is connected to the track control circuit, and the track control circuit is connected to the camera;
  • the camera device host chip is also electrically connected to the USB interface circuit, and the USB interface circuit is also connected to the camera;
  • the camera device host chip is electrically connected to the audio output circuit, and the audio output circuit is connected to the audio communication device;
  • the camera device host chip is electrically connected to the speaker output circuit, and the speaker output circuit is used to connect the speaker;
  • the camera device host chip is also electrically connected to the second 2.4G transceiver circuit;
  • the microphone control main chip and the camera device host chip are Bluetooth 5.3 LE Audio chips;
  • the audio control system includes a debugging mode and a conference mode; the debugging mode is used to collect first audio information and global character image information, and the conference mode is used to collect second audio information and local character image information;
  • the debugging mode is used to place the wireless microphone.
  • the wireless microphone is connected to the power of the audio and video conference host.
  • the wireless microphone is equipped with a power button and shakes its head up, down, left, and right without a target when the power is turned on to obtain global and local image information of the person.
  • the audio control system analyzes a specific person, the camera on the wireless microphone will aim at the participant and obtain the location address of the person and send the location address to the audio and video control system. Repeat the positioning.
  • the wireless microphone records the address of each person into the audio and video control system to complete the initial positioning of the system.
  • the conference mode is used when a participant is speaking, based on the position information that has been confirmed in the audio and video system, to turn the camera to the participant corresponding to the position information and amplify the second audio information and partial character image information of the participant.
  • the camera device host has three axes, XYZ, and drives the camera through the three axes;
  • the XYZ axis track scans.
  • the conference microphone will send a command to the camera host, and the camera host will record the position of the XYZ axis track corresponding to this conference microphone.
  • the conference microphone will use wireless signals to send a set of control codes to the camera host. After receiving the code, the camera host will use the XYZ axis track position data of the conference microphone to control the camera to turn to the direction of the conference microphone.
  • the conference microphone receives the audio signal through the LE Audio LC3 and LC3+ encoding and decoding technology, and uses TDMA multi-time division multiplexing technology to wirelessly transmit the formed voice packets to the camera device host.
  • the division of the first audio information and the second audio information includes the following process:
  • the time interval reflects two situations that may be included in the sound collection stage, one is that all sounds are disorderly and irregular, and the other is that the sound is regularly emitted after the planning is completed; the two situations are distinguished in order to distinguish the sound change rules applicable to many sound monitoring scenarios; and the fluctuation frequency index represents the average change in the entire sound collection stage;
  • Step S120 Based on the sound fluctuation frequency index of step S110, the audio and video control system traverses the first digital signal in the sound collection stage, obtains the sound fluctuation frequency of the first digital signal and the adjacent digital signal, and subtracts the sound fluctuation frequency of the first digital signal from the overall sound fluctuation frequency index to obtain the frequency fluctuation difference;
  • the frequency fluctuation difference indicates the degree of deviation between the sound fluctuation frequency and the average fluctuation frequency index in the monitoring scene over time;
  • the change of sound will be divided by a critical point, and the size relationship of the frequency fluctuation difference before and after the critical point is not defined here, which can accommodate more laws of sound changes in scenes, such as: sometimes the frequency fluctuation is large in the stage before sound collection, and sometimes the frequency fluctuation is small in the stage before sound collection;
  • Step S130 sequentially obtain the sound fluctuation frequencies of adjacent digital signals in the sound collection stage and mark the turning digital signal, the turning digital signal is a digital signal corresponding to the negative value of the quotient of the frequency fluctuation difference of the previous adjacent digital signal and the frequency fluctuation difference of the subsequent adjacent digital signal, and the positive and negative values of the frequency fluctuation differences of all digital signals after the turning digital signal are the same as the positive and negative values of the frequency fluctuation difference corresponding to the turning digital signal;
  • Step S140 Based on the judgment rule of step S130, the audio and video control system divides the audio information before the turning digital signal into the first audio information, and divides the audio information after the turning digital signal into the second audio information.
  • Step S200 includes the following process:
  • Frequency reflects the characteristics of sound. Everyone's voice frequency is different. First, divide the number of people monitored by frequency, and then analyze the decibel characteristics of each person separately, because the decibel size is affected by the distance between the receiving end and the generating end;
  • Step S220 record different first audio attributes and corresponding decibel features as a set A, and calculate the average decibel difference ratio of the decibel feature corresponding to the jth first audio attribute in the set A over time. ,in represents the difference between two adjacent decibel features in the first audio attribute in the jth order, n represents the number of decibel feature differences, and n is at least 1; calculate the overall deviation index of different first audio attributes in set A ;
  • Step S230 Classify the global images of different characters in the global character image information to obtain the jth global character image , record the character proportions of different types of global character images as a set B, and calculate the average character image proportion difference corresponding to the j-th type of global character image in set B over time ;in represents the difference in the proportion of the person image between two adjacent images in the jth global person image, m represents the number of the difference in the proportion of the person image, and m is at least 1; calculate the overall deviation index of different types of global person images in set B ;
  • the overall deviation index reflects the span of decibel size and the span of image reflection distance among all monitored persons. If these two spans are basically consistent, it can be said that there is a connection between the change of decibel and distance.
  • Step S240 Calculate the deviation index similarity based on the overall deviation index Q in step S220 and the overall deviation index Z in step S230 If the deviation index similarity is greater than the similarity threshold, it means that the change in the decibel of the person is related to the movement of the person's position in the global image, and the average decibel difference ratio in set A and the average person image ratio difference in set B are matched one by one with a similarity greater than 99%, and the audio attributes corresponding to different people in the global person image information are obtained.
  • the decibel change law of each type of audio attribute and the change law of the global image are further analyzed, and after matching, it is possible to match which person emits which frequency.
  • each type of audio attribute contains three decibel features, corresponding to , then the set A is represented by ⁇ ⁇ , , , ;but ;
  • the set B is ⁇ , ⁇ ,but %, 7%, 7%, ;
  • the one-to-one correspondence process is to compare the sets A one by one. ⁇ ⁇ ⁇ and set B ⁇ ⁇ ⁇ ⁇ similarity.
  • Step S300 includes the following process:
  • the person who first speaks in the second audio information is taken as the starting person, and the person ratios of all person images in the local person image information are obtained, and the person ratios are sorted from large to small, and the position coordinates corresponding to the image with the smallest person ratio are set as the starting coordinates;
  • the sector adaptation is performed on the starting coordinates to obtain the position coordinates of the second person; the sector adaptation means that the figure ratio of the person making the sound and the figure ratio of the starting person are converted into mathematical data with the starting coordinates as the center of the sector, and the same direction is estimated with the data as the radius.
  • the sector adaptation means that the figure ratio of the person making the sound and the figure ratio of the starting person are converted into mathematical data with the starting coordinates as the center of the sector, and the same direction is estimated with the data as the radius.
  • the audio and video control method includes an audio and video control system, which includes a spatial information acquisition module, a spatial information analysis module, a position data acquisition module and a monitoring data amplification module;
  • the spatial information acquisition module is used to obtain the data information of the space where the wireless microphone is located, and transmit the data information to the spatial information analysis module; the spatial information analysis module is used to analyze the data information from the spatial information acquisition module; the position data acquisition module is used to determine the position information of the people in the space based on the data information after the analysis; the monitoring data amplification module is used to globally amplify the characters in the new data information when the spatial information acquisition module adds new data information to obtain the audio and video monitoring information of the corresponding people.
  • the spatial information acquisition module includes an audio information acquisition module and a video information acquisition module; the audio information acquisition module acquires audio information, and the audio information includes first audio information and second audio information; the video information acquisition module acquires video information, and the video information acquisition module includes a global character image information acquisition module and a local character image information acquisition module;
  • the audio information acquisition module includes a digital signal conversion module, a voice fluctuation frequency index calculation module, a transition digital signal marking module and an audio information division module;
  • the digital signal conversion module is used to convert the audio information into a digital signal.
  • the turning digital signal marking module traverses the sound fluctuation frequencies of the first digital signal and the adjacent digital signals, and obtains the frequency fluctuation difference by subtracting the sound fluctuation frequency of the first digital signal from the overall sound fluctuation frequency index; sequentially obtains the sound fluctuation frequencies of adjacent digital signals in the sound collection stage and marks the turning digital signal, the turning digital signal is the digital signal corresponding to the negative value of the quotient of the frequency fluctuation difference of the previous adjacent digital signal and the frequency fluctuation difference of the subsequent adjacent digital signal, and the positive and negative of the frequency fluctuation differences of all digital signals after the turning digital signal are the same as the positive and negative of the frequency fluctuation difference corresponding to the turning digital signal;
  • the audio information division module is used to divide the audio information before the turning digital signal into the first audio information, and divide the audio information after the turning digital signal into the second audio information.
  • the spatial information analysis module includes an audio information analysis module, a video information analysis module and a character matching module;
  • the audio information analysis module includes an audio attribute classification module, an average decibel difference ratio calculation module, and an audio attribute deviation index calculation module;
  • the video information analysis module includes a global character image classification module, a character image ratio difference calculation module, and a global character image deviation index calculation module;
  • the character matching module includes a deviation index similarity calculation module and a character audio attribute corresponding module;
  • the audio attribute analysis module classifies different audio attributes.
  • the average decibel difference ratio calculation module is used to record the decibel features of each audio and record different first audio attributes and corresponding decibel features as a set A, and calculate the average decibel difference ratio of the decibel features corresponding to the jth first audio attribute in the set A over time;
  • the audio attribute deviation index calculation module is used to calculate the overall deviation index of different first audio attributes in the set A;
  • the global character image classification module is used to classify the character images in the global character image information acquisition module, and record the character image proportions corresponding to different character images as set B;
  • the character image proportion difference calculation module is used to calculate the average character image proportion difference of different types of global character images in set B over time, and the global character image deviation index calculation module is used to calculate the overall deviation index of different types of global character images;
  • the deviation index similarity calculation module is used to compare the numerical similarities in the global character image deviation index calculation module and the audio attribute deviation index calculation module. When the similarity is greater than the similarity threshold, it indicates that the change in the decibel of the person is related to the movement of the person's position in the global image; the character audio attribute corresponding module makes a one-to-one correspondence with the average decibel difference ratio in set A and the average character image ratio difference in set B with a similarity greater than 99%, and obtains the audio attributes corresponding to different people in the global character image information.
  • the position data acquisition module includes a character image ratio sorting module, a starting coordinate setting module and a sector adaptation module;
  • the character image ratio sorting module is used to take the person who first speaks in the second audio information as the starting person, obtain the character ratios of all character images in the local character image information, and sort the character ratios from large to small;
  • the starting coordinate setting module sets the corresponding position coordinates in the smallest character ratio image as the starting coordinates;
  • the sector adaptation module is used to perform sector adaptation on the starting coordinates to obtain the position coordinates of the second person when any person in the monitoring makes a sound, based on the corresponding person ratio of the person making the sound in the local person image and the person ratio size relationship of the starting person.

Abstract

本发明公开了一种具有无线麦克风智能跟踪功能的音视频控制方法,包括以下过程:步骤S100:音视频控制系统获取无线麦克风所在空间的音频信息和视频信息,音频信息包括第一音频信息和第二音频信息,视频信息包括全局人物图像信息和局部人物图像信息;步骤S200:将第一音频信息进行解析得到第一音频属性,将区分后的音频属性与全局人物图像信息进行匹配;步骤S300:根据第二音频信息和局部人物图像信息对不同人员的位置进行定位;步骤S400:监控所有位置数据上对应人员的第二音频信息是否更新,发送人员的位置数据并将人员对应的局部人物图像信息进行全局放大得到对应人员的音视频监控信息。

Description

一种具有无线麦克风智能跟踪功能的音视频控制方法 技术领域
本发明涉及音视频控制技术领域,具体为一种具有无线麦克风智能跟踪功能的音视频控制方法。
背景技术
目前,音频或者视频会议主机系统主要采用麦克风和扬声器作为声音信号传输的载体;其中视频会议,是指位于两个或多个地点的人们,通过通信设备和网络,进行面对面交谈的会议,根据参会地点数目不同,视频会议可分为点对点会议和多点会议,日常生活中的个人,对谈话内容安全性、会议质量、会议规模没有要求,可以采用视频软件来进行视频聊天,而政府机关、企业事业单位的商务视频会议,要求有稳定安全的网络、可靠的会议质量、正式的会议环境等条件,则需要使用专业的视频会议设备,组建专门的视频会议系统,由于这样的视频会议系统都要用到电视来显示,也被称为电视会议、视讯会议;
但是,大多企业在进行视频会议的时候,往往多组部门进行同时召开,但是每一组部门又存在多位会议参加人员,在会议过程中需要人员之间的相互探讨,导致麦克风在拾音过程中难以将收录的音频信息与实际的人物对应起来,且进行会议的空间环境各不相同,人员固定的位置也不相同,导致无法快速高效的确定每个在场人员的位置,以及智能麦克风的摄像功能无法精准找到正在发言人员的视频图像。
发明内容
本发明的目的在于提供一种具有无线麦克风智能跟踪功能的音视频控制方法,以解决上述背景技术中提出的问题。
为了解决上述技术问题,本发明提供如下技术方案:一种具有无线麦克风智能跟踪功能的音视频控制方法,包括以下具体过程:
步骤S100:音视频控制系统获取无线麦克风所在空间的音频信息和视频信息,音频信息包括第一音频信息和第二音频信息,视频信息包括全局人物图像信息和局部人物图像信息;
步骤S200:基于步骤S100中的音频信息,将第一音频信息进行解析得到第一音频属性,对不同属性的音频进行区分;根据区分后的音频属性与全局人物图像信息进行结合分析,匹配音频信息中不同属性的音频对应的全局人物图像信息中具体的人物信息;
步骤S300:在匹配完成后,根据第二音频信息和局部人物图像信息对不同人员的位置进行定位,并给音视频控制系统发送每个人的位置数据;
步骤S400:当音视频控制系统接收所有人的位置数据后,监控所有位置数据上对应人员的第二音频信息是否更新,当音视频控制系统监控到更新第二音频信息的人员,发送人员的位置数据并将人员对应的局部人物图像信息进行全局放大得到对应人员的音视频监控信息。
进一步的,音频控制系统包含调试模式和会议模式;调试模式用于采集第一音频信息和全局人物图像信息,会议模式用于采集第二音频信息和局部人物图像信息;
调试模式用于放置无线麦克风,无线麦克风接通音视频会议主机的电源,无线麦克风设有麦克风的电源按键并在接通电源时无目标的上下左右摇头获取全局人物图像信息和局部人物图像信息;当音频控制系统分析到具体人员时,无线麦克风上的摄像头会对准参会人员并获取人员的位置地址并将位置地址发送给音视频控制系统;重复定位,无线麦克风将每个人的地址记录到音视频控制系统中,完成系统的初步定位;
会议模式用于参会人员发言时依据音视频系统中已经确认的位置信息,将摄像头转向位置信息对应的参会人员并放大参会人员的第二音频信息和局部人物图像信息。
进一步的,第一音频信息和第二音频信息的划分包括以下过程:
步骤S110:音视频控制系统获取音频采集阶段的音频信息,将音频信息转化为数字信号,获取相邻数字信号之间的总时间间隔t0以及数字信号的总信息长度p0,计算数字信号的整体发声波动频率指数w=p0/t0;时间间隔反应出在声音采集阶段会包含的两种情况,一种是所有声音杂乱无章没有规律,另一种则是在规划完成后声音有规律的发声;将这两种情况进行分辨开来是为了区分在很多声音监控场景中适用的声音变化规律;而发生波动频率指数表示在整个声音采集阶段中的平均变化;
步骤S120:基于步骤S110的发声波动频率指数,音视频控制系统从声音采集阶段的第一个数字信号进行遍历,获取第一个数字信号与相邻数字信号的发声波动频率,并将第一个数字信号的发声波动频率与整体发声波动频率指数做差值得到频率波动差;频率波动差表示随着时间的发展,在监控场景中声音波动频率与平均波动频率指数的偏差程度;且声音的变化会有一个临界点进行划分,且这里并不界定临界点前和后的频率波动差的大小关系,能够包容更多场景声音变化的规律,如:有时声音采集前阶段频率波动较大,有时声音采集前阶段频率波动较小的情况;
步骤S130:依次获取声音采集阶段中相邻数字信号的发声波动频率并标记转折数字信号,转折数字信号为前相邻数字信号的频率波动差与后相邻数字信号的频率波动差做商为负值对应的数字信号,且转折数字信号后所有数字信号的频率波动差的正负均与转折数字信号对应的频率波动差的正负相同;
步骤S140:基于步骤S130的判断规则,音视频控制系统将转折数字信号前的音频信息划分为第一音频信息,转折数字信号后的音频信息划分为第二音频信息。
进一步的,步骤S200包括以下过程:
步骤S210:将第一音频属性对应的数字信号进行频率相似度的区分,将频率相似度大于95%的数字信号对应的音频属性归为一类,并记作 ,j={1,2,...k},j表示不同种类第一音频属性的个数, 表示第j种第一音频属性;且记录每次音频的分贝特征为 ,s为任意不为0的自然数,s表示区分后的第j种第一音频属性在声音采集阶段出现的次数, 表示第j种第一音频属性的第s次出现的分贝特征;
频率反应声音的特质,每个人的声音频率都不相同,先通过频率划分监控到的人数,再分别对每个人进行分析分贝的特征,因为分贝的大小受接收端和产生端距离的影响;
步骤S220:将不同种第一音频属性以及对应的分贝特征记作一个集合A,分别计算集合A中第j种第一音频属性对应的分贝特征随时间变化的平均分贝差值比例 ,其中 表示在第j中第一音频属性中相邻两种分贝特征的差值,n表示分贝特征差值的个数,n至少1;计算集合A中不同种第一音频属性的整体偏差指数
步骤S230:对全局人物图像信息中不同人物的全局图像进行分类得到第j种全局人物图像 ,将不同种全局人物图像的人物比例记作一个集合B,分别计算集合B中第j种全局人物图像随时间变化对应的平均人物图像比例差值 ;其中 表示第j种全局人物图像中相邻两个图像的人物图像比例差值,m表示人物图像比例差值的个数,m至少为1;计算集合B中不同种全局人物图像的整体偏差指数
整体偏差指数反应在监控的所有人中分贝大小的跨度和图像反映距离的跨度,如果这两种跨度基本一致,可以说明分贝的变化和距离存在联系;
步骤S240:基于步骤S220的整体偏差指数Q和步骤S230中的整体偏差指数Z,计算偏差指数相似度 ,如果偏差指数相似度大于相似度阈值,则说明人员分贝的变化与全局图像中人员位置的移动相关,并对集合A中平均分贝差值比例和集合B中的平均人物图像比例差值进行相似度大于99%的一一对应,获取全局人物图像信息中不同人对应的音频属性。在说明分贝的变化与距离具有关联性时,再进一步分析每种类型音频属性的分贝变化规律和全局图像的变化规律,进行对应后则可匹配是哪个人发出的哪种频率。
进一步的,步骤S300包括以下过程:
基于步骤S240中一一对应后的数据,以第二音频信息中首次发声的人员为起始人员,并获取局部人物图像信息中所有人物图像的人物比例,将人物比例从大到小进行排序,设置最小的人物比例图像中对应的位置坐标为起始坐标;
当监控中任意人员发声时,以此时发声人员在局部人物图像中对应的人物比例和起始人员的人物比例大小关系为依据,在起始坐标上进行扇形适配得到第二为人员的位置坐标;扇形适配表示以起始坐标为扇形的圆心,发声人员的人物比例与起始人员的人物比例关系转化为数学数据,并以数据为半径进行同方向估值。因为在无线麦克风监控过程中,人员在稳定状态下的分布存在多种位置可能,比如矩阵分布、圆形分布等等,且在获取人物图像时,先进行一个机位的所有人员图像的获取,可以得出每个人在图像中的比例关系,因为每个人的位置不同,图像获取的位置不变的情况下人物图像比例也将会是不同的,利用扇形适配可以增大位置坐标的包容性。
进一步的,音视频控制方法包括音视频控制系统,音视频控制系统包括空间信息获取模块、空间信息解析模块、位置数据获取模块和监控数据放大模块;
空间信息获取模块用于获取无线麦克风所处空间的数据信息,并将数据信息传输给空间信息解析模块;空间信息解析模块用于解析来自空间信息获取模块的数据信息;位置数据获取模块用于根据解析完成后的数据信息判断空间内人员的位置信息;监控数据放大模块用于在空间信息获取模块增加新的数据信息时,对新数据信息中的人物进行全局放大得到对应人员的音视频监控信息。
进一步的,空间信息获取模块包括音频信息获取模块和视频信息获取模块;音频信息获取模块获取音频信息,音频信息包括第一音频信息和第二音频信息;视频信息获取模块获取视频信息,视频信息获取模块包括全局人物图像信息获取模块和局部人物图像信息获取模块;
音频信息获取模块包括数字信号转化模块、发声波动频率指数计算模块、转折数字信号标记模块和音频信息划分模块;
数字信号转化模块用于将音频信息转化为数字信号,发生波动频率指数计算模块获取相邻数字信号之间的总时间间隔t0以及数字信号的总信息长度p0,计算数字信号的整体发声波动频率指数w=p0/t0;
转折数字信号标记模块遍历第一个数字信号与相邻数字信号的发声波动频率,并将第一个数字信号的发声波动频率与整体发声波动频率指数做差值得到频率波动差;依次获取声音采集阶段中相邻数字信号的发声波动频率并标记转折数字信号,转折数字信号为前相邻数字信号的频率波动差与后相邻数字信号的频率波动差做商为负值对应的数字信号,且转折数字信号后所有数字信号的频率波动差的正负均与转折数字信号对应的频率波动差的正负相同;
音频信息划分模块用于将将转折数字信号前的音频信息划分为第一音频信息,转折数字信号后的音频信息划分为第二音频信息。
进一步的,空间信息解析模块包括音频信息分析模块、视频信息分析模块和人物匹配模块;音频信息分析模块包括音频属性分类模块、平均分贝差值比例计算模块、音频属性偏差指数计算模块;视频信息分析模块包括全局人物图像分类模块、人物图像比例差值计算模块和全局人物图像偏差指数计算模块;人物匹配模块包括偏差指数相似度计算模块和人物音频属性对应模块;
音频属性分析模块将不同音频属性进行分类,平均分贝差值比例计算模块用于记录每次音频的分贝特征并将不同种第一音频属性以及对应的分贝特征记作一个集合A,分别计算集合A中第j种第一音频属性对应的分贝特征随时间变化的平均分贝差值比例;音频属性偏差指数计算模块用于计算集合A中不同种第一音频属性的整体偏差指数;
全局人物图像分类模块用于将全局人物图像信息获取模块中的人物图像进行分类,并将不同人物图像对应的人物图像比例记作集合B;人物图像比例差值计算模块用于计算集合B中不同种全局人物图像随时间变化的平均人物图像比例差值,全局人物图像偏差指数计算模块用于计算不同种全局人物图像的整体偏差指数;
偏差指数相似度计算模块用于比较全局人物图像偏差指数计算模块和音频属性偏差指数计算模块中的数值相似度大小,在大于相似度阈值时说明人员分贝的变化与全局图像中人员位置的移动相关;人物音频属性对应模块对集合A中平均分贝差值比例和集合B中的平均人物图像比例差值进行相似度大于99%的一一对应,获取全局人物图像信息中不同人对应的音频属性。
进一步的,位置数据获取模块包括人物图像比例排序模块、起始坐标设定模块和扇形适配模块;
人物图像比例排序模块用于以第二音频信息中首次发声的人员为起始人员,并获取局部人物图像信息中所有人物图像的人物比例,将人物比例从大到小进行排序;起始坐标设定模块设定最小的人物比例图像中对应的位置坐标为起始坐标;
扇形适配模块用于当监控中任意人员发声时,以此时发声人员在局部人物图像中对应的人物比例和起始人员的人物比例大小关系为依据,在起始坐标上进行扇形适配得到第二为人员的位置坐标。
与现有技术相比,本发明所达到的有益效果是:本发明解决了在适用于无线麦克风智能跟踪的场景中由于人员的繁杂不能简单高效的辨析声音的来源与人物的具体对应,以至于在监控时捕捉到声音的发出却不能明确声音的位置,且本发明适应麦克风监控的所有环境进行可调整的定位,利用音频信息和视频信息相互结合分析的方法判断空间声音位置的关联性,进而获取声音与人的对应关系,本发明使得在不同环境中均可辨析人发出的声音和人位置的实时关系,提高系统识别人物的效率;且本发明通过图像比例确定坐标增大了位置坐标所处场景的包容性。
附图说明
附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明,并不构成对本发明的限制。在附图中:
图1是本发明一种具有无线麦克风智能跟踪功能的音视频控制方法的系统结构图;
图2是本发明一种具有无线麦克风智能跟踪功能的音视频控制方法的步骤图。
图3是本发明一种具有无线麦克风智能跟踪功能的音视频控制方法的音视频控制原理框图;
图4是本发明一种具有无线麦克风智能跟踪功能的音视频控制方法的会议麦克风原理框图;
图5是本发明一种具有无线麦克风智能跟踪功能的音视频控制方法的会摄像设备主机原理框图。
实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参阅图1-图5,本发明提供技术方案:一种具有无线麦克风智能跟踪功能的音视频控制方法,包括以下具体过程:
步骤S100:音视频控制系统获取无线麦克风所在空间的音频信息和视频信息,音频信息包括第一音频信息和第二音频信息,视频信息包括全局人物图像信息和局部人物图像信息;
步骤S200:基于步骤S100中的音频信息,将第一音频信息进行解析得到第一音频属性,对不同属性的音频进行区分;根据区分后的音频属性与全局人物图像信息进行结合分析,匹配音频信息中不同属性的音频对应的全局人物图像信息中具体的人物信息;
步骤S300:在匹配完成后,根据第二音频信息和局部人物图像信息对不同人员的位置进行定位,并给音视频控制系统发送每个人的位置数据;
步骤S400:当音视频控制系统接收所有人的位置数据后,监控所有位置数据上对应人员的第二音频信息是否更新,当音视频控制系统监控到更新第二音频信息的人员,发送人员的位置数据并将人员对应的局部人物图像信息进行全局放大得到对应人员的音视频监控信息。
如实施例图3所示,音视频控制系统包含摄像设备主机以及多个会议麦克风,各个会议麦克风与摄像设备主机通过无线通讯的方式连接;
如图4所示,会议麦克风包括麦克风控制主晶片、麦克风输入电路、第一2.4G收发电路、按键电路、麦克风电源供应电路,麦克风输入电路、第一2.4G收发电路、按键电路分别连接麦克风控制主晶片,麦克风电源供应电路给麦克风控制主晶片、麦克风输入电路、第一2.4G收发电路、按键电路供电;
如图5所示,摄像设备主机具有摄像头以及喇叭;摄像设备主机的电路结构包括摄像设备主机晶片、音频输出电路、主机按键、第二2.4G收发电路、喇叭输出电路、USB接口电路、摄像头、轨道控制电路、电源供应电路;摄像设备主机晶片连接轨道控制电路,轨道控制电路连接摄像头;摄像设备主机晶片同时电连接USB接口电路,USB接口电路同时接于摄像头;摄像设备主机晶片电连接音频输出电路,该音频输出电路连接音频通讯设备;摄像设备主机晶片电连接喇叭输出电路,该喇叭输出电路用于连接喇叭;摄像设备主机晶片还电连接第二2.4G收发电路;
麦克风控制主晶片和摄像设备主机晶片为蓝牙5.3 LE Audio芯片;
音频控制系统包含调试模式和会议模式;调试模式用于采集第一音频信息和全局人物图像信息,会议模式用于采集第二音频信息和局部人物图像信息;
调试模式用于放置无线麦克风,无线麦克风接通音视频会议主机的电源,无线麦克风设有麦克风的电源按键并在接通电源时无目标的上下左右摇头获取全局人物图像信息和局部人物图像信息;当音频控制系统分析到具体人员时,无线麦克风上的摄像头会对准参会人员并获取人员的位置地址并将位置地址发送给音视频控制系统;重复定位,无线麦克风将每个人的地址记录到音视频控制系统中,完成系统的初步定位;
会议模式用于参会人员发言时依据音视频系统中已经确认的位置信息,将摄像头转向位置信息对应的参会人员并放大参会人员的第二音频信息和局部人物图像信息。
如实施例:摄像设备主机具有XYZ三轴,通过XYZ三轴驱动摄像头;
摄像设备主机启动连接摄像头后,XYZ轴轨道做扫描,当摄像头对到会议麦克风方向时,此时会议麦克风会发出命令至摄像设备主机,摄像设备主机此时会记录对应此支会议麦克风的XYZ轴轨道的位置;
会议麦克风会利用无线讯号送出一组控制码给摄像设备主机,摄像设备主机收到后会把这支会议麦克风的XYZ轴轨道的位置资料控制摄像头转向这支会议麦克风的方向;
会议麦克风将收到的音频讯号经过LE Audio LC3,LC3+的编解码技术把形成的语音包利用TDMA多时分工技术无线发射至摄像设备主机。
第一音频信息和第二音频信息的划分包括以下过程:
步骤S110:音视频控制系统获取音频采集阶段的音频信息,将音频信息转化为数字信号,获取相邻数字信号之间的总时间间隔t0以及数字信号的总信息长度p0,计算数字信号的整体发声波动频率指数w=p0/t0;时间间隔反应出在声音采集阶段会包含的两种情况,一种是所有声音杂乱无章没有规律,另一种则是在规划完成后声音有规律的发声;将这两种情况进行分辨开来是为了区分在很多声音监控场景中适用的声音变化规律;而发生波动频率指数表示在整个声音采集阶段中的平均变化;
步骤S120:基于步骤S110的发声波动频率指数,音视频控制系统从声音采集阶段的第一个数字信号进行遍历,获取第一个数字信号与相邻数字信号的发声波动频率,并将第一个数字信号的发声波动频率与整体发声波动频率指数做差值得到频率波动差;频率波动差表示随着时间的发展,在监控场景中声音波动频率与平均波动频率指数的偏差程度;且声音的变化会有一个临界点进行划分,且这里并不界定临界点前和后的频率波动差的大小关系,能够包容更多场景声音变化的规律,如:有时声音采集前阶段频率波动较大,有时声音采集前阶段频率波动较小的情况;
步骤S130:依次获取声音采集阶段中相邻数字信号的发声波动频率并标记转折数字信号,转折数字信号为前相邻数字信号的频率波动差与后相邻数字信号的频率波动差做商为负值对应的数字信号,且转折数字信号后所有数字信号的频率波动差的正负均与转折数字信号对应的频率波动差的正负相同;
步骤S140:基于步骤S130的判断规则,音视频控制系统将转折数字信号前的音频信息划分为第一音频信息,转折数字信号后的音频信息划分为第二音频信息。
步骤S200包括以下过程:
步骤S210:将第一音频属性对应的数字信号进行频率相似度的区分,将频率相似度大于95%的数字信号对应的音频属性归为一类,并记作 ,j={1,2,...k},j表示不同种类第一音频属性的个数, 表示第j种第一音频属性;且记录每次音频的分贝特征为 ,s为任意不为0的自然数,s表示区分后的第j种第一音频属性在声音采集阶段出现的次数, 表示第j种第一音频属性的第s次出现的分贝特征;
频率反应声音的特质,每个人的声音频率都不相同,先通过频率划分监控到的人数,再分别对每个人进行分析分贝的特征,因为分贝的大小受接收端和产生端距离的影响;
步骤S220:将不同种第一音频属性以及对应的分贝特征记作一个集合A,分别计算集合A中第j种第一音频属性对应的分贝特征随时间变化的平均分贝差值比例 ,其中 表示在第j中第一音频属性中相邻两种分贝特征的差值,n表示分贝特征差值的个数,n至少1;计算集合A中不同种第一音频属性的整体偏差指数
步骤S230:对全局人物图像信息中不同人物的全局图像进行分类得到第j种全局人物图像 ,将不同种全局人物图像的人物比例记作一个集合B,分别计算集合B中第j种全局人物图像随时间变化对应的平均人物图像比例差值 ;其中 表示第j种全局人物图像中相邻两个图像的人物图像比例差值,m表示人物图像比例差值的个数,m至少为1;计算集合B中不同种全局人物图像的整体偏差指数
整体偏差指数反应在监控的所有人中分贝大小的跨度和图像反映距离的跨度,如果这两种跨度基本一致,可以说明分贝的变化和距离存在联系;
步骤S240:基于步骤S220的整体偏差指数Q和步骤S230中的整体偏差指数Z,计算偏差指数相似度 ,如果偏差指数相似度大于相似度阈值,则说明人员分贝的变化与全局图像中人员位置的移动相关,并对集合A中平均分贝差值比例和集合B中的平均人物图像比例差值进行相似度大于99%的一一对应,获取全局人物图像信息中不同人对应的音频属性。在说明分贝的变化与距离具有关联性时,再进一步分析每种类型音频属性的分贝变化规律和全局图像的变化规律,进行对应后则可匹配是哪个人发出的哪种频率。
例如:现场有3类第一音频属性 ,且每一类音频属性包含三种分贝特征,对应为 ,则集合A表示为{ }, ;则
有3种全局人物图像,则集合B为{ },则 %, 7%, 7%,
一一对应的过程就是比较分别依次比较集合A中 、{ }、{ }与集合B中{ }、{ }、{ }的相似度。
步骤S300包括以下过程:
基于步骤S240中一一对应后的数据,以第二音频信息中首次发声的人员为起始人员,并获取局部人物图像信息中所有人物图像的人物比例,将人物比例从大到小进行排序,设置最小的人物比例图像中对应的位置坐标为起始坐标;
当监控中任意人员发声时,以此时发声人员在局部人物图像中对应的人物比例和起始人员的人物比例大小关系为依据,在起始坐标上进行扇形适配得到第二为人员的位置坐标;扇形适配表示以起始坐标为扇形的圆心,发声人员的人物比例与起始人员的人物比例关系转化为数学数据,并以数据为半径进行同方向估值。因为在无线麦克风监控过程中,人员在稳定状态下的分布存在多种位置可能,比如矩阵分布、圆形分布等等,且在获取人物图像时,先进行一个机位的所有人员图像的获取,可以得出每个人在图像中的比例关系,因为每个人的位置不同,图像获取的位置不变的情况下人物图像比例也将会是不同的,利用扇形适配可以增大位置坐标的包容性。
音视频控制方法包括音视频控制系统,音视频控制系统包括空间信息获取模块、空间信息解析模块、位置数据获取模块和监控数据放大模块;
空间信息获取模块用于获取无线麦克风所处空间的数据信息,并将数据信息传输给空间信息解析模块;空间信息解析模块用于解析来自空间信息获取模块的数据信息;位置数据获取模块用于根据解析完成后的数据信息判断空间内人员的位置信息;监控数据放大模块用于在空间信息获取模块增加新的数据信息时,对新数据信息中的人物进行全局放大得到对应人员的音视频监控信息。
空间信息获取模块包括音频信息获取模块和视频信息获取模块;音频信息获取模块获取音频信息,音频信息包括第一音频信息和第二音频信息;视频信息获取模块获取视频信息,视频信息获取模块包括全局人物图像信息获取模块和局部人物图像信息获取模块;
音频信息获取模块包括数字信号转化模块、发声波动频率指数计算模块、转折数字信号标记模块和音频信息划分模块;
数字信号转化模块用于将音频信息转化为数字信号,发生波动频率指数计算模块获取相邻数字信号之间的总时间间隔t0以及数字信号的总信息长度p0,计算数字信号的整体发声波动频率指数w=p0/t0;
转折数字信号标记模块遍历第一个数字信号与相邻数字信号的发声波动频率,并将第一个数字信号的发声波动频率与整体发声波动频率指数做差值得到频率波动差;依次获取声音采集阶段中相邻数字信号的发声波动频率并标记转折数字信号,转折数字信号为前相邻数字信号的频率波动差与后相邻数字信号的频率波动差做商为负值对应的数字信号,且转折数字信号后所有数字信号的频率波动差的正负均与转折数字信号对应的频率波动差的正负相同;
音频信息划分模块用于将将转折数字信号前的音频信息划分为第一音频信息,转折数字信号后的音频信息划分为第二音频信息。
空间信息解析模块包括音频信息分析模块、视频信息分析模块和人物匹配模块;音频信息分析模块包括音频属性分类模块、平均分贝差值比例计算模块、音频属性偏差指数计算模块;视频信息分析模块包括全局人物图像分类模块、人物图像比例差值计算模块和全局人物图像偏差指数计算模块;人物匹配模块包括偏差指数相似度计算模块和人物音频属性对应模块;
音频属性分析模块将不同音频属性进行分类,平均分贝差值比例计算模块用于记录每次音频的分贝特征并将不同种第一音频属性以及对应的分贝特征记作一个集合A,分别计算集合A中第j种第一音频属性对应的分贝特征随时间变化的平均分贝差值比例;音频属性偏差指数计算模块用于计算集合A中不同种第一音频属性的整体偏差指数;
全局人物图像分类模块用于将全局人物图像信息获取模块中的人物图像进行分类,并将不同人物图像对应的人物图像比例记作集合B;人物图像比例差值计算模块用于计算集合B中不同种全局人物图像随时间变化的平均人物图像比例差值,全局人物图像偏差指数计算模块用于计算不同种全局人物图像的整体偏差指数;
偏差指数相似度计算模块用于比较全局人物图像偏差指数计算模块和音频属性偏差指数计算模块中的数值相似度大小,在大于相似度阈值时说明人员分贝的变化与全局图像中人员位置的移动相关;人物音频属性对应模块对集合A中平均分贝差值比例和集合B中的平均人物图像比例差值进行相似度大于99%的一一对应,获取全局人物图像信息中不同人对应的音频属性。
位置数据获取模块包括人物图像比例排序模块、起始坐标设定模块和扇形适配模块;
人物图像比例排序模块用于以第二音频信息中首次发声的人员为起始人员,并获取局部人物图像信息中所有人物图像的人物比例,将人物比例从大到小进行排序;起始坐标设定模块设定最小的人物比例图像中对应的位置坐标为起始坐标;
扇形适配模块用于当监控中任意人员发声时,以此时发声人员在局部人物图像中对应的人物比例和起始人员的人物比例大小关系为依据,在起始坐标上进行扇形适配得到第二为人员的位置坐标。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。
最后应说明的是:以上所述仅为本发明的优选实施例而已,并不用于限制本发明,尽管参照前述实施例对本发明进行了详细的说明,对于本领域的技术人员来说,其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (9)

  1. 一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于,包括以下具体过程:
    步骤S100:音视频控制系统获取无线麦克风所在空间的音频信息和视频信息,所述音频信息包括第一音频信息和第二音频信息,所述视频信息包括全局人物图像信息和局部人物图像信息;
    步骤S200:基于所述步骤S100中的音频信息,将第一音频信息进行解析得到第一音频属性,对不同属性的音频进行区分;根据区分后的音频属性与全局人物图像信息进行结合分析,匹配音频信息中不同属性的音频对应的全局人物图像信息中具体的人物信息;
    步骤S300:在匹配完成后,根据第二音频信息和局部人物图像信息对不同人员的位置进行定位,并给音视频控制系统发送每个人的位置数据;
    步骤S400:当所述音视频控制系统接收所有人的位置数据后,监控所有位置数据上对应人员的第二音频信息是否更新,当所述音视频控制系统监控到更新第二音频信息的人员,发送人员的位置数据并将人员对应的局部人物图像信息进行全局放大得到对应人员的音视频监控信息。
  2. 根据权利要求1所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:所述音频控制系统包含调试模式和会议模式;所述调试模式用于采集第一音频信息和全局人物图像信息,所述会议模式用于采集第二音频信息和局部人物图像信息;
    所述调试模式用于放置无线麦克风,所述无线麦克风接通音视频会议主机的电源,所述无线麦克风设有麦克风的电源按键并在接通电源时无目标的上下左右摇头获取全局人物图像信息和局部人物图像信息;当所述音频控制系统分析到具体人员时,所述无线麦克风上的摄像头会对准参会人员并获取人员的位置地址并将位置地址发送给音视频控制系统;重复定位,所述无线麦克风将每个人的地址记录到音视频控制系统中,完成系统的初步定位;
    所述会议模式用于参会人员发言时依据音视频系统中已经确认的位置信息,将摄像头转向位置信息对应的参会人员并放大参会人员的第二音频信息和局部人物图像信息。
  3. 根据权利要求1所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:第一音频信息和第二音频信息的划分包括以下过程:
    步骤S110:音视频控制系统获取音频采集阶段的音频信息,将所述音频信息转化为数字信号,获取相邻数字信号之间的总时间间隔t0以及数字信号的总信息长度p0,计算数字信号的整体发声波动频率指数w=p0/t0;
    步骤S120:基于步骤S110的发声波动频率指数,音视频控制系统从声音采集阶段的第一个数字信号进行遍历,获取第一个数字信号与相邻数字信号的发声波动频率,并将第一个数字信号的发声波动频率与整体发声波动频率指数做差值得到频率波动差;
    步骤S130:依次获取声音采集阶段中相邻数字信号的发声波动频率并标记转折数字信号,所述转折数字信号为前相邻数字信号的频率波动差与后相邻数字信号的频率波动差做商为负值对应的数字信号,且转折数字信号后所有数字信号的频率波动差的正负均与所述转折数字信号对应的频率波动差的正负相同;
    步骤S140:基于步骤S130的判断规则,音视频控制系统将转折数字信号前的音频信息划分为第一音频信息,转折数字信号后的音频信息划分为第二音频信息。
  4. 根据权利要求2所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:所述步骤S200包括以下过程:
    步骤S210:将第一音频属性对应的数字信号进行频率相似度的区分,将频率相似度大于95%的数字信号对应的音频属性归为一类,并记作 ,j={1,2,...k},j表示不同种类第一音频属性的个数, 表示第j种第一音频属性;且记录每次音频的分贝特征为 ,s为任意不为0的自然数,s表示区分后的第j种第一音频属性在声音采集阶段出现的次数, 表示第j种第一音频属性的第s次出现的分贝特征;
    步骤S220:将不同种第一音频属性以及对应的分贝特征记作一个集合A,分别计算集合A中第j种第一音频属性对应的分贝特征随时间变化的平均分贝差值比例 ,其中 表示在第j中第一音频属性中相邻两种分贝特征的差值,n表示分贝特征差值的个数,n至少1;计算集合A中不同种第一音频属性的整体偏差指数
    步骤S230:对全局人物图像信息中不同人物的全局图像进行分类得到第j种全局人物图像 ,将不同种全局人物图像的人物比例记作一个集合B,分别计算集合B中第j种全局人物图像随时间变化对应的平均人物图像比例差值 ;其中 表示第j种全局人物图像中相邻两个图像的人物图像比例差值,m表示人物图像比例差值的个数,m至少为1;计算集合B中不同种全局人物图像的整体偏差指数
    步骤S240:基于步骤S220的整体偏差指数Q和步骤S230中的整体偏差指数Z,计算偏差指数相似度 ,如果所述偏差指数相似度大于相似度阈值,则说明人员分贝的变化与全局图像中人员位置的移动相关,并对集合A中平均分贝差值比例和集合B中的平均人物图像比例差值进行相似度大于99%的一一对应,获取全局人物图像信息中不同人对应的音频属性。
  5. 根据权利要求3所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:所述步骤S300包括以下过程:
    基于步骤S240中一一对应后的数据,以第二音频信息中首次发声的人员为起始人员,并获取局部人物图像信息中所有人物图像的人物比例,将所述人物比例从大到小进行排序,设置最小的人物比例图像中对应的位置坐标为起始坐标;
    当监控中任意人员发声时,以此时发声人员在局部人物图像中对应的人物比例和起始人员的人物比例大小关系为依据,在起始坐标上进行扇形适配得到第二为人员的位置坐标;所述扇形适配表示以起始坐标为扇形的圆心,发声人员的人物比例与起始人员的人物比例关系转化为数学数据,并以数据数据为半径进行同方向估值。
  6. 根据权利要求1所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:所述音视频控制方法包括音视频控制系统,所述音视频控制系统包括空间信息获取模块、空间信息解析模块、位置数据获取模块和监控数据放大模块;
    所述空间信息获取模块用于获取无线麦克风所处空间的数据信息,并将所述数据信息传输给所述空间信息解析模块;所述空间信息解析模块用于解析来自所述空间信息获取模块的数据信息;所述位置数据获取模块用于根据解析完成后的数据信息判断空间内人员的位置信息;所述监控数据放大模块用于在空间信息获取模块增加新的数据信息时,对新数据信息中的人物进行全局放大得到对应人员的音视频监控信息。
  7. 根据权利要求5所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:所述空间信息获取模块包括音频信息获取模块和视频信息获取模块;所述音频信息获取模块获取音频信息,所述音频信息包括第一音频信息和第二音频信息;所述视频信息获取模块获取视频信息,所述视频信息获取模块包括全局人物图像信息获取模块和局部人物图像信息获取模块;
    所述音频信息获取模块包括数字信号转化模块、发声波动频率指数计算模块、转折数字信号标记模块和音频信息划分模块;
    所述数字信号转化模块用于将音频信息转化为数字信号,所述发生波动频率指数计算模块获取相邻数字信号之间的总时间间隔t0以及数字信号的总信息长度p0,计算数字信号的整体发声波动频率指数w=p0/t0;
    所述转折数字信号标记模块遍历第一个数字信号与相邻数字信号的发声波动频率,并将第一个数字信号的发声波动频率与整体发声波动频率指数做差值得到频率波动差;依次获取声音采集阶段中相邻数字信号的发声波动频率并标记转折数字信号,所述转折数字信号为前相邻数字信号的频率波动差与后相邻数字信号的频率波动差做商为负值对应的数字信号,且转折数字信号后所有数字信号的频率波动差的正负均与所述转折数字信号对应的频率波动差的正负相同;
    所述音频信息划分模块用于将将转折数字信号前的音频信息划分为第一音频信息,转折数字信号后的音频信息划分为第二音频信息。
  8. 根据权利要求6所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:所述空间信息解析模块包括音频信息分析模块、视频信息分析模块和人物匹配模块;所述音频信息分析模块包括音频属性分类模块、平均分贝差值比例计算模块、音频属性偏差指数计算模块;所述视频信息分析模块包括全局人物图像分类模块、人物图像比例差值计算模块和全局人物图像偏差指数计算模块;所述人物匹配模块包括偏差指数相似度计算模块和人物音频属性对应模块;
    所述音频属性分析模块将不同音频属性进行分类,所述平均分贝差值比例计算模块用于记录每次音频的分贝特征并将不同种第一音频属性以及对应的分贝特征记作一个集合A,分别计算集合A中第j种第一音频属性对应的分贝特征随时间变化的平均分贝差值比例;所述音频属性偏差指数计算模块用于计算集合A中不同种第一音频属性的整体偏差指数;
    所述全局人物图像分类模块用于将全局人物图像信息获取模块中的人物图像进行分类,并将不同人物图像对应的人物图像比例记作集合B;所述人物图像比例差值计算模块用于计算集合B中不同种全局人物图像随时间变化的平均人物图像比例差值,所述全局人物图像偏差指数计算模块用于计算不同种全局人物图像的整体偏差指数;
    所述偏差指数相似度计算模块用于比较所述全局人物图像偏差指数计算模块和所述音频属性偏差指数计算模块中的数值相似度大小,在大于相似度阈值时说明人员分贝的变化与全局图像中人员位置的移动相关;所述人物音频属性对应模块对集合A中平均分贝差值比例和集合B中的平均人物图像比例差值进行相似度大于99%的一一对应,获取全局人物图像信息中不同人对应的音频属性。
  9. 根据权利要求6所述的一种具有无线麦克风智能跟踪功能的音视频控制方法,其特征在于:所述位置数据获取模块包括人物图像比例排序模块、起始坐标设定模块和扇形适配模块;
    所述人物图像比例排序模块用于以第二音频信息中首次发声的人员为起始人员,并获取局部人物图像信息中所有人物图像的人物比例,将所述人物比例从大到小进行排序;所述起始坐标设定模块设定最小的人物比例图像中对应的位置坐标为起始坐标;
    所述扇形适配模块用于当监控中任意人员发声时,以此时发声人员在局部人物图像中对应的人物比例和起始人员的人物比例大小关系为依据,在起始坐标上进行扇形适配得到第二为人员的位置坐标。
PCT/CN2023/099068 2022-10-27 2023-06-08 一种具有无线麦克风智能跟踪功能的音视频控制方法 WO2024087641A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211324931.8A CN115695708A (zh) 2022-10-27 2022-10-27 一种具有无线麦克风智能跟踪功能的音视频控制方法
CN202211324931.8 2022-10-27

Publications (1)

Publication Number Publication Date
WO2024087641A1 true WO2024087641A1 (zh) 2024-05-02

Family

ID=85099306

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/099068 WO2024087641A1 (zh) 2022-10-27 2023-06-08 一种具有无线麦克风智能跟踪功能的音视频控制方法

Country Status (2)

Country Link
CN (1) CN115695708A (zh)
WO (1) WO2024087641A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115695708A (zh) * 2022-10-27 2023-02-03 深圳奥尼电子股份有限公司 一种具有无线麦克风智能跟踪功能的音视频控制方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003189273A (ja) * 2001-12-20 2003-07-04 Sharp Corp 発言者識別装置及び該発言者識別装置を備えたテレビ会議システム
CN111343411A (zh) * 2020-03-20 2020-06-26 青岛海信智慧家居系统股份有限公司 一种智能远程视频会议系统
CN112073613A (zh) * 2020-09-10 2020-12-11 广州视源电子科技股份有限公司 会议人像的拍摄方法、交互平板、计算机设备及存储介质
CN112511757A (zh) * 2021-02-05 2021-03-16 北京电信易通信息技术股份有限公司 一种基于移动机器人的视频会议实现方法及系统
CN112633219A (zh) * 2020-12-30 2021-04-09 深圳市皓丽智能科技有限公司 一种会议发言人追踪方法、装置、计算机设备及存储介质
CN114298170A (zh) * 2021-12-08 2022-04-08 上海交通大学 一种多模态会议数据结构化方法、装置及计算机设备
CN115695708A (zh) * 2022-10-27 2023-02-03 深圳奥尼电子股份有限公司 一种具有无线麦克风智能跟踪功能的音视频控制方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003189273A (ja) * 2001-12-20 2003-07-04 Sharp Corp 発言者識別装置及び該発言者識別装置を備えたテレビ会議システム
CN111343411A (zh) * 2020-03-20 2020-06-26 青岛海信智慧家居系统股份有限公司 一种智能远程视频会议系统
CN112073613A (zh) * 2020-09-10 2020-12-11 广州视源电子科技股份有限公司 会议人像的拍摄方法、交互平板、计算机设备及存储介质
CN112633219A (zh) * 2020-12-30 2021-04-09 深圳市皓丽智能科技有限公司 一种会议发言人追踪方法、装置、计算机设备及存储介质
CN112511757A (zh) * 2021-02-05 2021-03-16 北京电信易通信息技术股份有限公司 一种基于移动机器人的视频会议实现方法及系统
CN114298170A (zh) * 2021-12-08 2022-04-08 上海交通大学 一种多模态会议数据结构化方法、装置及计算机设备
CN115695708A (zh) * 2022-10-27 2023-02-03 深圳奥尼电子股份有限公司 一种具有无线麦克风智能跟踪功能的音视频控制方法

Also Published As

Publication number Publication date
CN115695708A (zh) 2023-02-03

Similar Documents

Publication Publication Date Title
US8204759B2 (en) Social analysis in multi-participant meetings
JP4669041B2 (ja) ウェアラブル端末
US6687671B2 (en) Method and apparatus for automatic collection and summarization of meeting information
US6885989B2 (en) Method and system for collaborative speech recognition for small-area network
CN102843543B (zh) 视频会议提醒方法、装置和视频会议系统
WO2024087641A1 (zh) 一种具有无线麦克风智能跟踪功能的音视频控制方法
US8655654B2 (en) Generating representations of group interactions
WO2020073633A1 (zh) 会议音箱及会议记录方法、设备、系统和计算机存储介质
WO2022062471A1 (zh) 一种音频数据的处理方法、设备和系统
TW201333932A (zh) 記錄系統及方法、聲音輸入裝置和語音記錄裝置及方法
US20200258525A1 (en) Systems and methods for an intelligent virtual assistant for meetings
EP3005690B1 (en) Method and system for associating an external device to a video conference session
JP2023033634A (ja) サーバ装置、会議支援方法及びプログラム
CN110232553A (zh) 会议支援系统以及计算机可读取的记录介质
JP7464107B2 (ja) サーバ装置、会議支援システム、会議支援方法及びプログラム
WO2022160749A1 (zh) 一种用于语音处理装置的角色分离方法及其语音处理装置
KR101981049B1 (ko) 멀티 커넥션을 통한 회의록 생성 시스템 및 그 방법
CN111028837B (zh) 语音会话方法、语音识别系统及计算机存储介质
CN114764690A (zh) 一种智能进行会议纪要的方法、装置和系统
CN114531425A (zh) 一种处理方法和处理装置
TWM608957U (zh) 一種具有發言自動書記之智慧型會議室系統
TWI764020B (zh) 視訊會議系統及其方法
CN113517002A (zh) 信息处理方法、装置以及系统、会议端、服务器
JPS62209985A (ja) テレビ会議装置
CN114826804B (zh) 一种基于机器学习的远程会议质量监控的方法和系统