WO2021027424A1 - 图像采集的控制方法及采集终端 - Google Patents

图像采集的控制方法及采集终端 Download PDF

Info

Publication number
WO2021027424A1
WO2021027424A1 PCT/CN2020/099455 CN2020099455W WO2021027424A1 WO 2021027424 A1 WO2021027424 A1 WO 2021027424A1 CN 2020099455 W CN2020099455 W CN 2020099455W WO 2021027424 A1 WO2021027424 A1 WO 2021027424A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
speaker
camera
image
collected
Prior art date
Application number
PCT/CN2020/099455
Other languages
English (en)
French (fr)
Inventor
王光强
林宏伟
薛新丽
王之奎
贾其燕
Original Assignee
聚好看科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 聚好看科技股份有限公司 filed Critical 聚好看科技股份有限公司
Publication of WO2021027424A1 publication Critical patent/WO2021027424A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/695Control of camera direction for changing a field of view, e.g. pan, tilt or based on tracking of objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • This application relates to the field of multimedia technology, and in particular to a method for controlling image collection and a collection terminal.
  • the display device displays images in real time to show the status of multiple parties in the conference.
  • the image displayed by the display device is the image collected by the camera.
  • the image collected by the camera is restricted by the deployment position of the camera and the camera is not adjustable. Therefore, participants in the blind area of the camera will not appear in the image collected by the camera. Furthermore, if the speaker is located in the blind spot of the camera, because the image in the blind spot cannot be collected, the picture displayed by the display device does not include the portrait of the speaker, so that other participants cannot see the image of the speaker.
  • this application provides a method for controlling image collection, which is applied to a collection terminal, and the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the camera in the collection terminal according to the positioned position, and after adjustment, the speaker corresponding to the audio It is located in the center of the shooting screen of the camera, and the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera; the image of the speaker corresponding to the audio is obtained through image collection of the adjusted camera.
  • the present application provides a method for controlling image collection, which is applied to a collection terminal, and the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the focal length of the camera in the collection terminal according to the positioned position, so that the audio corresponds to the speech The person is at the focal position of the camera; the image of the speaker corresponding to the audio is obtained through image collection by the adjusted camera.
  • the present application provides a method for controlling image collection, which is applied to a collection terminal, the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the camera in the collection terminal according to the positioned position so that the speaker corresponding to the audio is located In the image collected by the camera and located at the focal length position of the camera, the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera; the audio is obtained by image collection of the adjusted camera The image of the corresponding speaker.
  • this application provides a method for controlling image collection, which is applied to a collection terminal, and the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the camera in the collection terminal according to the positioned position so that the speaker corresponding to the audio is located In the image collected by the camera and located at the focal length of the camera, the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera; performing image collection through the adjusted camera; collecting on the camera Perform speaker recognition in the received image to locate the portrait of the speaker in the image; crop the image according to the located portrait to obtain the image of the speaker corresponding to the audio; output on the display The image of the speaker corresponding to the audio.
  • the present application provides a control device for image collection, which is applied to a collection terminal, and the device includes: a voiceprint recognition module for performing voiceprint recognition on the collected audio, and confirming the speech through the voiceprint recognition Whether the person has changed; the positioning module, if the voiceprint recognition module determines that the speaker changes, then locate the position of the speaker in space corresponding to the audio according to the collected audio; the control module is used to locate the position according to the location , The camera in the collection terminal is adjusted. After the adjustment, the speaker corresponding to the audio is located in the center of the shooting screen of the camera.
  • the adjustment includes adjusting the shooting angle of the camera and/or adjusting the camera Focal length; image acquisition module for image acquisition through the adjusted camera to obtain the image of the speaker corresponding to the audio.
  • this application provides a collection terminal, including: a processor; and a memory, where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the above method.
  • Fig. 1 is a block diagram showing a terminal according to an exemplary embodiment
  • Fig. 2 is a flow chart showing a method for controlling image acquisition according to an exemplary embodiment
  • FIG. 3 is a flowchart of step 310 in some embodiments in the embodiment corresponding to FIG. 2;
  • step 330 is a flowchart of step 330 in some embodiments in the embodiment corresponding to FIG. 2;
  • FIG. 5 is a flowchart of step 350 in some embodiments in the embodiment corresponding to FIG. 2;
  • FIG. 6 is a flowchart of step 370 in some embodiments in the embodiment corresponding to FIG. 2;
  • FIG. 7 is a flowchart of step 371 in some embodiments in the embodiment corresponding to FIG. 6;
  • Fig. 8 is a flowchart of a method for controlling image capture according to some embodiments.
  • Fig. 9 is a block diagram showing an image acquisition control device according to an exemplary embodiment.
  • Fig. 1 is a block diagram showing a terminal 200 according to an exemplary embodiment.
  • the terminal 200 can be used as a fixed terminal for image collection according to the method of the present application.
  • the terminal 200 is, for example, a television, a desktop computer, etc. that integrate a camera and a sound collection module.
  • the terminal 200 may include one or more of the following components: a processing component 202, a memory 204, a power supply component 206, a multimedia component 208, a sound collection component 210, a camera 214, and a communication component 216.
  • the processing component 202 generally controls the overall operations of the terminal 200, such as operations associated with display, image capture, data communication, camera rotation, and recording operations.
  • the processing component 202 may include one or more processors 218 to execute instructions to complete all or part of the steps of the following method.
  • the processing component 202 may include one or more modules to facilitate the interaction between the processing component 202 and other components.
  • the processing component 202 may include a multimedia module to facilitate the interaction between the multimedia component 208 and the processing component 202.
  • the memory 204 is configured to store various types of data to support operations in the terminal 200. Examples of these data include instructions for any application or method operating on the terminal 200.
  • the memory 204 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM for short), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-only memory ( Read-Only Memory, ROM for short), magnetic storage, flash memory, magnetic disk or optical disk.
  • the memory 204 also stores one or more modules, and the one or more modules are configured to be executed by the one or more processors 218 to complete all or part of the steps in any of the following method embodiments.
  • the power supply component 206 provides power to various components of the terminal 200.
  • the power supply component 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the terminal 200.
  • the multimedia component 208 includes a screen that provides an output interface between the terminal 200 and the user.
  • the screen may include a liquid crystal display (Liquid Crystal Display, LCD for short) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation.
  • the screen may also include an organic electroluminescence display (Organic Light Emitting Display, OLED for short). Among them, the image collected by the camera can be displayed on the screen.
  • OLED Organic Light Emitting Display
  • the sound collection component 210 is configured to perform audio collection, where the sound collection component 210 may include several sound collection modules, such as a microphone (Microphone, MIC for short), through which the sound collection component 210 performs audio collection.
  • a microphone Microphone, MIC for short
  • the camera 214 is used for image collection to obtain an image.
  • the terminal 200 includes at least one camera capable of controlled rotation. Therefore, after determining the change of the speaker, the camera can be rotated according to the position of the speaker to collect the image of the speaker.
  • the communication component 216 is configured to facilitate wired or wireless communication between the terminal 200 and other devices.
  • the terminal 200 can access a wireless network based on a communication standard, such as WiFi (WIreless-Fidelity, wireless fidelity).
  • the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 216 further includes a near field communication (Near Field Communication, NFC for short) module to facilitate short-range communication.
  • the NFC module can be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth technology and other technologies. .
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • the terminal 200 may be configured by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processors, digital signal processing equipment, programmable logic devices, field programmable gate arrays, The controller, microcontroller, microprocessor or other electronic components are implemented to perform the following methods.
  • ASIC Application Specific Integrated Circuit
  • Fig. 2 is a flow chart showing a method for controlling image acquisition according to an exemplary embodiment. This image collection control method is applied to a collection terminal, such as the terminal 200 shown in FIG. 1. As shown in Figure 2, the method may include the following steps:
  • Step 310 Perform voiceprint recognition on the collected audio, and determine whether the speaker changes through the voiceprint recognition.
  • the collection terminal includes a sound collection module, which performs audio collection through the sound collection module, such as a microphone.
  • the sound collection module can be integrated inside the collection terminal, or deployed outside the collection terminal, for example, connected to the collection terminal through an external interface.
  • the sound collection module of the collection terminal continuously collects signals. It is understandable that because people do not speak continuously, the signals collected by the sound collection module include audio signals and non-audio signals.
  • the audio referred to in this application comes from the audio signal collected by the audio collection module, for example, a segment of the audio signal, or the entire segment of audio signal between two adjacent non-audio signals.
  • endpoint detection is used to determine the audio signal and the non-audio signal in the signal collected by the sound collection module.
  • the collected signal is segmented, and the audio obtained by the segment is imaged and controlled according to the disclosed method.
  • the segmentation performed, for example, on the basis of determining the audio signal and the silent signal according to the endpoint detection, the audio signal between two adjacent silent signals is taken as a segment of audio.
  • the collected signal may also be segmented according to the set collection period, so that the audio signal segment obtained by the segmentation is regarded as a piece of audio.
  • Step 310 in order to reduce the amount of calculation, only the next audio signal segment adjacent to the silent signal is identified by voiceprint recognition. In other words, if the previous signal segment adjacent to the audio is still an audio signal, then Step 310 is not performed, and it is assumed that the speaker corresponding to the audio is still the speaker corresponding to the previous adjacent audio signal segment.
  • each person’s voice organs such as the vocal cords, mouth, and nasal cavity, present in a variety of ways during the pronunciation, and the pronunciation capacity and frequency of the pronunciation are not the same, the sound produced by each person’s voice organs must have their own characteristics. Personal unique voiceprint.
  • Human voiceprint is characterized by voiceprint characteristics.
  • the voiceprint feature is obtained by feature extraction based on the collected audio.
  • Voiceprint features such as Mel Frequency Cepstral Coefficents (MFCC), short-term energy, short-term average amplitude, short-term average zero-crossing rate, formant, linear prediction cepstral coefficient (LPCC).
  • MFCC Mel Frequency Cepstral Coefficents
  • LPCC linear prediction cepstral coefficient
  • the voiceprint features extracted from the audio for voiceprint recognition may be one or more types, which are not specifically limited here.
  • the voiceprint recognition performed is to identify whether the voiceprint features of the currently collected audio are consistent with the voiceprint features of the last collected audio. If they are inconsistent, it indicates that the speaker corresponding to the currently collected audio is the same as the last collected audio. If the corresponding speakers are inconsistent, that is, the speakers have changed; on the contrary, if they are consistent, it means that the speaker corresponding to the current collected audio is consistent with the speaker corresponding to the last collected audio, that is, the speaker has not changed.
  • Step 330 If the speaker changes, locate the position in the space of the speaker corresponding to the audio according to the collected audio.
  • the positioning performed is to determine the position of the speaker corresponding to the audio in space by using the sound source positioning technology according to the time when the audio is collected.
  • the position of the speaker in the space is actually a spatial area.
  • a certain area for example, the area occupied by the head
  • a certain point in the space area occupied by the speaker is used to indicate the position of the speaker in the space.
  • the sound source localization technology uses the time delay of the audio collected by multiple sound collection modules to determine the position of the speaker corresponding to the audio.
  • the collection terminal includes at least two sound collection modules.
  • the time when the audio is collected by each sound collection module is stored in the collection terminal, so that the time delay for any two sound collection modules to collect the audio can be calculated according to the time when each sound collection module collects the audio, and then the speech can be realized Positioning of people's positions.
  • Step 350 Adjust the camera in the collection terminal according to the located position. After the adjustment, the speaker corresponding to the audio is located in the center of the shooting screen of the camera.
  • the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera.
  • the position and distance of the speaker corresponding to the audio relative to the camera can be determined according to the located position.
  • the camera is adjusted for the purpose of collecting clear and easily recognizable images of the speaker.
  • the adjustment can be to adjust the shooting angle of the camera so that the camera is aligned with the speaker corresponding to the audio; it can also be adjusted to adjust the focal length of the camera to ensure the proportion of the speaker's portrait in the collected image and to ensure the viewer
  • the spokesperson can be accurately identified through the image; the shooting angle and focal length of the camera can also be adjusted at the same time, which is determined according to the actual situation, that is, according to the determined distance and orientation to determine whether the shooting angle and focal length need to be adjusted.
  • control according to the determined position when it is determined that the speaker is not in the picture under the current shooting angle of the camera according to the position of the speaker corresponding to the audio relative to the camera, or the speaker deviates from the current shooting angle of the camera by a large amount, control according to the determined position When the camera rotates, the shooting angle of the camera is adjusted to ensure that the camera is aimed at the speaker after adjustment. Conversely, if it is determined that the speaker is located in the center of the shooting screen under the current shooting angle of the camera according to the determined orientation, the shooting angle adjustment is not performed.
  • the image at the focal position is relatively clear, the image at the non-focus position may be blurred. Therefore, in order to obtain a clear image of the speaker, according to the located position, the acquisition terminal The focus of the camera is adjusted so that the display is adjusted to a focal length that is compatible with the location. At this time, the position of the speaker is at or near the focal length.
  • Step 370 Perform image collection through the adjusted camera to obtain an image of the speaker corresponding to the audio.
  • the speaker corresponding to the audio is located in the center of the camera shooting screen, so that the image of the speaker corresponding to the audio can be correspondingly collected.
  • the image of the speaker may be a full-body image, an upper body image, etc., of the speaker, which is not specifically limited here.
  • the captured image of the speaker is an image whose main body is the speaker corresponding to the audio.
  • the image of the speaker collected in this application is used for display in the collection terminal, so that the image of the speaker is displayed while the speaker is speaking.
  • the collection terminal can be displayed through its own display screen or through an external display device, which is not specifically limited here.
  • the method further includes:
  • the speaker when the speaker changes according to the audio, the speaker is positioned according to the audio, and the camera is adjusted according to the position of the positioned speaker, so as to collect the image of the speaker. Realize the speaker tracking and positioning based on the audio, and collect the speaker's image according to the speaker's location. Therefore, it is ensured that the screen displayed on the collection terminal is the image of the collected speaker, which can effectively solve the problem of the absence of the speaker's portrait in the screen displayed in the related art.
  • the image of the speaker before displaying, is enlarged according to the scale of the display screen of the collection terminal, so as to ensure that the obtained image of the speaker fits the display screen and the display effect is ensured.
  • the display is controlled to display images captured by the camera.
  • the display is controlled to display the cropped image of the speaker.
  • step 310 if it is determined that the speaker has not changed, the shooting angle of the camera is maintained unchanged, so that images of the speaker can be continuously collected and displayed.
  • step 310 if it is determined that the speaker has not changed, the image displayed on the collection terminal is not replaced. In other words, if the speaker of the last collected audio and the current collected audio are the same person, then Keep the displayed image unchanged.
  • step 310 if it is determined that the speaker has not changed, it is determined based on the audio whether the position of the speaker corresponding to the audio has changed, and if the position of the speaker has not changed, adjustment is made according to the position of the speaker
  • the camera wherein the adjustment of the camera includes adjusting the shooting angle of the camera, and/or adjusting the focal length of the camera according to the distance between the speaker and the camera. Therefore, it is ensured that the speaker is located in the center of the shooting picture of the camera, so that a clear image of the speaker is collected, and it is convenient for the observer to recognize the speaker through the collected image of the speaker.
  • the method of this application can be applied to a multi-party video conference, so that according to the audio collected in the multi-party video conference, the image of the speaker is collected according to the method of this application to display the image of the speaker on the screen, and the speech Images of people are simultaneously displayed on the display screens of other conference parties, so that participants in a multi-party video conference can determine the speaker based on the displayed image.
  • step 310 includes:
  • Step 311 Extract voiceprint features from the audio.
  • the extracted voiceprint feature can be one or more of Mel frequency cepstrum coefficient, short-term energy, short-term average amplitude, short-term average zero-crossing rate, formant, and linear prediction cepstrum coefficient .
  • the extracted voiceprint features can ensure the accuracy of voiceprint recognition, and the extracted voiceprint features are not specifically limited here.
  • Step 313 Calculate the voiceprint similarity of the extracted voiceprint feature with respect to the voiceprint feature corresponding to the last collected audio.
  • the voiceprint similarity is used to characterize the similarity of the voiceprint feature of the currently collected audio with respect to the corresponding voiceprint feature of the last collected audio.
  • the voiceprint vector of the audio is constructed based on the voiceprint features extracted for the collected audio, so that the voiceprint vector of the current audio is compared with the voiceprint of the last collected audio.
  • the vector performs voiceprint similarity calculation, for example, the Euclidean distance, cosine distance, Mahalanobis distance of two voiceprint vectors are used as the voiceprint similarity.
  • Step 315 Determine whether the speaker changes according to the voiceprint similarity.
  • the calculated voiceprint similarity indicates that the two voiceprint features are similar, it is determined that the speaker has not changed; conversely, if the calculated voiceprint similarity indicates that the two voiceprint features are not similar, the speaker is determined to change.
  • the similarity range in order to determine whether the speaker changes according to the similarity of the voiceprint, can be preset. If the similarity of the voiceprint is within the similarity range, it means that the voiceprint similarity corresponds to two voiceprints. Features are similar.
  • the speaker change is determined.
  • the collection terminal includes one reference sound collection module and at least three non-reference sound collection modules.
  • step 330 includes:
  • Step 331 According to the time of audio collected by the reference sound collection module and the non-reference sound collection module, respectively, the time delay of each non-reference sound collection module with respect to the reference sound collection module is calculated.
  • each sound collection module stores the time when the audio was collected while collecting the audio. Therefore, the corresponding calculation is obtained based on the time when the audio is collected by the reference sound collection module and each non-reference sound collection module. The time delay for each non-reference sound collection module to collect the audio relative to the reference sound collection module.
  • Step 333 Calculate according to the arrangement position and time delay of the reference sound collection module and the non-reference sound collection module to obtain the position coordinates of the speaker corresponding to the audio.
  • the position of the reference sound collection module is used as the reference origin, and a coordinate system is constructed, so that according to the arrangement position of the reference sound collection module and the non-reference sound collection modules, the relative position of each non-reference sound collection module in the constructed coordinate system can be obtained. coordinate of.
  • the distance difference between the speaker corresponding to the audio and the non-reference sound collection module and the reference sound collection module can be calculated.
  • matrix A is an n ⁇ 4 matrix
  • n is the number of non-reference sound collection modules
  • the i-th row element in matrix A is [x i ,y i ,z i ,d i ]
  • x i is the i-th a non-reference sound collection module x coordinate
  • y i is the i th non-reference sound collection module y-coordinate
  • z i is the i th z-axis coordinate of the non-reference sound collection module
  • d i is the corresponding audio spokesman
  • X [x,y,z,R] T
  • matrix B is an n ⁇ 4 matrix
  • the i-th row element in matrix B is
  • Solving the above matrix equation can calculate the position coordinates (x, y, z) of the speaker corresponding to the audio.
  • step 350 includes:
  • Step 351 Determine the distance and orientation of the speaker corresponding to the audio relative to the camera according to the located position.
  • Step 353 Adjust the focal length of the camera according to the determined distance, and adjust the shooting angle of the camera according to the determined orientation.
  • the adjustment of the shooting angle is to control the rotation of the camera according to the determined orientation, so that the rotated camera is aligned with the speaker corresponding to the audio.
  • the focal length mapped by the distance is obtained from the configuration file, thereby adjusting the focal length of the camera to the obtained focal length.
  • step 370 includes:
  • Step 371 Perform speaker identification according to the image collected by the adjusted camera, and locate the portrait of the speaker in the image.
  • the collected image may include multiple people.
  • the lips should act accordingly while speaking. Therefore, the speaker recognition can be recognized by the lip movements of each person in the collected images. For example, extracting the lip pixels of a person from the continuously collected images, and judging whether the person's lips are moving by comparing the lip pixels extracted from the continuous images. If it does, it is determined that the person where the lip pixel is located is the speaker's Portrait; on the contrary, if the lips are not moving, it is determined that the portrait where the lip pixel is located is not the portrait of the speaker.
  • an action agreement may be made in advance, for example, an appointment may be made for the speaker to raise his hand when speaking, or an appointment for the speaker to stand and speak, so that the agreed action is recognized in the collected image , Such as raising hands, standing, and determining the portrait of the person in the image as the speaker's portrait.
  • Step 373 Clip the image according to the positioned portrait to obtain an image of the speaker.
  • the image with the speaker as the main body is obtained by cropping from the image including multiple portraits, that is, the image of the speaker.
  • the obtained speaker image includes at least the face image of the speaker.
  • the speaker's portrait is positioned and cropped, so as to ensure that the obtained speaker's image is based on the speaker, and improve the speed at which personnel can identify the speaker from the speaker's image.
  • step 371 includes:
  • Step 410 According to the image collected by the adjusted camera, pixel points of the designated organ are extracted for each person in the collected image.
  • the spokesperson recognition can be based on the lip actions or agreed actions of each person in the image, regardless of whether the lips or the agreed actions are realized by organs, such as lips and hands. Wait.
  • the executive organ of the action used for speaker recognition is the designated organ. For example, if the speaker is recognized by lip motion, the lips are the designated organ, and if the gesture is used for speaker recognition, the hand is the designated organ.
  • the speaker recognition is performed in the collected images, and the designated organ is located in the image first, and the pixels of the designated organ are extracted accordingly.
  • Step 430 Perform action recognition according to the extracted pixels, and determine the action represented by the extracted pixels.
  • the shape of the designated organ can be reconstructed through the extracted pixels, so as to determine the action represented by the pixel according to the reconstructed shape.
  • Step 450 Determine the portrait of the pixel where the represented action matches the predetermined action as the portrait of the speaker.
  • the predetermined actions are, for example, actions agreed to be used for speaker recognition, such as raising hands, standing, moving lips, etc., which are not specifically limited here.
  • the action represented by the pixel point matches the predetermined action, it is determined that the person where the pixel point is located is the speaker's portrait.
  • the method further includes: detecting whether audio is still not collected after a set interval of time. If yes, control the camera to rotate to the preset shooting angle. If not, perform the step of performing voiceprint recognition on the collected audio. After the interval is set for a period of time, if audio is still not collected, control the camera to rotate to the preset shooting angle. Further, the image collected under the shooting angle is displayed in the collection terminal. Conversely, after the interval is set for a period of time, if audio is collected, then go to step 310.
  • Fig. 8 is a flowchart of an image capture control method according to some embodiments.
  • the collection terminal is a television including a camera and a sound collection module. As shown in Fig. 8, it includes the following steps:
  • Step 510 Recognition of the spokesperson: the portrait of the spokesperson is recognized according to the image collected by the camera, and the recognition of the spokesperson can be performed by moving lips or an agreed action.
  • Step 520 Clipping of the speaker's image: After identifying the portrait of the speaker in the image, crop the collected image to obtain the image of the speaker, so as to display the obtained image of the speaker on the TV.
  • Step 530 whether to continue to collect audio: real-time detection of the audio collection state (for example, detection every second), if the audio continues to be collected, go to step 540; if no audio is collected, go to step 560.
  • Step 540 whether the speaker has changed: perform voiceprint recognition through the collected audio to determine whether the speaker has changed; if the speaker has changed, go to step 550; if the speaker has not changed, no processing is performed, that is, continue Display the image currently displayed on the TV.
  • Step 550 Adjust the camera according to the position of the speaker: Determine the position of the speaker according to the time of the collected audio, and accordingly adjust the camera according to the position of the speaker.
  • the adjustment performed is, for example, adjusting the shooting angle of the camera according to the angle of the speaker relative to the camera, or adjusting the focal length of the camera according to the distance of the speaker relative to the camera, or adjusting both the shooting angle and the focal length. Then, perform image collection through the adjusted camera, and go to step 510.
  • Step 560 whether it exceeds the set time: start timing when it is detected that the audio is not collected continuously, if the audio is still not collected after the set time (for example, 30s), then go to step 570; If the set time is exceeded, the timing will continue.
  • the set time for example, 30s
  • Step 570 Control the camera to rotate to a preset shooting angle: perform image collection at the preset shooting angle, and display the collected image on the TV. While displaying the image, perform speaker recognition based on the collected image, that is, go to step 510.
  • Fig. 9 is a block diagram showing an image acquisition control device according to an exemplary embodiment.
  • the device can be used in the terminal 200 shown in Fig. 1 to perform all or part of the steps in any method embodiment.
  • the device includes, but is not limited to: a voiceprint recognition module 610, a positioning module 630, an adjustment module 650, and an image acquisition module 670, wherein:
  • the voiceprint recognition module 610 is configured to perform voiceprint recognition on the collected audio, and determine whether the speaker changes through the voiceprint recognition.
  • the positioning module 630 is configured to locate the position in the space of the speaker corresponding to the audio according to the collected audio if the voiceprint recognition module determines that the speaker changes.
  • the adjustment module 650 is used to adjust the camera in the collection terminal according to the positioned position. After adjustment, the speaker corresponding to the audio is located in the center of the shooting screen of the camera. The adjustment includes adjusting the shooting angle of the camera and/or adjusting the camera focal length.
  • the image acquisition module 670 is used for image acquisition through the adjusted camera to obtain an image of the speaker corresponding to the audio.
  • modules can be implemented by hardware, software, or a combination of both.
  • these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits.
  • these modules may be implemented as one or more computer programs executed on one or more processors, for example, the programs stored in the memory 204 executed by the processor 218 in FIG. 1.
  • the voiceprint recognition module 610 includes a feature extraction unit for extracting voiceprint features from audio.
  • the calculation unit is used to calculate the voiceprint similarity of the extracted voiceprint feature with respect to the voiceprint feature corresponding to the last collected audio.
  • the determining unit is used to determine whether the speaker changes according to the similarity of the voiceprint.
  • the collection terminal includes one reference sound collection module and at least three non-reference sound collection modules
  • the positioning module 630 includes:
  • the time delay calculation unit is configured to calculate the audio time delay of each non-reference sound collection module relative to the reference sound collection module according to the time when the audio is collected by the reference sound collection module and the non-reference sound collection module.
  • the coordinate calculation unit is used to calculate according to the arrangement position and time delay of the reference sound collection module and the non-reference sound collection module to obtain the position coordinates of the speaker corresponding to the audio.
  • the adjustment module 650 includes an angle and orientation determining unit for determining the distance and orientation of the speaker corresponding to the audio relative to the camera according to the located position.
  • the adjustment unit is used to adjust the focal length of the camera according to the determined distance, and adjust the shooting angle of the camera according to the determined orientation.
  • the image acquisition module 670 includes: a portrait positioning unit, configured to perform speaker identification according to the image collected by the adjusted camera, and locate the portrait of the speaker in the image.
  • the cropping unit is used to crop the image according to the positioned portrait to obtain the image of the speaker.
  • the portrait positioning unit includes: a pixel extraction unit for extracting pixels for a designated organ for each portrait in the captured image according to the image captured by the adjusted camera.
  • the action recognition unit is used to perform action recognition according to the extracted pixels and determine the action represented by the extracted pixels.
  • the portrait determination unit is used to determine the portrait of the pixel where the represented action matches the predetermined action as the portrait of the speaker.
  • the device further includes: a display replacement module for replacing the image displayed by the collection terminal with the image of the speaker.
  • the device further includes: a detection module for detecting whether audio is still not collected after the interval is set for a period of time.
  • the rotation adjustment module is used for controlling the camera to rotate to a preset shooting angle if the detection module detects that no audio is collected after the interval set time period. If the detection module detects that the audio is collected after the interval set time period, it transfers to the voiceprint recognition module 610.
  • the present application also provides a collection terminal.
  • the collection terminal may be the terminal 200 shown in FIG. 1 and executes all or part of the steps in any of the above method embodiments.
  • the collection terminal includes:
  • a processor and a memory, and computer-readable instructions are stored on the memory, and the computer-readable instructions implement the method in any of the above method embodiments when executed by the processor.
  • a computer-readable non-volatile storage medium is also provided, and computer-readable instructions are stored thereon.
  • the computer-readable instructions are executed by a processor, any of the above method embodiments is implemented. method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Studio Devices (AREA)

Abstract

本申请揭示了一种图像采集的控制方法,应用于采集终端,包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头进行调整,调整后,所述音频所对应发言人位于所述摄像头的拍摄画面中央,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。

Description

图像采集的控制方法及采集终端
本申请要求在2019年08月13日提交中国专利局、申请号为201910746092.0、发明名称为“图像采集的控制方法、装置及采集终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及多媒体技术领域,特别涉及一种图像采集的控制方法及采集终端。
背景技术
相关技术中,随着互联网技术和通信技术的发展,多方视频会议在工作中的应用越来越广泛。
在多方视频会议中,显示设备实时进行图像显示,展示会议多方的状态。其中,显示设备所显示的图像为摄像头所采集的图像。
对于摄像头而言,摄像头所采集的图像受摄像头部署位置的限制且摄像头不可调节,从而,位于摄像头拍摄盲区的参会人员不会出现在摄像头所采集的图像中。进而,如果发言人位于摄像头的拍摄盲区,由于不能采集到拍摄盲区中的图像,从而显示设备所显示的画面中不包括发言人的人像,导致其他参会人员不能看到发言人的图像。
由上可知,如何进行图像采集以保证采集到发言人的图像的问题亟待解决。
发明内容
第一方面,本申请提供了一种图像采集的控制方法,应用于采集终端,所述方法包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头进行调整,调整后,所述音频所对应发言人位于所述摄像头的拍摄画面中央,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
第二方面,本申请提供了一种图像采集的控制方法,应用于采集终端,所述方法包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头的焦距进行调整,以使所述音频所对应发言人位于所述摄像头的焦距位置;通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
第三方面,本申请提供了一种图像采集的控制方法,应用于采集终端,所述方法包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头进行调整,以使所述音频所对应发言人位于所述摄像头采集的画面中,且位于所述摄像头的焦距位置,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
第四方面,本申请提供了一种图像采集的控制方法,应用于采集终端,所述方法包括: 对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头进行调整,以使所述音频所对应发言人位于所述摄像头采集的画面中,且位于所述摄像头的焦距位置,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;通过调整后的摄像头进行图像采集;在摄像头采集到的图像中进行发言人识别,以在所述图像中定位所述发言人的人像;根据所定位到的人像对所述图像进行剪裁,获得所述音频所对应发言人的图像;在显示器输出所述音频所对应发言人的图像。
第五方面,本申请提供了一种图像采集的控制装置,应用于采集终端,所述装置包括:声纹识别模块,用于对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;定位模块,用于若声纹识别模块判断发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;控制模块,用于根据所定位到的位置,对所述采集终端中的摄像头进行调整,调整后,所述音频所对应发言人位于所述摄像头的拍摄画面中央,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;图像采集模块,用于通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
第六方面,本申请提供了一种采集终端,包括:处理器;及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述的方法。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于描述本申请的实施方式。
图1是根据一示例性实施例示出的一种终端的框图;
图2是根据一示例性实施例示出的一种图像采集的控制方法的流程图;
图3是图2对应实施例中步骤310在一些实施例中的流程图;
图4是图2对应实施例中步骤330在一些实施例中的流程图;
图5是图2对应实施例中步骤350在一些实施例中的流程图;
图6是图2对应实施例中步骤370在一些实施例中的流程图;
图7是图6对应实施例中步骤371在一些实施例中的流程图;
图8是根据一些实施例示出的图像采集的控制方法的流程图;
图9是根据一示例性实施例示出的一种图像采集的控制装置的框图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述,这些附图和文字描述并不是为了通过任何方式限制本申请的实施方式,而是通过参考特定实施例为本领域技术人员说明本申请的实施方式。
具体实施方式
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权 利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
图1是根据一示例性实施例示出的一种终端200的框图。终端200可以作为固定终端用于按照本申请的方法进行图像采集,终端200例如集成摄像头和声音采集模块的电视机、台式电脑等。
参照图1,终端200可以包括以下一个或多个组件:处理组件202,存储器204,电源组件206,多媒体组件208,声音采集组件210,摄像头214以及通信组件216。
处理组件202通常控制终端200的整体操作,诸如与显示,图像采集,数据通信,摄像头旋转以及记录操作相关联的操作等。处理组件202可以包括一个或多个处理器218来执行指令,以完成下述的方法的全部或部分步骤。此外,处理组件202可以包括一个或多个模块,便于处理组件202和其他组件之间的交互。例如,处理组件202可以包括多媒体模块,以方便多媒体组件208和处理组件202之间的交互。
存储器204被配置为存储各种类型的数据以支持在终端200的操作。这些数据的示例包括用于在终端200上操作的任何应用程序或方法的指令。存储器204可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(Static Random Access Memory,简称SRAM),电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,简称EEPROM),可擦除可编程只读存储器(Erasable Programmable Read Only Memory,简称EPROM),可编程只读存储器(Programmable Red-Only Memory,简称PROM),只读存储器(Read-Only Memory,简称ROM),磁存储器,快闪存储器,磁盘或光盘。存储器204中还存储有一个或多个模块,该一个或多个模块被配置成由该一个或多个处理器218执行,以完成下述任一方法实施例中的全部或者部分步骤。
电源组件206为终端200的各种组件提供电力。电源组件206可以包括电源管理系统,一个或多个电源,及其他与为终端200生成、管理和分配电力相关联的组件。
多媒体组件208包括在所述终端200和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(Liquid Crystal Display,简称LCD)和触摸面板。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。屏幕还可以包括有机电致发光显示器(Organic Light Emitting Display,简称OLED)。其中,通过摄像头所采集的图像可以通过屏幕进行显示。
声音采集组件210被配置为进行音频采集,其中声音采集组件210可以包括若干个声音采集模块,声音采集模块例如麦克风(Microphone,简称MIC),通过声音采集组件210进行音频采集。
摄像头214用于进行图像采集,从而获得图像。在本申请的方案中,终端200中至少包括一可受控旋转的摄像头。从而,在确定发言人变化后,可以根据发言人的位置控制摄像头旋转,以采集发言人的图像。
通信组件216被配置为便于终端200和其他设备之间有线或无线方式的通信。终端200 可以接入基于通信标准的无线网络,如WiFi(WIreless-Fidelity,无线保真)。在一个示例性实施例中,通信组件216经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件216还包括近场通信(Near Field Communication,简称NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(Radio Frequency Identification,简称RFID)技术,红外数据协会(Infrared Data Association,简称IrDA)技术,超宽带(Ultra Wideband,简称UWB)技术,蓝牙技术和其他技术来实现。
在示例性实施例中,终端200可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器、数字信号处理设备、可编程逻辑器件、现场可编程门阵列、控制器、微控制器、微处理器或其他电子元件实现,用于执行下述方法。
图2是根据一示例性实施例示出的一种图像采集的控制方法的流程图。该图像采集的控制方法,应用于采集终端,采集终端例如图1所示的终端200。如图2所示,该方法,可以包括以下步骤:
步骤310,对采集的音频进行声纹识别,通过声纹识别确定发言人是否变化。
采集终端包括声音采集模块,通过声音采集模块进行音频采集,该声音采集模块例如麦克风。在一些实施例中,声音采集模块可以集成在采集终端内部,也可以部署与采集终端外部,例如通过外接接口与采集终端相连。
采集终端的声音采集模块持续进行信号采集,可以理解的是,由于人员并不是连续不断地讲话,从而,声音采集模块所采集的信号包括有音信号和无音信号。本申请所指的音频来自于声音采集模块所采集的有音信号,例如有音信号中的一段信号,或者两相邻的无音信号之间的整段有音信号。
在一些实施例中,通过端点检测来确定声音采集模块所采集信号中的有音信号和无音信号。
为按照本申请的方法采集发言人的图像,在步骤310之前,对所采集的信号进行分段,对分段获得的音频按照被公开的方法进行图像采集控制。所进行的分段,例如根据端点检测确定有音信号和无音信号的基础上,将两相邻无音信号之间的有音信号作为一段音频。
在另一些实施例中,还可以按照所设定的采集周期来对所采集的信号进行分段,从而,将分段所获得的有音信号段作为一段音频。
在一些实施例中,为降低运算量,仅对无音信号所相邻的下一有音信号段进行声纹识别,换言之,若音频所相邻的上一信号段仍为有音信号,则不执行步骤310,从而默认该音频所对应的发言人仍为所相邻上一有音信号段所对应发言人。
由于每个人的声音器官,例如声带、口腔、鼻腔等,在发音时呈现千姿百态,以及发音容量、发音频率的不尽相同,因而导致每个人的声音器官发出的声音必然有各自的特点,形成每个人独特的声纹。
人的声纹通过声纹特征来表征。声纹特征是根据所采集的音频进行特征提取获得。声纹特征例如梅尔频率倒谱系数(Mel Frequency Cepstral Coefficents,MFCC)、短时能量、 短时平均幅度、短时平均过零率、共振峰、线性预测倒谱系数(LPCC)。
在一些实施例中,为进行声纹识别从音频中所提取的声纹特征可以是一种或者多种,在此不进行具体限定。
所进行的声纹识别即识别当前所采集的音频的声纹特征与上一所采集音频的声纹特征是否一致,如果不一致,则表明当前所采集的音频所对应发言人与上一所采集音频所对应发言人不一致,即发言人发生变化;反之,如果一致,则表明当前所采集的音频所对应发言人与上一所采集音频所对应发言人一致,即发言人未变化。
步骤330,若发言人变化,则根据所采集的音频定位音频所对应发言人在空间中的位置。
所进行的定位,即根据采集到该音频的时间,利用声源定位技术确定该音频所对应发言人在空间中的位置。
可以理解的是,由于发言人具有一定的体积,发言人在空间中的位置实际上为一空间区域。为了便于进行计算,将发言人所占据空间区域中的某一区域(例如头部所占据的区域),或者某一点用来表示发言人在空间中的位置。
其中,声源定位技术是利用多个声音采集模块采集到音频的时延来确定音频所对应发言人的位置。
至此,可以理解的是,采集终端包括至少两个声音采集模块。在采集终端中存储了各声音采集模块采集到该音频的时间,从而,可以根据各声音采集模块采集到音频的时间对应计算到任两个声音采集模块采集到该音频的时延,进而实现发言人位置的定位。
步骤350,根据所定位到的位置,对采集终端中的摄像头进行调整,调整后,音频所对应发言人位于摄像头的拍摄画面中央,调整包括调整摄像头的拍摄角度和/或调整摄像头的焦距。
在一些实施例中,根据所定位到的位置,即可确定音频所对应发言人相对于摄像头的方位和距离。
对于图像采集而言,特别是以发言人为目标的图像采集而言,以采集到发言人的清晰且便于辨识的图像为目的进行摄像头的调整。
从而所进行的调整可以是调整摄像头的拍摄角度,使得调整后摄像头对准音频所对应发言人;也可以是调整摄像头的焦距,从而保证发言人的人像在所采集的图像中比例,保证观看人员可以通过图像准确辨识发言人;还可以是同时调整摄像头的拍摄角度和焦距,具体根据实际情况确定,即根据所确定的距离和方位判断是否需要进行拍摄角度和焦距的调整。
在一些实施例中,当根据音频所对应发言人相对于摄像头的方位判断发言人未在摄像头当前拍摄角度下的画面中,或者发言人偏离摄像头当前拍摄角度较大,则根据所确定的方位控制摄像头旋转,即调整摄像的拍摄角度,从而保证调整后,摄像头对准发言人。反之,若根据所确定的方位判断发言人位于摄像头当前拍摄角度下的拍摄画面的中央,则不进行拍摄角度调整。
在一些实施例中,当根据音频所对应发言人相对于摄像头的距离判断发言人距离摄像 头较远时,从而使得在当前焦距下所采集的图像中人像在图像中所占据的比例较小,则调整摄像头的焦距,以保证所采集图像中发言人的人像在图像中的比例满足设定的要求;反之,如果判断在当前焦距下所采集的图像中人像在图像中所占据的比例满足要求,则不进行焦距调整。
在一些实施例中,因为焦距位置处的图像是较为清晰的,非焦距位置的图像可能会出现模糊,因此为获取到发言人的清晰图像,根据所定位到的位置,对所述采集终端中的摄像头进行焦距调整,以使得显示器调整为和定位到的位置相适应的焦距,此时发言人的位置位于焦距位置处或附近。
步骤370,通过调整后的摄像头进行图像采集获得音频所对应发言人的图像。
如上,调整摄像头后,音频所对应发言人位于摄像头拍摄画面的中央,从而,即可对应采集获得音频所对应发言人的图像。
其中,发言人的图像可以是发言人的全身图像、上半身图像等,在此不进行具体限定。
在一些实施例中,所采集发言人的图像为以音频所对应发言人为主体的图像。
其中,本申请所采集发言人的图像用于在采集终端中进行显示,从而在发言人发言的同时,显示发言人的图像。其中采集终端可以通过自身的显示屏幕进行显示,也可以通过外接的显示设备进行显示,在此不进行具体限定。
在一些实施例中,步骤370之后,该方法还包括:
将采集终端所显示的图像替换为发言人的图像。
在本申请的技术方案中,根据音频判断发言人变化时,根据音频进行发言人定位,并按照所定位到发言人的位置调整摄像头,从而采集到发言人的图像。实现了根据音频进行发言人跟踪定位,并根据发言人的位置采集发言人的图像。从而,保证在采集终端所显示的画面为所采集发言人的图像,可以有效解决相关技术中所显示画面中不存在发言人的人像的问题。
在一些实施例中,在进行显示之前,根据采集终端的显示屏幕的比例大小对发言人的图像进行放大,从而保证所获得的发言人的图像适配于显示屏幕,保证显示效果。
在一些实施例中,控制显示器显示摄像头采集到的图像。
在一些实施例中,控制显示器显示裁切后的发言人的图像。
在一些实施例中,在步骤310之后,若确定发言人未变化,则维持摄像头的拍摄角度不变,从而可以继续采集该发言人的图像并显示。
在另一些实施例中,在步骤310之后,若确定发言人未变化时,不替换采集终端所显示的图像,换言之,若所采集上一音频和本次所采集音频的发言人为同一人,则维持所显示的图像不变。
在另一些实施例中,在步骤310之后,若确定发言人未变化,则根据该音频判断音频所对应发言人的位置是否发生变化,若发言人位置未变化,则根据发言人的位置进行调整摄像头,其中,对摄像头所进行的调整包括调整摄像头的拍摄角度,和/或,根据发言人与摄像头之间的距离调整摄像头的焦距。从而,保证发言人位于摄像头的拍摄画面的中央,从而采集到清晰的发言人的图像,便于观看人员通过所采集发言人的图像辨识发言人。
本申请的方法可以应用到多方视频会议中,从而根据在多方视频会议中所采集到的音频对应按照本申请的方法采集发言人的图像,以在屏幕中显示发言人的图像,并将该发言人的图像同步显示在其它会议方的显示屏幕中,从而使得多方视频会议中的参会人员可以根据所显示的图像确定发言人。
在一些实施例中,如图3所示,步骤310,包括:
步骤311,从音频中提取声纹特征。
如上所描述,所提取的声纹特征可以是梅尔频率倒谱系数、短时能量、短时平均幅度、短时平均过零率、共振峰、线性预测倒谱系数中的一种或者多种,所提取的声纹特征可以保证声纹识别的准确度即可,在此不对所提取的声纹特征进行具体限定。
步骤313,计算所提取声纹特征相对于上一所采集音频所对应声纹特征的声纹相似度。
声纹相似度用于表征当前所采集音频的声纹特征相对于上一所采集音频所对应声纹特征的相似性。
在一些实施例中,为进行声纹相似度的计算,根据为所采集音频提取的声纹特征构建该音频的声纹向量,从而通过当前音频的声纹向量与上一所采集音频的声纹向量进行声纹相似度计算,例如将两声纹向量的欧式距离、余弦距离、马氏距离等作为声纹相似度。
步骤315,根据声纹相似度确定发言人是否变化。
当所计算得到的声纹相似度表征两声纹特征相似时,则确定发言人未变化;反之,若所计算得到的声纹相似度表征两声纹特征不相似时,则确定发言人变化。
在一些实施例中,为根据声纹相似度确定发言人是否变化,可以预先设定相似度范围,若声纹相似度位于该相似度范围内,则表示该声纹相似度所对应两声纹特征相似。
从而,通过确定所计算得到的声纹相似度是否位于所设定的相似度范围即可确定发言人是否变化,即若声纹相似度位于相似度范围内,则确定发言人未变化;反之,若声纹相似度超出相似度范围,则确定发言人变化。
在一些实施例中,采集终端包括一个参考声音采集模块和至少三个非参考声音采集模块,如图4所示,步骤330,包括:
步骤331,根据参考声音采集模块和非参考声音采集模块所分别采集到音频的时间,计算得到每一非参考声音采集模块相对于参考声音采集模块采集到音频的时延。
在本实施例中,各声音采集模块在采集音频的同时,对应存储了采集到音频的时间,从而,根据参考声音采集模块和各非参考声音采集模块所分别采集到该音频的时间对应计算得到每一非参考声音采集模块相对于参考声音采集模块采集到该音频的时延。
步骤333,根据参考声音采集模块、非参考声音采集模块的布置位置和时延进行计算,获得音频所对应发言人的位置坐标。
其中,参考声音采集模块的位置作为参考原点,并构建坐标系,从而根据参考声音采集模块、各非参考声音采集模块的布置位置即可获得各非参考声音采集模块相对于在所构建坐标系中的坐标。
而根据每一非参考声音采集模块相对于参考声音采集模块采集到该音频的时延即可计算得到音频所对应发言人与非参考声音采集模块和与参考声音采集模块的距离差。
通过各非参考声音采集模块的坐标和所计算得到的距离差构建如下的矩阵方程:
AX=B
其中,矩阵A为n×4的矩阵,n为非参考声音采集模块的数量,矩阵A中的第i行元素为[x i,y i,z i,d i],x i为第i个非参考声音采集模块的x轴坐标,y i为第i个非参考声音采集模块的y轴坐标,z i为第i个非参考声音采集模块的z轴坐标,d i为音频所对应发言人与第i个非参考声音采集模块和与参考声音采集模块的距离差;X=[x,y,z,R] T;矩阵B为n×4的矩阵,矩阵B中的第i行元素为
Figure PCTCN2020099455-appb-000001
对上述矩阵方程进行求解,即可计算得到音频所对应发言人的位置坐标(x,y,z)。
在一些实施例中,如图5所示,步骤350,包括:
步骤351,根据所定位到的位置,确定音频所对应发言人相对于摄像头的距离和方位。
步骤353,根据所确定的距离调整摄像头的焦距,以及根据所确定的方位调整摄像头的拍摄角度。
其中,所进行拍摄角度的调整即根据所确定的方位控制摄像头旋转,从而使旋转后的摄像对准音频所对应发言人。
为进行焦距调整,可以根据配置文件进行。在配置文件中对距离与焦距进行了映射,从而,在确定音频所对应发言人与摄像头的距离后,从配置文件中获取该距离所映射的焦距,从而,将摄像头的焦距调整为所获取的焦距。
在一些实施例中,如图6所示,步骤370,包括:
步骤371,根据调整后的摄像头所采集的图像,进行发言人识别,在图像中定位发言人的人像。
在一应用场景中,若摄像头距离发言人的距离较远,且在采集终端所在的空间中容纳的人员较多,即使音频所对应发言人位于摄像头拍摄画面的中央,而在旋转后的摄像头的拍摄角度下,所采集到的图像中可能包括多个人员。
在此应用场景下,为了准确地获得音频所对应发言人的图像,进行发言人识别,确定音频所对应发言人的人像在所采集图像中的位置。
对于人员而言,发言的同时唇部对应进行动作。从而所进行的发言人识别可以通过所采集图像中各人员的唇部动作进行识别。例如从连续采集的图像中提取人员的唇部像素,通过比对从连续图像中所提取的唇部像素判断人员的唇部是否动作,如果动作,则确定该唇部像素所在人像为发言人的人像;反之,若唇部未动,则确定该唇部像素所在人像不是发言人的人像。
在其他实施例中,为进行发言人识别,可以预先进行动作约定,例如约定发言人在发言时进行举手示意、约定发言人站立发言,从而,在所采集的图像中通过识别所约定的动作,例如举手动作、站立,并将图像中呈现该动作状态的人像确定为发言人的人像。
步骤373,根据所定位到的人像对图像进行剪裁,获得发言人的图像。
至此,则从包括多个人像的图像中剪裁获得以发言人为主体的图像,即发言人的图像。其中所获得的发言人图像至少包括发言人的面部图像。
在一些参会人员较多的会议场景中,由于显示设备中所显示的是全景画面,从而所显示画面中的人像较多,导致参会的其他方并不能快速地从所显示的画面中定位到当前发言人的人像。
在本实施例的方案,通过进行发言人人像定位,并进行剪裁,从而保证所获得发言人的图像是以发言人为主体,提高人员从发言人的图像中识别发言人的速度。
在一些实施例中,如图7所示,步骤371,包括:
步骤410,根据调整后的摄像头所采集的图像,为所采集图像中的每一人像对指定器官进行像素点提取。
如上所描述,所进行的发言人识别可以是基于图像中各人员的唇部动作或者约定的动作来识别,而不管是唇部或者所约定的动作均是由器官来实现的,例如嘴唇、手等。
用于发言人识别的动作的执行器官即为指定器官,举例来说,若通过唇部动作来进行发言人识别,则嘴唇为指定器官,若手势来进行发言人识别,则手为指定器官。
从而,在所采集图像中进行发言人识别,先在图像中进行指定器官定位,定对应提取指定器官的像素点。
步骤430,根据所提取的像素点进行动作识别,确定所提取像素点所表征的动作。
通过所提取的像素点即可重构指定器官的形状,从而对应根据所重构的形状确定像素点所表征的动作。
步骤450,将所表征动作与预定动作相符的像素点所在人像确定为发言人的人像。
预定动作例如所约定用于进行发言人识别的动作,例如举手、站立、嘴唇动等,在此不进行具体限定。
从而,如果所像素点所表征的动作与预定动作相符,则确定该像素点所在人像为发言人的人像。
在一些实施例中,该方法还包括:检测在间隔设定时间段后是否仍未采集到音频。若为是,则控制摄像头旋转至预设拍摄角度。若为否,则执行对采集的音频进行声纹识别的步骤。在间隔设定时间段后,如果仍未采集到音频,则控制将摄像头旋转至预设拍摄角度。进一步的,在采集终端中显示在该拍摄角度下所采集到的图像。反之,在间隔设定时间段后,如果采集到音频,则转至执行步骤310。
图8是根据一些实施例示出的图像采集控制方法的流程图,在本实施例中,采集终端为包括摄像头和声音采集模块的电视机,如图8所示,包括如下步骤:
步骤510,发言人识别:根据摄像头所采集的图像识别发言人的人像,所进行的发言人识别可以通过嘴唇动或者约定的动作来进行识别。
步骤520,发言人图像剪裁:在图像中识别到发言人的人像后,对所采集的图像进行裁剪,获得发言人的图像,以在电视机上显示所获得的发言人的图像。
步骤530,是否继续采集到音频:实时进行音频采集状态的检测(例如每秒进行检测),如果继续采集到音频,则转至步骤540;若为未采集到音频,则转至步骤560。
步骤540,发言人是否变化:通过所采集到的音频进行声纹识别,以确定发言人是否变化;若发言人变化,则转至步骤550;若发言人未变化,则不做作处理,即继续显示电视机当前所显示的图像。
步骤550,根据发言人的位置调整摄像头:根据所采集到音频的时间确定发言人的位置,从而对应地根据发言人的位置调整摄像头。所进行的调整例如根据发言人相对于摄像头的角度调整摄像头的拍摄角度,又例如根据发言人相对于摄像头的距离调整摄像头的焦距,或者拍摄角度和焦距均调整。然后通过调整后的摄像头进行图像采集,并转至步骤510。
步骤560,是否超过设定时间:在检测未继续采集到音频时开始计时,如果在超过设定时间(例如30s)仍然未采集到音频,则转至步骤570;如果未采集到音频的时间未超过设定时间,则继续进行计时。
步骤570,控制摄像头旋转至预设拍摄角度:在预设拍摄角度下进行图像采集,并在电视机上显示所采集的图像。在显示图像的同时,根据所采集的图像进行发言人识别,即转至步骤510。
下述为本申请装置实施例,可以用于执行本申请上述终端200执行的图像采集的控制方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请图像采集的控制方法实施例。
图9是根据一示例性实施例示出的一种图像采集的控制装置的框图,该装置可以用于图1所示的终端200中,执行任一方法实施例中的全部或者部分步骤。如图9所示,该装置包括但不限于:声纹识别模块610、定位模块630、调整模块650以及图像采集模块670,其中:
声纹识别模块610,用于对采集的音频进行声纹识别,通过声纹识别确定发言人是否变化。
定位模块630,用于若声纹识别模块判断发言人变化,则根据所采集的音频定位音频所对应发言人在空间中的位置。
调整模块650,用于根据所定位到的位置,对采集终端中的摄像头进行调整,调整后,音频所对应发言人位于摄像头的拍摄画面中央,调整包括调整摄像头的拍摄角度和/或调整摄像头的焦距。
图像采集模块670,用于通过调整后的摄像头进行图像采集获得音频所对应发言人的图像。
上述装置中各个模块的功能和作用的实现过程具体详见上述图像采集的控制方法中对应步骤的实现过程,在此不再赘述。
可以理解,这些模块可以通过硬件、软件、或二者结合来实现。当以硬件方式实现时,这些模块可以实施为一个或多个硬件模块,例如一个或多个专用集成电路。当以软件方式 实现时,这些模块可以实施为在一个或多个处理器上执行的一个或多个计算机程序,例如图1的处理器218所执行的存储在存储器204中的程序。
在一些实施例中,声纹识别模块610,包括:特征提取单元,用于从音频中提取声纹特征。计算单元,用于计算所提取声纹特征相对于上一所采集音频所对应声纹特征的声纹相似度。确定单元,用于根据声纹相似度确定发言人是否变化。
在一些实施例中,采集终端包括一个参考声音采集模块和至少三个非参考声音采集模块,定位模块630,包括:
时延计算单元,用于根据参考声音采集模块和非参考声音采集模块所分别采集到音频的时间,计算得到每一非参考声音采集模块相对于参考声音采集模块采集到音频的时延。
坐标计算单元,用于根据参考声音采集模块、非参考声音采集模块的布置位置和时延进行计算,获得音频所对应发言人的位置坐标。
在一些实施例中,调整模块650,包括:角度和方位确定单元,用于根据所定位到的位置,确定音频所对应发言人相对于摄像头的距离和方位。调整单元,用于根据所确定的距离调整摄像头的焦距,以及根据所确定的方位调整摄像头的拍摄角度。
在一些实施例中,图像采集模块670,包括:人像定位单元,用于根据调整后的摄像头所采集的图像,进行发言人识别,在图像中定位发言人的人像。剪裁单元,用于根据所定位到的人像对图像进行剪裁,获得发言人的图像。
在一些实施例中,人像定位单元,包括:像素点提取单元,用于根据调整后的摄像头所采集的图像,为所采集图像中的每一人像对指定器官进行像素点提取。动作识别单元,用于根据所提取的像素点进行动作识别,确定所提取像素点所表征的动作。人像确定单元,用于将所表征动作与预定动作相符的像素点所在人像确定为发言人的人像。
在一些实施例中,该装置还包括:显示替换模块,用于将采集终端所显示的图像替换为发言人的图像。
在一些实施例中,该装置还包括:检测模块,用于检测在间隔设定时间段后是否仍未采集到音频。旋转调整模块,用于若检测模块检测在间隔设定时间段后未采集到音频,则控制摄像头旋转至预设拍摄角度。若检测模块检测在间隔设定时间段后采集到音频,则转至声纹识别模块610。
上述装置中各个模块/单元的功能和作用的实现过程具体详见上述图像法采集的控制方法中对应步骤的实现过程,在此不再赘述。
可选的,本申请还提供一种采集终端,该采集终端可以是图1所示的终端200,执行以上任一方法实施例中的全部或者部分步骤。采集终端包括:
处理器;及存储器,存储器上存储有计算机可读指令,计算机可读指令被处理器执行时实现以上任一方法实施例中的方法。
该实施例中的装置的处理器执行操作的具体方式已经在有关该图像采集的控制方法的实施例中执行了详细描述,此处将不做详细阐述说明。
在示例性实施例中,还提供了一种计算机可读非易失性存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时,实现以上任一方法实施例中的方法。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (12)

  1. 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:
    对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;
    若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;
    根据所定位到的位置,对所述采集终端中的摄像头进行调整,调整后,所述音频所对应发言人位于所述摄像头的拍摄画面中央,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;
    通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述音频进行声纹识别,通过所述声纹识别判断发言人是否变化,包括:
    从所述音频中提取声纹特征;
    计算所提取声纹特征相对于上一所采集音频所对应声纹特征的声纹相似度;
    根据所述声纹相似度确定发言人是否变化。
  3. 根据权利要求1所述的方法,其特征在于,所述采集终端包括一个参考声音采集模块和至少三个非参考声音采集模块,所述根据所采集的音频定位所述音频所对应发言人在空间中的位置,包括:
    根据所述参考声音采集模块和所述非参考声音采集模块所分别采集到所述音频的时间,计算得到每一所述非参考声音采集模块相对于所述参考声音采集模块采集到所述音频的时延;
    根据所述参考声音采集模块、所述非参考声音采集模块的布置位置和所述时延进行计算,获得所述音频所对应发言人的位置坐标。
  4. 根据权利要求1所述的方法,其特征在于,所述根据所定位到的位置,对所述采集终端中的摄像头进行调整,包括:
    根据所定位到的位置,确定所述音频所对应发言人相对于所述摄像头的距离和方位;
    根据所确定的距离调整所述摄像头的焦距,以及根据所确定的方位调整所述摄像头的拍摄角度。
  5. 根据权利要求1所述的方法,其特征在于,所述通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像,包括:
    根据调整后的摄像头所采集的图像,进行发言人识别,在所述图像中定位所述发言人的人像;
    根据所定位到的人像对所述图像进行剪裁,获得所述发言人的图像。
  6. 根据权利要求5所述的方法,其特征在于,所述根据调整后的摄像头所采集的图像,进行发言人识别,在所述图像中定位所述发言人的人像,包括:
    根据调整后的摄像头所采集的图像,为所采集图像中的每一人像对指定器官进行像素点提取;
    根据所提取的像素点进行动作识别,确定所提取像素点所表征的动作;
    将所表征动作与预定动作相符的像素点所在人像确定为发言人的人像。
  7. 根据权利要求1所述的方法,其特征在于,所述通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像之后,所述方法还包括:
    将所述采集终端所显示的图像替换为所述发言人的图像。
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    检测在间隔设定时间段后是否仍未采集到音频;
    若为是,则控制所述摄像头旋转至预设拍摄角度;
    若为否,则执行所述对采集的音频进行声纹识别的步骤。
  9. 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:
    对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;
    若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;
    根据所定位到的位置,对所述采集终端中的摄像头的焦距进行调整,以使所述音频所对应发言人位于所述摄像头的焦距位置;
    通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
  10. 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:
    对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;
    若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;
    根据所定位到的位置,对所述采集终端中的摄像头进行调整,以使所述音频所对应发言人位于所述摄像头采集的画面中,且位于所述摄像头的焦距位置,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;
    通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
  11. 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:
    对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;
    若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;
    根据所定位到的位置,对所述采集终端中的摄像头进行调整,以使所述音频所对应发言人位于所述摄像头采集的画面中,且位于所述摄像头的焦距位置,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;
    通过调整后的摄像头进行图像采集
    在摄像头采集到的图像中进行发言人识别,以在所述图像中定位所述发言人的人像;
    根据所定位到的人像对所述图像进行剪裁,获得所述音频所对应发言人的图像;
    在显示器输出所述音频所对应发言人的图像。
  12. 一种采集终端,其特征在于,包括:
    处理器;及
    存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如权利要求1至11中任一项所述的方法。
PCT/CN2020/099455 2019-08-13 2020-06-30 图像采集的控制方法及采集终端 WO2021027424A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910746092.0A CN110505399A (zh) 2019-08-13 2019-08-13 图像采集的控制方法、装置及采集终端
CN201910746092.0 2019-08-13

Publications (1)

Publication Number Publication Date
WO2021027424A1 true WO2021027424A1 (zh) 2021-02-18

Family

ID=68587511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/099455 WO2021027424A1 (zh) 2019-08-13 2020-06-30 图像采集的控制方法及采集终端

Country Status (2)

Country Link
CN (1) CN110505399A (zh)
WO (1) WO2021027424A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113682319A (zh) * 2021-08-05 2021-11-23 地平线(上海)人工智能技术有限公司 摄像头调整方法及装置、电子设备和存储介质
CN114554095A (zh) * 2022-02-25 2022-05-27 深圳锐取信息技术股份有限公司 一种4k摄像机的目标对象确定方法以及相关装置

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110505399A (zh) * 2019-08-13 2019-11-26 聚好看科技股份有限公司 图像采集的控制方法、装置及采集终端
CN113556499B (zh) * 2020-04-07 2023-05-09 上海汽车集团股份有限公司 一种车载视频通话方法及车载系统
CN111586341A (zh) * 2020-05-20 2020-08-25 深圳随锐云网科技有限公司 一种视频会议拍摄装置拍摄方法和画面显示方法
CN111901524B (zh) * 2020-07-22 2022-04-26 维沃移动通信有限公司 对焦方法、装置和电子设备
CN112073639A (zh) * 2020-09-11 2020-12-11 Oppo(重庆)智能科技有限公司 拍摄控制方法及装置、计算机可读介质和电子设备
CN112312042A (zh) * 2020-10-30 2021-02-02 维沃移动通信有限公司 显示控制方法、装置、电子设备及存储介质
CN112541402A (zh) * 2020-11-20 2021-03-23 北京搜狗科技发展有限公司 一种数据处理方法、装置和电子设备
TWI798867B (zh) * 2021-06-27 2023-04-11 瑞昱半導體股份有限公司 視訊處理方法與相關的系統晶片
CN113542604A (zh) * 2021-07-12 2021-10-22 口碑(上海)信息技术有限公司 视频对焦方法及装置
CN113824916A (zh) * 2021-08-19 2021-12-21 深圳壹秘科技有限公司 图像显示方法、装置、设备及存储介质
CN115242971A (zh) * 2022-06-21 2022-10-25 海南视联通信技术有限公司 摄像头控制方法、装置、终端设备和存储介质
CN117640877B (zh) * 2024-01-24 2024-03-29 浙江华创视讯科技有限公司 线上会议的画面重构方法及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107144820A (zh) * 2017-06-21 2017-09-08 歌尔股份有限公司 声源定位方法及装置
US20190215636A1 (en) * 2017-05-24 2019-07-11 Glen A. Norris User Experience Localizing Binaural Sound During a Telephone Call
CN110082723A (zh) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 一种声源定位方法、装置、设备及存储介质
CN110505399A (zh) * 2019-08-13 2019-11-26 聚好看科技股份有限公司 图像采集的控制方法、装置及采集终端

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100505837C (zh) * 2007-05-10 2009-06-24 华为技术有限公司 一种控制图像采集装置进行目标定位的系统及方法
CN104902203A (zh) * 2015-05-19 2015-09-09 广东欧珀移动通信有限公司 一种基于旋转摄像头的视频录制方法及终端
CN104991573A (zh) * 2015-06-25 2015-10-21 北京品创汇通科技有限公司 一种基于声源阵列的定位跟踪方法及其装置
CN107247923A (zh) * 2017-05-18 2017-10-13 珠海格力电器股份有限公司 一种指令识别方法、装置、存储设备、移动终端及电器
CN109754811B (zh) * 2018-12-10 2023-06-02 平安科技(深圳)有限公司 基于生物特征的声源追踪方法、装置、设备及存储介质
CN109783642A (zh) * 2019-01-09 2019-05-21 上海极链网络科技有限公司 多人会议场景的结构化内容处理方法、装置、设备及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190215636A1 (en) * 2017-05-24 2019-07-11 Glen A. Norris User Experience Localizing Binaural Sound During a Telephone Call
CN107144820A (zh) * 2017-06-21 2017-09-08 歌尔股份有限公司 声源定位方法及装置
CN110082723A (zh) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 一种声源定位方法、装置、设备及存储介质
CN110505399A (zh) * 2019-08-13 2019-11-26 聚好看科技股份有限公司 图像采集的控制方法、装置及采集终端

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113682319A (zh) * 2021-08-05 2021-11-23 地平线(上海)人工智能技术有限公司 摄像头调整方法及装置、电子设备和存储介质
CN113682319B (zh) * 2021-08-05 2023-08-01 地平线(上海)人工智能技术有限公司 摄像头调整方法及装置、电子设备和存储介质
CN114554095A (zh) * 2022-02-25 2022-05-27 深圳锐取信息技术股份有限公司 一种4k摄像机的目标对象确定方法以及相关装置
CN114554095B (zh) * 2022-02-25 2024-04-16 深圳锐取信息技术股份有限公司 一种4k摄像机的目标对象确定方法以及相关装置

Also Published As

Publication number Publication date
CN110505399A (zh) 2019-11-26

Similar Documents

Publication Publication Date Title
WO2021027424A1 (zh) 图像采集的控制方法及采集终端
US10270972B2 (en) Portable video communication system
US10083710B2 (en) Voice control system, voice control method, and computer readable medium
EP3855731B1 (en) Context based target framing in a teleconferencing environment
US10136043B2 (en) Speech and computer vision-based control
US10681308B2 (en) Electronic apparatus and method for controlling thereof
US20150146078A1 (en) Shift camera focus based on speaker position
WO2020119032A1 (zh) 基于生物特征的声源追踪方法、装置、设备及存储介质
WO2019206186A1 (zh) 唇语识别方法及其装置、增强现实设备以及存储介质
US11308692B2 (en) Method and device for processing image, and storage medium
US10250803B2 (en) Video generating system and method thereof
US11477393B2 (en) Detecting and tracking a subject of interest in a teleconference
JP2014165565A (ja) テレビ会議装置およびシステムおよび方法
AU2013222959B2 (en) Method and apparatus for processing information of image including a face
CN114513622A (zh) 说话人检测方法、设备、存储介质及程序产品
US11778130B1 (en) Reversible digital mirror
US11496675B2 (en) Region of interest based adjustment of camera parameters in a teleconferencing environment
KR20160017499A (ko) 피사체의 소리를 수신하는 방법 및 이를 구현하는 전자장치
CN115766927B (zh) 测谎方法、装置、移动终端及存储介质
TWI799048B (zh) 環景影像會議系統及方法
CN114339554B (zh) 发声装置及其控制方法
CN115766927A (zh) 测谎方法、装置、移动终端及存储介质
CN116055858A (zh) 一种控制方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20853339

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20853339

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29-11-22)

122 Ep: pct application non-entry in european phase

Ref document number: 20853339

Country of ref document: EP

Kind code of ref document: A1