WO2021027424A1 - 图像采集的控制方法及采集终端 - Google Patents
图像采集的控制方法及采集终端 Download PDFInfo
- Publication number
- WO2021027424A1 WO2021027424A1 PCT/CN2020/099455 CN2020099455W WO2021027424A1 WO 2021027424 A1 WO2021027424 A1 WO 2021027424A1 CN 2020099455 W CN2020099455 W CN 2020099455W WO 2021027424 A1 WO2021027424 A1 WO 2021027424A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- speaker
- camera
- image
- collected
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 230000009471 action Effects 0.000 claims description 26
- 210000000056 organ Anatomy 0.000 claims description 13
- 230000005236 sound signal Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 9
- 239000011159 matrix material Substances 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 230000008859 change Effects 0.000 description 7
- 238000001514 detection method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000000605 extraction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005401 electroluminescence Methods 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S5/00—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
- G01S5/18—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/67—Focus control based on electronic image sensor signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/695—Control of camera direction for changing a field of view, e.g. pan, tilt or based on tracking of objects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
Definitions
- This application relates to the field of multimedia technology, and in particular to a method for controlling image collection and a collection terminal.
- the display device displays images in real time to show the status of multiple parties in the conference.
- the image displayed by the display device is the image collected by the camera.
- the image collected by the camera is restricted by the deployment position of the camera and the camera is not adjustable. Therefore, participants in the blind area of the camera will not appear in the image collected by the camera. Furthermore, if the speaker is located in the blind spot of the camera, because the image in the blind spot cannot be collected, the picture displayed by the display device does not include the portrait of the speaker, so that other participants cannot see the image of the speaker.
- this application provides a method for controlling image collection, which is applied to a collection terminal, and the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the camera in the collection terminal according to the positioned position, and after adjustment, the speaker corresponding to the audio It is located in the center of the shooting screen of the camera, and the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera; the image of the speaker corresponding to the audio is obtained through image collection of the adjusted camera.
- the present application provides a method for controlling image collection, which is applied to a collection terminal, and the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the focal length of the camera in the collection terminal according to the positioned position, so that the audio corresponds to the speech The person is at the focal position of the camera; the image of the speaker corresponding to the audio is obtained through image collection by the adjusted camera.
- the present application provides a method for controlling image collection, which is applied to a collection terminal, the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the camera in the collection terminal according to the positioned position so that the speaker corresponding to the audio is located In the image collected by the camera and located at the focal length position of the camera, the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera; the audio is obtained by image collection of the adjusted camera The image of the corresponding speaker.
- this application provides a method for controlling image collection, which is applied to a collection terminal, and the method includes: performing voiceprint recognition on the collected audio, and determining whether the speaker has changed through the voiceprint recognition; Change, locate the position in space of the speaker corresponding to the audio according to the collected audio; adjust the camera in the collection terminal according to the positioned position so that the speaker corresponding to the audio is located In the image collected by the camera and located at the focal length of the camera, the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera; performing image collection through the adjusted camera; collecting on the camera Perform speaker recognition in the received image to locate the portrait of the speaker in the image; crop the image according to the located portrait to obtain the image of the speaker corresponding to the audio; output on the display The image of the speaker corresponding to the audio.
- the present application provides a control device for image collection, which is applied to a collection terminal, and the device includes: a voiceprint recognition module for performing voiceprint recognition on the collected audio, and confirming the speech through the voiceprint recognition Whether the person has changed; the positioning module, if the voiceprint recognition module determines that the speaker changes, then locate the position of the speaker in space corresponding to the audio according to the collected audio; the control module is used to locate the position according to the location , The camera in the collection terminal is adjusted. After the adjustment, the speaker corresponding to the audio is located in the center of the shooting screen of the camera.
- the adjustment includes adjusting the shooting angle of the camera and/or adjusting the camera Focal length; image acquisition module for image acquisition through the adjusted camera to obtain the image of the speaker corresponding to the audio.
- this application provides a collection terminal, including: a processor; and a memory, where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the above method.
- Fig. 1 is a block diagram showing a terminal according to an exemplary embodiment
- Fig. 2 is a flow chart showing a method for controlling image acquisition according to an exemplary embodiment
- FIG. 3 is a flowchart of step 310 in some embodiments in the embodiment corresponding to FIG. 2;
- step 330 is a flowchart of step 330 in some embodiments in the embodiment corresponding to FIG. 2;
- FIG. 5 is a flowchart of step 350 in some embodiments in the embodiment corresponding to FIG. 2;
- FIG. 6 is a flowchart of step 370 in some embodiments in the embodiment corresponding to FIG. 2;
- FIG. 7 is a flowchart of step 371 in some embodiments in the embodiment corresponding to FIG. 6;
- Fig. 8 is a flowchart of a method for controlling image capture according to some embodiments.
- Fig. 9 is a block diagram showing an image acquisition control device according to an exemplary embodiment.
- Fig. 1 is a block diagram showing a terminal 200 according to an exemplary embodiment.
- the terminal 200 can be used as a fixed terminal for image collection according to the method of the present application.
- the terminal 200 is, for example, a television, a desktop computer, etc. that integrate a camera and a sound collection module.
- the terminal 200 may include one or more of the following components: a processing component 202, a memory 204, a power supply component 206, a multimedia component 208, a sound collection component 210, a camera 214, and a communication component 216.
- the processing component 202 generally controls the overall operations of the terminal 200, such as operations associated with display, image capture, data communication, camera rotation, and recording operations.
- the processing component 202 may include one or more processors 218 to execute instructions to complete all or part of the steps of the following method.
- the processing component 202 may include one or more modules to facilitate the interaction between the processing component 202 and other components.
- the processing component 202 may include a multimedia module to facilitate the interaction between the multimedia component 208 and the processing component 202.
- the memory 204 is configured to store various types of data to support operations in the terminal 200. Examples of these data include instructions for any application or method operating on the terminal 200.
- the memory 204 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM for short), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-only memory ( Read-Only Memory, ROM for short), magnetic storage, flash memory, magnetic disk or optical disk.
- the memory 204 also stores one or more modules, and the one or more modules are configured to be executed by the one or more processors 218 to complete all or part of the steps in any of the following method embodiments.
- the power supply component 206 provides power to various components of the terminal 200.
- the power supply component 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the terminal 200.
- the multimedia component 208 includes a screen that provides an output interface between the terminal 200 and the user.
- the screen may include a liquid crystal display (Liquid Crystal Display, LCD for short) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
- the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation.
- the screen may also include an organic electroluminescence display (Organic Light Emitting Display, OLED for short). Among them, the image collected by the camera can be displayed on the screen.
- OLED Organic Light Emitting Display
- the sound collection component 210 is configured to perform audio collection, where the sound collection component 210 may include several sound collection modules, such as a microphone (Microphone, MIC for short), through which the sound collection component 210 performs audio collection.
- a microphone Microphone, MIC for short
- the camera 214 is used for image collection to obtain an image.
- the terminal 200 includes at least one camera capable of controlled rotation. Therefore, after determining the change of the speaker, the camera can be rotated according to the position of the speaker to collect the image of the speaker.
- the communication component 216 is configured to facilitate wired or wireless communication between the terminal 200 and other devices.
- the terminal 200 can access a wireless network based on a communication standard, such as WiFi (WIreless-Fidelity, wireless fidelity).
- the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
- the communication component 216 further includes a near field communication (Near Field Communication, NFC for short) module to facilitate short-range communication.
- the NFC module can be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth technology and other technologies. .
- RFID Radio Frequency Identification
- IrDA Infrared Data Association
- UWB Ultra Wideband
- the terminal 200 may be configured by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processors, digital signal processing equipment, programmable logic devices, field programmable gate arrays, The controller, microcontroller, microprocessor or other electronic components are implemented to perform the following methods.
- ASIC Application Specific Integrated Circuit
- Fig. 2 is a flow chart showing a method for controlling image acquisition according to an exemplary embodiment. This image collection control method is applied to a collection terminal, such as the terminal 200 shown in FIG. 1. As shown in Figure 2, the method may include the following steps:
- Step 310 Perform voiceprint recognition on the collected audio, and determine whether the speaker changes through the voiceprint recognition.
- the collection terminal includes a sound collection module, which performs audio collection through the sound collection module, such as a microphone.
- the sound collection module can be integrated inside the collection terminal, or deployed outside the collection terminal, for example, connected to the collection terminal through an external interface.
- the sound collection module of the collection terminal continuously collects signals. It is understandable that because people do not speak continuously, the signals collected by the sound collection module include audio signals and non-audio signals.
- the audio referred to in this application comes from the audio signal collected by the audio collection module, for example, a segment of the audio signal, or the entire segment of audio signal between two adjacent non-audio signals.
- endpoint detection is used to determine the audio signal and the non-audio signal in the signal collected by the sound collection module.
- the collected signal is segmented, and the audio obtained by the segment is imaged and controlled according to the disclosed method.
- the segmentation performed, for example, on the basis of determining the audio signal and the silent signal according to the endpoint detection, the audio signal between two adjacent silent signals is taken as a segment of audio.
- the collected signal may also be segmented according to the set collection period, so that the audio signal segment obtained by the segmentation is regarded as a piece of audio.
- Step 310 in order to reduce the amount of calculation, only the next audio signal segment adjacent to the silent signal is identified by voiceprint recognition. In other words, if the previous signal segment adjacent to the audio is still an audio signal, then Step 310 is not performed, and it is assumed that the speaker corresponding to the audio is still the speaker corresponding to the previous adjacent audio signal segment.
- each person’s voice organs such as the vocal cords, mouth, and nasal cavity, present in a variety of ways during the pronunciation, and the pronunciation capacity and frequency of the pronunciation are not the same, the sound produced by each person’s voice organs must have their own characteristics. Personal unique voiceprint.
- Human voiceprint is characterized by voiceprint characteristics.
- the voiceprint feature is obtained by feature extraction based on the collected audio.
- Voiceprint features such as Mel Frequency Cepstral Coefficents (MFCC), short-term energy, short-term average amplitude, short-term average zero-crossing rate, formant, linear prediction cepstral coefficient (LPCC).
- MFCC Mel Frequency Cepstral Coefficents
- LPCC linear prediction cepstral coefficient
- the voiceprint features extracted from the audio for voiceprint recognition may be one or more types, which are not specifically limited here.
- the voiceprint recognition performed is to identify whether the voiceprint features of the currently collected audio are consistent with the voiceprint features of the last collected audio. If they are inconsistent, it indicates that the speaker corresponding to the currently collected audio is the same as the last collected audio. If the corresponding speakers are inconsistent, that is, the speakers have changed; on the contrary, if they are consistent, it means that the speaker corresponding to the current collected audio is consistent with the speaker corresponding to the last collected audio, that is, the speaker has not changed.
- Step 330 If the speaker changes, locate the position in the space of the speaker corresponding to the audio according to the collected audio.
- the positioning performed is to determine the position of the speaker corresponding to the audio in space by using the sound source positioning technology according to the time when the audio is collected.
- the position of the speaker in the space is actually a spatial area.
- a certain area for example, the area occupied by the head
- a certain point in the space area occupied by the speaker is used to indicate the position of the speaker in the space.
- the sound source localization technology uses the time delay of the audio collected by multiple sound collection modules to determine the position of the speaker corresponding to the audio.
- the collection terminal includes at least two sound collection modules.
- the time when the audio is collected by each sound collection module is stored in the collection terminal, so that the time delay for any two sound collection modules to collect the audio can be calculated according to the time when each sound collection module collects the audio, and then the speech can be realized Positioning of people's positions.
- Step 350 Adjust the camera in the collection terminal according to the located position. After the adjustment, the speaker corresponding to the audio is located in the center of the shooting screen of the camera.
- the adjustment includes adjusting the shooting angle of the camera and/or adjusting the focal length of the camera.
- the position and distance of the speaker corresponding to the audio relative to the camera can be determined according to the located position.
- the camera is adjusted for the purpose of collecting clear and easily recognizable images of the speaker.
- the adjustment can be to adjust the shooting angle of the camera so that the camera is aligned with the speaker corresponding to the audio; it can also be adjusted to adjust the focal length of the camera to ensure the proportion of the speaker's portrait in the collected image and to ensure the viewer
- the spokesperson can be accurately identified through the image; the shooting angle and focal length of the camera can also be adjusted at the same time, which is determined according to the actual situation, that is, according to the determined distance and orientation to determine whether the shooting angle and focal length need to be adjusted.
- control according to the determined position when it is determined that the speaker is not in the picture under the current shooting angle of the camera according to the position of the speaker corresponding to the audio relative to the camera, or the speaker deviates from the current shooting angle of the camera by a large amount, control according to the determined position When the camera rotates, the shooting angle of the camera is adjusted to ensure that the camera is aimed at the speaker after adjustment. Conversely, if it is determined that the speaker is located in the center of the shooting screen under the current shooting angle of the camera according to the determined orientation, the shooting angle adjustment is not performed.
- the image at the focal position is relatively clear, the image at the non-focus position may be blurred. Therefore, in order to obtain a clear image of the speaker, according to the located position, the acquisition terminal The focus of the camera is adjusted so that the display is adjusted to a focal length that is compatible with the location. At this time, the position of the speaker is at or near the focal length.
- Step 370 Perform image collection through the adjusted camera to obtain an image of the speaker corresponding to the audio.
- the speaker corresponding to the audio is located in the center of the camera shooting screen, so that the image of the speaker corresponding to the audio can be correspondingly collected.
- the image of the speaker may be a full-body image, an upper body image, etc., of the speaker, which is not specifically limited here.
- the captured image of the speaker is an image whose main body is the speaker corresponding to the audio.
- the image of the speaker collected in this application is used for display in the collection terminal, so that the image of the speaker is displayed while the speaker is speaking.
- the collection terminal can be displayed through its own display screen or through an external display device, which is not specifically limited here.
- the method further includes:
- the speaker when the speaker changes according to the audio, the speaker is positioned according to the audio, and the camera is adjusted according to the position of the positioned speaker, so as to collect the image of the speaker. Realize the speaker tracking and positioning based on the audio, and collect the speaker's image according to the speaker's location. Therefore, it is ensured that the screen displayed on the collection terminal is the image of the collected speaker, which can effectively solve the problem of the absence of the speaker's portrait in the screen displayed in the related art.
- the image of the speaker before displaying, is enlarged according to the scale of the display screen of the collection terminal, so as to ensure that the obtained image of the speaker fits the display screen and the display effect is ensured.
- the display is controlled to display images captured by the camera.
- the display is controlled to display the cropped image of the speaker.
- step 310 if it is determined that the speaker has not changed, the shooting angle of the camera is maintained unchanged, so that images of the speaker can be continuously collected and displayed.
- step 310 if it is determined that the speaker has not changed, the image displayed on the collection terminal is not replaced. In other words, if the speaker of the last collected audio and the current collected audio are the same person, then Keep the displayed image unchanged.
- step 310 if it is determined that the speaker has not changed, it is determined based on the audio whether the position of the speaker corresponding to the audio has changed, and if the position of the speaker has not changed, adjustment is made according to the position of the speaker
- the camera wherein the adjustment of the camera includes adjusting the shooting angle of the camera, and/or adjusting the focal length of the camera according to the distance between the speaker and the camera. Therefore, it is ensured that the speaker is located in the center of the shooting picture of the camera, so that a clear image of the speaker is collected, and it is convenient for the observer to recognize the speaker through the collected image of the speaker.
- the method of this application can be applied to a multi-party video conference, so that according to the audio collected in the multi-party video conference, the image of the speaker is collected according to the method of this application to display the image of the speaker on the screen, and the speech Images of people are simultaneously displayed on the display screens of other conference parties, so that participants in a multi-party video conference can determine the speaker based on the displayed image.
- step 310 includes:
- Step 311 Extract voiceprint features from the audio.
- the extracted voiceprint feature can be one or more of Mel frequency cepstrum coefficient, short-term energy, short-term average amplitude, short-term average zero-crossing rate, formant, and linear prediction cepstrum coefficient .
- the extracted voiceprint features can ensure the accuracy of voiceprint recognition, and the extracted voiceprint features are not specifically limited here.
- Step 313 Calculate the voiceprint similarity of the extracted voiceprint feature with respect to the voiceprint feature corresponding to the last collected audio.
- the voiceprint similarity is used to characterize the similarity of the voiceprint feature of the currently collected audio with respect to the corresponding voiceprint feature of the last collected audio.
- the voiceprint vector of the audio is constructed based on the voiceprint features extracted for the collected audio, so that the voiceprint vector of the current audio is compared with the voiceprint of the last collected audio.
- the vector performs voiceprint similarity calculation, for example, the Euclidean distance, cosine distance, Mahalanobis distance of two voiceprint vectors are used as the voiceprint similarity.
- Step 315 Determine whether the speaker changes according to the voiceprint similarity.
- the calculated voiceprint similarity indicates that the two voiceprint features are similar, it is determined that the speaker has not changed; conversely, if the calculated voiceprint similarity indicates that the two voiceprint features are not similar, the speaker is determined to change.
- the similarity range in order to determine whether the speaker changes according to the similarity of the voiceprint, can be preset. If the similarity of the voiceprint is within the similarity range, it means that the voiceprint similarity corresponds to two voiceprints. Features are similar.
- the speaker change is determined.
- the collection terminal includes one reference sound collection module and at least three non-reference sound collection modules.
- step 330 includes:
- Step 331 According to the time of audio collected by the reference sound collection module and the non-reference sound collection module, respectively, the time delay of each non-reference sound collection module with respect to the reference sound collection module is calculated.
- each sound collection module stores the time when the audio was collected while collecting the audio. Therefore, the corresponding calculation is obtained based on the time when the audio is collected by the reference sound collection module and each non-reference sound collection module. The time delay for each non-reference sound collection module to collect the audio relative to the reference sound collection module.
- Step 333 Calculate according to the arrangement position and time delay of the reference sound collection module and the non-reference sound collection module to obtain the position coordinates of the speaker corresponding to the audio.
- the position of the reference sound collection module is used as the reference origin, and a coordinate system is constructed, so that according to the arrangement position of the reference sound collection module and the non-reference sound collection modules, the relative position of each non-reference sound collection module in the constructed coordinate system can be obtained. coordinate of.
- the distance difference between the speaker corresponding to the audio and the non-reference sound collection module and the reference sound collection module can be calculated.
- matrix A is an n ⁇ 4 matrix
- n is the number of non-reference sound collection modules
- the i-th row element in matrix A is [x i ,y i ,z i ,d i ]
- x i is the i-th a non-reference sound collection module x coordinate
- y i is the i th non-reference sound collection module y-coordinate
- z i is the i th z-axis coordinate of the non-reference sound collection module
- d i is the corresponding audio spokesman
- X [x,y,z,R] T
- matrix B is an n ⁇ 4 matrix
- the i-th row element in matrix B is
- Solving the above matrix equation can calculate the position coordinates (x, y, z) of the speaker corresponding to the audio.
- step 350 includes:
- Step 351 Determine the distance and orientation of the speaker corresponding to the audio relative to the camera according to the located position.
- Step 353 Adjust the focal length of the camera according to the determined distance, and adjust the shooting angle of the camera according to the determined orientation.
- the adjustment of the shooting angle is to control the rotation of the camera according to the determined orientation, so that the rotated camera is aligned with the speaker corresponding to the audio.
- the focal length mapped by the distance is obtained from the configuration file, thereby adjusting the focal length of the camera to the obtained focal length.
- step 370 includes:
- Step 371 Perform speaker identification according to the image collected by the adjusted camera, and locate the portrait of the speaker in the image.
- the collected image may include multiple people.
- the lips should act accordingly while speaking. Therefore, the speaker recognition can be recognized by the lip movements of each person in the collected images. For example, extracting the lip pixels of a person from the continuously collected images, and judging whether the person's lips are moving by comparing the lip pixels extracted from the continuous images. If it does, it is determined that the person where the lip pixel is located is the speaker's Portrait; on the contrary, if the lips are not moving, it is determined that the portrait where the lip pixel is located is not the portrait of the speaker.
- an action agreement may be made in advance, for example, an appointment may be made for the speaker to raise his hand when speaking, or an appointment for the speaker to stand and speak, so that the agreed action is recognized in the collected image , Such as raising hands, standing, and determining the portrait of the person in the image as the speaker's portrait.
- Step 373 Clip the image according to the positioned portrait to obtain an image of the speaker.
- the image with the speaker as the main body is obtained by cropping from the image including multiple portraits, that is, the image of the speaker.
- the obtained speaker image includes at least the face image of the speaker.
- the speaker's portrait is positioned and cropped, so as to ensure that the obtained speaker's image is based on the speaker, and improve the speed at which personnel can identify the speaker from the speaker's image.
- step 371 includes:
- Step 410 According to the image collected by the adjusted camera, pixel points of the designated organ are extracted for each person in the collected image.
- the spokesperson recognition can be based on the lip actions or agreed actions of each person in the image, regardless of whether the lips or the agreed actions are realized by organs, such as lips and hands. Wait.
- the executive organ of the action used for speaker recognition is the designated organ. For example, if the speaker is recognized by lip motion, the lips are the designated organ, and if the gesture is used for speaker recognition, the hand is the designated organ.
- the speaker recognition is performed in the collected images, and the designated organ is located in the image first, and the pixels of the designated organ are extracted accordingly.
- Step 430 Perform action recognition according to the extracted pixels, and determine the action represented by the extracted pixels.
- the shape of the designated organ can be reconstructed through the extracted pixels, so as to determine the action represented by the pixel according to the reconstructed shape.
- Step 450 Determine the portrait of the pixel where the represented action matches the predetermined action as the portrait of the speaker.
- the predetermined actions are, for example, actions agreed to be used for speaker recognition, such as raising hands, standing, moving lips, etc., which are not specifically limited here.
- the action represented by the pixel point matches the predetermined action, it is determined that the person where the pixel point is located is the speaker's portrait.
- the method further includes: detecting whether audio is still not collected after a set interval of time. If yes, control the camera to rotate to the preset shooting angle. If not, perform the step of performing voiceprint recognition on the collected audio. After the interval is set for a period of time, if audio is still not collected, control the camera to rotate to the preset shooting angle. Further, the image collected under the shooting angle is displayed in the collection terminal. Conversely, after the interval is set for a period of time, if audio is collected, then go to step 310.
- Fig. 8 is a flowchart of an image capture control method according to some embodiments.
- the collection terminal is a television including a camera and a sound collection module. As shown in Fig. 8, it includes the following steps:
- Step 510 Recognition of the spokesperson: the portrait of the spokesperson is recognized according to the image collected by the camera, and the recognition of the spokesperson can be performed by moving lips or an agreed action.
- Step 520 Clipping of the speaker's image: After identifying the portrait of the speaker in the image, crop the collected image to obtain the image of the speaker, so as to display the obtained image of the speaker on the TV.
- Step 530 whether to continue to collect audio: real-time detection of the audio collection state (for example, detection every second), if the audio continues to be collected, go to step 540; if no audio is collected, go to step 560.
- Step 540 whether the speaker has changed: perform voiceprint recognition through the collected audio to determine whether the speaker has changed; if the speaker has changed, go to step 550; if the speaker has not changed, no processing is performed, that is, continue Display the image currently displayed on the TV.
- Step 550 Adjust the camera according to the position of the speaker: Determine the position of the speaker according to the time of the collected audio, and accordingly adjust the camera according to the position of the speaker.
- the adjustment performed is, for example, adjusting the shooting angle of the camera according to the angle of the speaker relative to the camera, or adjusting the focal length of the camera according to the distance of the speaker relative to the camera, or adjusting both the shooting angle and the focal length. Then, perform image collection through the adjusted camera, and go to step 510.
- Step 560 whether it exceeds the set time: start timing when it is detected that the audio is not collected continuously, if the audio is still not collected after the set time (for example, 30s), then go to step 570; If the set time is exceeded, the timing will continue.
- the set time for example, 30s
- Step 570 Control the camera to rotate to a preset shooting angle: perform image collection at the preset shooting angle, and display the collected image on the TV. While displaying the image, perform speaker recognition based on the collected image, that is, go to step 510.
- Fig. 9 is a block diagram showing an image acquisition control device according to an exemplary embodiment.
- the device can be used in the terminal 200 shown in Fig. 1 to perform all or part of the steps in any method embodiment.
- the device includes, but is not limited to: a voiceprint recognition module 610, a positioning module 630, an adjustment module 650, and an image acquisition module 670, wherein:
- the voiceprint recognition module 610 is configured to perform voiceprint recognition on the collected audio, and determine whether the speaker changes through the voiceprint recognition.
- the positioning module 630 is configured to locate the position in the space of the speaker corresponding to the audio according to the collected audio if the voiceprint recognition module determines that the speaker changes.
- the adjustment module 650 is used to adjust the camera in the collection terminal according to the positioned position. After adjustment, the speaker corresponding to the audio is located in the center of the shooting screen of the camera. The adjustment includes adjusting the shooting angle of the camera and/or adjusting the camera focal length.
- the image acquisition module 670 is used for image acquisition through the adjusted camera to obtain an image of the speaker corresponding to the audio.
- modules can be implemented by hardware, software, or a combination of both.
- these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits.
- these modules may be implemented as one or more computer programs executed on one or more processors, for example, the programs stored in the memory 204 executed by the processor 218 in FIG. 1.
- the voiceprint recognition module 610 includes a feature extraction unit for extracting voiceprint features from audio.
- the calculation unit is used to calculate the voiceprint similarity of the extracted voiceprint feature with respect to the voiceprint feature corresponding to the last collected audio.
- the determining unit is used to determine whether the speaker changes according to the similarity of the voiceprint.
- the collection terminal includes one reference sound collection module and at least three non-reference sound collection modules
- the positioning module 630 includes:
- the time delay calculation unit is configured to calculate the audio time delay of each non-reference sound collection module relative to the reference sound collection module according to the time when the audio is collected by the reference sound collection module and the non-reference sound collection module.
- the coordinate calculation unit is used to calculate according to the arrangement position and time delay of the reference sound collection module and the non-reference sound collection module to obtain the position coordinates of the speaker corresponding to the audio.
- the adjustment module 650 includes an angle and orientation determining unit for determining the distance and orientation of the speaker corresponding to the audio relative to the camera according to the located position.
- the adjustment unit is used to adjust the focal length of the camera according to the determined distance, and adjust the shooting angle of the camera according to the determined orientation.
- the image acquisition module 670 includes: a portrait positioning unit, configured to perform speaker identification according to the image collected by the adjusted camera, and locate the portrait of the speaker in the image.
- the cropping unit is used to crop the image according to the positioned portrait to obtain the image of the speaker.
- the portrait positioning unit includes: a pixel extraction unit for extracting pixels for a designated organ for each portrait in the captured image according to the image captured by the adjusted camera.
- the action recognition unit is used to perform action recognition according to the extracted pixels and determine the action represented by the extracted pixels.
- the portrait determination unit is used to determine the portrait of the pixel where the represented action matches the predetermined action as the portrait of the speaker.
- the device further includes: a display replacement module for replacing the image displayed by the collection terminal with the image of the speaker.
- the device further includes: a detection module for detecting whether audio is still not collected after the interval is set for a period of time.
- the rotation adjustment module is used for controlling the camera to rotate to a preset shooting angle if the detection module detects that no audio is collected after the interval set time period. If the detection module detects that the audio is collected after the interval set time period, it transfers to the voiceprint recognition module 610.
- the present application also provides a collection terminal.
- the collection terminal may be the terminal 200 shown in FIG. 1 and executes all or part of the steps in any of the above method embodiments.
- the collection terminal includes:
- a processor and a memory, and computer-readable instructions are stored on the memory, and the computer-readable instructions implement the method in any of the above method embodiments when executed by the processor.
- a computer-readable non-volatile storage medium is also provided, and computer-readable instructions are stored thereon.
- the computer-readable instructions are executed by a processor, any of the above method embodiments is implemented. method.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Studio Devices (AREA)
Abstract
Description
Claims (12)
- 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头进行调整,调整后,所述音频所对应发言人位于所述摄像头的拍摄画面中央,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
- 根据权利要求1所述的方法,其特征在于,所述对所述音频进行声纹识别,通过所述声纹识别判断发言人是否变化,包括:从所述音频中提取声纹特征;计算所提取声纹特征相对于上一所采集音频所对应声纹特征的声纹相似度;根据所述声纹相似度确定发言人是否变化。
- 根据权利要求1所述的方法,其特征在于,所述采集终端包括一个参考声音采集模块和至少三个非参考声音采集模块,所述根据所采集的音频定位所述音频所对应发言人在空间中的位置,包括:根据所述参考声音采集模块和所述非参考声音采集模块所分别采集到所述音频的时间,计算得到每一所述非参考声音采集模块相对于所述参考声音采集模块采集到所述音频的时延;根据所述参考声音采集模块、所述非参考声音采集模块的布置位置和所述时延进行计算,获得所述音频所对应发言人的位置坐标。
- 根据权利要求1所述的方法,其特征在于,所述根据所定位到的位置,对所述采集终端中的摄像头进行调整,包括:根据所定位到的位置,确定所述音频所对应发言人相对于所述摄像头的距离和方位;根据所确定的距离调整所述摄像头的焦距,以及根据所确定的方位调整所述摄像头的拍摄角度。
- 根据权利要求1所述的方法,其特征在于,所述通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像,包括:根据调整后的摄像头所采集的图像,进行发言人识别,在所述图像中定位所述发言人的人像;根据所定位到的人像对所述图像进行剪裁,获得所述发言人的图像。
- 根据权利要求5所述的方法,其特征在于,所述根据调整后的摄像头所采集的图像,进行发言人识别,在所述图像中定位所述发言人的人像,包括:根据调整后的摄像头所采集的图像,为所采集图像中的每一人像对指定器官进行像素点提取;根据所提取的像素点进行动作识别,确定所提取像素点所表征的动作;将所表征动作与预定动作相符的像素点所在人像确定为发言人的人像。
- 根据权利要求1所述的方法,其特征在于,所述通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像之后,所述方法还包括:将所述采集终端所显示的图像替换为所述发言人的图像。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:检测在间隔设定时间段后是否仍未采集到音频;若为是,则控制所述摄像头旋转至预设拍摄角度;若为否,则执行所述对采集的音频进行声纹识别的步骤。
- 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头的焦距进行调整,以使所述音频所对应发言人位于所述摄像头的焦距位置;通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
- 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头进行调整,以使所述音频所对应发言人位于所述摄像头采集的画面中,且位于所述摄像头的焦距位置,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;通过调整后的摄像头进行图像采集获得所述音频所对应发言人的图像。
- 一种图像采集的控制方法,应用于采集终端,其特征在于,所述方法包括:对采集的音频进行声纹识别,通过所述声纹识别确定发言人是否变化;若发言人变化,则根据所采集的音频定位所述音频所对应发言人在空间中的位置;根据所定位到的位置,对所述采集终端中的摄像头进行调整,以使所述音频所对应发言人位于所述摄像头采集的画面中,且位于所述摄像头的焦距位置,所述调整包括调整所述摄像头的拍摄角度和/或调整所述摄像头的焦距;通过调整后的摄像头进行图像采集在摄像头采集到的图像中进行发言人识别,以在所述图像中定位所述发言人的人像;根据所定位到的人像对所述图像进行剪裁,获得所述音频所对应发言人的图像;在显示器输出所述音频所对应发言人的图像。
- 一种采集终端,其特征在于,包括:处理器;及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如权利要求1至11中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910746092.0A CN110505399A (zh) | 2019-08-13 | 2019-08-13 | 图像采集的控制方法、装置及采集终端 |
CN201910746092.0 | 2019-08-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021027424A1 true WO2021027424A1 (zh) | 2021-02-18 |
Family
ID=68587511
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/099455 WO2021027424A1 (zh) | 2019-08-13 | 2020-06-30 | 图像采集的控制方法及采集终端 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110505399A (zh) |
WO (1) | WO2021027424A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113682319A (zh) * | 2021-08-05 | 2021-11-23 | 地平线(上海)人工智能技术有限公司 | 摄像头调整方法及装置、电子设备和存储介质 |
CN114554095A (zh) * | 2022-02-25 | 2022-05-27 | 深圳锐取信息技术股份有限公司 | 一种4k摄像机的目标对象确定方法以及相关装置 |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110505399A (zh) * | 2019-08-13 | 2019-11-26 | 聚好看科技股份有限公司 | 图像采集的控制方法、装置及采集终端 |
CN113556499B (zh) * | 2020-04-07 | 2023-05-09 | 上海汽车集团股份有限公司 | 一种车载视频通话方法及车载系统 |
CN111586341A (zh) * | 2020-05-20 | 2020-08-25 | 深圳随锐云网科技有限公司 | 一种视频会议拍摄装置拍摄方法和画面显示方法 |
CN111901524B (zh) * | 2020-07-22 | 2022-04-26 | 维沃移动通信有限公司 | 对焦方法、装置和电子设备 |
CN112073639A (zh) * | 2020-09-11 | 2020-12-11 | Oppo(重庆)智能科技有限公司 | 拍摄控制方法及装置、计算机可读介质和电子设备 |
CN112312042A (zh) * | 2020-10-30 | 2021-02-02 | 维沃移动通信有限公司 | 显示控制方法、装置、电子设备及存储介质 |
CN112541402A (zh) * | 2020-11-20 | 2021-03-23 | 北京搜狗科技发展有限公司 | 一种数据处理方法、装置和电子设备 |
TWI798867B (zh) * | 2021-06-27 | 2023-04-11 | 瑞昱半導體股份有限公司 | 視訊處理方法與相關的系統晶片 |
CN113542604A (zh) * | 2021-07-12 | 2021-10-22 | 口碑(上海)信息技术有限公司 | 视频对焦方法及装置 |
CN113824916A (zh) * | 2021-08-19 | 2021-12-21 | 深圳壹秘科技有限公司 | 图像显示方法、装置、设备及存储介质 |
CN115242971A (zh) * | 2022-06-21 | 2022-10-25 | 海南视联通信技术有限公司 | 摄像头控制方法、装置、终端设备和存储介质 |
CN117640877B (zh) * | 2024-01-24 | 2024-03-29 | 浙江华创视讯科技有限公司 | 线上会议的画面重构方法及电子设备 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107144820A (zh) * | 2017-06-21 | 2017-09-08 | 歌尔股份有限公司 | 声源定位方法及装置 |
US20190215636A1 (en) * | 2017-05-24 | 2019-07-11 | Glen A. Norris | User Experience Localizing Binaural Sound During a Telephone Call |
CN110082723A (zh) * | 2019-05-16 | 2019-08-02 | 浙江大华技术股份有限公司 | 一种声源定位方法、装置、设备及存储介质 |
CN110505399A (zh) * | 2019-08-13 | 2019-11-26 | 聚好看科技股份有限公司 | 图像采集的控制方法、装置及采集终端 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100505837C (zh) * | 2007-05-10 | 2009-06-24 | 华为技术有限公司 | 一种控制图像采集装置进行目标定位的系统及方法 |
CN104902203A (zh) * | 2015-05-19 | 2015-09-09 | 广东欧珀移动通信有限公司 | 一种基于旋转摄像头的视频录制方法及终端 |
CN104991573A (zh) * | 2015-06-25 | 2015-10-21 | 北京品创汇通科技有限公司 | 一种基于声源阵列的定位跟踪方法及其装置 |
CN107247923A (zh) * | 2017-05-18 | 2017-10-13 | 珠海格力电器股份有限公司 | 一种指令识别方法、装置、存储设备、移动终端及电器 |
CN109754811B (zh) * | 2018-12-10 | 2023-06-02 | 平安科技(深圳)有限公司 | 基于生物特征的声源追踪方法、装置、设备及存储介质 |
CN109783642A (zh) * | 2019-01-09 | 2019-05-21 | 上海极链网络科技有限公司 | 多人会议场景的结构化内容处理方法、装置、设备及介质 |
-
2019
- 2019-08-13 CN CN201910746092.0A patent/CN110505399A/zh active Pending
-
2020
- 2020-06-30 WO PCT/CN2020/099455 patent/WO2021027424A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190215636A1 (en) * | 2017-05-24 | 2019-07-11 | Glen A. Norris | User Experience Localizing Binaural Sound During a Telephone Call |
CN107144820A (zh) * | 2017-06-21 | 2017-09-08 | 歌尔股份有限公司 | 声源定位方法及装置 |
CN110082723A (zh) * | 2019-05-16 | 2019-08-02 | 浙江大华技术股份有限公司 | 一种声源定位方法、装置、设备及存储介质 |
CN110505399A (zh) * | 2019-08-13 | 2019-11-26 | 聚好看科技股份有限公司 | 图像采集的控制方法、装置及采集终端 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113682319A (zh) * | 2021-08-05 | 2021-11-23 | 地平线(上海)人工智能技术有限公司 | 摄像头调整方法及装置、电子设备和存储介质 |
CN113682319B (zh) * | 2021-08-05 | 2023-08-01 | 地平线(上海)人工智能技术有限公司 | 摄像头调整方法及装置、电子设备和存储介质 |
CN114554095A (zh) * | 2022-02-25 | 2022-05-27 | 深圳锐取信息技术股份有限公司 | 一种4k摄像机的目标对象确定方法以及相关装置 |
CN114554095B (zh) * | 2022-02-25 | 2024-04-16 | 深圳锐取信息技术股份有限公司 | 一种4k摄像机的目标对象确定方法以及相关装置 |
Also Published As
Publication number | Publication date |
---|---|
CN110505399A (zh) | 2019-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021027424A1 (zh) | 图像采集的控制方法及采集终端 | |
US10270972B2 (en) | Portable video communication system | |
US10083710B2 (en) | Voice control system, voice control method, and computer readable medium | |
EP3855731B1 (en) | Context based target framing in a teleconferencing environment | |
US10136043B2 (en) | Speech and computer vision-based control | |
US10681308B2 (en) | Electronic apparatus and method for controlling thereof | |
US20150146078A1 (en) | Shift camera focus based on speaker position | |
WO2020119032A1 (zh) | 基于生物特征的声源追踪方法、装置、设备及存储介质 | |
WO2019206186A1 (zh) | 唇语识别方法及其装置、增强现实设备以及存储介质 | |
US11308692B2 (en) | Method and device for processing image, and storage medium | |
US10250803B2 (en) | Video generating system and method thereof | |
US11477393B2 (en) | Detecting and tracking a subject of interest in a teleconference | |
JP2014165565A (ja) | テレビ会議装置およびシステムおよび方法 | |
AU2013222959B2 (en) | Method and apparatus for processing information of image including a face | |
CN114513622A (zh) | 说话人检测方法、设备、存储介质及程序产品 | |
US11778130B1 (en) | Reversible digital mirror | |
US11496675B2 (en) | Region of interest based adjustment of camera parameters in a teleconferencing environment | |
KR20160017499A (ko) | 피사체의 소리를 수신하는 방법 및 이를 구현하는 전자장치 | |
CN115766927B (zh) | 测谎方法、装置、移动终端及存储介质 | |
TWI799048B (zh) | 環景影像會議系統及方法 | |
CN114339554B (zh) | 发声装置及其控制方法 | |
CN115766927A (zh) | 测谎方法、装置、移动终端及存储介质 | |
CN116055858A (zh) | 一种控制方法、装置、电子设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20853339 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20853339 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29-11-22) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20853339 Country of ref document: EP Kind code of ref document: A1 |