WO2021217897A1 - 定位方法、终端设备及会议系统 - Google Patents

定位方法、终端设备及会议系统 Download PDF

Info

Publication number
WO2021217897A1
WO2021217897A1 PCT/CN2020/102299 CN2020102299W WO2021217897A1 WO 2021217897 A1 WO2021217897 A1 WO 2021217897A1 CN 2020102299 W CN2020102299 W CN 2020102299W WO 2021217897 A1 WO2021217897 A1 WO 2021217897A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
image
scene
audio signal
currently located
Prior art date
Application number
PCT/CN2020/102299
Other languages
English (en)
French (fr)
Inventor
林瑞成
霍澄平
Original Assignee
深圳市鸿合创新信息技术有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市鸿合创新信息技术有限责任公司 filed Critical 深圳市鸿合创新信息技术有限责任公司
Publication of WO2021217897A1 publication Critical patent/WO2021217897A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • H04N23/611Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/695Control of camera direction for changing a field of view, e.g. pan, tilt or based on tracking of objects

Definitions

  • This application belongs to the field of data processing technology, and in particular relates to a positioning method, terminal equipment and conference system.
  • the camera needs to be pointed at the speaker to collect the image information of the speaker.
  • manual operation is usually used. The operator needs to know the position of the speaker first, and then repeatedly adjust the lens through the zoom and movement functions of the camera, and finally move the lens to the speaker. This operation process is troublesome and time-consuming, and it is impossible. Achieving rapid positioning of speakers, poor user experience.
  • the present application provides a positioning method, terminal equipment and conference system to solve the problem of slow positioning in existing positioning methods.
  • the first aspect of the present application provides a positioning method, and the positioning method includes:
  • the position of the target face image in the image of the scene where the user is currently located is acquired.
  • a second aspect of the present application provides a positioning device, the positioning device includes:
  • Audio signal acquisition module for acquiring audio signals
  • the target facial image acquisition module is configured to acquire the target facial image of the user who sent the audio signal according to the audio signal;
  • a scene image acquisition module for acquiring an image of the scene where the user is currently located
  • the first position obtaining module is configured to search for the target facial image from the image of the scene where the user is currently located, and obtain the position of the target facial image in the image of the scene where the user is currently located;
  • the second position obtaining module is configured to obtain the position of the user in the current scene according to the position of the target face image in the image of the scene where the user is currently located.
  • a third aspect of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, the implementation is as described above. The steps of the positioning method provided in the first aspect of the implementation manner of this application.
  • the fourth aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements the positioning method provided in the first aspect of the above-mentioned embodiments of the present application A step of.
  • the fifth aspect of the present application provides a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the positioning method provided in the first aspect of the implementation manners of the present application.
  • a sixth aspect of the present application is a conference system, where the conference system includes:
  • both the audio collection device and the image collection device are electrically connected to the terminal device.
  • the target facial image of the user who sent the audio signal can be obtained, and then the image containing the current scene of the user is obtained, and finally according to the target facial image in the current scene of the user
  • the position in the image to get the user s position in the current scene.
  • the positioning method provided in this application integrates audio recognition and face recognition technologies, can process and analyze the acquired audio signals and image signals, and automatically locate the user, so that the audio signals and image signals can be immediately recognized after the audio signals and image signals are acquired Get users.
  • all aspects of the present application have at least the following beneficial effects: the positioning process is relatively simple, and it has a faster recognition speed, achieves rapid target positioning and recognition, and improves user experience.
  • FIG. 1 is a schematic flowchart of an embodiment of a positioning method provided according to the present application.
  • FIG. 2 is a schematic flowchart of another embodiment of the positioning method provided according to the present application.
  • Fig. 3 is a schematic structural diagram of an embodiment of a positioning device provided according to the present application.
  • FIG. 4 is a schematic structural diagram of an implementation manner of a terminal device according to the present application.
  • Fig. 5 is a schematic structural diagram of an implementation manner of a conference system provided according to the present application.
  • Fig. 1 shows a schematic flowchart of an implementation manner of the positioning method provided according to the present application. For ease of description, only the parts related to the embodiments of the present application are shown.
  • the positioning method 100 includes:
  • Step S101 Obtain an audio signal
  • Step S102 Acquire a target face image of the user who sent the audio signal according to the audio signal;
  • Step S103 Obtain an image of the scene where the user is currently located
  • Step S104 Search for the target face image from the image of the scene where the user is currently located, and obtain the position of the target face image in the image of the scene where the user is currently located;
  • Step S105 Acquire the position of the user in the scene where the user is currently located according to the position of the target face image in the image of the scene where the user is currently located.
  • the positioning method of this application can be applied to a variety of different scenarios.
  • the positioning method of the present application can be applied in the daily teaching process in the classroom to locate the user's position, such as the position of the teacher giving a lecture or the student answering a question.
  • This method can also be applied to conference halls or conference rooms to locate the user's location, such as the location of the speaker.
  • the application scenario of the positioning method takes a conference room as an example.
  • step S101 specifically includes collecting audio signals through an audio collecting device.
  • audio collection devices There are many suitable examples of audio collection devices, and in practical applications, there can be many names, such as: microphone, microphone, MIC (Microphone, microphone), pickup, microphone, microphone, etc.
  • step S102 specifically includes: acquiring the voiceprint of the user who sent the audio signal according to the acquired audio signal; and then acquiring the target of the user who sent the audio signal according to the voiceprint of the user who sent the audio signal Face image.
  • the correspondence between different voiceprints and different facial images is established in advance to obtain the target facial image of the user who emits the audio signal according to the voiceprint of the user who emits the audio signal.
  • the pre-established correspondence relationship may include at least two sets of corresponding pairs of different voiceprints and different facial images, but the specific number of groups may be set according to actual needs.
  • only a set of voiceprints and facial images are included in the corresponding relationship, and the subsequent recognition process can also be realized.
  • establishing the correspondence between different voiceprints and different facial images may include: registering each user first; entering facial images and voiceprint information; establishing a relationship between the user’s voiceprint and facial image Correspondence. Entering the user's facial image and voiceprint information can be done by allowing the user to take pictures and speak to the camera, so the entry process is very simple. After receiving the audio signal of any user, extract the voiceprint of the audio signal, and then correspond with the collected facial image of the user to establish the corresponding relationship between the user's voiceprint and facial image .
  • the input audio signal can be some representative sentences or words. Representative expressions are easy to appear, and different scenes have different representative sentences or words. For example: in a meeting scenario, the more representative sentences or words are "Hello everyone", "I say a few words" and so on. Finally, establish the correspondence between different voiceprints and different facial images.
  • step S102 after the audio signal is obtained, the voiceprint of the audio signal is identified through a voiceprint recognition algorithm, that is, the voiceprint of the user who issued the audio signal. Then, from the established correspondence between different voiceprints and different facial images, a facial image corresponding to the voiceprint of the user who sent the audio signal is obtained, and the facial image is the target facial image of the user who sent the audio signal .
  • the audio signal sent by the user can correspond to the sentence or words entered in the process of establishing the corresponding relationship. For example, in a meeting scenario, the audio signal sent by the user can be "Hello everyone" or "I say a few words.” "and many more.
  • step S102 includes: if the facial image corresponding to the voiceprint of the user who sent the audio signal is not obtained from the correspondence between different voiceprints and different facial images, the audio signal is re-acquired until The facial image corresponding to the voiceprint of the user who sent the re-acquired audio signal is acquired in the correspondence relationship.
  • the facial image corresponding to the voiceprint is not obtained from the correspondence between different voiceprints and different facial images, in order to accurately locate the user At the position of the current scene, the audio signal can be retrieved. If the user is still speaking, the re-acquired audio signal is another audio signal of the user.
  • the voiceprint of the user who issued the audio signal is obtained, and then it is determined whether there is a corresponding voiceprint of the user who issued the audio signal from the above-mentioned correspondence between different voiceprints and different facial images
  • the face image is the target face image. If there is a target face image in the corresponding relationship, step 104 and step 105 are executed. If the target face image does not exist in the corresponding relationship, the audio signal is acquired again until the face corresponding to the voiceprint of the user who issued the reacquired audio signal is acquired in the above-mentioned corresponding relationship between different voiceprints and different facial images. Department image.
  • the audio signal of the user in order to improve the accuracy of obtaining the target facial image, in the case that the facial image corresponding to the voiceprint of the user who sent the audio signal is not obtained, the audio signal of the user can be obtained through a loop , Until the user's target face image is obtained, so that the user can be positioned.
  • the facial image corresponding to the voiceprint of the user who sent the audio signal is not obtained, then It is considered that the user has not registered in the corresponding relationship, and the audio signal of the next user may be obtained without processing or after a preset period of time.
  • an image of the scene where the user is currently located can be acquired through an image acquisition device, such as a camera. Since the location of the user has not been obtained at this time, the image obtained at this time is an image of the scene where the user is currently located, that is, an image containing all persons in the conference room, and may be a panoramic picture of the conference room.
  • an image capture device such as a camera
  • step S104 may include: acquiring all facial images contained in the image of the scene where the user is currently located; determining that there is a target facial image in all facial images, then acquiring the location of the target facial image in the scene where the user is currently located The position in the image.
  • a specific example of obtaining all facial images contained in the image of the scene where the user is currently located includes: performing face recognition on the image of the scene where the user is currently located according to a face recognition algorithm, and recognize all the facial images contained in the image. Face image.
  • S104 includes if the target facial image does not exist in all the facial images, reacquiring the audio signal until it is determined that all the facial images included in the image of the scene where the user that sent the reacquired audio signal are present The user's target face image.
  • the audio signal in order to accurately locate the position of the user in the current scene, if the target facial image does not exist in all the acquired facial images, the audio signal can be acquired again. If the user is still speaking, the re-acquired audio signal is another audio signal of the user.
  • the target facial image of the user is acquired according to the re-acquired audio signal. Then, reacquire the image of the scene where the user is currently located. Then obtain all the facial images contained in the image of the scene where the user is currently located, and detect whether there is a target facial image in all the facial images. If there is a target facial image of the user in all facial images included in the image of the scene where the user is located, step 105 is executed.
  • the audio signal is reacquired again until it is determined that all the faces contained in the image of the scene where the user is located that sent the reacquired audio signal There is a target face image of the user in the partial images.
  • the accurate positioning of the user in the current scene can be achieved.
  • a specific example of detecting whether there is a target facial image in all facial images includes: comparing the target facial image with each facial image in all facial images to detect all facial images Whether there is a target face image in.
  • the comparison process may be: obtaining the matching value between the target facial image and each facial image in all facial images; comparing each matching value with a preset matching threshold. If one of the matching values is greater than the preset matching threshold, it is determined that the target facial image exists in all facial images, and if none of the matching values is greater than the preset matching threshold, it is determined that there is no facial image The target face image.
  • Step S105 Acquire the position of the user in the scene where the user is currently located according to the position of the target face image in the image of the scene where the user is currently located.
  • S105 may include: obtaining the coordinates of the position of the target facial image in the image of the scene where the user is currently located; performing coordinates on the coordinates of the position of the target facial image in the image of the scene where the user is currently located Transform to obtain the coordinates of the user in the current scene, that is, the position of the user in the current scene.
  • the coordinates of the position of the target face image in the image of the scene where the user is currently located may be referred to as the image coordinates of the target face (the image coordinates are two-dimensional coordinates and usually include X coordinates and Y coordinates).
  • the above-mentioned coordinate conversion may specifically refer to the preset distance value between the user and the camera (ie, the camera that obtains the image of the scene where the user is currently located) as the Z coordinate of the image coordinate, and the three-dimensional coordinates of the target face in the camera coordinate system can be obtained.
  • the camera coordinate system can be converted to the world coordinate system to obtain the three-dimensional coordinates of the target face in the world coordinate system.
  • the three-dimensional coordinates of the target face in the world coordinate system are the position of the user in the current scene.
  • the first is to obtain the audio signal again, and perform the above steps S102, S103, S104, and S105 again; the second It is to issue a prompt message to prompt the user to enter the target facial image and the voiceprint of the corresponding audio signal, and add it to the corresponding relationship between different voiceprints and different facial images above.
  • the positioning method provided by this application integrates audio recognition and face recognition technologies, processes and analyzes the acquired audio signals and image signals, and automatically detects users. Compared with the artificial positioning method, the audio signals and image signals are acquired. After that, the user can be identified immediately, the process is relatively simple, and the recognition speed is faster, to achieve rapid target positioning and recognition, and improve user experience.
  • Fig. 2 shows a schematic flowchart of another implementation manner of the positioning method provided according to the present application. For ease of description, only the parts related to the embodiment of the present application are shown.
  • the positioning method 200 includes:
  • Step S201 Obtain an audio signal
  • Step S202 Acquire a target face image of the user who sent the audio signal according to the audio signal;
  • Step S203 Obtain an image of the scene where the user is currently located
  • Step S204 Search for the target face image from the image of the scene where the user is currently located, and obtain the position of the target face image in the image of the scene where the user is currently located;
  • Step S205 Acquire the position of the user in the scene where the user is currently located according to the position of the target face image in the image of the scene where the user is currently located;
  • Step S206 According to the position of the user in the current scene, output a control instruction to the image acquisition device, where the control instruction is used to instruct the image acquisition device to acquire an image of the location of the user, and the location of the user is The image includes the face image of the user. The size of the image at the location of the user is greater than or equal to the preset size.
  • Step S201 of the positioning method 200 may be implemented by any embodiment of the step S101 of the positioning method 100
  • step S202 of the positioning method 200 may be implemented by any embodiment of the step S102 of the positioning method 100
  • step S203 of the positioning method 200 may be implemented by the above Step S103 of the positioning method 100 can be implemented by any embodiment
  • step S204 of the positioning method 200 can be implemented by any embodiment of the step S104 of the positioning method 100 described above
  • step S205 of the positioning method 200 can be implemented by any of the steps S105 of the positioning method 100 described above.
  • the embodiments are implemented, and will not be repeated here.
  • the image capture device may be a camera.
  • the control instruction is used to instruct the camera to move, for example, instruct the camera to zoom, position the shooting screen on the user, and obtain an image including the location of the user's face image, and the size of the image at the location of the user is greater than or equal to the preset Size, the preset size is the enlarged size of the image at the user's location. Therefore, the acquired image of the user's location can be understood as the user's close-up image, which is the enlarged image of the user's location. Since the image of the user's location includes the user's face image, the image of the user's location is also zoomed in. The image of the user's face is enlarged.
  • the control instruction is used to instruct the camera to zoom and move, shorten the distance between the camera and the user, and obtain an image of the user's location.
  • the first camera is used to capture images of the user’s current scene, including all the people in the conference room, that is, used to implement step S203; the second camera can be moved to capture the user’s The image at the location is used to implement step S206.
  • the positioning method provided in this application can greatly simplify the operation difficulty and speed of zooming in and displaying the speaker screen in the video conference scene, so that the participants can concentrate more on the conference itself, without worrying about the conference equipment too much, and save valuable The time and energy of conference personnel have significant social and economic benefits.
  • FIG. 3 is a schematic structural diagram of an embodiment of a positioning device provided according to the present application. For ease of description, only the parts related to the embodiments of the present application are shown.
  • the positioning device 300 includes:
  • the audio signal acquisition module 301 is used to acquire audio signals
  • the target facial image acquisition module 302 is configured to acquire the target facial image of the user who sent the audio signal according to the audio signal;
  • the scene image acquisition module 303 is configured to acquire an image of the scene where the user is currently located;
  • the first position obtaining module 304 is configured to search for the target face image from the image of the scene where the user is currently located, and obtain the position of the target face image in the image of the scene where the user is currently located;
  • the second position obtaining module 305 is configured to obtain the position of the user in the scene where the user is currently located according to the position of the target face image in the image of the scene where the user is currently located.
  • the target facial image acquisition module 302 includes:
  • a voiceprint acquiring unit configured to acquire, according to the audio signal, the voiceprint of the user who sent the audio signal
  • the target facial image acquiring unit is configured to acquire the target facial image of the user who sent the audio signal according to the voiceprint of the user who sent the audio signal.
  • the positioning device 300 further includes:
  • Correspondence establishment module used to establish the correspondence relationship between different voiceprints and different facial images
  • the target face image acquisition unit is specifically used for:
  • the target face image acquisition unit is used to:
  • the facial image corresponding to the voiceprint of the user who sent the audio signal is not obtained, then the audio signal is obtained again until the corresponding relationship is obtained The facial image corresponding to the voiceprint of the user who sent the reacquired audio signal.
  • the first location obtaining module 304 includes:
  • An all facial image acquisition unit for acquiring all facial images included in the image of the scene where the user is currently located;
  • the position acquiring unit is configured to determine that the target facial image exists in all facial images, and then acquire the position of the target facial image in the image of the scene where the user is currently located.
  • the location acquisition unit is used to:
  • the audio signal is re-acquired until it is determined that all the facial images contained in the image of the scene where the user who issued the re-acquired audio signal is located.
  • the user's target face image If the target facial image does not exist in all the facial images, the audio signal is re-acquired until it is determined that all the facial images contained in the image of the scene where the user who issued the re-acquired audio signal is located. The user's target face image.
  • the second location obtaining module 305 is specifically configured to:
  • the positioning device 300 further includes:
  • the control instruction output module is configured to output a control instruction to an image acquisition device according to the position of the user in the current scene.
  • the control instruction is used to instruct the image acquisition device to acquire an image of the location of the user.
  • the image of the location of the user includes the face image of the user, and the size of the image of the location of the user is greater than or equal to a preset size.
  • the above-mentioned function allocation can be completed by different functional modules as required, i.e.
  • the internal structure of the positioning device 300 is divided into different functional modules to complete all or part of the functions described above.
  • the functional modules in the embodiments can be integrated in one processing unit, or each unit can exist alone physically, or two or more units can be integrated in one unit.
  • the above integrated units can be implemented in the form of hardware. It can also be implemented in the form of software functional units.
  • the specific names of the functional modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of each functional module in the above, reference may be made to the corresponding process in the foregoing positioning method embodiment, which will not be repeated here.
  • Fig. 4 is a schematic structural diagram of an embodiment of a terminal device provided according to the present application.
  • the terminal device 400 includes a processor 402, a memory 401, and a computer program 403 that is stored in the memory 401 and can run on the processor 402.
  • the number of processors 402 is at least one, and FIG. 4 takes one as an example.
  • the processor 402 executes the computer program 403, the implementation steps of the above positioning method are implemented, that is, the steps shown in FIG. 1 or FIG. 2.
  • the computer program 403 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 401 and executed by the processor 402 to complete the application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 403 in the terminal device 400.
  • the terminal device 400 may be a computing device such as a desktop computer, a notebook, a palmtop computer, a master control, etc., a camera, a mobile phone, and other devices with image acquisition functions and data processing functions, or a touch display device.
  • the terminal device 400 may include, but is not limited to, a processor and a memory.
  • FIG. 4 is only an example of the terminal device 400, and does not constitute a limitation on the terminal device 400. It may include more or less components than shown in the figure, or a combination of certain components, or different components.
  • the terminal device 400 may also include input and output devices, network access devices, buses, and the like.
  • the processor 402 may be a CPU (Central Processing Unit, central processing unit), other general-purpose processors, DSP (Digital Signal Processor, digital signal processor), ASIC (Application Specific Integrated Circuit, application specific integrated circuit), FPGA ( Field-Programmable Gate Array) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 401 may be an internal storage unit of the terminal device 400, such as a hard disk or a memory.
  • the memory 401 may also be an external storage device of the terminal device 400, such as a plug-in hard disk, SMC (Smart Media Card), SD card (Secure Digital), Flash Card ( Flash memory card) etc. Further, the memory 401 may also include both an internal storage unit of the terminal device 400 and an external storage device.
  • the memory 401 is used to store an operating system, an application program, a boot loader, data, and other programs, for example, the program code of the computer program 403 and the like.
  • the memory 401 can also be used to temporarily store data that has been output or will be output.
  • This application also provides a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the positioning method provided in the first aspect of this application and the steps in any embodiment of the positioning method.
  • Fig. 5 is a schematic structural diagram of an implementation manner of a conference system provided according to the present application.
  • the conference system includes: an audio collection device 501, an image collection device 502, and a terminal device 503. Both the audio collection device 501 and the image collection device 502 are electrically connected to the terminal device 503.
  • the audio collection device 501 is used to collect audio signals.
  • the audio collection device 501 there are many suitable examples of the audio collection device 501, and in practical applications, there can be many names, such as: microphone, microphone, MIC (Microphone, microphone), pickup, microphone, microphone, etc.
  • the image collection device 502 is used to collect image information, such as a camera, a camera, and other devices or devices capable of collecting images. As described above, the image acquisition device 502 may be one camera, or may include two cameras.
  • the terminal device 503 performs data processing according to the audio signal collected by the audio collecting device 501 and the image information collected by the image collecting device 502.
  • the positioning method corresponding to the software program in the terminal device 503 is described in the above terminal device embodiment and positioning method embodiment It has been described in detail, and will not be described in detail in this embodiment.
  • the present application also provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements the positioning method provided in the first aspect of the present application and any implementation of the positioning method The steps in the example.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • this application implements all or part of the processes in the above-mentioned positioning method embodiments, which can be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the above-mentioned positioning method embodiment.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, ROM (Read-Only Memory), RAM (Random Access) Memory, random access memory), electrical carrier signals, telecommunications signals, and software distribution media.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signals telecommunications signals
  • software distribution media For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media refers to non-transitory computer-readable storage media, and therefore does not include electrical carrier signals and telecommunication signals.
  • the disclosed device/terminal device and method may be implemented in other ways.
  • the device/terminal device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种定位方法(100,200)、终端设备(400,503)及会议系统,适用于数据处理技术领域,定位方法(100,200)包括:获取音频信号(S101,S201);根据音频信号,获取发出该音频信号的用户的目标脸部图像(S102,S202);获取用户当前所在场景的图像(S103,S203);从用户当前所在场景的图像中查找目标脸部图像,获取目标脸部图像在用户当前所在场景的图像中的位置(S104,S204);根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置(S105,S205)。定位方法(100,200)融合了音频识别和人脸识别技术,能够根据获取到的音频信号以及图像信号进行处理分析,从而在获取到音频信号以及图像信号之后,就能够立即识别得到用户,过程比较简单,而且具有较快的识别速度,实现目标的快速定位识别,提升用户体验。

Description

定位方法、终端设备及会议系统
相关申请的交叉引用
本申请要求享有于2020年04月28日提交的名称为“定位方法、终端设备及会议系统”的中国专利申请202010347463.0的优先权,该申请的全部内容通过引用并入本文中。
技术领域
本申请属于数据处理技术领域,尤其涉及一种定位方法、终端设备及会议系统。
背景技术
在开会时,尤其是进行视频会议时,需要将摄像头对准发言人,以采集发言人的图像信息。但是,目前通常采用人为操作的方式,操作人员需要先获知发言人的位置,然后通过摄像头的变焦和移动功能将镜头反复调整,最终移动到发言人上,该操作过程既麻烦又耗时,无法实现快速定位发言人,用户体验不佳。
发明内容
有鉴于此,本申请提供了一种定位方法、终端设备及会议系统,以解决现有的定位方式定位缓慢的问题。
本申请第一方面提供了一种定位方法,所述定位方法包括:
获取音频信号;
根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像;
获取所述用户当前所在场景的图像;
从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置;
根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置。
本申请第二方面提供了一种定位装置,所述定位装置包括:
音频信号获取模块,用于获取音频信号;
目标脸部图像获取模块,用于根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像;
场景图像获取模块,用于获取所述用户当前所在场景的图像;
第一位置获取模块,用于从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置;
第二位置获取模块,用于根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置。
本申请第三方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述本申请实施方式第一方面提供的定位方法的步骤。
本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述本申请实施方式第一方面提供的定位方法的步骤。
本申请第五方面提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述本申请实施方式第一方面提供的定位方法。
本申请第六方面一种会议系统,所述会议系统包括:
音频采集设备;
图像采集设备;以及
如上述本申请实施方式第三方面提供的终端设备,
其中,所述音频采集设备和所述图像采集设备均与所述终端设备电连接。
根据本申请的各个方面,基于获取到的音频信号能够获取得到发出该音频信号的用户的目标脸部图像,然后获取到包含用户当前所在场景的图像,最后根据目标脸部图像在用户当前所在场景的图像中的位置,获取用户在当前场景中的位置。本申请提供的定位方法融合了音频识别和人脸识别技术,能够根据获取到的音频信号以及图像信号进行处理分析,自动定位到用户,从而在获取到音频信号以及图像信号之后,就能够立即识别得到用户。与现有的人为定位的方式相比,本申请的各个方面至少具有如下的有益效果:定位过程比较简单,而且具有较快的识别速度,实现目标的快速定位识别,提升用户体验。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据附图获得其他的附图。
图1是根据本申请提供的定位方法的一个实施方式的流程示意图;
图2是根据本申请提供的定位方法的另一个实施方式的流程示意图;
图3是根据本申请提供的定位装置的实施方式的结构示意图;
图4是根据本申请提供的终端设备的实施方式的结构示意图;
图5是根据本申请提供的会议系统的实施方式的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施方式。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施方式中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排 除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施方式的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
应当理解,本实施例中各步骤的先后撰写顺序并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
为了说明本申请所述的技术方案,下面通过具体实施方式来进行说明。
图1示出了根据本申请提供的定位方法的一个实现方式的流程示意图。为了便于说明,仅示出了与本申请实施例相关的部分。
如图1所示,该定位方法100包括:
步骤S101:获取音频信号;
步骤S102:根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像;
步骤S103:获取所述用户当前所在场景的图像;
步骤S104:从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置;以及
步骤S105:根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置。
本申请的定位方法可应用于各种不同的场景。作为示例,本申请的定位方法可以应用于教室中的日常教学过程中,以定位用户的位置,例如讲课的老师或者回答问题的学生的位置。该方法也可以应用到会议厅或者会议室内,以定位用户的位置,例如发言人的位置。本实施例中,该定位方法应用的场景以会议室为例。
在一些实施例中,步骤S101具体包括通过音频采集设备采集音频信号。音频采集设备的合适示例有很多种,而且,在实际应用中,可有很多名称,例如:麦克、话筒、MIC(Microphone,麦克风)、拾音器、咪头、送话器等等。
在一些实施例中,步骤S102具体包括:根据获取到的音频信号,获取发出该音频信号的用户的声纹;然后根据发出该音频信号的用户的声纹,获取发出该音频信号的用户的目标脸部图像。
在具体实施例中,预先建立不同声纹与不同脸部图像的对应关系,以根据发出该音频信号的用户的声纹,获取发出该音频信号的用户的目标脸部图像。例如,该预先建立的对应关系中可包含至少两组不同声纹与不同脸部图像的对应对,但具体组数可根据实际需要进行设置。作为一个具体示例,对应关系中只包含一组声纹与脸部图像,也能够实现后续的识别过程。
在一些实施例中,建立不同声纹与不同脸部图像的对应关系可包括:对各个用户先进行注册;录入脸部图像以及声纹信息;建立该用户的声纹与脸部图像之间的对应关系。录入用户的脸部图像以及声纹信息可通过让用户对着摄像头拍照和说话来完成,因此录入过程很简单。当接收到任意一个用户的音频信号后,提取该音频信号的声纹,然后与采集到的该用户的脸部图像进行对应,即可建立该用户的声纹与脸部图像之间的对应关系。为了提升识别精度,录入的音频信号可以是一些比较有代表性的语句或者 词语,代表性表示容易出现,不同的场景具有不同的代表性的语句或者词语。例如:会议场景下,比较有代表性的语句或者词语有“大家好”、“我说两句”等等。最终,建立不同声纹与不同脸部图像的对应关系。
在一些实施例中,在步骤S102中,获取到音频信号之后,通过声纹识别算法识别得到该音频信号的声纹,即发出该音频信号的用户的声纹。然后,从建立的不同声纹与不同脸部图像的对应关系中,获取发出该音频信号的用户的声纹对应的脸部图像,该脸部图像为发出该音频信号的用户的目标脸部图像。为了提升识别准确性,用户发出的音频信号可以与上述建立对应关系过程中事先录入的语句或者词语相对应,例如会议场景下,用户发出的音频信号可以为“大家好”、“我说两句”等等。
在一些实施例中,步骤S102包括:若从不同声纹与不同脸部图像的对应关系中,未获取到发出音频信号的用户的声纹对应的脸部图像,则重新获取音频信号,直至从对应关系中获取发出重新获取的音频信号的用户的声纹对应的脸部图像。
在一些实施例中,若根据用户的一个音频信号对应的声纹,未从不同声纹与不同脸部图像的对应关系中获取到该声纹对应的脸部图像,则为了准确定位到该用户在当前所在场景的位置,则可以重新获取音频信号。若该用户仍在发声,则重新获取的音频信号则是该用户的另一个音频信号。
然后,根据重新获取的音频信号,获取发出该音频信号的用户的声纹,接着判断从上述不同声纹与不同脸部图像的对应关系中,是否存在发出该音频信号的用户的声纹对应的脸部图像,即目标脸部图像。若该对应关系中存在目标脸部图像,则执行步骤104和步骤105。若该对应关系中不存在目标脸部图像,则再次重新获取音频信号,直至在上述不同声纹与不同脸部图像的对应关系中获取到发出重新获取的音频信号的用户的声纹对应的脸部图像。
在本申请的实施例中,为了提高对目标脸部图像获取的准确性,在未获取到发出音频信号的用户的声纹对应的脸部图像的情况下,可以通过循环获取该用户的音频信号,直至获取到该用户的目标脸部图像,从而可以实现对该用户的定位。
在本申请的实施例中,为了提高数据处理效率,当从上述不同声纹与不同脸部图像的对应关系中,未获取到发出所述音频信号的用户的声纹对应的脸部图像,则认为该用户未在该对应关系中进行注册,则可以不处理或者在预设时间段后继续获取下一个用户的音频信号。
在一些实施例中,在步骤S103,可通过图像采集设备,例如摄像机,获取到用户当前所在场景的图像。由于此时还没有获取到用户的位置,那么,此时获取到的图像为用户当前所在场景的图像,即包含处于会议室内的所有人员的图像,可以是会议室的全景画面。对于图像采集设备(例如摄像机)的布置位置不做限定,但是需要布置在能够有效采集到会议室内的所有人员的脸部图像的位置,例如:固定在会议室左上角或者右上角的墙壁上。
在一些实施例中,步骤S104可包括:获取用户当前所在场景的图像中包含的所有脸部图像;确定所有脸部图像中存在目标脸部图像,则获取目标脸部图像在用户当前所在场景的图像中的位置。在一些实施例中,获取用户当前所在场景的图像中包含的所有脸部图像的具体示例包括:根据人脸识别算法,对用户当前所在场景的图像进行人脸识别,识别得到图像中包含的所有的脸部图像。
在一些实施例中,S104包括若所有脸部图像中不存在目标脸部图像,则重新获取音频信号,直至确定发出重新获取的音频信号的用户所在场景的图像中包含的所有脸部图像中存在用户的目标脸部图像。
在一些实施例中,为了准确定位到该用户在当前所在场景的位置,若获取的所有脸部图像中不存在目标脸部图像,则可以重新获取音频信号。若该用户仍在发声,则重新获取的音频信号则是该用户的另一个音频信号。
然后,根据重新获取的音频信号获取该用户的目标脸部图像。接着,重新获取该用户当前所在场景的图像。再接着获取用户当前所在场景的图像中包含的所有脸部图像,并检测该所有脸部图像中是否存在目标脸部图像。若在用户所在场景的图像中包含的所有脸部图像中存在该用户的目标脸部图像,则执行步骤105。若在用户所在场景的图像中包含的所有脸部图像中不存在目标脸部图像,则再次重新获取音频信号,直至确定发出所 述重新获取的音频信号的用户所在场景的图像中包含的所有脸部图像中存在所述用户的目标脸部图像。
在本申请的实施例中,通过多次循环,直至在发出重新获取的音频信号的用户当前所在场景的图像中查找到目标脸部图像,从而可以实现对用户在当前所在场景中的精确定位。
在一些实施例中,检测所有脸部图像中是否存在目标脸部图像的具体示例包括:将目标脸部图像与所有脸部图像中的各个脸部图像分别进行比对,以检测所有脸部图像中是否存在目标脸部图像。例如,比对过程可以为:获取目标脸部图像与所有脸部图像中的各个脸部图像的匹配值;比较各匹配值与预设的匹配门限值的大小。若其中有一个匹配值大于预设的匹配门限值,则确定所有脸部图像中存在目标脸部图像,若其中没有匹配值大于预设的匹配门限值,确定所有脸部图像中不存在目标脸部图像。
步骤S105:根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置。
在本申请的一些实施例中,S105可以包括:获取目标脸部图像在用户当前所在场景的图像中的位置的坐标;对目标脸部图像在用户当前所在场景的图像中的位置的坐标进行坐标转换,获取用户在当前所在场景中的坐标,即用户在当前所在场景中的位置。
在一些示例中,目标脸部图像在用户当前所在场景的图像中的位置的坐标可以称之为目标脸部的图像坐标(图像坐标为二维坐标,通常包括X坐标和Y坐标)。上述坐标转换具体可以是指将预先设置的用户与摄像头(即获取用户当前所在场景的图像的摄像头)的距离值作为图像坐标的Z坐标,可以获得目标脸部在摄像头坐标系下的三维坐标。根据摄像头坐标系与世界坐标系之间的转换矩阵,可以将摄像头坐标系转换为世界坐标系,得到目标脸部在世界坐标系下的三维坐标。目标脸部在世界坐标系下的三维坐标即为用户在当前所在场景中的位置。
若所有的脸部图像中不存在目标脸部图像,则可以有两种处理方式,第一种是再次获取音频信号,并再次执行上述步骤S102、步骤S103、步骤S104和步骤S105;第二种是发出提示信息,提示用户录入该目标脸部图像 以及对应的音频信号的声纹,将其添加到上文中的不同声纹与不同脸部图像的对应关系中。
本申请提供的定位方法融合了音频识别和人脸识别技术,根据获取到的音频信号以及图像信号进行处理分析,自动检测到用户,相较于人为定位的方式,在获取到音频信号以及图像信号之后,就能够立即识别得到用户,过程比较简单,而且具有较快的识别速度,实现目标的快速定位识别,提升用户体验。
为了说明本申请所述的技术方案,下面通过具体实施方式来进行说明。
图2示出了根据本申请提供的定位方法的另一实施方式的流程示意图,为了便于说明,仅示出了与本申请实施例相关的部分。
如图2所示,该定位方法200包括:
步骤S201:获取音频信号;
步骤S202:根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像;
步骤S203:获取所述用户当前所在场景的图像;
步骤S204:从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置;
步骤S205:根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置;
步骤S206:根据所述用户在当前所在场景中的位置,向图像采集设备输出控制指令,所述控制指令用于指示所述图像采集设备获取所述用户所在位置的图像,所述用户所在位置的图像包括所述用户的脸部图像。所述用户所在位置的图像的尺寸大于或者等于预设尺寸。
定位方法200的步骤S201可由上述定位方法100的步骤S101的任意实施例来实现,定位方法200的步骤S202可由上述定位方法100的步骤S102的任意实施例来实现,定位方法200的步骤S203可由上述定位方法100的步骤S103的任意实施例来实现,定位方法200的步骤S204可由上述 定位方法100的步骤S104的任意实施例来实现,定位方法200的步骤S205可由上述定位方法100的步骤S105的任意实施例来实现,在此不再赘述。
在一些实施例中,图像采集设备可以为摄像机。在识别得到用户在当前所在场景中的位置之后,根据用户在当前所在场景中的位置,向摄像机输出控制指令。该控制指令用于指示摄像机动作,例如指示摄像机变焦,将拍摄画面定位到该用户上,获取包括该用户的脸部图像的所在位置的图像,该用户所在位置的图像的尺寸大于或者等于预设尺寸,预设尺寸为对用户所在位置的图像进行放大后的尺寸。所以,获取到的用户所在位置的图像可以理解为用户的特写画面,是放大后的用户所在位置的图像,由于用户所在位置的图像包括用户的脸部图像,故放大用户所在位置的图像时也放大了用户的脸部图像。
如果摄像机能够移动,例如通过可移动支架能够调节摄像机的位置以及拍摄角度,则控制指令用于指示摄像机变焦和移动,缩短摄像机与用户之间的距离,获取用户所在位置的图像。
而且,还可以设置两个摄像机,第一个摄像机用于拍摄得到用户当前所在场景的图像,包含会议室所有的人员,即用于实现步骤S203;第二个摄像机能够移动,用于拍摄得到用户所在位置的图像,即用于实现步骤S206。
本申请提供的定位方法能够极大简化在视频会议场景下,对发言人画面进行放大显示的操作难度和速度,使得参加会议人员能更专心与会议本身,不用过多操心会议设备,节省了宝贵的会议人员时间和精力,有显著的社会效益和经济效益。
对应于上文中的定位方法实施例中所述的定位方法,图3是根据本申请提供的定位装置的实施方式的结构示意图。为了便于说明,仅示出了与本申请实施例相关的部分。
参照图3,定位装置300包括:
音频信号获取模块301,用于获取音频信号;
目标脸部图像获取模块302,用于根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像;
场景图像获取模块303,用于获取所述用户当前所在场景的图像;
第一位置获取模块304,用于从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置;
第二位置获取模块305,用于根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置。
可选地,目标脸部图像获取模块302,包括:
声纹获取单元,用于根据所述音频信号,获取发出所述音频信号的用户的声纹;
目标脸部图像获取单元,用于根据发出所述音频信号的用户的声纹,获取发出所述音频信号的用户的目标脸部图像。
可选地,定位装置300还包括:
对应关系建立模块,用于建立不同声纹与不同脸部图像的对应关系;
相应的,目标脸部图像获取单元具体用于:
从所述不同声纹与不同脸部图像的对应关系中,获取发出所述音频信号的用户的声纹对应的脸部图像,该脸部图像为发出所述音频信号的用户的目标脸部图像。
可选的,目标脸部图像获取单元用于:
若从所述不同声纹与不同脸部图像的对应关系中,未获取到发出所述音频信号的用户的声纹对应的脸部图像,则重新获取音频信号,直至从所述对应关系中获取发出所述重新获取的音频信号的用户的声纹对应的脸部图像。
可选地,第一位置获取模块304,包括:
所有脸部图像获取单元,用于获取所述用户当前所在场景的图像中包含的所有脸部图像;
位置获取单元,用于确定所述所有脸部图像中存在所述目标脸部图像,则获取所述目标脸部图像在所述用户当前所在场景的图像中的位置。
可选的,位置获取单元用于:
若所述所有脸部图像中不存在所述目标脸部图像,则重新获取音频信号,直至确定发出所述重新获取的音频信号的用户所在场景的图像中包含的所有脸部图像中存在所述用户的目标脸部图像。
可选地,第二位置获取模块305具体用于:
获取所述目标脸部图像在所述用户当前所在场景的图像中的位置的坐标;
对所述目标脸部图像在所述用户当前所在场景的图像中的位置的坐标进行坐标转换,获取所述用户在当前所在场景中的位置。
可选地,定位装置300还包括:
控制指令输出模块,用于根据所述用户在当前所在场景中的位置,向图像采集设备输出控制指令,所述控制指令用于指示所述图像采集设备获取所述用户所在位置的图像,所述用户所在位置的图像包括所述用户的脸部图像,所述用户所在位置的图像的尺寸大于或者等于预设尺寸。
需要说明的是,上述装置/模块之间的信息交互、执行过程等内容,由于与本申请定位方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见定位方法实施例部分,此处不再赘述。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将所述定位装置300的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述中各功能模块的具体工作过程,可以参考前述定位方法实施例中的对应过程,在此不再赘述。
图4是根据本申请提供的终端设备的实施方式的结构示意图。如图4所示,终端设备400包括:处理器402、存储器401以及存储在存储器401中并可在处理器402上运行的计算机程序403。处理器402的个数是至少一个,图4以一个为例。处理器402执行计算机程序403时实现上述定位方法的实现步骤,即图1或者图2所示的步骤。
终端设备400的具体实现过程可以参见上文中的定位方法实施例。
示例性的,计算机程序403可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在存储器401中,并由处理器402执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序403在终端设备400中的执行过程。
终端设备400可以是桌上型计算机、笔记本、掌上电脑、主控等计算设备,也可以是相机、手机等具有图像采集功能和数据处理功能的设备,还可以是触控显示设备。终端设备400可包括,但不仅限于,处理器以及存储器。本领域技术人员可以理解,图4仅是终端设备400的示例,并不构成对终端设备400的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如终端设备400还可以包括输入输出设备、网络接入设备、总线等。
处理器402可以是CPU(Central Processing Unit,中央处理单元),还可以是其他通用处理器、DSP(Digital Signal Processor,数字信号处理器)、ASIC(Application Specific Integrated Circuit,专用集成电路)、FPGA(Field-Programmable Gate Array,现成可编程门阵列)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器401可以是终端设备400的内部存储单元,例如硬盘或内存。存储器401也可以是终端设备400的外部存储设备,例如终端设备400上配备的插接式硬盘、SMC(Smart Media Card,智能存储卡)、SD卡(Secure Digital,安全数字卡)、Flash Card(闪存卡)等。进一步地,存储器401还可以既包括终端设备400的内部存储单元也包括外部存储设备。 存储器401用于存储操作系统、应用程序、引导装载程序、数据以及其他程序等,例如所述计算机程序403的程序代码等。存储器401还可以用于暂时地存储已经输出或者将要输出的数据。
本申请还提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行本申请第一方面提供的定位方法以及定位方法中任意实施例中的步骤。
图5是根据本申请提供的会议系统的实施方式的结构示意图。如图5所示,会议系统包括:音频采集设备501、图像采集设备502和终端设备503。音频采集设备501和图像采集设备502均与终端设备503电连接。
音频采集设备501用于采集音频信号。其中,音频采集设备501的合适示例有很多种,而且,在实际应用中,可以有很多名称,例如:麦克、话筒、MIC(Microphone,麦克风)、拾音器、咪头、送话器等等。
图像采集设备502用于采集图像信息,例如摄像机、摄像头以及其他的能够采集图像的设备或者器件。如上文所述,图像采集设备502可以是一个摄像机,也可以包括两个摄像机。
终端设备503根据音频采集设备501采集到的音频信号以及图像采集设备502采集到的图像信息进行数据处理,终端设备503内的软件程序对应的定位方法在上述终端设备实施例以及定位方法实施例中已进行了详细地描述,本实施例就不再具体说明。
本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现本申请第一方面提供的定位方法以及定位方法中任意实施例中的步骤。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述定位方法实施例中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述定位方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述 计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、ROM(Read-Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质是指非暂态计算机可读存储介质,因此不包括电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/终端设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/终端设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方 案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (14)

  1. 一种定位方法,包括:
    获取音频信号;
    根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像;
    获取所述用户当前所在场景的图像;
    从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置;
    根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置。
  2. 根据权利要求1所述的定位方法,其中,所述根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像,包括:
    根据所述音频信号,获取发出所述音频信号的用户的声纹;
    根据发出所述音频信号的用户的声纹,获取发出所述音频信号的用户的目标脸部图像。
  3. 根据权利要求2所述的定位方法,其中,在获取所述音频信号之前,所述定位方法还包括:
    建立不同声纹与不同脸部图像的对应关系;
    相应的,所述根据发出所述音频信号的用户的声纹,获取发出所述音频信号的用户的目标脸部图像,包括:
    从所述不同声纹与不同脸部图像的对应关系中,获取发出所述音频信号的用户的声纹对应的脸部图像,该脸部图像为发出所述音频信号的用户的目标脸部图像。
  4. 根据权利要求3所述的定位方法,其中,所述从所述不同声纹与不同脸部图像的对应关系中,获取发出所述音频信号的用户的声纹对应的脸部图像,包括:
    若从所述不同声纹与不同脸部图像的对应关系中,未获取到发出所述音频信号的用户的声纹对应的脸部图像,则重新获取音频信号,直至从所 述对应关系中获取发出所述重新获取的音频信号的用户的声纹对应的脸部图像。
  5. 根据权利要求1所述的定位方法,其中,所述从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置,包括:
    获取所述用户当前所在场景的图像中包含的所有脸部图像;
    确定所述所有脸部图像中存在所述目标脸部图像,则获取所述目标脸部图像在所述用户当前所在场景的图像中的位置。
  6. 如权利要求5所述的定位方法,其中,所述确定所述所有脸部图像中存在所述目标脸部图像,包括:
    若所述所有脸部图像中不存在所述目标脸部图像,则重新获取音频信号,直至确定发出所述重新获取的音频信号的用户所在场景的图像中包含的所有脸部图像中存在所述用户的目标脸部图像。
  7. 根据权利要求5所述的定位方法,其中,所述根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置,包括:
    获取所述目标脸部图像在所述用户当前所在场景的图像中的位置的坐标;
    对所述目标脸部图像在所述用户当前所在场景的图像中的位置的坐标进行坐标转换,获取所述用户在当前所在场景中的位置。
  8. 根据权利要求1至7任一项所述的定位方法,其中,在所述获取所述用户在当前所在场景中的位置之后,所述定位方法还包括:
    根据所述用户在当前所在场景中的位置,向图像采集设备输出控制指令,所述控制指令用于指示所述图像采集设备获取所述用户所在位置的图像,所述用户所在位置的图像包括所述用户的脸部图像,所述用户所在位置的图像的尺寸大于或者等于预设尺寸。
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至8任一项所述的定位方法的步骤。
  10. 一种会议系统,其中,所述会议系统包括:
    音频采集设备;
    图像采集设备;以及
    如权利要求9所述的终端设备;
    所述音频采集设备和所述图像采集设备均与所述终端设备电连接。
  11. 一种定位装置,包括:
    音频信号获取模块,用于获取音频信号;
    目标脸部图像获取模块,用于根据所述音频信号,获取发出所述音频信号的用户的目标脸部图像;
    场景图像获取模块,用于获取所述用户当前所在场景的图像;
    第一位置获取模块,用于从所述用户当前所在场景的图像中查找所述目标脸部图像,获取所述目标脸部图像在所述用户当前所在场景的图像中的位置;
    第二位置获取模块,用于根据所述目标脸部图像在所述用户当前所在场景的图像中的位置,获取所述用户在当前所在场景中的位置。
  12. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1-8任一项所述的定位方法的步骤。
  13. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-8任一项所述的定位方法的步骤。
  14. 一种计算机程序产品,当所述计算机程序产品在终端设备上运行时,使得所述终端设备执行如权利要求1-8任一项所述的定位方法的步骤。
PCT/CN2020/102299 2020-04-28 2020-07-16 定位方法、终端设备及会议系统 WO2021217897A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010347463.0 2020-04-28
CN202010347463.0A CN111614928B (zh) 2020-04-28 2020-04-28 定位方法、终端设备及会议系统

Publications (1)

Publication Number Publication Date
WO2021217897A1 true WO2021217897A1 (zh) 2021-11-04

Family

ID=72205597

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/102299 WO2021217897A1 (zh) 2020-04-28 2020-07-16 定位方法、终端设备及会议系统

Country Status (2)

Country Link
CN (1) CN111614928B (zh)
WO (1) WO2021217897A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113630556A (zh) * 2021-09-26 2021-11-09 北京市商汤科技开发有限公司 聚焦方法、装置、电子设备以及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005151471A (ja) * 2003-11-19 2005-06-09 Sony Corp 音声集音・映像撮像装置および撮像条件決定方法
US20130162752A1 (en) * 2011-12-22 2013-06-27 Advanced Micro Devices, Inc. Audio and Video Teleconferencing Using Voiceprints and Face Prints
CN106972990A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 基于声纹识别的智能家居设备
CN109318243A (zh) * 2018-12-11 2019-02-12 珠海市微半导体有限公司 一种视觉机器人的声源跟踪系统、方法和清洁机器人
CN109783642A (zh) * 2019-01-09 2019-05-21 上海极链网络科技有限公司 多人会议场景的结构化内容处理方法、装置、设备及介质
CN110148418A (zh) * 2019-06-14 2019-08-20 安徽咪鼠科技有限公司 一种场景记录分析系统、方法及其装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8842161B2 (en) * 2010-05-18 2014-09-23 Polycom, Inc. Videoconferencing system having adjunct camera for auto-framing and tracking
CN106960455A (zh) * 2017-03-17 2017-07-18 宇龙计算机通信科技(深圳)有限公司 定向传声方法及终端
CN110443371B (zh) * 2019-06-25 2023-07-25 深圳欧克曼技术有限公司 一种人工智能设备和方法
CN110503045A (zh) * 2019-08-26 2019-11-26 北京华捷艾米科技有限公司 一种人脸定位方法及装置
CN110716180B (zh) * 2019-10-17 2022-03-15 北京华捷艾米科技有限公司 一种基于人脸检测的音频定位方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005151471A (ja) * 2003-11-19 2005-06-09 Sony Corp 音声集音・映像撮像装置および撮像条件決定方法
US20130162752A1 (en) * 2011-12-22 2013-06-27 Advanced Micro Devices, Inc. Audio and Video Teleconferencing Using Voiceprints and Face Prints
CN106972990A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 基于声纹识别的智能家居设备
CN109318243A (zh) * 2018-12-11 2019-02-12 珠海市微半导体有限公司 一种视觉机器人的声源跟踪系统、方法和清洁机器人
CN109783642A (zh) * 2019-01-09 2019-05-21 上海极链网络科技有限公司 多人会议场景的结构化内容处理方法、装置、设备及介质
CN110148418A (zh) * 2019-06-14 2019-08-20 安徽咪鼠科技有限公司 一种场景记录分析系统、方法及其装置

Also Published As

Publication number Publication date
CN111614928A (zh) 2020-09-01
CN111614928B (zh) 2021-09-28

Similar Documents

Publication Publication Date Title
CN108933915B (zh) 视频会议装置与视频会议管理方法
US10241990B2 (en) Gesture based annotations
WO2019184650A1 (zh) 字幕生成方法及终端
WO2020000912A1 (zh) 一种行为检测方法、装置、电子设备和存储介质
CN105303161A (zh) 一种多人拍照的方法及装置
WO2020119032A1 (zh) 基于生物特征的声源追踪方法、装置、设备及存储介质
CN105631403A (zh) 人脸识别方法及装置
US10015445B1 (en) Room conferencing system with heat map annotation of documents
CN111432115A (zh) 基于声音辅助定位的人脸追踪方法、终端及存储装置
CN110196914B (zh) 一种将人脸信息录入数据库的方法和装置
WO2023155532A1 (zh) 位姿检测方法及装置、电子设备和存储介质
CN114445562A (zh) 三维重建方法及装置、电子设备和存储介质
WO2021120190A1 (zh) 数据处理方法、装置、电子设备和存储介质
WO2021082045A1 (zh) 微笑表情检测方法、装置、计算机设备及存储介质
CN110673811B (zh) 基于声音信息定位的全景画面展示方法、装置及存储介质
CN111818385B (zh) 视频处理方法、视频处理装置及终端设备
CN112689221A (zh) 录音方法、录音装置、电子设备及计算机可读存储介质
WO2023173616A1 (zh) 一种人群统计方法及装置、电子设备和存储介质
WO2021217897A1 (zh) 定位方法、终端设备及会议系统
CN114257757B (zh) 视频的自动裁剪切换方法及系统、视频播放器及存储介质
CN114520888A (zh) 影像撷取系统
CN110717452B (zh) 图像识别方法、装置、终端及计算机可读存储介质
WO2018133321A1 (zh) 一种生成镜头信息的方法和装置
CN112866617A (zh) 视频会议设备以及视频会议方法
CN111918127B (zh) 一种视频剪辑方法、装置、计算机可读存储介质及相机

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20932946

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20932946

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20932946

Country of ref document: EP

Kind code of ref document: A1