CN111614928B - Positioning method, terminal device and conference system - Google Patents

Positioning method, terminal device and conference system Download PDF

Info

Publication number
CN111614928B
CN111614928B CN202010347463.0A CN202010347463A CN111614928B CN 111614928 B CN111614928 B CN 111614928B CN 202010347463 A CN202010347463 A CN 202010347463A CN 111614928 B CN111614928 B CN 111614928B
Authority
CN
China
Prior art keywords
user
image
audio signal
face image
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010347463.0A
Other languages
Chinese (zh)
Other versions
CN111614928A (en
Inventor
林瑞成
霍澄平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Honghe Innovation Information Technology Co Ltd
Original Assignee
Shenzhen Honghe Innovation Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Honghe Innovation Information Technology Co Ltd filed Critical Shenzhen Honghe Innovation Information Technology Co Ltd
Priority to CN202010347463.0A priority Critical patent/CN111614928B/en
Priority to PCT/CN2020/102299 priority patent/WO2021217897A1/en
Publication of CN111614928A publication Critical patent/CN111614928A/en
Application granted granted Critical
Publication of CN111614928B publication Critical patent/CN111614928B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • H04N23/611Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/695Control of camera direction for changing a field of view, e.g. pan, tilt or based on tracking of objects

Abstract

The application is applicable to the technical field of data processing, and provides a positioning method, terminal equipment and a conference system, wherein the positioning method comprises the following steps: acquiring an audio signal; acquiring a target face image of a user sending the audio signal according to the audio signal; acquiring an image of a scene where a user is currently located; searching a target face image from an image of a current scene of a user, and acquiring the position of the target face image in the image of the current scene of the user; and acquiring the position of the user in the current scene according to the position. The positioning method provided by the application integrates audio identification and face identification technologies, processing and analysis are carried out according to the obtained audio signals and image signals, the user is automatically positioned, compared with a manual positioning mode, the user can be immediately identified after the audio signals and the image signals are obtained, the process is simple, the rapid positioning and identification of the target are realized, and the user experience is improved.

Description

Positioning method, terminal device and conference system
Technical Field
The present application belongs to the technical field of data processing, and in particular, to a positioning method, a terminal device, and a conference system.
Background
When a meeting is carried out, particularly a video conference is carried out, a camera needs to be aimed at a speaker so as to collect image information of the speaker. However, at present, a manual operation mode is usually adopted, an operator needs to know the position of a speaker first, then repeatedly adjust the lens through the zooming and moving functions of the camera, and finally move the lens to the speaker, and the operation process is troublesome and time-consuming, cannot realize quick positioning, and is poor in user experience.
Content of application
In view of this, the present application provides a positioning method, a terminal device and a conference system, so as to solve the problem that the existing positioning method is slow in positioning.
A first aspect of an embodiment of the present application provides a positioning method, where the positioning method includes:
acquiring an audio signal;
acquiring a target face image of a user sending the audio signal according to the audio signal;
acquiring an image of a scene where the user is currently located;
searching the target face image from the image of the scene where the user is located currently, and acquiring the position of the target face image in the image of the scene where the user is located currently;
and acquiring the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user.
A second aspect of embodiments of the present application provides a positioning apparatus, including:
the audio signal acquisition module is used for acquiring an audio signal;
the target face image acquisition module is used for acquiring a target face image of a user sending the audio signal according to the audio signal;
the scene image acquisition module is used for acquiring an image of a scene where the user is currently located;
the first position acquisition module is used for searching the target face image from the image of the scene where the user is located currently and acquiring the position of the target face image in the image of the scene where the user is located currently;
and the second position acquisition module is used for acquiring the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user.
A third aspect of embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the positioning method provided in the first aspect of embodiments of the present application.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the positioning method provided in the first aspect of embodiments of the present application.
A fifth aspect of embodiments of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute a positioning method provided in the first aspect of embodiments of the present application.
In a sixth aspect of an embodiment of the present application, a conference system includes:
an audio acquisition device;
an image acquisition device; and
a terminal device as provided in the third aspect of the embodiment of the present application described above;
the audio acquisition equipment and the image acquisition equipment are electrically connected with the terminal equipment.
Compared with the prior art, the embodiment of the application has the advantages that: the method comprises the steps of obtaining a target face image of a user sending an audio signal according to the obtained audio signal, then obtaining an image containing a current scene of the user, and finally obtaining the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user. The positioning method provided by the application integrates audio identification and face identification technologies, processing and analysis are carried out according to the obtained audio signals and image signals, the user is automatically positioned, compared with a manual positioning mode, the user can be immediately identified after the audio signals and the image signals are obtained, the process is simple, the rapid positioning and identification of the target are realized, and the user experience is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.
Fig. 1 is a schematic flowchart of a first implementation process of a positioning method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a second implementation process of a positioning method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a positioning device according to a second embodiment of the present application;
fig. 4 is a schematic structural diagram of a terminal device according to a third embodiment of the present application;
fig. 5 is a schematic structural diagram of a conference system provided in the fourth embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the order of writing each step in this embodiment does not mean the order of execution, and the order of execution of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of this embodiment.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
In order to explain the technical means described in the present application, the following description will be given by way of specific embodiments.
Referring to fig. 1, it is a flowchart of a first implementation procedure of a positioning method provided in an embodiment of the present application, and for convenience of description, only a part related to the embodiment of the present application is shown.
The positioning method comprises the following steps:
step S101: an audio signal is acquired.
The positioning method is not unique in applied scene, and can be applied to a classroom and used in the daily teaching process to position the position of a user, such as the position of a teacher giving a lecture or a student answering a question; it may also be applied in conference halls or conference rooms to locate the position of a user, such as the speaker. In this embodiment, a scene in which the positioning method is applied is a conference room as an example.
Audio signals are captured by audio capture devices, which are many and in practical applications are named as: microphones, MICs (microphones), pickups, microphones, and so forth.
Step S102: and acquiring a target face image of the user sending the audio signal according to the audio signal.
One implementation is given below: acquiring a voiceprint of a user sending the audio signal according to the acquired audio signal; and then acquiring a target face image of the user sending the audio signal according to the voiceprint of the user sending the audio signal.
Specifically, the method comprises the following steps: in order to obtain a target face image of a user who sends the audio signal, a corresponding relationship between different voiceprints and different face images needs to be established in advance, in general, at least two groups of different voiceprints and different face images need to be set in the corresponding relationship, the specific number of the groups needs to be set according to actual needs, and as a special implementation mode, a subsequent recognition process can be realized only by including one group of voiceprint and face image in the corresponding relationship.
The establishment process of the corresponding relationship may be: the method includes the steps that each user is registered firstly, then, face images and voiceprint information are input, the input process is simple, and the user can take pictures and speak by facing a camera. When an audio signal of any user is received, extracting the voiceprint of the audio signal, and then corresponding to the collected face image of the user, namely establishing the corresponding relation between the voiceprint of the user and the face image. To improve the recognition accuracy, the recorded audio signal may be some comparative representative sentences or words, representative representations are easy to appear, and different scenes have different representative sentences or words, such as: in a meeting scene, representative sentences or words include 'good family', 'i say two sentences', and the like. And finally, establishing and obtaining corresponding relations between different voiceprints and different face images.
During identification, after the audio signal is acquired, the voiceprint of the audio signal is identified and acquired through a voiceprint identification algorithm, namely, the voiceprint of the user sending the audio signal. Then, a face image corresponding to the voiceprint of the user sending the audio signal is obtained from the established corresponding relationship between the different voiceprints and the different face images, where the face image is a target face image of the user sending the audio signal, and if a face image corresponding to the voiceprint of the user sending the audio signal is not obtained or found from the corresponding relationship between the different voiceprints and the different face images, the step S101 may be executed again. In order to improve the identification accuracy, the audio signal sent by the user may correspond to a sentence or a word previously entered in the process of establishing the correspondence, for example, in a conference scene, the audio signal sent by the user may be "family good", "i say two sentences", and the like.
Step S103: and acquiring an image of the scene where the user is located currently.
An image of a scene where a user is currently located is acquired through an image acquisition device, such as a camera. Since the position of the user is not acquired at this time, the acquired image is an image of a scene where the user is currently located, that is, an image including all the persons in the conference room, and may be a panoramic image of the conference room. The arrangement position of the camera is not limited, but needs to be arranged at a position where the image of the face of all the persons in the conference room can be effectively acquired, such as: fixed on the wall of the upper left corner or the upper right corner of the conference room.
Step S104: and searching the target face image from the image of the scene where the user is located currently, and acquiring the position of the target face image in the image of the scene where the user is located currently.
One implementation is given below: acquiring all face images contained in an image of a scene where a user is located; detecting whether a target face image exists in all face images; if all the face images have the target face image, acquiring the position of the target face image in the image of the current scene of the user; if the target face image does not exist in all the face images, the process returns to step S101.
The method comprises the following steps of obtaining all face images contained in an image of a scene where a user is located: and according to a face recognition algorithm, carrying out face recognition on the image of the current scene of the user, and recognizing to obtain all face images contained in the image.
Detecting whether a target face image exists in all face images: comparing the target face image with each face image in all face images respectively to detect whether the target face image exists in all face images, wherein the comparison process can be as follows: acquiring a matching value of the target face image and each face image in all face images; and comparing the matching values with a preset matching threshold value, if one matching value is larger than the preset matching threshold value, determining that the target face image exists in all face images, and if none of the matching values is larger than the preset matching threshold value, determining that the target face image does not exist in all face images.
Step S105: and acquiring the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user.
If all the face images have the target face image, acquiring the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user: if all the face images have the target face image, acquiring coordinates of the position of the target face image in the image of the current scene of the user; and performing coordinate conversion on the coordinates of the position of the target face image in the image of the current scene of the user to acquire the coordinates of the user in the current scene, namely the position of the user in the current scene. The coordinate conversion may specifically refer to taking a preset distance value between the user and a camera (i.e., a camera that obtains the image of the scene where the user is currently located) as a Z coordinate of the image coordinate, and may obtain a three-dimensional coordinate of the target face in a camera coordinate system, and according to a conversion matrix between the camera coordinate system and a world coordinate system, may convert the camera coordinate system into the world coordinate system to obtain a three-dimensional coordinate of the target face in the world coordinate system, where the three-dimensional coordinate of the target face in the world coordinate system is the position of the user in the scene where the user is currently located.
If the target face image does not exist in all the face images, two processing modes can be provided, wherein the first mode is to acquire the audio signal again and to execute the implementation processes of the step S102, the step S103, the step S104 and the step S105 again; the second is to send out prompt information to prompt the user to input the target face image and the voiceprint of the corresponding audio signal, and add the voiceprint to the corresponding relationship between different voiceprints and different face images.
The positioning method provided by the application integrates audio identification and face identification technologies, processing and analysis are carried out according to the obtained audio signals and image signals, the user is automatically detected, compared with an artificial positioning mode, the user can be immediately identified after the audio signals and the image signals are obtained, the process is simple, the rapid identification speed is achieved, the rapid positioning and identification of the target are achieved, and the user experience is improved.
In order to explain the technical means described in the present application, the following description will be given by way of specific embodiments.
Referring to fig. 2, it is a flowchart of a second implementation procedure of the positioning method provided in the first embodiment of the present application, and for convenience of description, only the parts related to the embodiment of the present application are shown.
The positioning method comprises the following steps:
step S201: an audio signal is acquired.
The implementation process of step S201 is the same as that of step S101, and is not described again.
Step S202: and acquiring a target face image of the user sending the audio signal according to the audio signal.
The implementation process of step S202 is the same as that of step S102, and is not described again.
Step S203: and acquiring an image of the scene where the user is located currently.
The implementation process of step S203 is the same as that of step S103, and is not described again.
Step S204: and searching the target face image from the image of the scene where the user is located currently, and acquiring the position of the target face image in the image of the scene where the user is located currently.
The implementation process of step S204 is the same as that of step S104, and is not described again.
Step S205: and acquiring the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user.
The implementation process of step S205 is the same as that of step S105, and is not described again.
Step S206: and outputting a control instruction to image acquisition equipment according to the position of the user in the current scene, wherein the control instruction is used for instructing the image acquisition equipment to acquire the image of the position of the user, the image of the position of the user comprises the face image of the user, and the size of the image of the position of the user is larger than or equal to a preset size.
After the position of the user in the current scene is identified, a control instruction is output to the camera according to the position of the user in the current scene, the control instruction is used for indicating the camera to act, such as indicating the camera to zoom, positioning a shot picture on the user, and acquiring an image of the position including the face image of the user, wherein the size of the image of the position of the user is larger than or equal to a preset size, and the preset size is the size obtained by amplifying the image of the position of the user. Therefore, the acquired image of the position of the user can be understood as a close-up picture of the user and is an enlarged image of the position of the user, and the image of the position of the user includes the face image of the user, so that the face image of the user is enlarged when the image of the position of the user is enlarged.
If the camera can move, for example, the position and shooting angle of the camera can be adjusted through the movable support, the control instruction is used for instructing the camera to zoom and move, shortening the distance between the camera and the user and acquiring an image of the position of the user.
Moreover, two cameras can be further arranged, wherein the first camera is used for shooting an image of a scene where the user is currently located, contains all personnel in the meeting room, and is used for realizing the step S203; the second camera can be moved for capturing an image of the location of the user, i.e. for implementing step S206.
The positioning method provided by the application can greatly simplify the operation difficulty and speed of amplifying and displaying the speaker picture in a video conference scene, so that conference participants can concentrate on the conference themselves without worrying about conference equipment, the precious time and energy of the conference participants are saved, and the method has remarkable social and economic benefits.
Fig. 3 shows a block diagram of a positioning apparatus provided in the second embodiment of the present application, which corresponds to the positioning method described in the foregoing positioning method embodiment, and only shows portions related to the second embodiment of the present application for convenience of description.
Referring to fig. 3, the positioning device 300 includes:
an audio signal acquisition module 301, configured to acquire an audio signal;
a target face image obtaining module 302, configured to obtain, according to the audio signal, a target face image of a user who sends the audio signal;
a scene image obtaining module 303, configured to obtain an image of a scene where the user is currently located;
a first position obtaining module 304, configured to search the target face image from the image of the current scene of the user, and obtain a position of the target face image in the image of the current scene of the user;
a second position obtaining module 305, configured to obtain a position of the user in the current scene according to a position of the target face image in the image of the current scene of the user.
Optionally, the target face image obtaining module 302 includes:
the voiceprint acquisition unit is used for acquiring the voiceprint of the user sending the audio signal according to the audio signal;
and the target face image acquisition unit is used for acquiring the target face image of the user sending the audio signal according to the voiceprint of the user sending the audio signal.
Optionally, the positioning apparatus 300 further includes, further including:
the corresponding relation establishing module is used for establishing corresponding relations between different voiceprints and different face images;
correspondingly, the target face image acquisition unit is specifically configured to:
and acquiring a face image corresponding to the voiceprint of the user sending the audio signal from the corresponding relation between the different voiceprints and the different face images, wherein the face image is a target face image of the user sending the audio signal.
Optionally, the target face image obtaining unit is further configured to:
if the face image corresponding to the voiceprint of the user who sent the audio signal is not obtained from the corresponding relationship between the different voiceprints and the different face images, the audio signal obtaining module 301 is returned to be executed.
Optionally, the first position obtaining module 304 includes:
the all-face image acquisition unit is used for acquiring all face images contained in the image of the scene where the user is located currently;
a detection unit configured to detect whether the target face image exists in all the face images;
and the position acquisition unit is used for acquiring the position of the target face image in the image of the scene where the user is located currently if the target face image exists in all the face images.
Optionally, the first position obtaining module 304 further includes:
a return execution unit, configured to return to execute the audio signal acquisition module 301 if the target face image does not exist in all the face images.
Optionally, the second position obtaining module 305 is specifically configured to:
acquiring coordinates of the position of the target face image in the image of the scene where the user is located currently;
and performing coordinate conversion on the coordinates of the position of the target face image in the image of the current scene of the user to acquire the position of the user in the current scene.
Optionally, the positioning apparatus 300 further comprises:
and the control instruction output module is used for outputting a control instruction to the image acquisition equipment according to the position of the user in the current scene, wherein the control instruction is used for instructing the image acquisition equipment to acquire the image of the position of the user, the image of the position of the user comprises the face image of the user, and the size of the image of the position of the user is larger than or equal to a preset size.
It should be noted that, because the contents of information interaction, execution process, and the like between the above-mentioned devices/modules are based on the same concept as that of the embodiment of the positioning method of the present application, specific functions and technical effects thereof may be referred to specifically in the embodiment of the positioning method, and are not described herein again.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the above division of the functional modules is merely illustrated, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the positioning apparatus 300 is divided into different functional modules to perform all or part of the above described functions. Each functional module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional modules are only used for distinguishing one functional module from another, and are not used for limiting the protection scope of the application. The specific working process of each functional module in the above description may refer to the corresponding process in the foregoing positioning method embodiment, and is not described herein again.
Fig. 4 is a schematic structural diagram of a terminal device according to a third embodiment of the present application. As shown in fig. 4, the terminal device 400 includes: a processor 402, a memory 401, and a computer program 403 stored in the memory 401 and executable on the processor 402. The number of the processors 402 is at least one, and fig. 4 takes one as an example. The processor 402 executes the computer program 403 to implement the implementation steps of the positioning method described above, i.e. the steps shown in fig. 1 or fig. 2.
The specific implementation process of the terminal device 400 can refer to the above positioning method embodiment.
Illustratively, the computer program 403 may be partitioned into one or more modules/units that are stored in the memory 401 and executed by the processor 402 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 403 in the terminal device 400.
The terminal device 400 may be a desktop computer, a notebook, a palm computer, a main control device, or other computing devices, or may be a camera, a mobile phone, or other devices having an image acquisition function and a data processing function, or may be a touch display device. Terminal device 400 may include, but is not limited to, a processor and a memory. Those skilled in the art will appreciate that fig. 4 is only an example of a terminal device 400 and does not constitute a limitation of terminal device 400 and may include more or less components than those shown, or combine certain components, or different components, e.g., terminal device 400 may also include input-output devices, network access devices, buses, etc.
The Processor 402 may be a CPU (Central Processing Unit), other general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 401 may be an internal storage unit of the terminal device 400, such as a hard disk or a memory. The memory 401 may also be an external storage device of the terminal device 400, such as a plug-in hard disk, SMC (Smart Media Card), SD (Secure Digital Card), Flash Card, or the like provided on the terminal device 400. Further, the memory 401 may also include both an internal storage unit of the terminal device 400 and an external storage device. The memory 401 is used for storing an operating system, application programs, a boot loader, data, and other programs, such as program codes of the computer program 403. The memory 401 may also be used to temporarily store data that has been output or is to be output.
Fig. 5 is a schematic structural diagram of a conference system provided in the fourth embodiment of the present application. As shown in fig. 5, the conference system includes: an audio capture device 501, an image capture device 502, and a terminal device 503. The audio capture device 501 and the image capture device 502 are both electrically connected to a terminal device 503.
The audio capturing device 501 is used for capturing audio signals, wherein the audio capturing device 501 is of various types, and in practical applications, there are many names, such as: microphones, MICs (microphones), pickups, microphones, and so forth.
Image capture device 502 is used to capture image information, such as cameras, video cameras, and other devices or devices capable of capturing images. As described above, the image capture device 502 may be one camera or may include two cameras.
The terminal device 503 performs data processing according to the audio signal acquired by the audio acquisition device 501 and the image information acquired by the image acquisition device 502, and the positioning method corresponding to the software program in the terminal device 503 is described in detail in the above terminal device embodiment and positioning method embodiment, which is not described in detail in this embodiment.
Embodiments of the present application further provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program may implement the steps in the above embodiments of the positioning method.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the embodiments of the positioning method implemented in the present application may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the embodiments of the positioning method may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, ROM (Read-Only Memory), RAM (Random Access Memory), electrical carrier wave signal, telecommunication signal, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (9)

1. A positioning method, characterized in that the positioning method comprises:
acquiring an audio signal in real time;
acquiring a voiceprint of a user sending the audio signal according to the audio signal;
acquiring a target face image of the user sending the audio signal according to the voiceprint of the user sending the audio signal;
acquiring an image of a scene where the user is currently located;
searching the target face image from the image of the scene where the user is located currently, and acquiring the position of the target face image in the image of the scene where the user is located currently;
and acquiring the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user and the distance value between the user and the camera of the image of the current scene of the user.
2. The positioning method according to claim 1, characterized in that before acquiring the audio signal, the positioning method further comprises:
establishing corresponding relations between different voiceprints and different face images;
correspondingly, the obtaining a target face image of the user sending the audio signal according to the voiceprint of the user sending the audio signal includes:
and acquiring a face image corresponding to the voiceprint of the user sending the audio signal from the corresponding relation between the different voiceprints and the different face images, wherein the face image is a target face image of the user sending the audio signal.
3. The positioning method according to claim 2, further comprising:
and if the face image corresponding to the voiceprint of the user sending the audio signal is not obtained from the corresponding relation between the different voiceprints and the different face images, returning to execute the audio signal obtaining.
4. The method according to claim 1, wherein the searching for the target face image from the image of the scene where the user is currently located to obtain the position of the target face image in the image of the scene where the user is currently located comprises:
acquiring all face images contained in the image of the scene where the user is located currently;
detecting whether the target face image exists in all the face images;
and if the target face image exists in all the face images, acquiring the position of the target face image in the image of the scene where the user is located currently.
5. The positioning method of claim 4, further comprising:
and if the target face image does not exist in all the face images, returning to execute the audio signal acquisition.
6. The method according to claim 4, wherein the obtaining the position of the user in the current scene according to the position of the target face image in the image of the current scene of the user comprises:
acquiring coordinates of the position of the target face image in the image of the scene where the user is located currently;
and performing coordinate conversion on the coordinates of the position of the target face image in the image of the current scene of the user to acquire the position of the user in the current scene.
7. The positioning method according to any one of claims 1 to 6, wherein after obtaining the position of the user in the current scene, the positioning method further comprises:
and outputting a control instruction to image acquisition equipment according to the position of the user in the current scene, wherein the control instruction is used for instructing the image acquisition equipment to acquire the image of the position of the user, the image of the position of the user comprises the face image of the user, and the size of the image of the position of the user is larger than or equal to a preset size.
8. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the positioning method according to any of claims 1 to 7 when executing the computer program.
9. A conferencing system, the conferencing system comprising:
an audio acquisition device;
an image acquisition device; and
the terminal device of claim 8;
the audio acquisition equipment and the image acquisition equipment are electrically connected with the terminal equipment.
CN202010347463.0A 2020-04-28 2020-04-28 Positioning method, terminal device and conference system Active CN111614928B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010347463.0A CN111614928B (en) 2020-04-28 2020-04-28 Positioning method, terminal device and conference system
PCT/CN2020/102299 WO2021217897A1 (en) 2020-04-28 2020-07-16 Positioning method, terminal device and conference system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010347463.0A CN111614928B (en) 2020-04-28 2020-04-28 Positioning method, terminal device and conference system

Publications (2)

Publication Number Publication Date
CN111614928A CN111614928A (en) 2020-09-01
CN111614928B true CN111614928B (en) 2021-09-28

Family

ID=72205597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010347463.0A Active CN111614928B (en) 2020-04-28 2020-04-28 Positioning method, terminal device and conference system

Country Status (2)

Country Link
CN (1) CN111614928B (en)
WO (1) WO2021217897A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113630556A (en) * 2021-09-26 2021-11-09 北京市商汤科技开发有限公司 Focusing method, focusing device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106972990A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 Intelligent home device based on Application on Voiceprint Recognition
CN109783642A (en) * 2019-01-09 2019-05-21 上海极链网络科技有限公司 Structured content processing method, device, equipment and the medium of multi-person conference scene
CN110503045A (en) * 2019-08-26 2019-11-26 北京华捷艾米科技有限公司 A kind of Face detection method and device
CN110716180A (en) * 2019-10-17 2020-01-21 北京华捷艾米科技有限公司 Audio positioning method and device based on face detection

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4479227B2 (en) * 2003-11-19 2010-06-09 ソニー株式会社 Audio pickup / video imaging apparatus and imaging condition determination method
US8842161B2 (en) * 2010-05-18 2014-09-23 Polycom, Inc. Videoconferencing system having adjunct camera for auto-framing and tracking
US20130162752A1 (en) * 2011-12-22 2013-06-27 Advanced Micro Devices, Inc. Audio and Video Teleconferencing Using Voiceprints and Face Prints
CN106960455A (en) * 2017-03-17 2017-07-18 宇龙计算机通信科技(深圳)有限公司 Orient transaudient method and terminal
CN109318243B (en) * 2018-12-11 2023-07-07 珠海一微半导体股份有限公司 Sound source tracking system and method of vision robot and cleaning robot
CN110148418A (en) * 2019-06-14 2019-08-20 安徽咪鼠科技有限公司 A kind of scene record analysis system, method and device thereof
CN110443371B (en) * 2019-06-25 2023-07-25 深圳欧克曼技术有限公司 Artificial intelligence device and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106972990A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 Intelligent home device based on Application on Voiceprint Recognition
CN109783642A (en) * 2019-01-09 2019-05-21 上海极链网络科技有限公司 Structured content processing method, device, equipment and the medium of multi-person conference scene
CN110503045A (en) * 2019-08-26 2019-11-26 北京华捷艾米科技有限公司 A kind of Face detection method and device
CN110716180A (en) * 2019-10-17 2020-01-21 北京华捷艾米科技有限公司 Audio positioning method and device based on face detection

Also Published As

Publication number Publication date
WO2021217897A1 (en) 2021-11-04
CN111614928A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN110012209B (en) Panoramic image generation method and device, storage medium and electronic equipment
US10241990B2 (en) Gesture based annotations
CN106060470B (en) Video monitoring method and system
CN111432115B (en) Face tracking method based on voice auxiliary positioning, terminal and storage device
CN108563651B (en) Multi-video target searching method, device and equipment
CN110296686B (en) Vision-based positioning method, device and equipment
CN112991553B (en) Information display method and device, electronic equipment and storage medium
CN108632536B (en) Camera control method and device, terminal and storage medium
CN111062234A (en) Monitoring method, intelligent terminal and computer readable storage medium
CN111582240B (en) Method, device, equipment and medium for identifying number of objects
CN112689221A (en) Recording method, recording device, electronic device and computer readable storage medium
CN110673811B (en) Panoramic picture display method and device based on sound information positioning and storage medium
CN111614928B (en) Positioning method, terminal device and conference system
CN111881740A (en) Face recognition method, face recognition device, electronic equipment and medium
CN113011403B (en) Gesture recognition method, system, medium and device
CN113936154A (en) Image processing method and device, electronic equipment and storage medium
CN110717452B (en) Image recognition method, device, terminal and computer readable storage medium
CN112597944A (en) Key point detection method and device, electronic equipment and storage medium
CN117011497A (en) Remote multiparty video interaction method based on AI universal assistant in AR scene
US20220084314A1 (en) Method for obtaining multi-dimensional information by picture-based integration and related device
CN112287945A (en) Screen fragmentation determination method and device, computer equipment and computer readable storage medium
CN114222065B (en) Image processing method, image processing apparatus, electronic device, storage medium, and program product
CN114257757B (en) Automatic video clipping and switching method and system, video player and storage medium
CN113259734B (en) Intelligent broadcasting guide method, device, terminal and storage medium for interactive scene
CN114390206A (en) Shooting method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant