WO2020253616A1 - 音频采集设备定位方法及装置、说话人识别方法及系统 - Google Patents

音频采集设备定位方法及装置、说话人识别方法及系统 Download PDF

Info

Publication number
WO2020253616A1
WO2020253616A1 PCT/CN2020/095640 CN2020095640W WO2020253616A1 WO 2020253616 A1 WO2020253616 A1 WO 2020253616A1 CN 2020095640 W CN2020095640 W CN 2020095640W WO 2020253616 A1 WO2020253616 A1 WO 2020253616A1
Authority
WO
WIPO (PCT)
Prior art keywords
coordinate data
image
detected
collection device
audio collection
Prior art date
Application number
PCT/CN2020/095640
Other languages
English (en)
French (fr)
Inventor
揭泽群
葛政
刘威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20825730.3A priority Critical patent/EP3985610A4/en
Publication of WO2020253616A1 publication Critical patent/WO2020253616A1/zh
Priority to US17/377,316 priority patent/US11915447B2/en
Priority to US18/410,404 priority patent/US20240153137A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • G06T7/41Analysis of texture based on statistical description of texture
    • G06T7/44Analysis of texture based on statistical description of texture using image operators, e.g. filters, edge density metrics or local histograms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • G06T7/85Stereo camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • This application relates to the field of image processing technology, and in particular to an audio collection device positioning method, an audio collection device positioning device and electronic equipment, a speaker recognition method, a speaker recognition system, and a computer-readable storage medium.
  • speaker recognition technology is widely used in various fields of daily life. When performing speaker recognition, it is often necessary to accurately locate the microphone device.
  • microphone detection is usually performed by a deep learning-based method, that is, the image to be detected is input to the deep learning model, and the output result of the deep learning model is used as the final detection result.
  • An embodiment of the application provides a method for locating an audio collection device, which is executed by an electronic device, and the method for locating an audio collection device includes:
  • the displacement data is determined according to the first coordinate data and historical coordinate data of the audio acquisition device, and the coordinates of the audio acquisition device in the image to be detected are determined according to the displacement data.
  • the embodiment of the application provides an audio acquisition device positioning device, including: an image acquisition module configured to acquire an image to be detected; an image recognition module configured to identify the audio acquisition device in the image to be detected to obtain the audio acquisition First coordinate data of the device; a coordinate calculation module configured to determine displacement data according to the first coordinate data and historical coordinate data of the audio acquisition device, and determine the coordinates of the audio acquisition device in the image to be detected according to the displacement data .
  • An embodiment of the application provides an electronic device, including: one or more processors; a storage device, configured to store one or more programs, when the one or more programs are executed by the one or more processors At this time, the one or more processors are caused to implement the audio collection device positioning method provided in the embodiment of the present application or the speaker identification method provided in the embodiment of the present application.
  • the embodiment of the present application provides a speaker recognition method, which is executed by an electronic device, and the speaker recognition method includes:
  • the distance between the coordinates of the audio collection device and each of the face coordinates is determined, and the object corresponding to the face coordinates with the smallest distance is determined as the speaker.
  • the embodiment of the application provides a speaker recognition system, including:
  • Camera equipment configured to obtain images to be detected
  • An electronic device connected to the camera device, and the electronic device includes a storage device and a processor, wherein the storage device is configured to store one or more programs, and when the one or more programs are executed by the processor At this time, the processor is caused to implement the speaker identification method described in the embodiment of the present application, so as to process the image to be detected to obtain the speaker.
  • An embodiment of the present application provides a computer-readable storage medium that stores one or more programs.
  • the processor realizes the audio capture provided by the embodiment of the present application.
  • the device positioning method, or the speaker recognition method provided in this embodiment of the application.
  • FIG. 1A shows an optional architectural schematic diagram of an exemplary system to which the technical solutions of the embodiments of the present application can be applied;
  • FIG. 1B shows an optional architectural schematic diagram of an exemplary system to which the technical solutions of the embodiments of the present application can be applied;
  • FIG. 2 schematically shows an optional flowchart of an audio collection device positioning method provided by an embodiment of the present application
  • FIG. 3 schematically shows an optional schematic diagram of a process for recognizing an image to be detected according to an embodiment of the present application
  • FIG. 4 schematically shows a schematic diagram of an optional detection result of an image to be detected provided in an embodiment of the present application
  • FIG. 5 schematically shows a schematic diagram of an optional detection result obtained after edge detection of an image to be detected provided in an embodiment of the present application
  • FIG. 6A schematically shows an optional flowchart of an audio collection device positioning method provided by an embodiment of the present application
  • FIG. 6B schematically shows an optional flowchart of an audio collection device positioning method provided by an embodiment of the present application
  • FIG. 6C schematically shows an optional schematic diagram of a historical coordinate database and a mobile cache coordinate database provided by an embodiment of the present application
  • FIG. 7 schematically shows an optional flowchart of a speaker recognition method provided by an embodiment of the present application.
  • FIG. 8 schematically shows a schematic diagram of an optional detection result that includes a face recognition result and an audio collection device recognition result provided by an embodiment of the present application
  • FIG. 9 schematically shows an optional schematic diagram of the architecture of the audio collection device positioning apparatus provided by an embodiment of the present application.
  • FIG. 10 schematically shows an optional structural diagram of the speaker recognition system provided by an embodiment of the present application.
  • FIG. 11 shows an optional structural schematic diagram of a computer system suitable for implementing the electronic equipment of the embodiments of the present application.
  • this application provides an audio collection device positioning method, an audio collection device positioning device, and electronic equipment, a speaker recognition method, a speaker recognition system, and a computer-readable storage medium, It can quickly and accurately identify the coordinates of the audio collection device, and can accurately identify the speaker. The details are described below.
  • FIG. 1A shows an optional schematic diagram of the architecture of an exemplary system to which the technical solutions of the embodiments of the present application can be applied.
  • the system architecture 100 may include terminal devices (such as a smart phone 101, a tablet computer 102, and a portable computer 103 as shown in FIG. 1A), a network 104, and a server 105.
  • the network 104 is used as a medium for providing a communication link between the terminal device and the server 105.
  • the network 104 may include various connection types, such as wired communication links, wireless communication links, and so on.
  • the numbers of terminal devices, networks, and servers in FIG. 1A are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks and servers.
  • the server 105 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or it may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the types of terminal devices are not limited to the smart phones, tablet computers, and portable computers shown in FIG. 1A, and may also be desktop computers, cameras, smart speakers, smart watches, etc., for example.
  • the user can use the terminal device 101 (or the terminal device 102 or 103) to obtain the image to be detected, and then send the image to be detected to the server 105; the server 105 receives the image to be detected sent by the terminal device 101 Then, it can perform image recognition on the image to be detected, identify the audio capture device in the image to be detected, and obtain the first coordinate data of the audio capture device; then calculate the displacement data according to the first coordinate data and historical coordinate data of the audio capture device, The displacement data calculates the coordinates of the audio capture device in the image to be detected.
  • the embodiments of this application can use image recognition technology to accurately determine the only audio collection device in the image to be detected, and avoid the error of multiple targets in the image to be detected; on the other hand, it can combine historical coordinate data to compare the first coordinate data Make a judgment on the correctness and optimize the coordinate data to further improve the accuracy of the final coordinate of the audio capture device.
  • the terminal device 101 in a scenario where multiple people are talking, the terminal device 101 (or the terminal device 102 or 103) can be used to obtain the image to be detected.
  • the facial targets the facial targets 106, 107, and 108 shown in FIG. 1B
  • the audio collection device the microphone 109 shown in FIG. 1B
  • the distance between the coordinates of the audio collection device and the face coordinates of each face target is determined, and the face target corresponding to the face coordinates with the smallest distance is determined as the speaker.
  • the face target 108 closest to the microphone 109 is determined as the speaker.
  • the recognition The speaker executes the automatic lens follow to enlarge the image in the lens and display it in the graphical interface of the terminal device, for example, the face target 108 is displayed in the graphical interface; in addition, after the speaker is identified, the microphone array can be automatically adjusted Position the microphone array towards the speaker to collect clearer speech.
  • the audio collection device positioning method provided in the embodiment of the present application can be executed by the server 105, and accordingly, the audio collection device positioning device can be set in the server 105.
  • the terminal device may also have a similar function to the server, so as to execute the audio collection device positioning method provided in the embodiments of the present application.
  • the speaker recognition method provided in the embodiments of the present application has the same principle.
  • FIG. 2 schematically shows an optional flowchart of a method for locating an audio collection device for microphone recognition provided by an embodiment of the present application.
  • the positioning method may be performed by a server (for example, the server 105 shown in FIG. 1A or FIG. 1B) It may be executed by a terminal device (for example, the terminal device 101, 102, or 103 shown in FIG. 1A or FIG. 1B), or may be executed by the terminal device and the server together.
  • the method for locating the audio collection device at least includes steps S210 to S240, which are described in detail as follows:
  • step S210 an image to be detected is acquired.
  • the image to be detected may be obtained through the terminal device 101 (or the terminal device 102 or 103).
  • the terminal device 101 may be a device such as a camera or a video recorder, or may take a picture or video through a built-in photographing unit of the terminal device 101 or a photographing device externally connected to the terminal device 101 to obtain each frame of the image to be detected; also
  • the terminal device 101 can be connected to a data network to obtain video data or image data by browsing and downloading network resources or local database resources, and then obtain the image to be detected in each frame; the specific method of obtaining the image to be detected in this embodiment of the application Not limited.
  • step S220 the audio collection device in the image to be detected is identified, and the first coordinate data of the audio collection device is obtained.
  • the above-mentioned audio collection device may be a microphone, a microphone, or an audio collection device with audio collection and audio amplification functions such as a mobile phone in each frame of the image to be detected.
  • the embodiments of the present application use the audio collection device Give an example of the microphone.
  • the image to be detected after obtaining the image to be detected, the image to be detected can be feature extracted through the pre-trained target recognition model, and then the audio collection device can be identified and located according to the extracted feature information, and then the audio can be obtained. Collect the coordinate data of the device as the first coordinate data.
  • the image to be detected may be input to the target recognition model to recognize the image to be detected.
  • the target recognition model may be a machine learning model, for example, a convolutional neural network (CNN, Convolutional Neural Network). Networks) model, Recurrent Neural Network (RNN) model, Faster Region-Convolutional Neural Networks (Faster RCNN, Faster Region-Convolutional Neural Networks) model, etc., which are not specifically limited in the embodiment of the application.
  • CNN convolutional neural network
  • RNN Recurrent Neural Network
  • Faster RCNN Faster Region-Convolutional Neural Networks
  • Faster RCNN Faster Region-Convolutional Neural Networks
  • FIG. 3 shows a schematic diagram of the process of using the target recognition model to recognize the image to be detected.
  • the target recognition model based on Faster RCNN as an example, as shown in FIG. 3:
  • the image to be detected is Convolution processing to obtain a feature map;
  • the image to be detected in the input model can be a picture of any size, and the convolution layer is used to extract the feature map of the image to be detected.
  • each feature point in the feature map is classified to determine the candidate region; the region generation network (RPN, Region Proposal Network) is mainly used to generate region proposals; a certain number can be generated on the feature map obtained above Anchor Boxes, judge whether the anchors belong to the foreground or the background, and correct the anchor boxes to obtain relatively accurate candidate regions.
  • the candidate region is pooled to obtain a candidate feature map; the candidate region and feature map generated by the RPN are used to obtain a fixed size candidate feature map.
  • step S304 the candidate feature maps are fully connected to obtain the first coordinate data of the audio capture device; the fixed-size candidate feature maps formed by the pooling layer are fully connected, and specific categories are classified, and at the same time Regression operation, thereby obtaining the first coordinate data of the audio acquisition device in the image to be detected.
  • the target recognition model may return one microphone coordinate data (microphone coordinate data is the coordinate data corresponding to the recognition target), or multiple microphones Coordinate data; the microphone coordinate data may be the center coordinates corresponding to the center position of the recognized microphone target (the microphone target corresponds to the recognized target).
  • a frame of image to be detected as shown in FIG. 4 includes two recognition results: a recognition frame 410 and a recognition frame 420. Two microphone targets are identified in this frame of the to-be-detected image. The microphone target corresponding to the recognition frame 410 is correct, and the microphone target corresponding to the recognition frame 420 is misdetected, that is, the human arm is mistakenly recognized as the microphone target.
  • edge detection can be performed on the image to be detected, and the bracket device used to support the microphone target in the image to be detected can be identified, thereby eliminating false detection of the bracket device.
  • the bracket device may be a microphone bracket.
  • the detection result shown in FIG. 5 can be obtained. In FIG. 5, only one recognition frame 510 is included, and the coordinate data corresponding to the recognition frame 510 is the first One coordinate data.
  • the embodiment of the application provides a flowchart as shown in FIG. 6A.
  • the audio collection device as a microphone as an example
  • edge detection is performed on the image to be detected, and the image to be detected is extracted.
  • the edge information in the image is detected. Since the microphone holder is an edge, it will be detected. Then, it is determined whether the edge information of the microphone holder is detected under the identified multiple microphone targets, so that among the multiple microphone targets, the microphone target with the microphone holder set below is confirmed as the only correct microphone.
  • the embodiment of the present application also provides a flowchart as shown in FIG. 6B.
  • the microphone coordinate data in the image to be detected is detected by the microphone detector, and the edge
  • the extractor extracts edge features in the image to be detected, where the microphone detector is the target recognition model, the process of extracting edge features is edge detection, and the edge features are edge information. If the microphone coordinate data of only one microphone target is obtained, the microphone coordinate data is used as the first coordinate data; if the microphone coordinate data of multiple microphone targets are obtained, the microphone target with the microphone holder set below is set through the obtained edge features Determined to be the only correct microphone.
  • step S230 the displacement data is determined according to the first coordinate data and the historical coordinate data of the audio collecting device, and the coordinates of the audio collecting device are determined according to the displacement data.
  • a historical coordinate database and a mobile cache coordinate database can be pre-configured; among them, the historical coordinate database can be configured to store the coordinates of the audio capture device in each frame of history to be detected in the image, that is, the microphone before the "fact movement" occurs. All location records; the mobile cache coordinate database can be configured to store the mobile cache coordinate data corresponding to the recognition target in the current image to be detected, and the location record of the microphone "may move".
  • the first coordinate data can be accurately corrected.
  • the displacement data can be calculated based on the first coordinate data and the historical coordinate data of the audio collection device.
  • the weighted average processing is performed on each historical coordinate data in the historical coordinate database to obtain the first historical coordinate data.
  • the first coordinate data and the first historical coordinate data can be calculated in the same coordinate system. The difference between the axis and the vertical axis is the displacement data.
  • the displacement data when the current image to be detected is the nth frame of image to be detected, after the displacement data is obtained, the displacement data can be compared with the preset first threshold, as shown in FIG. 6A.
  • the first coordinate data of the audio acquisition device in the image to be detected is saved to the historical coordinate database as the new historical coordinate data in the historical coordinate database.
  • the weighted average processing is performed on each historical coordinate data in the updated historical coordinate database to obtain the second coordinate data, and the second coordinate data is used as the precise coordinate of the audio acquisition device in the image to be detected in the nth frame.
  • the displacement data when the displacement data is less than the first threshold, it means that the microphone position in the image to be detected in the nth frame is close to the historical position.
  • the coordinate difference between the first coordinate data and the first historical coordinate data on the horizontal axis and the vertical axis is less than 50 pixels, it is considered that the microphone has not moved, as shown in Figure 6B, it is considered that the microphone position has not been greatly disturbed, and the first coordinate data (x1, y1, x2, y2) is saved to the historical coordinate database (historical position pipeline) , Thereby updating the historical coordinate database, where x represents the value of the horizontal axis, y represents the value of the vertical axis, and (x1, y1, x2, y2) is the form of the corresponding identification frame.
  • weighted average processing is performed on each historical coordinate data in the updated historical coordinate database, and the processing result is used as the precise coordinate of the audio capture device in the image to be detected in the nth frame, as shown in Figure 6B (x1_, y1_ ,x2_,y2_). At the same time, it can also clear the mobile cache database and delete the mobile cache coordinate data.
  • the displacement data is compared with the preset first threshold. As shown in FIG. 6A, when the displacement data corresponding to the nth frame of the image to be detected is When it is greater than or equal to the first threshold, it means that the position of the audio capture device in the image to be detected in the nth frame deviates greatly from the historical position, and the audio capture device in the image to be detected in the nth frame is "possible to move" (that is, the image in Figure 6B "The position has a major disturbance"), at this time the first coordinate data of the image to be detected in the nth frame is saved to the mobile cache coordinate database (the mobile cache pipeline in Figure 6B) as the mobile cache coordinate data in the mobile cache coordinate database .
  • the mobile cache coordinate database the mobile cache pipeline in Figure 6B
  • the first historical coordinate data can be configured as the precise coordinates of the audio capture device in the image to be detected in the nth frame to ensure the continuity of target recognition of the image to be detected, as shown in the precise coordinates (x1_ ,y1_,x2_,y2_).
  • the displacement data is greater than or equal to the first threshold, which can be that the coordinate difference between the first coordinate data and the first historical coordinate data on the horizontal axis is greater than or equal to 50 pixels, and it can be that the coordinate difference on the vertical axis is greater than or equal to 50 pixels, or the coordinate difference between the horizontal axis and the vertical axis is greater than or equal to 50 pixels.
  • the n+1th frame to be detected image can be obtained, and the first coordinate data of the audio acquisition device in the n+1th frame to be detected image can be determined, and the first coordinate data can be compared with the mobile buffer coordinate data, and then according to the comparison As a result, it is judged whether it is necessary to compare with the first historical coordinate data, thereby judging the correctness of the mobile buffer coordinate data, and determining the precise coordinates of the audio acquisition device in the image to be detected in the n+1th frame.
  • the first coordinate data in the image to be detected in the n+1th frame is compared with the mobile buffer coordinate data, for example, the first coordinate data in the image to be detected in the n+1th frame and the mobile buffer coordinate are calculated Position deviation data between the data, and compare the position deviation data with the second threshold.
  • the position deviation data is less than the second threshold
  • the historical coordinate database is cleared, and the mobile cache coordinate data in the mobile cache coordinate database and the first coordinate data in the n+1th frame to be detected image are saved to the historical coordinate database to As the historical coordinate data in the historical coordinate database.
  • weighted average processing is performed on each historical coordinate data in the updated historical coordinate database to obtain third coordinate data, and the third coordinate data is used as the precise coordinate of the audio acquisition device in the image to be detected in the n+1th frame.
  • the position deviation data is less than the preset second threshold, for example, in the same coordinate system, the coordinate difference between the first coordinate data in the image to be detected in the n+1th frame and the mobile buffer coordinate data on the horizontal axis and the vertical axis are both When it is less than 50 pixels, it indicates that the position of the audio collection device in the image to be detected in the n+1 frame is close to the position of the audio collection device in the image to be detected in the nth frame, and it is considered that the "fact movement" has occurred in the nth frame.
  • the first coordinate data in the image to be detected in the n+1th frame is compared with the mobile buffer coordinate data, for example, the first coordinate data in the image to be detected in the n+1th frame and the mobile buffer coordinate are calculated Position deviation data between the data, and compare the position deviation data with the second threshold.
  • the position deviation data is greater than or equal to the second threshold
  • the first coordinate data in the image to be detected in the n+1th frame is compared with the first historical coordinate data to calculate the displacement data.
  • the displacement data is less than the first threshold
  • the first coordinate data in the image to be detected in the n+1th frame is saved to the historical coordinate database as the new historical coordinate data in the historical coordinate database.
  • the coordinate difference between the first coordinate data in the image to be detected in the n+1th frame and the mobile buffer coordinate data on the horizontal axis is greater than or Equal to 50 pixels, or the coordinate difference on the vertical axis is greater than or equal to 50 pixels, or the coordinate difference between the horizontal axis and the vertical axis is greater than or equal to 50 pixels, calculate the n+1th frame to be The displacement data between the first coordinate data and the first historical coordinate data in the image is detected.
  • the displacement data is less than the preset first threshold, it means that the first coordinate data in the image to be detected in the n+1th frame is close to each historical coordinate data in the historical coordinate database, and the image to be detected in the nth frame is disturbed ,
  • the first coordinate data in the image to be detected in the n+1th frame can be saved in the historical coordinate database, the historical coordinate database can be updated, and the mobile cache coordinate database can be cleared. Then, perform weighted average processing on each historical coordinate data in the updated historical coordinate database, and use the processing result as the precise coordinate of the audio collection device in the image to be detected in the n+1th frame.
  • the first coordinate data corresponding to the image to be detected in the nth frame is compared with the first historical coordinate data, and when the obtained displacement data is greater than or equal to the first threshold, the n+1th frame to be The first coordinate data of the detected image is compared with the first historical coordinate data. If the obtained displacement data is less than the first threshold, it means that the image to be detected in the nth frame is disturbed, and the first coordinate of the image to be detected in the n+1 frame The data is saved in the historical coordinate database, and each historical coordinate data in the updated historical coordinate database is weighted and averaged, and the processing result is used as the precise coordinate of the audio acquisition device in the image to be detected in the n+1th frame.
  • the first coordinate data of the image to be detected in the n+1th frame is compared with the first historical coordinate data, and the displacement data obtained is greater than or equal to the first threshold, then the first coordinate data of the image to be detected in the n+1th frame is determined -Position deviation data between coordinate data and mobile buffer coordinate data.
  • the position deviation data is less than the second threshold, the historical coordinate database is cleared, and the moving buffer coordinate data and the first coordinate data in the image to be detected in the n+1th frame are saved to the historical coordinate database as the history in the historical coordinate database Coordinate data. Then, perform weighted average processing on each historical coordinate data in the updated historical coordinate database, and use the processing result as the coordinate of the audio collection device in the image to be detected in the n+1th frame.
  • the data capacity in the historical coordinate database and the mobile cache coordinate database can also be configured.
  • the configuration historical coordinate database (historical location pipeline) can store up to 300 or 500 historical coordinate data.
  • the historical coordinate data in the historical location pipeline shown in Figure 6C (x1_0, y1_0, x2_0, y2_0).
  • one or more historical coordinate data that are ranked higher can be deleted, such as the first 10, 50 Or 100 historical coordinate data, or clear the historical coordinate database according to a certain time period, so that the coordinate data can be normally stored in the historical coordinate database.
  • the mobile cache coordinate database (mobile cache pipeline), as shown in Figure 6C, it can be configured to store at most 2 or 3 mobile cache coordinate data, which can quickly and accurately process the disturbed images to be detected in each frame to avoid Due to the accumulation of mobile cache coordinate data, large errors or errors occur in the subsequent calculation of the audio capture device coordinates in the image to be detected.
  • Fig. 6C shows mobile buffer coordinate data (x1_0, y1_0, x2_0, y2_0) in the mobile buffer pipeline.
  • each historical coordinate data in the historical coordinate database may be subjected to weighted average processing. Obtain fifth coordinate data, and use the fifth coordinate data as the precise coordinates of the audio acquisition device in the image to be detected.
  • each historical coordinate data in the historical coordinate database is subjected to weighted average processing to obtain the first Five coordinate data, and use the fifth coordinate data as the precise coordinates of the audio acquisition device in the image to be detected; where k is a positive integer.
  • each historical coordinate data in the historical coordinate database is subjected to weighted average processing to obtain the fifth coordinate data, and the fifth coordinate The data is used as the precise coordinates of the audio collection device in the three consecutive frames to be detected.
  • an alarm message indicating that there is no audio acquisition device in the image to be detected is generated; where j Is a positive integer. For example, if the audio collection device is not recognized in the image to be detected for five or six consecutive frames, it is considered that there is no audio collection device in the image to be detected, an alarm message can be issued to remind the user, and the target of the image to be detected is temporarily terminated Positioning.
  • the audio collection device positioning method in the embodiments of the present application can be applied to products such as speaker recognition, for real-time detection and positioning of microphone devices or other audio collection devices in the environment. It can also be applied to other scenarios that need to identify specific targets.
  • the first coordinate data can be judged and confirmed in combination with the historical coordinate data of the audio capture device , Thereby improving the accuracy and accuracy of the obtained coordinates of the audio acquisition device, and avoiding missed detection, false detection or position shift.
  • Fig. 7 schematically shows an optional flowchart of the speaker recognition method provided by the embodiment of the present application.
  • the recognition method may be executed by a server, such as the server shown in Fig. 1A or Fig. 1B; or, the recognition
  • the method may also be executed by a terminal device, such as the terminal device shown in FIG. 1A or FIG. 1B; or, the identification method may also be executed by the terminal device and the server together.
  • the speaker recognition method includes at least step S710 to step S740, which are described in detail as follows:
  • step S710 the image to be detected is acquired by the imaging device.
  • the image to be detected can be acquired by a camera device.
  • the imaging device may be a video camera, a digital camera, or a monitor, or it may be a built-in camera unit of the terminal device or a camera unit external to the terminal device.
  • a camera device is used to take photos or videos of an environment containing an audio collection device (such as a microphone), and then obtain each frame of the image to be detected.
  • step S720 face recognition processing is performed on the image to be detected to obtain at least one face coordinate.
  • the face recognition model can be used to perform face recognition on the image to be detected to obtain one or more face targets in the image to be detected, and the coordinates of the center point of each face target As the respective face coordinates. For example, in a frame of the to-be-detected image corresponding to the scene shown in Figure 8, after face recognition, it is determined that the current frame of the to-be-detected image contains 3 face targets (target 711, target 712, and target 713), and each The coordinates of the center point of the face target are used as the face coordinates of each face target in the image to be detected in the current frame.
  • step S730 the audio collection device in the image to be detected is identified, and the coordinates of the audio collection device are obtained.
  • the audio capture device positioning method described above can also be used to identify the audio capture device (such as a microphone) in the image to be detected, so as to obtain the current frame to be detected.
  • the precise coordinates of the audio capture device in the image For example, the scene as shown in FIG. 8 includes a microphone device 721, and the coordinates of the center point of the microphone device 721 may be used as the precise coordinates of the microphone.
  • the detailed process of recognizing the image to be detected to obtain the precise coordinates of the audio capture device in the image to be detected has been described in detail in the above-mentioned embodiment, and will not be repeated here.
  • step S740 the distance between the coordinates of the audio collection device and the coordinates of each face is determined, and the object corresponding to the face coordinates with the smallest distance is determined as the speaker.
  • the distance between the coordinates of the audio collection device (such as a microphone) and the coordinates of each face can be calculated, and the The object corresponding to the face coordinates with the smallest distance between objects is determined as the speaker.
  • the distance between the microphone device 721 and the face target 713 is the smallest, and the object corresponding to the face target 713 can be determined as the speaker.
  • the microphone and face in the image to be detected can be identified and positioned, and the positional relationship between the microphone and each face in the image can be determined, which can effectively assist the visual angle Locate the speaker.
  • the following describes the audio collection device positioning apparatus provided in the embodiment of the present application, which can be configured to execute the audio collection device positioning method provided in the embodiment of the present application.
  • the audio collection device positioning apparatus can be configured to execute the audio collection device positioning method provided in the embodiment of the present application.
  • the audio collection device positioning apparatus please refer to the above description of the audio collection device positioning method provided in the embodiments of the present application.
  • FIG. 9 schematically shows an optional architecture diagram of the audio collection device positioning apparatus provided in an embodiment of the present application.
  • the audio collection device positioning device 900 includes: an image acquisition module 901, an image recognition module 902, and a coordinate calculation module 903.
  • the image acquisition module 901 is configured to acquire the image to be detected
  • the image recognition module 902 is configured to identify the audio acquisition device in the image to be detected to obtain the first coordinate data of the audio acquisition device
  • the coordinate calculation module 903 is configured to The first coordinate data and historical coordinate data of the collection device determine displacement data, and the coordinates of the audio collection device are determined according to the displacement data.
  • the image recognition module 902 is configured to: recognize the image to be detected to obtain a recognition target that conforms to the audio capture device; when a recognition target is recognized, the coordinate data corresponding to the recognition target is determined to be the audio capture device's The first coordinate data.
  • the image recognition module 902 is configured to: when multiple recognition targets are recognized, perform edge detection on the image to be detected to determine the recognition target set on the cradle device among the multiple recognition targets; The coordinate data corresponding to the identification target on the support device is determined as the first coordinate data of the audio collecting device.
  • the coordinate calculation module 903 is configured to perform weighted average processing on each historical coordinate data of the audio collection device in the preset historical coordinate database to obtain the first historical coordinate data; The coordinate data is compared with the first historical coordinate data to obtain displacement data.
  • the coordinate calculation module 903 is configured to: when the displacement data is less than the first threshold, save the first coordinate data of the audio collection device to the historical coordinate database as the historical coordinate data in the historical coordinate database; Each historical coordinate data in the coordinate database is subjected to weighted average processing to obtain second coordinate data, and the second coordinate data is determined as the coordinates of the audio collection device.
  • the image to be detected is the nth frame of the image to be detected; the coordinate calculation module 903 is configured to determine the first historical coordinate data as the nth frame of the image to be detected when the displacement data is greater than or equal to the first threshold.
  • the coordinate calculation module 903 is configured to: save the first coordinate data of the image to be detected in the nth frame to a preset mobile buffer coordinate database as the mobile buffer coordinate data; Compare the first coordinate data in the image with the mobile buffer coordinate data, or compare the first coordinate data in the image to be detected in the n+1th frame with the first historical coordinate data, and determine the n+1th frame according to the comparison result The coordinates of the audio capture device in the image to be detected.
  • the coordinate calculation module 903 is configured to determine the position deviation data between the first coordinate data in the image to be detected in the n+1th frame and the mobile buffer coordinate data; when the position deviation data is less than the second threshold, Clear the historical coordinate database, and save the mobile cache coordinate data and the first coordinate data in the n+1th frame to be detected in the historical coordinate database as the historical coordinate data in the historical coordinate database; for each item in the historical coordinate database
  • the historical coordinate data is subjected to weighted average processing to obtain the third coordinate data, and the third coordinate data is determined as the coordinate of the audio acquisition device in the image to be detected in the n+1th frame.
  • the coordinate calculation module 903 is configured to determine the position deviation data between the first coordinate data in the image to be detected in the n+1th frame and the mobile buffer coordinate data; when the position deviation data is greater than or equal to the second threshold When the first coordinate data in the image to be detected in the n+1 frame is compared with the first historical coordinate data, the displacement data is obtained; when the displacement data corresponding to the image to be detected in the n+1 frame is less than the first threshold, the The first coordinate data in the image to be detected in the n+1th frame is saved to the historical coordinate database as the historical coordinate data in the historical coordinate database; each historical coordinate data in the historical coordinate database is subjected to weighted average processing to obtain the fourth coordinate The fourth coordinate data is determined as the coordinate of the audio acquisition device in the image to be detected in the n+1th frame; the mobile cache coordinate database is cleared.
  • the coordinate calculation module 903 is configured to compare the first coordinate data in the image to be detected in the n+1th frame with the first historical coordinate data to obtain the displacement data; when the image to be detected in the n+1th frame When the corresponding displacement data is greater than or equal to the first threshold, determine the position deviation data between the first coordinate data in the image to be detected in the n+1 frame and the mobile buffer coordinate data; when the position deviation data is less than the second threshold, clear it Historical coordinate database, and save the mobile cache coordinate data and the first coordinate data in the n+1th frame to be detected in the historical coordinate database as the historical coordinate data in the historical coordinate database; for each historical coordinate in the historical coordinate database The coordinate data is subjected to weighted average processing to obtain third coordinate data, and the third coordinate data is determined as the coordinates of the audio acquisition device in the image to be detected in the n+1th frame.
  • the coordinate calculation module 903 is configured to compare the first coordinate data in the image to be detected in the n+1th frame with the first historical coordinate data to obtain the displacement data; when the image to be detected in the n+1th frame When the corresponding displacement data is less than the first threshold, the first coordinate data in the image to be detected in the n+1th frame is saved to the historical coordinate database as the historical coordinate data in the historical coordinate database; for each historical coordinate in the historical coordinate database The coordinate data is subjected to weighted average processing to obtain the fourth coordinate data, and the fourth coordinate data is determined as the coordinate of the audio acquisition device in the image to be detected in the n+1th frame; the mobile cache coordinate database is cleared.
  • the audio capture device positioning device 900 further includes: a first non-target processing module configured to pre-empt the audio capture device when the audio capture device is not recognized in the continuous k frames of images to be detected and k is less than the third threshold.
  • the historical coordinate data of the audio acquisition device in the historical coordinate database is weighted and averaged to obtain the fifth coordinate data, and the fifth coordinate data is determined as the coordinate of the audio acquisition device in the image to be detected; where k is positive Integer.
  • the audio capture device positioning device 900 further includes: a second non-target processing module configured to, when the audio capture device is not identified in the continuous j frames of images to be detected, and j is greater than or equal to the third threshold, Generate alarm information that there is no audio collection device in the image to be detected; where j is a positive integer.
  • the image recognition module 902 is configured to: perform convolution processing on the image to be detected to obtain a feature map; classify each feature point in the feature map to determine a candidate area; perform pooling processing on the candidate area to obtain a candidate Feature map: Fully connect the candidate feature maps to obtain the first coordinate data of the audio collection device.
  • FIG. 10 schematically shows an optional architecture diagram of the speaker recognition system provided by an embodiment of the present application.
  • the identification system provided by the embodiment of the present application includes: a camera device 1001 and an electronic device 1002.
  • the camera device 1001 is configured to obtain the image to be detected;
  • the electronic device 1002 is connected to the camera device, and the electronic device includes a storage device and a processor, wherein the storage device is configured to store one or more programs, when one or more programs When executed by the processor, the processor is allowed to implement the speaker recognition method provided in the embodiment of the present application to process the image to be detected to obtain the speaker.
  • FIG. 11 shows an optional structural schematic diagram of a computer system suitable for implementing the electronic equipment of the embodiments of the present application.
  • the computer system 1100 includes a central processing unit (CPU) 1101, which can be loaded into a random system according to a program stored in a read-only memory (Read-Only Memory, ROM) 1702 or from the storage part 1108. Access to the program in the memory (Random Access Memory, RAM) 1103 to execute various appropriate actions and processing. In RAM 1103, various programs and data required for system operation are also stored.
  • the CPU 1101, the ROM 1102, and the RAM 1103 are connected to each other through a bus 1104.
  • An input/output (Input/Output, I/O) interface 1105 is also connected to the bus 1104.
  • the following components are connected to the I/O interface 1105: input part 1106 including keyboard, mouse, etc.; including output part 1107 such as cathode ray tube (Cathode Ray Tube, CRT), liquid crystal display (LCD), etc., and speakers. ; A storage part 1108 including a hard disk, etc.; and a communication part 1109 including a network interface card such as a LAN (Local Area Network) card, a modem, etc.
  • the communication section 1109 performs communication processing via a network such as the Internet.
  • the driver 1110 is also connected to the I/O interface 1105 as needed.
  • a removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1110 as needed, so that the computer program read from it is installed into the storage portion 1108 as needed.
  • the process described below with reference to the flowchart can be implemented as a computer software program.
  • the embodiments of the present application include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes program code configured to execute the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication part 1109, and/or installed from the removable medium 1111.
  • CPU central processing unit
  • various functions defined in the system of the embodiment of the present application are executed.
  • the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be It is realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present application can be implemented in software or hardware, and the described units can also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the embodiments of the present application also provide a computer-readable medium.
  • the computer-readable medium may be included in the electronic device described in the above-mentioned embodiments; or it may exist alone without being assembled into the In electronic equipment.
  • the foregoing computer-readable medium carries one or more programs, and when the foregoing one or more programs are executed by an electronic device, the electronic device is caused to implement the method in the foregoing embodiment.
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a touch terminal, or a network device, etc.
  • the first coordinate data of the audio collection device is obtained by identifying the audio collection device in the image to be detected, and the displacement data is determined according to the first coordinate data and historical coordinate data of the audio collection device, and then the displacement data is determined
  • the coordinates of the audio collection device can be combined with historical coordinate data to judge the correctness of the first coordinate data, and the coordinate data can be optimized, which improves the accuracy of the obtained coordinates, and can be applied to various application scenarios of audio collection device positioning .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Remote Sensing (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

本申请提供一种音频采集设备定位方法、音频采集设备定位装置及电子设备,一种说话人识别方法及系统,以及一种计算机可读存储介质。其中,方法包括:获取待检测图像;识别所述待检测图像中的音频采集设备,得到所述音频采集设备的第一坐标数据;根据所述音频采集设备的第一坐标数据与历史坐标数据确定位移数据,并根据所述位移数据确定所述待检测图像中音频采集设备的坐标。本申请能够结合历史坐标数据对第一坐标数据的正确性作出判断,并对坐标数据进行优化,进一步提高得到的音频采集设备的坐标的精确度。

Description

音频采集设备定位方法及装置、说话人识别方法及系统
相关申请的交叉引用
本申请基于申请号为201910523416.4、申请日为2019年06月17日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及图像处理技术领域,尤其涉及一种音频采集设备定位方法、音频采集设备定位装置及电子设备,一种说话人识别方法、说话人识别系统,以及一种计算机可读存储介质。
背景技术
随着音频处理技术的快速发展,说话人识别技术被广泛应用在日常生活的各个领域中。在进行说话人识别时,往往需要对麦克风设备进行准确定位。
在相关技术提供的方案中,通常是通过基于深度学习的方法进行麦克风检测,即是将待检测图像输入至深度学习模型,将深度学习模型的输出结果作为最终的检测结果。
发明内容
本申请实施例提供了一种音频采集设备定位方法,由电子设备执行,所述音频采集设备定位方法包括:
获取待检测图像;
识别所述待检测图像中的音频采集设备,得到所述音频采集设备的第一坐标数据;
根据所述音频采集设备的第一坐标数据与历史坐标数据确定位移数据,并根据所述位移数据确定所述待检测图像中音频采集设备的坐标。
本申请实施例提供了一种音频采集设备定位装置,包括:图像获取模块,配置为获取待检测图像;图像识别模块,配置为识别所述待检测图像中的音频采集设备,得到所述音频采集设备的第一坐标数据;坐标计算模块,配置为根据所述音频采集设备的第一坐标数据与历史坐标数据确定位移数据,并根据所述位移数据确定所述待检测图像中音频采集设备的坐标。
本申请实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,配置为存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现本申请实施例提供的音频采集设备定位方法,或者本申请实施例提供的说话人识别方法。
本申请实施例提供了一种说话人识别方法,由电子设备执行,所述说话人识别方法包括:
通过摄像设备获取待检测图像;
对所述待检测图像进行人脸识别处理,得到至少一个人脸坐标;
识别所述待检测图像中的音频采集设备,得到所述音频采集设备的坐标;
确定所述音频采集设备的坐标与每个所述人脸坐标之间的距离,并将具有最小距离的所述人脸坐标对应的对象,确定为说话人。
本申请实施例提供了一种说话人识别系统,包括:
摄像设备,配置为获取待检测图像;
电子设备,与所述摄像设备连接,并且所述电子设备包括存储装置和处理器,其中所述存储装置配置为存储一个或多个程序,当所述一个或多个程序被所述处理器执行时,使得所述处理器实现本申请实施例所述的说话人识别方法,以对所述待检测图像进行处理得到说话人。
本申请实施例提供了一种计算机可读存储介质,存储有一个或多个程 序,当所述一个或多个程序被处理器执行时,使得所述处理器实现本申请实施例提供的音频采集设备定位方法,或者本申请实施例提供的说话人识别方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了本申请实施例,并与说明书一起用于解释本申请实施例的原理。显而易见地,下面描述中的附图仅仅是本申请实施例的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1A示出了可以应用本申请实施例的技术方案的示例性系统的一个可选的架构示意图;
图1B示出了可以应用本申请实施例的技术方案的示例性系统的一个可选的架构示意图;
图2示意性示出了本申请实施例提供的音频采集设备定位方法的一个可选的流程示意图;
图3示意性示出了本申请实施例提供的对待检测图像进行识别的一个可选的流程示意图;
图4示意性示出了本申请实施例提供的待检测图像的一个可选的检测结果示意图;
图5示意性示出了本申请实施例提供的对待检测图像进行边缘检测后得到的一个可选的检测结果示意图;
图6A示意性示出了本申请实施例提供的音频采集设备定位方法的一个可选的流程示意图;
图6B示意性示出了本申请实施例提供的音频采集设备定位方法的一个可选的流程示意图;
图6C示意性示出了本申请实施例提供的历史坐标数据库和移动缓存坐标数据库的一个可选的示意图;
图7示意性示出了本申请实施例提供的说话人识别方法的一个可选的流程示意图;
图8示意性示出了本申请实施例提供的包含人脸识别结果和音频采集设备识别结果的一个可选的检测结果示意图;
图9示意性示出了本申请实施例提供的音频采集设备定位装置的一个可选的架构示意图;
图10示意性示出了本申请实施例提供的说话人识别系统的一个可选的架构示意图;
图11示出了适于用来实现本申请实施例的电子设备的计算机系统的一个可选的结构示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请实施例将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请实施例的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请实施例的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
在以下的描述中,涉及到的“多个”是指“至少两个”。另外,除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
近年来,人工智能技术快速发展,从传统的机器学习到目前的深度学习,人工智能技术被广泛应用于多个领域。同样的,深度学习在麦克风检测领域也同样有着广泛的应用,具体来说,将待检测图像输入至深度学习模型,并将深度学习模型的输出结果直接作为最终的检测结果。但是,相关技术提供的方案至少存在以下两方面的缺陷:(1)若要使深度学习模型达到较好的性能,则需要大量带标注的样本对深度学习模型进行训练,其中样本搜集和标注都需要大量人力成本和时间成本的投入;(2)麦克风检测的准确度低,尤其在面对小目标定位任务或复杂环境时,容易出现漏检、检测位置偏移以及误检等问题,并且很难确定深度学习模型出现上述异常的原因。
鉴于相关技术中存在的问题,本申请提供了一种音频采集设备定位方法、音频采集设备定位装置及电子设备,一种说话人识别方法、说话人识别系统,以及一种计算机可读存储介质,能够快速、准确地识别出音频采 集设备的坐标,并且能够对说话人进行准确识别。下面进行详细说明。
图1A示出了可以应用本申请实施例的技术方案的示例性系统的一个可选的架构示意图。
如图1A所示,系统架构100可以包括终端设备(如图1A中所示的智能手机101、平板电脑102和便携式计算机103)、网络104和服务器105。网络104用以在终端设备和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线通信链路、无线通信链路等等。
应该理解,图1A中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。在一些实施例中,服务器105可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备的类型并不局限于图1A示出的智能手机、平板电脑和便携式计算机,例如还可以是台式计算机、照相机、智能音箱及智能手表等。
在一些实施例中,用户可以利用终端设备101(也可以是终端设备102或103)获取待检测图像,然后将待检测图像发送至服务器105;服务器105在接收到终端设备101发送的待检测图像后,能够对待检测图像进行图像识别,识别待检测图像中的音频采集设备,得到音频采集设备的第一坐标数据;然后根据音频采集设备的第一坐标数据与历史坐标数据计算位移数据,以根据位移数据计算待检测图像中音频采集设备的坐标。本申请实施例一方面能够利用图像识别技术准确地确定待检测图像中唯一的音频采集设备,避免待检测图像中出现多个目标的错误;另一方面,能够结合历史坐标数据对第一坐标数据的正确性作出判断,并对坐标数据进行优化,进一步地提高最终确定出的音频采集设备的坐标的精确度。
为了便于理解,以实际场景说明系统架构100的应用,如图1B所示, 在多人说话的场景中,可以利用终端设备101(也可以是终端设备102或103)获取待检测图像。对于待检测图像,识别出其中的人脸目标(如图1B示出的人脸目标106、107和108)和音频采集设备(如图1B示出的麦克风109)。然后,确定音频采集设备的坐标与每个人脸目标的人脸坐标之间的距离,并将具有最小距离的人脸坐标对应的人脸目标,确定为说话人。在图1B中,即是将距麦克风109最近的人脸目标108确定为说话人,在确定出说话人之后,可以根据实际应用场景执行进一步操作,例如在多人视频会议中,可以对识别出的说话人执行镜头自动跟随,以将镜头内图像放大显示在终端设备的图形界面中,例如将人脸目标108显示于图形界面中;此外,在识别出说话人之后,还可以自动调整麦克风阵列的位置,使麦克风阵列朝向说话人,以采集更清晰的语音。
需要说明的是,本申请实施例所提供的音频采集设备定位方法可以由服务器105执行,相应地,音频采集设备定位装置可以设置于服务器105中。但是,在一些实施例中,终端设备也可以与服务器具有相似的功能,从而执行本申请实施例所提供的音频采集设备定位方法。本申请实施例提供的说话人识别方法同理。
图2示意性示出了本申请实施例提供的用于麦克风识别的音频采集设备定位方法的一个可选的流程图,该定位方法可以由服务器(例如图1A或图1B示出的服务器105)来执行,可以由终端设备(例如图1A或图1B示出的终端设备101、102或103)来执行,也可以由终端设备和服务器共同执行。参照图2所示,该音频采集设备定位方法至少包括步骤S210至步骤S240,详细介绍如下:
在步骤S210中,获取待检测图像。
在一些实施例中,可以通过终端设备101(也可以是终端设备102或103)获取待检测图像。具体地,终端设备101可以是相机或录像机等设备, 或者可以通过终端设备101内置的拍摄单元或与终端设备101外部连接的拍摄装置进行拍照或录像,以获取每一帧的待检测图像;也可以将终端设备101与数据网络连接,通过浏览、下载网络资源或本地数据库资源以获取视频数据或图像数据,进而获取每一帧的待检测图像;本申请实施例对获取待检测图像的具体方式不做限定。
在步骤S220中,识别待检测图像中的音频采集设备,得到音频采集设备的第一坐标数据。
在一些实施例中,上述的音频采集设备可以是每一帧待检测图像中的话筒、麦克风,或者手机等具有音频采集、音频放大功能的音频采集设备,本申请的各实施例以音频采集设备为麦克风进行举例说明。在本申请实施例中,在获得待检测图像后,可以通过预先训练的目标识别模型对该待检测图像进行特征提取,进而根据提取到的特征信息对音频采集设备进行识别和定位,进而获取音频采集设备的坐标数据,以作为第一坐标数据。在本申请实施例中,可以将待检测图像输入至目标识别模型,以对该待检测图像进行识别,该目标识别模型可以为一机器学习模型,例如可以是卷积神经网络(CNN,Convolutional Neural Networks)模型、循环神经网络(RNN,Recurrent Neural Network)模型、快速区域卷积神经网络(Faster RCNN,Faster Region-Convolutional Neural Networks)模型等,本申请实施例对此不做具体限定。
在一些实施例中,图3示出了利用目标识别模型对待检测图像进行识别的流程示意图,以基于Faster RCNN的目标识别模型为例,如图3所示:在步骤S301中,对待检测图像进行卷积处理,得到特征图;输入模型的待检测图像可以是任意大小的图片,利用卷积层提取待检测图像的特征图。在步骤S302中,对特征图中各特征点进行分类以确定候选区域;主要利用区域生成网络(RPN,Region Proposal Network)生成候选区域(region proposals);可以在上述获取的特征图上生成一定数量的锚框(Anchor  Boxes),并对锚(anchors)属于前景或者后景进行判断,同时对锚框进行修正,从而获取相对精确的候选区域。在步骤S303中,对候选区域进行池化处理,得到候选特征图;利用RPN生成的候选区域和特征图,得到固定大小的候选特征图。在步骤S304中,对候选特征图进行全连接处理,得到音频采集设备的第一坐标数据;对池化层形成的固定大小的候选特征图进行全连接操作,并做具体类别的分类,同时进行回归操作,从而得到待检测图像中音频采集设备的第一坐标数据。
在一些实施例中,在将待检测图像输入目标识别模型进行音频采集设备的识别操作后,目标识别模型可能返回一个麦克风坐标数据(麦克风坐标数据即识别目标对应的坐标数据),或多个麦克风坐标数据;该麦克风坐标数据可以是识别出的麦克风目标(麦克风目标对应识别目标)的中心位置对应的中心坐标。例如,如图4所示的一帧待检测图像中,包括有两个识别结果:识别框410和识别框420。在这一帧待检测图像中识别出两个麦克风目标,其中识别框410对应的麦克风目标是正确的,识别框420对应的麦克风目标为误检,即误将人的手臂识别为了麦克风目标。
在本申请实施例中,可对待检测图像进行边缘检测,识别待检测图像中用于支撑麦克风目标的支架装置,从而消除误检支架装置,其中,支架装置可以是麦克风支架。例如,对图4所示的待检测图像进行边缘检测后,可得到如图5所示的检测结果,在图5中,仅包括一个识别框510,该识别框510对应的坐标数据即为第一坐标数据。
本申请实施例提供了如图6A所示的流程图,以音频采集设备为麦克风为例,当在待检测图像中识别出符合麦克风的多个麦克风目标后,对待检测图像进行边缘检测,提取待检测图像中的边缘信息,由于麦克风支架属于边缘,故会被检测到。然后,判断识别出的多个麦克风目标下方是否有检测到麦克风支架的边缘信息,从而在多个麦克风目标中,将下方设置有麦克风支架的麦克风目标确认为唯一正确的麦克风。
本申请实施例还提供了如图6B所示的流程图,同样以音频采集设备为麦克风为例,在得到待检测图像后,通过麦克风检测器检测待检测图像中的麦克风坐标数据,同时通过边缘提取器提取待检测图像中的边缘特征,其中,麦克风检测器即为目标识别模型,提取边缘特征的过程即为边缘检测,边缘特征即为边缘信息。若仅得到一个麦克风目标的麦克风坐标数据,则将该麦克风坐标数据作为第一坐标数据;若得到多个麦克风目标的麦克风坐标数据,则通过得到的边缘特征,将下方设置有麦克风支架的麦克风目标确定为唯一正确的麦克风。
在步骤S230中,根据音频采集设备的第一坐标数据与历史坐标数据确定位移数据,并根据位移数据确定音频采集设备的坐标。
在一些实施例中,可以预先配置历史坐标数据库和移动缓存坐标数据库;其中,历史坐标数据库可以配置为存储历史各帧待检测图像中音频采集设备的坐标,即麦克风在发生“事实移动”前的所有位置记录;移动缓存坐标数据库可以配置为存储当前待检测图像中的识别目标对应的移动缓存坐标数据,及麦克风“可能移动”的位置记录。
在一些实施例中,在得到音频采集设备的唯一的第一坐标数据后,便可以对该第一坐标数据进行精确修正。
例如,若当前的待检测图像为第n帧待检测图像时,前序的第m帧、第m+1帧……第n-1帧待检测图像,共n-m帧待检测图像中音频采集设备的精确的坐标便可以存储在历史坐标数据库中,作为历史坐标数据,其中,n和m均为正整数,且n>m。在本申请实施例中,可以根据音频采集设备的第一坐标数据与历史坐标数据计算位移数据,首先,对历史坐标数据库中各历史坐标数据进行加权平均处理,得到第一历史坐标数据。其次,将待检测图像中的音频采集设备的第一坐标数据与第一历史坐标数据进行对比,得到位移数据,例如可以在同一坐标系中,计算第一坐标数据与第一历史坐标数据在横轴和纵轴上的差值,得到位移数据。
在一些实施例中,在当前的待检测图像为第n帧待检测图像时,得到位移数据后,便可以将该位移数据与预设的第一阈值进行对比,如图6A所示,首先,在位移数据小于第一阈值时,将待检测图像中的音频采集设备的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中新的历史坐标数据。其次,对更新后的历史坐标数据库中各历史坐标数据进行加权平均处理,得到第二坐标数据,并将第二坐标数据作为第n帧待检测图像中的音频采集设备的精确的坐标。同时,还可以清空移动缓存坐标数据库。
例如,在位移数据小于第一阈值时,则说明第n帧待检测图像中麦克风位置与历史位置相近,例如第一坐标数据与第一历史坐标数据在横轴和纵轴的坐标差值均小于50个像素点,则认为麦克风未发生移动,如图6B所示,认为麦克风位置未发生大扰动,将第一坐标数据(x1,y1,x2,y2)保存至历史坐标数据库(历史位置管道)中,从而对历史坐标数据库进行更新,其中,x表示横轴的数值,y表示纵轴的数值,(x1,y1,x2,y2)即对应识别框的形式。然后,再对更新后的历史坐标数据库中的各历史坐标数据进行加权平均处理,将处理结果作为第n帧待检测图像中音频采集设备的精确的坐标,如图6B所示的(x1_,y1_,x2_,y2_)。同时,还可以对移动缓存数据库进行清空处理,删除其中的移动缓存坐标数据。
在一些实施例中,若当前的待检测图像为第n帧待检测图像,将位移数据与预设的第一阈值进行对比,如图6A所示,当第n帧待检测图像对应的位移数据大于或等于第一阈值时,则说明第n帧待检测图像中音频采集设备的位置与历史位置偏差较大,第n帧待检测图像中音频采集设备发生“可能移动”(即图6B中的“位置发生大扰动”),此时将第n帧待检测图像的第一坐标数据保存至移动缓存坐标数据库(图6B中的移动缓存管道),以作为移动缓存坐标数据库中的移动缓存坐标数据。同时,可以将第一历史坐标数据配置为第n帧待检测图像中音频采集设备的精确的坐标,以保 证对待检测图像进行目标识别的连续性,如图6B中示出的精确的坐标(x1_,y1_,x2_,y2_)。其中,位移数据大于或等于第一阈值,可以是第一坐标数据与第一历史坐标数据在横轴的坐标差值大于或等于50个像素点,可以是在纵轴的坐标差值大于或等于50个像素点,也可以是在横轴和纵轴的坐标差值均大于或等于50个像素点。
然后,可以获取第n+1帧待检测图像,并确定第n+1帧待检测图像中音频采集设备的第一坐标数据,将该第一坐标数据与移动缓存坐标数据进行对比,再根据对比结果判断是否需要与第一历史坐标数据进行对比,从而判断移动缓存坐标数据的正确性,以及确定第n+1帧待检测图像中音频采集设备的精确的坐标。
在一些实施例中,将第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据进行对比,例如,计算第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据之间的位置偏差数据,并将位置偏差数据与第二阈值进行对比。当位置偏差数据小于第二阈值时,清空历史坐标数据库,并将移动缓存坐标数据库中的移动缓存坐标数据、以及第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中的历史坐标数据。然后,对更新后的历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第三坐标数据,并将第三坐标数据作为第n+1帧待检测图像中音频采集设备的精确的坐标。
在位置偏差数据小于预设的第二阈值时,例如在同一坐标系中,第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据在横轴和纵轴的坐标差值均小于50个像素点时,表明第n+1帧待检测图像中的音频采集设备与第n帧待检测图像中音频采集设备的位置接近,则认为第n帧发生了“事实移动”,便可以将历史坐标数据库清空,并将第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据保存至历史坐标数据库中,完成对历史坐标数据库的更新,此时历史坐标数据库中即保存了第n帧、第n+1帧待检 测图像中音频采集设备的第一坐标数据。然后可以对更新后的历史坐标数据库中的历史坐标数据进行加权平均处理,并将处理结果作为第n+1帧待检测图像中音频采集设备的精确的坐标。
在一些实施例中,将第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据进行对比,例如,计算第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据之间的位置偏差数据,并将位置偏差数据与第二阈值进行对比。在位置偏差数据大于或等于第二阈值时,将第n+1帧待检测图像中的第一坐标数据与第一历史坐标数据进行对比以计算位移数据。然后,在该位移数据小于第一阈值时,将第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中新的历史坐标数据。对更新后的历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第四坐标数据,并将第四坐标数据作为第n+1帧待检测图像中音频采集设备的精确的坐标。同时,清空移动缓存坐标数据库。
在位置偏差数据大于或等于预设的第二阈值时,例如在同一坐标系中,第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据在横轴的坐标差值大于或等于50个像素点、或者在纵轴的坐标差值大于或等于50个像素点、或者在横轴和纵轴的坐标差值均大于或等于50个像素点时,计算第n+1帧待检测图像中的第一坐标数据与第一历史坐标数据之间的位移数据。若该位移数据小于预设的第一阈值,则说明第n+1帧待检测图像中的第一坐标数据与历史坐标数据库中的各历史坐标数据接近,且第n帧待检测图像发生了扰动,便可以将第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库中,对历史坐标数据库进行更新,以及清空移动缓存坐标数据库。然后,对更新后的历史坐标数据库中的各历史坐标数据进行加权平均处理,并将处理结果作为第n+1帧待检测图像中音频采集设备的精确的坐标。
在一些实施例中,将第n帧待检测图像对应的第一坐标数据与第一历 史坐标数据进行对比,当得到的位移数据大于或等于第一阈值时,还可以将第n+1帧待检测图像的第一坐标数据与第一历史坐标数据进行对比,若得到的位移数据小于第一阈值,则说明第n帧待检测图像出现扰动,将第n+1帧待检测图像的第一坐标数据保存至历史坐标数据库,并对更新后的历史坐标数据库中的各历史坐标数据进行加权平均处理,将处理结果作为第n+1帧待检测图像中音频采集设备的精确的坐标。
反之,若将第n+1帧待检测图像的第一坐标数据与第一历史坐标数据进行对比后,得到的位移数据大于或等于第一阈值,则确定第n+1帧待检测图像的第一坐标数据与移动缓存坐标数据之间的位置偏差数据。当位置偏差数据小于第二阈值时,清空历史坐标数据库,并将移动缓存坐标数据和第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中的历史坐标数据。然后,对更新后的历史坐标数据库中的各历史坐标数据进行加权平均处理,将处理结果作为第n+1帧待检测图像中音频采集设备的坐标。
在一些实施例中,还可以配置历史坐标数据库与移动缓存坐标数据库中的数据容量。例如,如图6C所示,配置历史坐标数据库(历史位置管道)中最多可以存储300或500个历史坐标数据,历史坐标数据如图6C示出的历史位置管道中的(x1_0,y1_0,x2_0,y2_0)。在历史坐标数据库中的历史坐标数据的数量达到上限时,若有新的坐标数据需要保存至历史坐标数据库中,则可以将排序靠前的一个或多个历史坐标数据删除,例如前10、50或100个历史坐标数据,或按一定的时间周期清空历史坐标数据库,从而使得坐标数据可以正常地存储在历史坐标数据库中。对于移动缓存坐标数据库(移动缓存管道),如图6C所示,可以配置其至多存储2个或3个移动缓存坐标数据,从而可以快速、准确地处理每一帧出现扰动的待检测图像,避免因移动缓存坐标数据堆积,导致后续待检测图像中音频采集设备坐标的计算出现较大的误差或错误。例如,图6C示出了移动缓存管道中的移动 缓存坐标数据(x1_0,y1_0,x2_0,y2_0)。
在一些实施例中,在当前帧的待检测图像中未识别出音频采集设备时,为了保证对待检测图像进行目标识别的连续性,可以将历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第五坐标数据,并将该第五坐标数据作为待检测图像中音频采集设备的精确的坐标。
在一些实施例中,当在连续k帧的待检测图像中均未识别出音频采集设备,且k小于第三阈值时,则将历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第五坐标数据,并将第五坐标数据作为待检测图像中音频采集设备的精确的坐标;其中,k为正整数。例如,若连续的两帧或三帧待检测图像中均未识别出音频采集设备,则将历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第五坐标数据,并将该第五坐标数据作为该连续三帧待检测图像中音频采集设备的精确的坐标。
在一些实施例中,在连续j帧的待检测图像中未识别出音频采集设备,且j大于或等于第三阈值时,则生成待检测图像中不存在音频采集设备的报警信息;其中,j为正整数。例如,若连续五帧或六帧待检测图像中未识别出音频采集设备,则认为待检测图像中不存在音频采集设备,便可以发出报警信息对用户进行提示,并暂时结束对待检测图像的目标定位。
本申请实施例中的音频采集设备定位方法可以应用于说话人识别等产品中,用于对环境中麦克风设备或其他音频采集设备进行实时的检测、定位。还可以应用于其他的需要对特定目标进行识别的场景中。
根据本申请实施例中的音频采集设备定位方法,在利用图像识别技术确定音频采集设备的唯一的一个第一坐标数据后,可以结合音频采集设备的历史坐标数据对第一坐标数据进行判断和确认,从而提高得到的音频采集设备的坐标的精确度和准确度,避免出现漏检、误检或位置偏移的情况。
图7示意性示出了本申请实施例提供的说话人识别方法的一个可选的流程图,该识别方法可以由服务器来执行,例如图1A或图1B中所示的服 务器;或者,该识别方法也可以由终端设备来执行,例如图1A或图1B中所示的终端设备;或者,该识别方法也可以由终端设备和服务器共同执行。参照图7所示,该说话人识别方法至少包括步骤S710至步骤S740,详细介绍如下:
在步骤S710中,通过摄像设备获取待检测图像。
在一些实施例中,可以通过摄像设备获取待检测图像。例如,摄像设备可以是摄像机、数码相机或监视器等设备,也可以是终端设备内置的拍摄单元或终端设备外接的拍摄单元等设备。利用摄像设备对包含音频采集设备(例如麦克风)的环境进行拍照或者录像,进而获取每一帧的待检测图像。
在步骤S720中,对待检测图像进行人脸识别处理,得到至少一个人脸坐标。
在一些实施例中,在获得待检测图像后,可以利用人脸识别模型对待检测图像进行人脸识别,得到待检测图像中的一个或多个人脸目标,并将各人脸目标中心点的坐标作为各自的人脸坐标。例如,如图8所示的场景对应的一帧待检测图像中,在人脸识别后,确定当前帧待检测图像中包含3个人脸目标(目标711、目标712和目标713),并将各人脸目标的中心点坐标,作为各人脸目标在当前帧待检测图像中的人脸坐标。
在步骤S730中,识别待检测图像中的音频采集设备,得到音频采集设备的坐标。
在一些实施例中,在对待检测图像进行人脸识别的同时,还可以利用上述的音频采集设备定位方法,对待检测图像中的音频采集设备(如麦克风)进行识别,从而得到当前帧的待检测图像中音频采集设备的精确的坐标。例如,如图8所示的场景中包含麦克风设备721,可以将麦克风设备721的中心点坐标作为麦克风的精确的坐标。对待检测图像进行识别,以得到待检测图像中音频采集设备的精确的坐标的详细过程已经在上述的实施 例中详细说明,在此不再赘述。
在步骤S740中,确定音频采集设备的坐标与每个人脸坐标之间的距离,并将具有最小距离的人脸坐标对应的对象,确定为说话人。
在一些实施例中,在获取当前帧中麦克风设备的精准的坐标,以及人脸坐标后,便可以计算音频采集设备(如麦克风)的坐标与各人脸坐标之间的物间距离,并且将物间距离最小的人脸坐标对应的对象,确定为说话人。例如,如图8中所示的场景下,经计算后,麦克风设备721与人脸目标713之间的距离最小,便可以将人脸目标713对应的对象确定为说话人。
根据本申请实施例中的说话人识别方法,可以对待检测图像中的麦克风和人脸进行识别和定位,确定图像中麦克风与各人脸之间的位置关系,进而可以在视觉角度上有效地辅助定位说话人。
以下介绍本申请实施例提供的音频采集设备定位装置,可以配置为执行本申请实施例提供的音频采集设备定位方法。对于音频采集设备定位装置中未披露的细节,请参照本申请实施例提供的音频采集设备定位方法的上述描述。
图9示意性示出了本申请实施例提供的音频采集设备定位装置的一个可选的架构图。
参照图9所示,音频采集设备定位装置900,包括:图像获取模块901、图像识别模块902和坐标计算模块903。其中,图像获取模块901,配置为获取待检测图像;图像识别模块902,配置为识别待检测图像中的音频采集设备,得到音频采集设备的第一坐标数据;坐标计算模块903,配置为根据音频采集设备的第一坐标数据与历史坐标数据确定位移数据,并根据位移数据确定音频采集设备的坐标。
在一些实施例中,图像识别模块902配置为:对待检测图像进行识别,得到符合音频采集设备的识别目标;当识别得到一个识别目标时,将识别目标对应的坐标数据,确定为音频采集设备的第一坐标数据。
在一些实施例中,图像识别模块902配置为:当识别得到多个识别目标时,对待检测图像进行边缘检测,以在多个识别目标中,确定设置在支架装置上的识别目标;将设置在支架装置上的识别目标对应的坐标数据,确定为音频采集设备的第一坐标数据。
在一些实施例中,坐标计算模块903配置为:对预设的历史坐标数据库中的、音频采集设备的各历史坐标数据进行加权平均处理,得到第一历史坐标数据;将音频采集设备的第一坐标数据与第一历史坐标数据进行对比,得到位移数据。
在一些实施例中,坐标计算模块903配置为:当位移数据小于第一阈值时,将音频采集设备的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中的历史坐标数据;对历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第二坐标数据,并将第二坐标数据确定为音频采集设备的坐标。
在一些实施例中,待检测图像为第n帧待检测图像;坐标计算模块903配置为:当位移数据大于或等于第一阈值时,将第一历史坐标数据,确定为第n帧待检测图像中音频采集设备的坐标;其中,n为正整数。
在一些实施例中,坐标计算模块903配置为:将第n帧待检测图像的第一坐标数据保存至预设的移动缓存坐标数据库,以作为移动缓存坐标数据;将第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据进行对比,或者将第n+1帧待检测图像中的第一坐标数据与第一历史坐标数据进行对比,并根据对比结果确定第n+1帧待检测图像中音频采集设备的坐标。
在一些实施例中,坐标计算模块903配置为:确定第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据之间的位置偏差数据;当位置偏差数据小于第二阈值时,清空历史坐标数据库,并将移动缓存坐标数据和第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中的历史坐标数据;对历史坐标数据库中的各历史坐标数据 进行加权平均处理,得到第三坐标数据,并将第三坐标数据,确定为第n+1帧待检测图像中音频采集设备的坐标。
在一些实施例中,坐标计算模块903配置为:确定第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据之间的位置偏差数据;当位置偏差数据大于或等于第二阈值时,将第n+1帧待检测图像中的第一坐标数据与第一历史坐标数据进行对比,得到位移数据;当第n+1帧待检测图像对应的位移数据小于第一阈值时,将第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中的历史坐标数据;对历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第四坐标数据,并将第四坐标数据,确定为第n+1帧待检测图像中音频采集设备的坐标;清空移动缓存坐标数据库。
在一些实施例中,坐标计算模块903配置为:将第n+1帧待检测图像中的第一坐标数据与第一历史坐标数据进行对比,得到位移数据;当第n+1帧待检测图像对应的位移数据大于或等于第一阈值时,确定第n+1帧待检测图像中的第一坐标数据与移动缓存坐标数据之间的位置偏差数据;当位置偏差数据小于第二阈值时,清空历史坐标数据库,并将移动缓存坐标数据和第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中的历史坐标数据;对历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第三坐标数据,并将第三坐标数据,确定为第n+1帧待检测图像中音频采集设备的坐标。
在一些实施例中,坐标计算模块903配置为:将第n+1帧待检测图像中的第一坐标数据与第一历史坐标数据进行对比,得到位移数据;当第n+1帧待检测图像对应的位移数据小于第一阈值时,将第n+1帧待检测图像中的第一坐标数据保存至历史坐标数据库,以作为历史坐标数据库中的历史坐标数据;对历史坐标数据库中的各历史坐标数据进行加权平均处理,得 到第四坐标数据,并将第四坐标数据,确定为第n+1帧待检测图像中音频采集设备的坐标;清空移动缓存坐标数据库。
在一些实施例中,音频采集设备定位装置900还包括:第一无目标处理模块,配置为当在连续k帧待检测图像中未识别出音频采集设备、且k小于第三阈值时,将预设的历史坐标数据库中的、音频采集设备的各历史坐标数据进行加权平均处理,得到第五坐标数据,并将第五坐标数据确定为待检测图像中音频采集设备的坐标;其中,k为正整数。
在一些实施例中,音频采集设备定位装置900还包括:第二无目标处理模块,配置为当在连续j帧待检测图像中未识别到音频采集设备、且j大于或等于第三阈值时,生成待检测图像中不存在音频采集设备的报警信息;其中,j为正整数。
在一些实施例中,图像识别模块902配置为:对待检测图像进行卷积处理,得到特征图;对特征图中各特征点进行分类,以确定候选区域;对候选区域进行池化处理,得到候选特征图;对候选特征图进行全连接处理,得到音频采集设备的第一坐标数据。
图10示意性示出了本申请实施例提供的说话人识别系统的一个可选的架构图。参照图10所示,本申请实施例提供的识别系统,包括:摄像设备1001,电子设备1002。
其中,摄像设备1001,配置为获取待检测图像;电子设备1002,与摄像设备连接,并且电子设备包括存储装置和处理器,其中存储装置配置为存储一个或多个程序,当一个或多个程序被处理器执行时,使得处理器实现本申请实施例提供的说话人识别方法,以对待检测图像进行处理得到说话人。
图11示出了适于用来实现本申请实施例的电子设备的计算机系统的一个可选的结构示意图。
需要说明的是,图11示出的电子设备的计算机系统1100仅是一个示 例,不应对本申请实施例的功能和使用范围带来任何限制。
如图11所示,计算机系统1100包括中央处理单元(Central Processing Unit,CPU)1101,其可以根据存储在只读存储器(Read-Only Memory,ROM)1702中的程序或者从存储部分1108加载到随机访问存储器(Random Access Memory,RAM)1103中的程序而执行各种适当的动作和处理。在RAM 1103中,还存储有系统操作所需的各种程序和数据。CPU 1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(Input/Output,I/O)接口1105也连接至总线1104。
以下部件连接至I/O接口1105:包括键盘、鼠标等的输入部分1106;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1107;包括硬盘等的存储部分1108;以及包括诸如LAN(Local Area Network,局域网)卡、调制解调器等的网络接口卡的通信部分1109。通信部分1109经由诸如因特网的网络执行通信处理。驱动器1110也根据需要连接至I/O接口1105。可拆卸介质1111,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1110上,以便于从其上读出的计算机程序根据需要被安装入存储部分1108。
特别地,根据本申请实施例,下文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含配置为执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1109从网络上被下载和安装,和/或从可拆卸介质1111被安装。在该计算机程序被中央处理单元(CPU)1101执行时,执行本申请实施例的系统中限定的各种功能。
需要说明的是,本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、 或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请实施例中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。
作为另一方面,本申请实施例还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现上述实施例中的方法。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。
本领域技术人员在考虑说明书及实践这里公开的申请后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅 由所附的权利要求来限制。
工业实用性
本申请实施例中,通过识别待检测图像中的音频采集设备,得到音频采集设备的第一坐标数据,并根据音频采集设备的第一坐标数据与历史坐标数据确定位移数据,进而根据位移数据确定音频采集设备的坐标,能够结合历史坐标数据对第一坐标数据的正确性作出判断,并对坐标数据进行优化,提升了得到的坐标的精确度,能够适用于音频采集设备定位的各种应用场景。

Claims (20)

  1. 一种音频采集设备定位方法,由电子设备执行,所述音频采集设备定位方法包括:
    获取待检测图像;
    识别所述待检测图像中的音频采集设备,得到所述音频采集设备的第一坐标数据;
    根据所述音频采集设备的第一坐标数据与历史坐标数据确定位移数据,并根据所述位移数据确定所述音频采集设备的坐标。
  2. 根据权利要求1所述的音频采集设备定位方法,其中,所述识别所述待检测图像中的音频采集设备,得到所述音频采集设备的第一坐标数据,包括:
    对所述待检测图像进行识别,得到符合音频采集设备的识别目标;
    当识别得到一个识别目标时,将所述识别目标对应的坐标数据,确定为所述音频采集设备的第一坐标数据。
  3. 根据权利要求2所述的音频采集设备定位方法,其中,还包括:
    当识别得到多个识别目标时,对所述待检测图像进行边缘检测,以在所述多个识别目标中,确定设置在支架装置上的识别目标;
    将所述设置在支架装置上的识别目标对应的坐标数据,确定为所述音频采集设备的第一坐标数据。
  4. 根据权利要求1所述的音频采集设备定位方法,其中,所述根据所述音频采集设备的第一坐标数据与历史坐标数据确定位移数据,包括:
    对预设的历史坐标数据库中的、所述音频采集设备的各历史坐标数据进行加权平均处理,得到第一历史坐标数据;
    将所述音频采集设备的第一坐标数据与第一历史坐标数据进行对比,得到位移数据。
  5. 根据权利要求4所述的音频采集设备定位方法,其中,所述根据所述位移数据确定所述音频采集设备的坐标,包括:
    当所述位移数据小于第一阈值时,将所述音频采集设备的第一坐标数据保存至所述历史坐标数据库,以作为所述历史坐标数据库中的历史坐标数据;
    对所述历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第二坐标数据,并
    将所述第二坐标数据确定为所述音频采集设备的坐标。
  6. 根据权利要求4所述的音频采集设备定位方法,其中,所述待检测图像为第n帧待检测图像;
    所述根据所述位移数据确定所述音频采集设备的坐标,包括:
    当所述位移数据大于或等于第一阈值时,将所述第一历史坐标数据,确定为所述第n帧待检测图像中所述音频采集设备的坐标;
    其中,n为正整数。
  7. 根据权利要求6所述的音频采集设备定位方法,其中,还包括:
    将所述第n帧待检测图像的第一坐标数据保存至预设的移动缓存坐标数据库,以作为移动缓存坐标数据;
    将第n+1帧待检测图像中的第一坐标数据与所述移动缓存坐标数据进行对比,或者将所述第n+1帧待检测图像中的第一坐标数据与所述第一历史坐标数据进行对比,并
    根据对比结果确定所述第n+1帧待检测图像中所述音频采集设备的坐标。
  8. 根据权利要求7所述的音频采集设备定位方法,其中,
    所述将第n+1帧待检测图像中的第一坐标数据与所述移动缓存坐标数据进行对比,包括:
    确定所述第n+1帧待检测图像中的第一坐标数据与所述移动缓存坐标数据之间的位置偏差数据;
    所述根据对比结果确定所述第n+1帧待检测图像中所述音频采集设备的坐标,包括:
    当所述位置偏差数据小于第二阈值时,清空所述历史坐标数据库,并
    将所述移动缓存坐标数据和所述第n+1帧待检测图像中的第一坐标数据保存至所述历史坐标数据库,以作为所述历史坐标数据库中的历史坐标数据;
    对所述历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第三坐标数据,并
    将所述第三坐标数据,确定为所述第n+1帧待检测图像中所述音频采集设备的坐标。
  9. 根据权利要求7所述的音频采集设备定位方法,其中,
    所述将第n+1帧待检测图像中的第一坐标数据与所述移动缓存坐标数据进行对比,包括:
    确定所述第n+1帧待检测图像中的第一坐标数据与所述移动缓存坐标数据之间的位置偏差数据;
    所述根据对比结果确定所述第n+1帧待检测图像中所述音频采集设备的坐标,包括:
    当所述位置偏差数据大于或等于第二阈值时,将所述第n+1帧待检测图像中的第一坐标数据与所述第一历史坐标数据进行对比,得到位移数据;
    当所述第n+1帧待检测图像对应的位移数据小于所述第一阈值时,将所述第n+1帧待检测图像中的第一坐标数据保存至所述历史坐标数据库,以作为所述历史坐标数据库中的历史坐标数据;
    对所述历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第四坐标数据,并
    将所述第四坐标数据,确定为所述第n+1帧待检测图像中所述音频采集设备的坐标。
  10. 根据权利要求9所述的音频采集设备定位方法,其中,还包括:
    清空所述移动缓存坐标数据库。
  11. 根据权利要求7所述的音频采集设备定位方法,其中,
    所述将所述第n+1帧待检测图像中的第一坐标数据与所述第一历史坐标数据进行对比,包括:
    将所述第n+1帧待检测图像中的第一坐标数据与所述第一历史坐标数据进行对比,得到位移数据;
    所述根据对比结果确定所述第n+1帧待检测图像中所述音频采集设备的坐标,包括:
    当所述第n+1帧待检测图像对应的位移数据大于或等于所述第一阈值时,确定所述第n+1帧待检测图像中的第一坐标数据与所述移动缓存坐标数据之间的位置偏差数据;
    当所述位置偏差数据小于第二阈值时,清空所述历史坐标数据库,并
    将所述移动缓存坐标数据和所述第n+1帧待检测图像中的第一坐标数据保存至所述历史坐标数据库,以作为所述历史坐标数据库中的历史坐标数据;
    对所述历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第三坐标数据,并
    将所述第三坐标数据,确定为所述第n+1帧待检测图像中所述音频采集设备的坐标。
  12. 根据权利要求7所述的音频采集设备定位方法,其中,
    所述将所述第n+1帧待检测图像中的第一坐标数据与所述第一历史坐标数据进行对比,包括:
    将所述第n+1帧待检测图像中的第一坐标数据与所述第一历史坐标数据进行对比,得到位移数据;
    所述根据对比结果确定所述第n+1帧待检测图像中所述音频采集设备的坐标,包括:
    当所述第n+1帧待检测图像对应的位移数据小于所述第一阈值时,将所述第n+1帧待检测图像中的第一坐标数据保存至所述历史坐标数据库,以作为所述历史坐标数据库中的历史坐标数据;
    对所述历史坐标数据库中的各历史坐标数据进行加权平均处理,得到第四坐标数据,并
    将所述第四坐标数据,确定为所述第n+1帧待检测图像中所述音频采集设备的坐标。
  13. 根据权利要求12所述的音频采集设备定位方法,其中,还包括:
    清空所述移动缓存坐标数据库。
  14. 根据权利要求1所述的音频采集设备定位方法,其中,还包括:
    当在连续k帧所述待检测图像中未识别出所述音频采集设备、且k小于第三阈值时,将预设的历史坐标数据库中的、所述音频采集设备的各历史坐标数据进行加权平均处理,得到第五坐标数据,并
    将所述第五坐标数据确定为所述待检测图像中所述音频采集设备的坐标;其中,k为正整数。
  15. 根据权利要求1所述的音频采集设备定位方法,其中,还包括:
    当在连续j帧所述待检测图像中未识别到所述音频采集设备、且j大于或等于第三阈值时,生成所述待检测图像中不存在所述音频采集设备的报警信息;其中,j为正整数。
  16. 一种音频采集设备定位装置,包括:
    图像获取模块,配置为获取待检测图像;
    图像识别模块,配置为识别所述待检测图像中的音频采集设备,得到所述音频采集设备的第一坐标数据;
    坐标计算模块,配置为根据所述音频采集设备的第一坐标数据与历史坐标数据确定位移数据,并根据所述位移数据确定所述待检测图像中音频采集设备的坐标。
  17. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,配置为存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1至15中任一项所述的音频采集设备定位方法。
  18. 一种基于权利要求1至15中任一项所述的音频采集设备定位方法的说话人识别方法,由电子设备执行,所述说话人识别方法包括:
    通过摄像设备获取待检测图像;
    对所述待检测图像进行人脸识别处理,得到至少一个人脸坐标;
    识别所述待检测图像中的音频采集设备,得到所述音频采集设备的坐标;
    确定所述音频采集设备的坐标与每个所述人脸坐标之间的距离,并
    将具有最小距离的所述人脸坐标对应的对象,确定为说话人。
  19. 一种说话人识别系统,包括:
    摄像设备,配置为获取待检测图像;
    电子设备,与所述摄像设备连接,并且所述电子设备包括存储装置和处理器,其中所述存储装置配置为存储一个或多个程序,当所述一个或多个程序被所述处理器执行时,使得所述处理器实现如权利要求18中所述的说话人识别方法,以对所述待检测图像进行处理得到说话人。
  20. 一种计算机可读存储介质,存储有一个或多个程序,当所述一个或多个程序被处理器执行时,使得所述处理器实现如权利要求1至15中任 一项所述的音频采集设备定位方法,或者如权利要求18中所述的说话人识别方法。
PCT/CN2020/095640 2019-06-17 2020-06-11 音频采集设备定位方法及装置、说话人识别方法及系统 WO2020253616A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP20825730.3A EP3985610A4 (en) 2019-06-17 2020-06-11 METHOD AND APPARATUS FOR POSITIONING AUDIO COLLECTION DEVICE, AND METHOD AND SPEAKER RECOGNITION SYSTEM
US17/377,316 US11915447B2 (en) 2019-06-17 2021-07-15 Audio acquisition device positioning method and apparatus, and speaker recognition method and system
US18/410,404 US20240153137A1 (en) 2019-06-17 2024-01-11 Microphone detection using historical and cache coordinate databases

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910523416.4 2019-06-17
CN201910523416.4A CN110335313B (zh) 2019-06-17 2019-06-17 音频采集设备定位方法及装置、说话人识别方法及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/377,316 Continuation US11915447B2 (en) 2019-06-17 2021-07-15 Audio acquisition device positioning method and apparatus, and speaker recognition method and system

Publications (1)

Publication Number Publication Date
WO2020253616A1 true WO2020253616A1 (zh) 2020-12-24

Family

ID=68142083

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/095640 WO2020253616A1 (zh) 2019-06-17 2020-06-11 音频采集设备定位方法及装置、说话人识别方法及系统

Country Status (4)

Country Link
US (2) US11915447B2 (zh)
EP (1) EP3985610A4 (zh)
CN (2) CN110660102B (zh)
WO (1) WO2020253616A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110660102B (zh) 2019-06-17 2020-10-27 腾讯科技(深圳)有限公司 基于人工智能的说话人识别方法及装置、系统
CN112420057B (zh) * 2020-10-26 2022-05-03 四川长虹电器股份有限公司 基于距离编码的声纹识别方法、装置、设备及存储介质
CN112487978B (zh) * 2020-11-30 2024-04-16 清华珠三角研究院 一种视频中说话人定位的方法、装置及计算机存储介质
CN113487609B (zh) * 2021-09-06 2021-12-07 北京字节跳动网络技术有限公司 组织腔体的定位方法、装置、可读介质和电子设备
CN115272302B (zh) * 2022-09-23 2023-03-14 杭州申昊科技股份有限公司 图像中零部件检测方法,零部件检测设备及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108230358A (zh) * 2017-10-27 2018-06-29 北京市商汤科技开发有限公司 目标跟踪及神经网络训练方法、装置、存储介质、电子设备
CN108876858A (zh) * 2018-07-06 2018-11-23 北京字节跳动网络技术有限公司 用于处理图像的方法和装置
CN109559347A (zh) * 2018-11-28 2019-04-02 中南大学 对象识别方法、装置、系统及存储介质
US20190108647A1 (en) * 2017-10-10 2019-04-11 The Boeing Company Systems and methods for 3d cluster recognition for relative tracking
JP2019066293A (ja) * 2017-09-29 2019-04-25 株式会社Nttドコモ 映像表示システム
CN110335313A (zh) * 2019-06-17 2019-10-15 腾讯科技(深圳)有限公司 音频采集设备定位方法及装置、说话人识别方法及系统

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5148669B2 (ja) * 2000-05-26 2013-02-20 本田技研工業株式会社 位置検出装置、位置検出方法、及び位置検出プログラム
US6801850B1 (en) * 2000-10-30 2004-10-05 University Of Illionis - Chicago Method and system for tracking moving objects
US20030154084A1 (en) * 2002-02-14 2003-08-14 Koninklijke Philips Electronics N.V. Method and system for person identification using video-speech matching
JP2007147762A (ja) * 2005-11-24 2007-06-14 Fuji Xerox Co Ltd 発話者予測装置および発話者予測方法
US8170280B2 (en) * 2007-12-03 2012-05-01 Digital Smiths, Inc. Integrated systems and methods for video-based object modeling, recognition, and tracking
TWI507047B (zh) * 2010-08-24 2015-11-01 Hon Hai Prec Ind Co Ltd 麥克風控制系統及方法
US8873813B2 (en) * 2012-09-17 2014-10-28 Z Advanced Computing, Inc. Application of Z-webs and Z-factors to analytics, search engine, learning, recognition, natural language, and other utilities
JP5685177B2 (ja) * 2011-12-12 2015-03-18 本田技研工業株式会社 情報伝達システム
CN103377561B (zh) * 2013-07-30 2015-06-17 甘永伦 一种车辆定位系统、方法和装置
CN104703090B (zh) * 2013-12-05 2018-03-20 北京东方正龙数字技术有限公司 一种基于人脸识别的自动调节拾音设备及自动调节方法
US9729865B1 (en) * 2014-06-18 2017-08-08 Amazon Technologies, Inc. Object detection and tracking
CN105812721A (zh) * 2014-12-30 2016-07-27 浙江大华技术股份有限公司 一种跟踪监控方法及跟踪监控设备
CN107534725B (zh) * 2015-05-19 2020-06-16 华为技术有限公司 一种语音信号处理方法及装置
CN106292732A (zh) * 2015-06-10 2017-01-04 上海元趣信息技术有限公司 基于声源定位和人脸检测的智能机器人转动方法
US10178301B1 (en) * 2015-06-25 2019-01-08 Amazon Technologies, Inc. User identification based on voice and face
JP6614611B2 (ja) * 2016-02-29 2019-12-04 Kddi株式会社 画像間類似度を考慮して物体を追跡する装置、プログラム及び方法
CN107820037B (zh) * 2016-09-14 2021-03-26 中兴通讯股份有限公司 音频信号、图像处理的方法、装置和系统
CN107167140B (zh) * 2017-05-26 2019-11-08 江苏大学 一种无人机视觉定位累积误差抑制方法
US10395385B2 (en) * 2017-06-27 2019-08-27 Qualcomm Incorporated Using object re-identification in video surveillance
CN107809596A (zh) * 2017-11-15 2018-03-16 重庆科技学院 基于麦克风阵列的视频会议跟踪系统及方法
CN108229308A (zh) * 2017-11-23 2018-06-29 北京市商汤科技开发有限公司 目标对象识别方法、装置、存储介质和电子设备
CN108152788A (zh) * 2017-12-22 2018-06-12 西安Tcl软件开发有限公司 声源追踪方法、声源追踪设备及计算机可读存储介质
US11615623B2 (en) * 2018-02-19 2023-03-28 Nortek Security & Control Llc Object detection in edge devices for barrier operation and parcel delivery
CN108460787B (zh) * 2018-03-06 2020-11-27 北京市商汤科技开发有限公司 目标跟踪方法和装置、电子设备、程序、存储介质
CN108482421B (zh) * 2018-03-21 2020-04-28 南京城铁信息技术有限公司 一种无缝线路钢轨位移爬行检测系统
US11651589B2 (en) * 2018-05-07 2023-05-16 Google Llc Real time object detection and tracking
CN108734733B (zh) * 2018-05-17 2022-04-26 东南大学 一种基于麦克风阵列与双目摄像头的说话人定位与识别方法
US10902263B1 (en) * 2018-06-26 2021-01-26 Amazon Technologies, Inc. Image processing system for object identification
US11010905B2 (en) * 2018-09-07 2021-05-18 Apple Inc. Efficient object detection and tracking
CN108831483A (zh) * 2018-09-07 2018-11-16 马鞍山问鼎网络科技有限公司 一种人工智能语音识别系统
KR101995294B1 (ko) * 2018-12-24 2019-07-03 (주)제이엘케이인스펙션 영상 분석 장치 및 방법
CN109887525B (zh) 2019-01-04 2023-04-07 平安科技(深圳)有限公司 智能客服方法、装置及计算机可读存储介质
US11948312B2 (en) * 2019-04-17 2024-04-02 Nec Corporation Object detection/tracking device, method, and program recording medium
TWI711007B (zh) * 2019-05-02 2020-11-21 緯創資通股份有限公司 調整感興趣區域的方法與其運算裝置
US11128793B2 (en) * 2019-05-03 2021-09-21 Cisco Technology, Inc. Speaker tracking in auditoriums

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019066293A (ja) * 2017-09-29 2019-04-25 株式会社Nttドコモ 映像表示システム
US20190108647A1 (en) * 2017-10-10 2019-04-11 The Boeing Company Systems and methods for 3d cluster recognition for relative tracking
CN108230358A (zh) * 2017-10-27 2018-06-29 北京市商汤科技开发有限公司 目标跟踪及神经网络训练方法、装置、存储介质、电子设备
CN108876858A (zh) * 2018-07-06 2018-11-23 北京字节跳动网络技术有限公司 用于处理图像的方法和装置
CN109559347A (zh) * 2018-11-28 2019-04-02 中南大学 对象识别方法、装置、系统及存储介质
CN110335313A (zh) * 2019-06-17 2019-10-15 腾讯科技(深圳)有限公司 音频采集设备定位方法及装置、说话人识别方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3985610A4 *

Also Published As

Publication number Publication date
CN110660102B (zh) 2020-10-27
US20240153137A1 (en) 2024-05-09
US20210343042A1 (en) 2021-11-04
CN110660102A (zh) 2020-01-07
EP3985610A1 (en) 2022-04-20
EP3985610A4 (en) 2022-07-27
CN110335313B (zh) 2022-12-09
US11915447B2 (en) 2024-02-27
CN110335313A (zh) 2019-10-15

Similar Documents

Publication Publication Date Title
WO2020253616A1 (zh) 音频采集设备定位方法及装置、说话人识别方法及系统
WO2020056903A1 (zh) 用于生成信息的方法和装置
CN111292420B (zh) 用于构建地图的方法和装置
CN109670444B (zh) 姿态检测模型的生成、姿态检测方法、装置、设备及介质
CN111222509B (zh) 目标检测方法、装置及电子设备
CN111783626A (zh) 图像识别方法、装置、电子设备及存储介质
CN110781823A (zh) 录屏检测方法、装置、可读介质及电子设备
US20230336878A1 (en) Photographing mode determination method and apparatus, and electronic device and storage medium
CN111368657A (zh) 牛脸识别方法和装置
WO2019214019A1 (zh) 基于卷积神经网络的网络教学方法以及装置
CN114898154A (zh) 一种增量式目标检测方法、装置、设备及介质
WO2024099068A1 (zh) 基于图像的速度确定方法、装置、设备及存储介质
CN111310595A (zh) 用于生成信息的方法和装置
WO2022194130A1 (zh) 字符位置修正方法、装置、电子设备和存储介质
CN115100536B (zh) 建筑物识别方法、装置、电子设备和计算机可读介质
WO2023020268A1 (zh) 一种手势识别方法、装置、设备及介质
CN110781809A (zh) 基于注册特征更新的识别方法、装置及电子设备
CN113762017B (zh) 一种动作识别方法、装置、设备及存储介质
CN114489903A (zh) 界面元素定位方法、装置、存储介质及电子设备
CN113033552B (zh) 文本识别方法、装置和电子设备
CN115393755A (zh) 视觉目标跟踪方法、装置、设备以及存储介质
CN111680754B (zh) 图像分类方法、装置、电子设备及计算机可读存储介质
CN111401182B (zh) 针对饲喂栏的图像检测方法和装置
CN110263743B (zh) 用于识别图像的方法和装置
CN114120423A (zh) 人脸图像检测方法、装置、电子设备和计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20825730

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020825730

Country of ref document: EP

Effective date: 20220117