CN110767226A

CN110767226A - Sound source localization method, device, speech recognition method, system, storage device and terminal with high accuracy

Info

Publication number: CN110767226A
Application number: CN201911048283.6A
Authority: CN
Inventors: 周辉; 高鑫; 任亚敏; 邓朋朋; 王之帅; 王宇飞
Original assignee: Shanxi Jiansheng Technology Co Ltd
Current assignee: Beijing Youmei Shangpin Technology Co.,Ltd.
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-07
Anticipated expiration: 2039-10-30
Also published as: CN110767226B

Abstract

The invention discloses a sound source positioning method and a device with high accuracy, which comprises the following steps: collecting a sound signal; judging whether a voice signal exists in the sound signal or not; extracting all voice signals and acquiring the sound source position of each voice signal; performing voiceprint recognition on each voice signal one by one; judging whether the identified voiceprint features are stored in a voiceprint database or not; acquiring image information of a sound source position where a voice signal corresponding to the voiceprint feature is located; performing model training by using a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database; and displaying the sound source position information of the voice signal corresponding to the voiceprint characteristics and the identity information of the corresponding speaker. The invention can position the speaker in preparation, match the identity of the speaker and the content of speaking; the method is suitable for the field of voice recognition.

Description

Sound source localization method, device, speech recognition method, system, storage device and terminal with high accuracy

技术领域technical field

本发明涉及语音识别的技术领域，具体涉及一种具有高准确度的声源定位方法、装置及具有高针对性的语音识别方法、系统、存储设备及终端。The present invention relates to the technical field of speech recognition, in particular to a sound source localization method and device with high accuracy, and a speech recognition method, system, storage device and terminal with high pertinence.

背景技术Background technique

作为信息技术领域十大重要的科技发展技术之一，语音识别技术已经在工业、家电、通信、汽车电子、医疗、家庭服务、消费电子产品等各个领域得到了广泛的发展。而在进行语音识别时，通常会采用一些基于麦克风阵列的算法来进行声源定位。As one of the ten important scientific and technological development technologies in the field of information technology, speech recognition technology has been widely developed in various fields such as industry, home appliances, communications, automotive electronics, medical care, home services, and consumer electronics products. When performing speech recognition, some algorithms based on microphone arrays are usually used for sound source localization.

一般来说，基于麦克风阵列的声源定位算法划分为三类：一是基于波束形成的方法，二是基于高分辨率谱估计的方法，三是基于声达时延差（TDOA）的方法。（1）基于最大输出功率的可控波束形成技术 Beamforming，它的基本思想就是将各阵元采集来的信号进行加权求和形成波束，通过搜索声源的可能位置来引导该波束，修改权值使得传声器阵列的输出信号功率最大。这种方法既能在时域中使用，也能在频域中使用。它在时域中的时间平移等价于在频域中的相位延迟。在频域处理中，首先使用一个包含自谱和互谱的矩阵，称之为互谱矩阵（Cross-Spectral Matrix，CSM）。在每个感兴趣频率之处，阵列信号的处理给出了在每个给定的空间扫描网格点上或每个信号到达方向（Direction ofArrival，DOA）的能量水平。因此，阵列表示了一种与声源分布相关联的响应求和后的数量。这种方法适用于大型麦克风阵列，对测试环境适应性强。（2）基于高分辨率谱估计的方法包括了自回归AR模型、最小方差谱估计（MV）和特征值分解方法（如Music算法）等，所有这些方法都通过获取了传声器阵列的信号来计算空间谱的相关矩阵。在理论上可以对声源的方向进行有效估计，实际中若要获得较理想的精度，就要付出很大的计算量代价，而且需要较多的假设条件，当阵列较大时这种谱估计方法的运算量很大，对环境噪声敏感，还很容易导致定位不准确，因而在现代的大型声源定位系统中很少采用。（3）声达时间差（TDOA）的定位技术，这类声源定位方法一般分为两个步骤进行，先进行声达时间差估计，并从中获取传声器阵列中阵元间的声延迟（TDOA）；再利用获取的声达时间差，结合已知的传声器阵列的空间位置进一步定出声源的位置。这种方法的计算量一般比前二种要小，更利于实时处理，但定位精度和抗干扰能力较弱，适合于近场，单一音源，而且不是重复性的信号，如语音信号，微软 XBOX360的 kinect 的麦阵（4 个间距不等的一维阵）就是典型的 TDOA 算法应用。Generally speaking, sound source localization algorithms based on microphone arrays are divided into three categories: one is the method based on beamforming, the other is the method based on high-resolution spectral estimation, and the third is the method based on the delay difference of arrival (TDOA). (1) Beamforming is a controllable beamforming technology based on maximum output power. Its basic idea is to weight and sum the signals collected by each array element to form a beam, and to guide the beam by searching for the possible position of the sound source, and modify the weights. Maximize the output signal power of the microphone array. This method can be used both in the time domain and in the frequency domain. Its time shift in the time domain is equivalent to a phase delay in the frequency domain. In frequency domain processing, a matrix containing auto-spectrum and cross-spectrum is first used, which is called Cross-Spectral Matrix (CSM). At each frequency of interest, the processing of the array signal gives the energy level at each given spatial scan grid point or each direction of arrival (DOA) of the signal. Thus, the array represents a summed quantity of responses associated with the sound source distribution. This method is suitable for large microphone arrays and is highly adaptable to the test environment. (2) Methods based on high-resolution spectral estimation include autoregressive AR model, minimum variance spectral estimation (MV), and eigenvalue decomposition methods (such as Music algorithm), all of which are calculated by acquiring the signals of the microphone array. Correlation matrix of the spatial spectrum. In theory, the direction of the sound source can be effectively estimated. In practice, to obtain ideal accuracy, it is necessary to pay a large computational cost, and more assumptions are required. When the array is large, this kind of spectral estimation The method requires a large amount of computation, is sensitive to environmental noise, and can easily lead to inaccurate positioning, so it is rarely used in modern large-scale sound source localization systems. (3) The localization technology of time difference of arrival (TDOA), this kind of sound source localization method is generally divided into two steps, first to estimate the time difference of arrival, and obtain the acoustic delay (TDOA) between the elements in the microphone array; Then, the obtained sound arrival time difference is combined with the known spatial position of the microphone array to further determine the position of the sound source. The calculation amount of this method is generally smaller than the first two, which is more conducive to real-time processing, but the positioning accuracy and anti-interference ability are weak, and it is suitable for near-field, single audio source, and non-repetitive signals, such as voice signals, Microsoft XBOX360 The kinect's microphone array (4 one-dimensional arrays with unequal spacing) is a typical TDOA algorithm application.

目前传统的这些声源定位算法对麦克风阵列的数目依赖较大，对信噪比小的声音信号定位能力也较差，面对多人会议、鸡尾酒会等场景，传统的声源定位几乎没法工作，更谈不上准确地定位出说话人的位置、匹配说话人的身份以及说话的内容。At present, these traditional sound source localization algorithms are highly dependent on the number of microphone arrays, and their ability to localize sound signals with small signal-to-noise ratios is also poor. In the face of multi-person meetings, cocktail parties and other scenarios, traditional sound source localization is almost impossible. work, not to mention accurately locating the speaker's location, matching the speaker's identity and the content of the speech.

发明内容SUMMARY OF THE INVENTION

针对相关技术中存在的不足，本发明所要解决的技术问题在于：提供一种能够准备地定位说话人的位置、匹配说话人的身份以及说话的内容的具有高准确度的声源定位方法、装置、语音识别方法、系统、存储设备及终端。In view of the deficiencies in the related art, the technical problem to be solved by the present invention is to provide a sound source localization method and device with high accuracy that can prepare the location of the speaker, match the identity of the speaker and the content of the speech. , Speech recognition method, system, storage device and terminal.

为解决上述技术问题，本发明提供了一种具有高准确度的声源定位方法，包括以下步骤：S101、实时采集环境中的声音信号；S102、判断采集到的声音信号中是否存在语音信号，若存在，则执行步骤S103，否则，返回步骤S101；S103、提取出所有的语音信号，并获取每一个语音信号所在的声源位置信息；S104、将提取出的每一个语音信号一一进行声纹识别；S105、判断当前识别出的声纹特征是否已存储于声纹数据库中，若是，则执行步骤S108，否则，执行步骤S106；其中，声纹数据库中存储有多组声纹特征以及每一组声纹特征所唯一对应的发声人的身份信息；S106、获取该声纹特征对应的语音信号所在的声源位置处的图像信息；S107、利用机器自学习方法进行模型训练，确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中；S108、显示该声纹特征对应的语音信号所在的声源位置信息以及对应的发声人的身份信息。In order to solve the above technical problems, the present invention provides a sound source localization method with high accuracy, comprising the following steps: S101, collecting sound signals in the environment in real time; S102, judging whether there is a voice signal in the collected sound signals, If there is, go to step S103, otherwise, go back to step S101; S103, extract all the voice signals, and obtain the sound source position information where each voice signal is located; S104, carry out the sound of each extracted voice signal one by one S105, determine whether the currently recognized voiceprint features have been stored in the voiceprint database, if so, then go to step S108, otherwise, go to step S106; wherein, the voiceprint database stores multiple groups of voiceprint features and each The identity information of the speaker uniquely corresponding to a set of voiceprint features; S106, obtain the image information at the position of the sound source where the voice signal corresponding to the voiceprint feature is located; S107, use the machine self-learning method to perform model training, and determine the voice The speaker and the identity information corresponding to the voiceprint feature are stored, and the corresponding voiceprint feature and the identity information of the speaker are stored in the voiceprint database; S108. Display the sound source position information where the voice signal corresponding to the voiceprint feature is located and the corresponding The identity of the speaker.

优选地，执行完毕步骤S103之后，先执行步骤S103-1~S103-3，再执行步骤S104；S103-1、采集每一个语音信号所在的声源位置处的图像信息；S103-2、一一判断对应的声源位置处是否有人存在，若有，则执行步骤S103-3，否则，返回步骤S101；S103-3、判定有人存在的声源位置处对应的语音信号为有效语音信号，然后执行步骤S104；步骤S104中所述的语音信号为有效语音信号。Preferably, after performing step S103, first perform steps S103-1 to S103-3, and then perform step S104; S103-1, collect image information at the position of the sound source where each voice signal is located; S103-2, one by one Determine whether someone exists at the corresponding sound source position, if so, execute step S103-3, otherwise, return to step S101; S103-3, determine that the corresponding voice signal at the sound source position where someone exists is a valid voice signal, and then execute Step S104; the voice signal described in step S104 is a valid voice signal.

优选地，步骤S106中所述的图像信息包括：人的脸部特征信息和唇动特征信息；执行完毕步骤S106之后，先执行步骤S106-1~S106-4，再执行步骤S107；S106-1、根据获取到的图像信息，进行人脸识别和唇动识别，确定该声源位置处当前的发声人数；S106-2、判断当前的发声人数是否仅为1个，如是，则执行步骤S107，否则，执行步骤S106-3；S106-3、继续采集该声源位置处的声音信号；S106-4、判断该声源位置处是否还存在其他语音信号，如是，则返回执行步骤S104，否则，重复执行S106-2。Preferably, the image information described in step S106 includes: human face feature information and lip movement feature information; after performing step S106, first perform steps S106-1 to S106-4, and then perform steps S107; S106-1 , according to the acquired image information, carry out face recognition and lip movement recognition, determine the current number of voices at this sound source position; S106-2, judge whether the current number of voices is only 1, if so, then execute step S107, Otherwise, execute step S106-3; S106-3, continue to collect the sound signal at the sound source position; S106-4, judge whether there are other voice signals at the sound source position, if so, return to execute step S104, otherwise, S106-2 is repeatedly executed.

本发明还提供了一种具有高准确度的声源定位装置，包括：第一声音采集单元：用于实时采集环境中的声音信号；第一判断单元：用于判断采集到的声音信号中是否存在语音信号；语音提取及定位单元：用于当采集到的声音信号中存在语音信号时，提取出所有的语音信号，并获取每一个语音信号所在的声源位置信息；声纹识别单元：用于将提取出的每一个语音信号一一进行声纹识别；第二判断单元：用于判断当前识别出的声纹特征是否已存储于声纹数据库中；其中，声纹数据库中存储有多组声纹特征以及每一组声纹特征所唯一对应的发声人的身份信息；图像获取单元：用于当当前识别出的声纹特征没有存储于声纹数据库中时，获取该声纹特征对应的语音信号所在的声源位置处的图像信息；机器学习单元：用于利用机器自学习方法进行模型训练，确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中；显示单元：用于当当前识别出的声纹特征已存储于声纹数据库中时，显示该声纹特征对应的语音信号所在的声源位置信息以及对应的发声人的身份信息。The present invention also provides a sound source localization device with high accuracy, comprising: a first sound collection unit: used to collect sound signals in the environment in real time; a first judgment unit: used to judge whether the collected sound signals There is a voice signal; voice extraction and positioning unit: when there is a voice signal in the collected voice signal, it is used to extract all voice signals, and obtain the position information of the sound source where each voice signal is located; voiceprint recognition unit: use Perform voiceprint recognition on each of the extracted voice signals one by one; the second judgment unit: for judging whether the currently identified voiceprint features have been stored in the voiceprint database; wherein, the voiceprint database stores multiple groups of The voiceprint feature and the identity information of the speaker uniquely corresponding to each group of voiceprint features; image acquisition unit: used to obtain the corresponding voiceprint feature when the currently recognized voiceprint feature is not stored in the voiceprint database. Image information at the location of the sound source where the speech signal is located; machine learning unit: used for model training using machine self-learning methods, to determine the speaker and its identity information corresponding to the voiceprint feature, and to compare the corresponding voiceprint feature and utterance. The personal identity information is stored in the voiceprint database; the display unit is used to display the sound source position information of the voice signal corresponding to the voiceprint feature and the corresponding voiceprint when the currently recognized voiceprint feature has been stored in the voiceprint database. The identity of the speaker.

优选地，还包括：图像采集单元：用于当提取出所有的语音信号并获取每一个语音信号所在的声源位置信息之后，采集每一个语音信号所在的声源位置处的图像信息；第三判断单元：用于一一判断对应的声源位置处是否有人存在；有效语音判定单元：用于当对应的声源位置处有人存在时，判定有人存在的声源位置处对应的语音信号为有效语音信号，然后再将所有的有效语音信号一一进行声纹识别。Preferably, it also includes: an image acquisition unit: used to collect the image information at the sound source location where each speech signal is located after extracting all the speech signals and acquiring the sound source location information where each speech signal is located; third Judging unit: used to judge whether there is someone at the corresponding sound source position one by one; valid voice judgment unit: used to determine that the corresponding voice signal at the sound source position where someone exists is valid when there is someone at the corresponding sound source position voice signal, and then perform voiceprint recognition on all valid voice signals one by one.

优选地，所述图像获取单元中所述的图像信息包括：人的脸部特征信息和唇动特征信息；所述的具有高准确度的声源定位装置包括：发声人数统计单元：用于当当前识别出的声纹特征没有存储于声纹数据库中时获取该声纹特征对应的语音信号所在的声源位置处的图像信息之后，根据获取到的图像信息，进行人脸识别和唇动识别，确定该声源位置处当前的发声人数；第四判断单元：用于判断当前的发声人数是否仅为1个，如是，则再利用机器自学习方法进行模型训练，确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中；第二声音采集单元：用于当当前的发声人数是否不仅仅为1个时，继续采集该声源位置处的声音信号；第五判断单元：用于判断该声源位置处是否还存在其他语音信号，如是，则将提取出的每一个语音信号一一进行声纹识别，否则，重复判断当前的发声人数是否仅为1个。Preferably, the image information in the image acquisition unit includes: human face feature information and lip movement feature information; the sound source localization device with high accuracy includes: a sounding number counting unit: used for when When the currently recognized voiceprint feature is not stored in the voiceprint database, after obtaining the image information at the sound source position where the voice signal corresponding to the voiceprint feature is located, perform face recognition and lip movement recognition according to the obtained image information , determine the current number of voices at the sound source position; the fourth judgment unit: used to judge whether the current number of voices is only one, if so, then use the machine self-learning method for model training to determine the corresponding voiceprint feature. Speakers and their identity information, and store the corresponding voiceprint features and speaker identity information in the voiceprint database; the second sound collection unit: used to continue to collect the sound when the current number of speakers is more than one. The sound signal at the position of the sound source; the fifth judgment unit: used to judge whether there are other voice signals at the position of the sound source, if so, perform voiceprint recognition on each of the extracted voice signals, otherwise, repeat the judgment Whether the current number of voices is only 1.

本发明还提供了一种具有高针对性的语音识别方法，包括以下步骤：S10、采用声源定位方法，确定语音信号所在的声源位置信息以及对应的发声人的身份信息；S20、响应于语音识别命令，将指定的发声人所对应的语音内容转换成文字内容，并进行显示；所述的声源定位方法，为上所述的具有高准确度的声源定位方法。The present invention also provides a highly targeted speech recognition method, comprising the following steps: S10, adopting a sound source localization method to determine the location information of the sound source where the speech signal is located and the identity information of the corresponding speaker; S20, responding to The voice recognition command converts the voice content corresponding to the designated speaker into text content and displays it; the sound source localization method is the above-mentioned sound source localization method with high accuracy.

本发明还提供了一种具有高针对性的语音识别系统，包括：声源定位装置：用于采用声源定位方法，确定语音信号所在的声源位置信息以及对应的发声人的身份信息；语音转换模块：用于响应于语音识别命令，将指定的发声人所对应的语音内容转换成文字内容，并进行显示；所述的声源定位装置，为上所述的具有高准确度的声源定位装置。The present invention also provides a highly targeted speech recognition system, comprising: a sound source localization device: used for using a sound source localization method to determine the location information of the sound source where the speech signal is located and the identity information of the corresponding speaker; Conversion module: used to convert the voice content corresponding to the designated speaker into text content in response to the voice recognition command, and display it; the sound source localization device is the above-mentioned sound source with high accuracy Positioning means.

本发明还提供了一种存储设备，其中存储有多条指令，所述指令适于由处理器加载并执行如上述的具有高针对性的语音识别方法。The present invention also provides a storage device, in which a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor and executing the above-mentioned highly targeted speech recognition method.

本发明还提供了一种终端，包括：处理器，适于实现各指令；以及存储设备，适于存储多条指令，所述指令适于由处理器加载并执行如上述的具有高针对性的语音识别方法。The present invention also provides a terminal, comprising: a processor, adapted to implement various instructions; and a storage device, adapted to store a plurality of instructions, the instructions are adapted to be loaded by the processor and execute the above-mentioned highly targeted speech recognition method.

本发明的有益技术效果在于：The beneficial technical effect of the present invention is:

1、本发明首先通过第一声音采集单元实时采集环境中的各种声音信号，然后判断采集到的这些声音信号中是否有语音信号，如果没有，则说明这些声音信号不是想要的目标信号，不作任何处理；如果有，则一方面获取这些语音信号的声源位置，另一方面对这些语音信号一一进行声纹识别。由于每一个发声体在讲话时使用的发声器官（如舌、牙齿、喉头、肺、鼻腔等）在尺寸和形态方面都存在差异，因此任何两个发声体的声纹特征都是不一样的，所以对于本发明中的每一条语音信号来说，只要它们来自于不同的发声人，那么它们就唯一地对应一组声纹特征。当得出一个语音信号对应的声纹特征后，先判断该声纹特征是否已存储于声纹数据库中，若是，则说明该声纹特征已被事先采集并存储在声纹数据库中，并且拥有该声纹特征的发声人的身份也已确认，那么就直接通过显示单元显示出对应的发声人当前所在的位置信息以及他/她的身份信息；若不是，则说明只能通过现场采集的方式来确认和匹配发声人的位置和身份，首先通过图像获取单元获取该语音信号所在的声源位置处的图像信息，然后利用机器自学习方法来对发声人进行模型训练，最终确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中，再通过显示单元来显示对应的发声人当前所在的位置信息以及他/她的身份信息。这样，就能够准确地知道当前都有谁在说话、在何处说话。如果还想知道谁说了什么内容，只需要向语音转换模块发送请求，语音转换模块即可将指定的发声人所对应的语音内容转换成文字内容显示给请求人。本发明在传统的声源定位的基础上加入了声纹识别和图像识别，能够准确地定位说话人的位置、匹配说话人的身份以及说话的内容，对麦克风的数量依赖程度较小，对低信噪比的环境声源定位准确，完全能够适应多人会议、鸡尾酒会等存在复杂声源的环境中。1. The present invention first collects various sound signals in the environment through the first sound acquisition unit in real time, and then judges whether there are voice signals in these collected sound signals, if not, then explain that these sound signals are not the desired target signals, No processing is performed; if so, on the one hand, the position of the sound source of these voice signals is obtained, and on the other hand, voiceprint recognition is performed on these voice signals one by one. Since the vocal organs (such as tongue, teeth, larynx, lungs, nasal cavity, etc.) used by each vocal body when speaking are different in size and shape, the voiceprint characteristics of any two vocal bodies are different. Therefore, for each voice signal in the present invention, as long as they come from different speakers, they uniquely correspond to a set of voiceprint features. When the voiceprint feature corresponding to a voice signal is obtained, first determine whether the voiceprint feature has been stored in the voiceprint database. The identity of the speaker of the voiceprint feature has also been confirmed, then the current location information of the corresponding speaker and his/her identity information are displayed directly through the display unit; if not, it can only be collected through on-site methods. To confirm and match the position and identity of the speaker, first obtain the image information at the sound source position where the voice signal is located through the image acquisition unit, and then use the machine self-learning method to train the speaker model, and finally determine the voiceprint feature Corresponding speaker and its identity information, and store the corresponding voiceprint features and speaker identity information in the voiceprint database, and then display the current location information of the corresponding speaker and his/her identity information through the display unit . In this way, it is possible to know exactly who is speaking and where. If you still want to know who said what, you only need to send a request to the voice conversion module, and the voice conversion module can convert the voice content corresponding to the specified speaker into text content and display it to the requester. The present invention adds voiceprint recognition and image recognition on the basis of traditional sound source localization, which can accurately locate the speaker's position, match the speaker's identity and the content of the speech, and is less dependent on the number of microphones and less sensitive to low The signal-to-noise ratio of ambient sound source localization is accurate, and it can fully adapt to environments with complex sound sources such as multi-person meetings and cocktail parties.

2、本发明中，在确定采集到的声音信号中存在语音信号后，可以先通过图像采集单元去采集每一个语音信号所在的声源位置处的图像信息，然后判断这些声源位置处是否有人存在，如果有人存在，再对这些语音信号进行后续处理，这样就排除了不是来自人体的声源（例如收音机发声、电视机发声等）的干扰，提高了准确性和识别效率。2. In the present invention, after it is determined that there is a voice signal in the collected sound signal, the image information at the position of the sound source where each voice signal is located can be collected first through the image acquisition unit, and then it is determined whether there are people at these sound source positions. If there is someone, follow-up processing is performed on these voice signals, which eliminates the interference of sound sources that do not come from the human body (such as radio sound, TV sound, etc.), and improves the accuracy and recognition efficiency.

3、本发明中，当需要对发声人的位置和身份进行现场采集时，图像获取单元在现场获得的图像信息包括了人的脸部特征信息和唇动特征信息，这样，可以根据获得的脸部特征信息和唇动特征信息来进行人脸识别和唇动识别，确定该声源位置处当前的发声人数，然后判断当前的发声人数是否仅为1个，如果为1个，则直接进行机器自学习；如果为多个，则先判断该声源位置处是否还存在其他语音信号，如果存在，直接进行声纹识别，如果不存在，则一直循环判断发声人数，直至发声人数为1才进行机器自学习，最终能够唯一确定出当前发声人的身份和位置，这样就避免了同一位置或相邻位置上有多个人同时在说话而又分辨不出当前采集到的声纹到底属于谁的干扰。3. In the present invention, when the location and identity of the speaker need to be collected on-site, the image information obtained by the image acquisition unit on the spot includes the person's facial feature information and lip movement feature information. In this way, according to the obtained face Use facial feature information and lip motion feature information to perform face recognition and lip motion recognition, determine the current number of speakers at the sound source location, and then determine whether the current number of speakers is only 1, if it is 1, then directly carry out the machine Self-learning; if there are more than one, first judge whether there are other voice signals at the sound source position, if so, perform voiceprint recognition directly, if not, it will loop to judge the number of voices until the number of voices is 1. Machine self-learning can finally uniquely determine the identity and position of the current speaker, thus avoiding the interference of multiple people speaking at the same or adjacent positions at the same time but unable to distinguish who the currently collected voiceprint belongs to. .

附图说明Description of drawings

通过附图所示，本发明的上述及其它目的、特征和优势将更加清晰。在全部附图中相同的附图标记指示相同的部分。并未刻意按实际尺寸等比例缩放绘制附图，重点在于示出本发明的主旨。The above and other objects, features and advantages of the present invention will become more apparent from the accompanying drawings. The same reference numerals refer to the same parts throughout the drawings. The drawings are not intentionally scaled to actual size, and the emphasis is on illustrating the gist of the present invention.

图1是本发明实施例一提供的一种具有高准确度的声源定位方法的流程示意图；1 is a schematic flowchart of a sound source localization method with high accuracy provided in Embodiment 1 of the present invention;

图2是本发明实施例一提供的一种具有高准确度的声源定位装置的结构示意图；2 is a schematic structural diagram of a sound source localization device with high accuracy according to Embodiment 1 of the present invention;

图3是本发明实施例一提供的一种具有高针对性的语音识别方法的流程示意图；3 is a schematic flowchart of a highly targeted speech recognition method provided by Embodiment 1 of the present invention;

图4是本发明实施例一提供的一种具有高针对性的语音识别系统的结构示意图；4 is a schematic structural diagram of a highly targeted speech recognition system provided by Embodiment 1 of the present invention;

图5是本发明实施例二提供的一种具有高准确度的声源定位方法的流程示意图；5 is a schematic flowchart of a sound source localization method with high accuracy provided in Embodiment 2 of the present invention;

图6是本发明实施例二提供的一种具有高准确度的声源定位装置的结构示意图；6 is a schematic structural diagram of a sound source localization device with high accuracy according to Embodiment 2 of the present invention;

图7是本发明实施例三提供的一种具有高准确度的声源定位方法的流程示意图；7 is a schematic flowchart of a sound source localization method with high accuracy provided by Embodiment 3 of the present invention;

图8是本发明实施例三提供的一种具有高准确度的声源定位装置的结构示意图；8 is a schematic structural diagram of a sound source localization device with high accuracy provided by Embodiment 3 of the present invention;

图9是本发明实施例四提供的一种具有高准确度的声源定位方法的流程示意图；9 is a schematic flowchart of a sound source localization method with high accuracy provided in Embodiment 4 of the present invention;

图10是本发明实施例四提供的一种具有高准确度的声源定位装置的结构示意图；10 is a schematic structural diagram of a sound source localization device with high accuracy provided in Embodiment 4 of the present invention;

图中：101为第一声音采集单元，102为第一判断单元，103为语音提取及定位单元，104为声纹识别单元，105为第二判断单元，106为图像获取单元，107为机器学习单元，108为显示单元，109为身份标记单元，103-1为图像采集单元，103-2为第三判断单元，103-3为有效语音判定单元，106-1为发声人数统计单元，106-2为第四判断单元，106-3为第二声音采集单元，106-4为第五判断单元，10为声源定位装置，20为语音转换模块。In the figure: 101 is the first sound acquisition unit, 102 is the first judgment unit, 103 is the voice extraction and positioning unit, 104 is the voiceprint recognition unit, 105 is the second judgment unit, 106 is the image acquisition unit, 107 is the machine learning unit unit, 108 is a display unit, 109 is an identity marking unit, 103-1 is an image acquisition unit, 103-2 is a third judgment unit, 103-3 is a valid voice judgment unit, 106-1 is a voice counting unit, 106- 2 is a fourth judgment unit, 106-3 is a second sound collection unit, 106-4 is a fifth judgment unit, 10 is a sound source localization device, and 20 is a voice conversion module.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例；基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, not all of the embodiments; based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention .

其次，本发明结合示意图进行详细描述，在详述本发明实施例时，为便于说明，所述示意图只是示例，其在此不应限制本发明保护的范围。Next, the present invention is described in detail with reference to the schematic diagrams. When describing the embodiments of the present invention in detail, for the convenience of description, the schematic diagrams are only examples, which should not limit the protection scope of the present invention.

以下结合附图详细说明所述具有高准确度的声源定位方法、装置、语音识别方法、系统、存储设备及终端的一个实施例。An embodiment of the high-accuracy sound source localization method, device, speech recognition method, system, storage device, and terminal will be described in detail below with reference to the accompanying drawings.

实施例一Example 1

图1是本发明实施例一提供的一种具有高准确度的声源定位方法的流程示意图，如图1所示，具有高准确度的声源定位方法，可包括以下步骤：FIG. 1 is a schematic flowchart of a sound source localization method with high accuracy provided by Embodiment 1 of the present invention. As shown in FIG. 1 , the sound source localization method with high accuracy may include the following steps:

S101、实时采集环境中的声音信号。S101. Collect sound signals in the environment in real time.

S102、判断采集到的声音信号中是否存在语音信号，若存在，则执行步骤S103，否则，返回步骤S101。S102 , judging whether there is a voice signal in the collected sound signal, if so, execute step S103 , otherwise, return to step S101 .

S103、提取出所有的语音信号，并获取每一个语音信号所在的声源位置信息。S103 , extracting all the voice signals, and acquiring the position information of the sound source where each voice signal is located.

S104、将提取出的每一个语音信号一一进行声纹识别。S104: Perform voiceprint recognition on each of the extracted voice signals one by one.

S105、判断当前识别出的声纹特征是否已存储于声纹数据库中，若是，则执行步骤S108，否则，执行步骤S106；其中，声纹数据库中存储有多组声纹特征以及每一组声纹特征所唯一对应的发声人的身份信息。S105, determine whether the currently recognized voiceprint features have been stored in the voiceprint database, if so, go to step S108, otherwise, go to step S106; wherein, there are multiple groups of voiceprint features and each group of voiceprint features stored in the voiceprint database The identity information of the speaker uniquely corresponding to the pattern feature.

S106、获取该声纹特征对应的语音信号所在的声源位置处的图像信息。S106: Acquire image information at the sound source position where the voice signal corresponding to the voiceprint feature is located.

S107、利用机器自学习方法进行模型训练，确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中。S107. Use a machine self-learning method to perform model training, determine a speaker corresponding to the voiceprint feature and its identity information, and store the corresponding voiceprint feature and speaker identity information in a voiceprint database.

S108、显示该声纹特征对应的语音信号所在的声源位置信息以及对应的发声人的身份信息。S108: Display the sound source position information where the voice signal corresponding to the voiceprint feature is located and the identity information of the corresponding speaker.

相应地，本实施例还提供了一种具有高准确度的声源定位装置，图2是本发明实施例一提供的具有高准确度的声源定位装置的结构示意图，如图2所示，具有高准确度的声源定位装置，可包括：Correspondingly, this embodiment also provides a sound source localization device with high accuracy. FIG. 2 is a schematic structural diagram of the sound source localization device with high accuracy provided in Embodiment 1 of the present invention, as shown in FIG. 2 , A sound source localization device with high accuracy, which can include:

第一声音采集单元101：用于实时采集环境中的声音信号。The first sound collection unit 101: used to collect sound signals in the environment in real time.

第一判断单元102：用于判断采集到的声音信号中是否存在语音信号。The first judging unit 102: for judging whether there is a voice signal in the collected sound signal.

语音提取及定位单元103：用于当采集到的声音信号中存在语音信号时，提取出所有的语音信号，并获取每一个语音信号所在的声源位置信息。Voice extraction and positioning unit 103 : for extracting all the voice signals when there are voice signals in the collected voice signals, and acquiring the position information of the sound source where each voice signal is located.

声纹识别单元104：用于将提取出的每一个语音信号一一进行声纹识别。Voiceprint recognition unit 104: for performing voiceprint recognition on each extracted voice signal one by one.

第二判断单元105：用于判断当前识别出的声纹特征是否已存储于声纹数据库中；其中，声纹数据库中存储有多组声纹特征以及每一组声纹特征所唯一对应的发声人的身份信息。Second judging unit 105: for judging whether the currently recognized voiceprint feature has been stored in the voiceprint database; wherein, the voiceprint database stores multiple sets of voiceprint features and the uniquely corresponding utterance of each set of voiceprint features person's identifiable information.

图像获取单元106：用于当当前识别出的声纹特征没有存储于声纹数据库中时，获取该声纹特征对应的语音信号所在的声源位置处的图像信息。The image acquisition unit 106 is configured to acquire image information at the position of the sound source where the voice signal corresponding to the voiceprint feature is located when the currently recognized voiceprint feature is not stored in the voiceprint database.

机器学习单元107：用于利用机器自学习方法进行模型训练，确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中。The machine learning unit 107 is used to perform model training by using the machine self-learning method, determine the speaker corresponding to the voiceprint feature and its identity information, and store the corresponding voiceprint feature and speaker identity information in the voiceprint database.

显示单元108：用于当当前识别出的声纹特征已存储于声纹数据库中时，显示该声纹特征对应的语音信号所在的声源位置信息以及对应的发声人的身份信息。Display unit 108: used to display the sound source location information of the voice signal corresponding to the voiceprint feature and the identity information of the corresponding speaker when the currently recognized voiceprint feature has been stored in the voiceprint database.

此外，本实施例还提供了一种具有高针对性的语音识别方法，图3是本发明实施例一提供的一种具有高针对性的语音识别方法的流程示意图，如图3所示，具有高针对性的语音识别方法，可包括以下步骤：In addition, this embodiment also provides a highly targeted speech recognition method. FIG. 3 is a schematic flowchart of a highly targeted speech recognition method provided in Embodiment 1 of the present invention. As shown in FIG. A highly targeted speech recognition method may include the following steps:

S10、采用声源定位方法，确定语音信号所在的声源位置信息以及对应的发声人的身份信息。S10. Using a sound source localization method, determine the position information of the sound source where the voice signal is located and the identity information of the corresponding speaker.

S20、响应于语音识别命令，将指定的发声人所对应的语音内容转换成文字内容，并进行显示。S20. In response to the voice recognition command, convert the voice content corresponding to the designated speaker into text content and display it.

所述的声源定位方法，为上面所述的具有高准确度的声源定位方法。The sound source localization method is the above-mentioned sound source localization method with high accuracy.

相应地，本实施例还提供了一种具有高针对性的语音识别系统，图4是本发明实施例一提供的一种具有高针对性的语音识别系统的结构示意图，如图4所示，具有高针对性的语音识别系统，可包括：Correspondingly, this embodiment also provides a highly targeted speech recognition system, and FIG. 4 is a schematic structural diagram of a highly targeted speech recognition system provided in Embodiment 1 of the present invention, as shown in FIG. 4 , Highly targeted speech recognition systems that can include:

声源定位装置10：用于采用声源定位方法，确定语音信号所在的声源位置信息以及对应的发声人的身份信息。The sound source localization device 10 is used for determining the position information of the sound source where the voice signal is located and the identity information of the corresponding speaker by adopting the sound source localization method.

语音转换模块20：用于响应于语音识别命令，将指定的发声人所对应的语音内容转换成文字内容，并进行显示。Voice conversion module 20: used to convert the voice content corresponding to the specified speaker into text content in response to the voice recognition command, and display it.

所述的声源定位装置10，为上面所述的具有高准确度的声源定位装置。The sound source localization device 10 is the above-mentioned sound source localization device with high accuracy.

本实施例首先通过第一声音采集单元101实时采集环境中的各种声音信号，然后判断采集到的这些声音信号中是否有语音信号，如果没有，则说明这些声音信号不是想要的目标信号，不作任何处理；如果有，则一方面获取这些语音信号的声源位置，另一方面对这些语音信号一一进行声纹识别。由于每一个发声体在讲话时使用的发声器官（如舌、牙齿、喉头、肺、鼻腔等）在尺寸和形态方面都存在差异，因此任何两个发声体的声纹特征都是不一样的，所以对于本发明中的每一条语音信号来说，只要它们来自于不同的发声人，那么它们就唯一地对应一组声纹特征。当得出一个语音信号对应的声纹特征后，先判断该声纹特征是否已存储于声纹数据库中，若是，则说明该声纹特征已被事先采集并存储在声纹数据库中，并且拥有该声纹特征的发声人的身份也已确认，那么就直接通过显示单元108显示出对应的发声人当前所在的位置信息以及他/她的身份信息；若不是，则说明只能通过现场采集的方式来确认和匹配发声人的位置和身份，首先通过图像获取单元106获取该语音信号所在的声源位置处的图像信息，然后利用机器自学习方法来对发声人进行模型训练，最终确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中，再通过显示单元108来显示对应的发声人当前所在的位置信息以及他/她的身份信息。这样，就能够准确地知道当前都有谁在说话、在何处说话。如果还想知道谁说了什么内容，只需要向语音转换模块20发送请求，语音转换模块20即可将指定的发声人所对应的语音内容转换成文字内容显示给请求人。本发明在传统的声源定位的基础上加入了声纹识别和图像识别，能够准确地定位说话人的位置、匹配说话人的身份以及说话的内容，对麦克风的数量依赖程度较小，对低信噪比的环境声源定位准确，完全能够适应多人会议、鸡尾酒会等存在复杂声源的环境中。In this embodiment, the first sound collecting unit 101 first collects various sound signals in the environment in real time, and then judges whether there are voice signals in the collected sound signals, if not, it means that these sound signals are not the intended target signals, No processing is performed; if so, on the one hand, the position of the sound source of these voice signals is obtained, and on the other hand, voiceprint recognition is performed on these voice signals one by one. Since the vocal organs (such as tongue, teeth, larynx, lungs, nasal cavity, etc.) used by each vocal body when speaking are different in size and shape, the voiceprint characteristics of any two vocal bodies are different. Therefore, for each voice signal in the present invention, as long as they come from different speakers, they uniquely correspond to a set of voiceprint features. When the voiceprint feature corresponding to a voice signal is obtained, first determine whether the voiceprint feature has been stored in the voiceprint database. The identity of the speaker of the voiceprint feature has also been confirmed, then the current location information of the corresponding speaker and his/her identity information are directly displayed through the display unit 108; way to confirm and match the position and identity of the speaker, first obtain the image information at the sound source position where the voice signal is located through the image acquisition unit 106, and then use the machine self-learning method to perform model training on the speaker, and finally determine the sound source. The speaker and the identity information corresponding to the fingerprint feature are stored, and the corresponding voiceprint feature and speaker identity information are stored in the voiceprint database, and then the display unit 108 is used to display the current location information of the corresponding speaker and his/her identity information. In this way, it is possible to know exactly who is speaking and where. If you still want to know who said what, you only need to send a request to the voice conversion module 20, and the voice conversion module 20 can convert the voice content corresponding to the specified speaker into text and display it to the requester. The present invention adds voiceprint recognition and image recognition on the basis of traditional sound source localization, which can accurately locate the speaker's position, match the speaker's identity and the content of the speech, and is less dependent on the number of microphones and less sensitive to low The signal-to-noise ratio of ambient sound source localization is accurate, and it can fully adapt to environments with complex sound sources such as multi-person meetings and cocktail parties.

在具体实施时，本发明中所述的发声人的位置信息的显示可采用地图形式进行显示，该地图的核心地理位置可为佩戴本发明设备/装置的使用者所处的位置，所有发声人的位置在地图上使用者位置周围被一一标记出来进行实时显示。具体地，本发明中所述的发声人的身份信息，可为发声人的头像，也可为发声人的名字。在使用本发明设备/装置之前，可提前采集预发声人的声纹信息和身份信息，这样可在使用过程中直接进行匹配，省去大量现场采集和匹配的时间。During specific implementation, the display of the location information of the speaker described in the present invention can be displayed in the form of a map, and the core geographic location of the map can be the location of the user wearing the device/device of the present invention. All speakers The location of the user is marked one by one around the user's location on the map for real-time display. Specifically, the identity information of the speaker described in the present invention may be the avatar of the speaker or the name of the speaker. Before using the device/device of the present invention, the voiceprint information and identity information of the pre-speaker can be collected in advance, so that matching can be performed directly during use, saving a lot of time for on-site collection and matching.

实施例二Embodiment 2

图5是本发明实施例二提供的一种具有高准确度的声源定位方法的流程示意图，如图5所示，在实施例一的基础上，本发明可在执行完毕步骤S103之后，先执行步骤S103-1~S103-3，再执行步骤S104。FIG. 5 is a schematic flowchart of a sound source localization method with high accuracy provided by Embodiment 2 of the present invention. As shown in FIG. 5 , on the basis of Embodiment 1, the present invention may first perform step S103 Steps S103-1 to S103-3 are executed, and then step S104 is executed.

S103-1、采集每一个语音信号所在的声源位置处的图像信息。S103-1: Collect image information at the sound source position where each voice signal is located.

S103-2、一一判断对应的声源位置处是否有人存在，若有，则执行步骤S103-3，否则，返回步骤S101。S103-2: Determine whether there is a person at the corresponding sound source position one by one, if yes, execute step S103-3, otherwise, return to step S101.

S103-3、判定有人存在的声源位置处对应的语音信号为有效语音信号，然后执行步骤S104；S103-3, determine that the voice signal corresponding to the sound source position where someone exists is a valid voice signal, and then execute step S104;

步骤S104中所述的语音信号为有效语音信号。The voice signal described in step S104 is a valid voice signal.

相应地，本实施例还提供了一种具有高准确度的声源定位装置，图6是本发明实施例二提供的一种具有高准确度的声源定位装置的结构示意图，如图6所示，具有高准确度的声源定位装置，还可包括：Correspondingly, this embodiment also provides a sound source localization device with high accuracy. FIG. 6 is a schematic structural diagram of a sound source localization device with high accuracy provided in Embodiment 2 of the present invention, as shown in FIG. 6 . As shown, the sound source localization device with high accuracy may further include:

图像采集单元103-1：用于当提取出所有的语音信号并获取每一个语音信号所在的声源位置信息之后，采集每一个语音信号所在的声源位置处的图像信息。Image acquisition unit 103-1: used to collect image information at the sound source location where each speech signal is located after extracting all the speech signals and acquiring the sound source location information where each speech signal is located.

第三判断单元103-2：用于一一判断对应的声源位置处是否有人存在。The third judging unit 103-2 is used for judging whether there is a person at the corresponding sound source position one by one.

有效语音判定单元103-3：用于当对应的声源位置处有人存在时，判定有人存在的声源位置处对应的语音信号为有效语音信号，然后再将所有的有效语音信号一一进行声纹识别。Valid voice determination unit 103-3: used to determine that the corresponding voice signal at the sound source position where someone exists is a valid voice signal when someone exists at the corresponding sound source position, and then all valid voice signals are sounded one by one. Pattern recognition.

本实施例中，在确定采集到的声音信号中存在语音信号后，可以先通过图像采集单元103-1去采集每一个语音信号所在的声源位置处的图像信息，然后判断这些声源位置处是否有人存在，如果有人存在，再对这些语音信号进行后续处理，这样就排除了不是来自人体的声源（例如收音机发声、电视机发声等）的干扰，提高了准确性和识别效率。In this embodiment, after it is determined that there are voice signals in the collected sound signals, the image acquisition unit 103-1 may first collect image information at the sound source positions where each voice signal is located, and then determine the positions of these sound source positions. Whether someone exists, and if someone exists, these voice signals are then processed, which eliminates the interference of sound sources that do not come from the human body (such as radio sound, TV sound, etc.), and improves accuracy and recognition efficiency.

实施例三Embodiment 3

图7是本发明实施例三提供的一种具有高准确度的声源定位方法的流程示意图，如图7所示，在实施例一的基础上，步骤S106中所述的图像信息可包括：人的脸部特征信息和唇动特征信息。FIG. 7 is a schematic flowchart of a sound source localization method with high accuracy provided by Embodiment 3 of the present invention. As shown in FIG. 7 , on the basis of Embodiment 1, the image information described in step S106 may include: Human facial feature information and lip movement feature information.

本发明可在执行完毕步骤S106之后，先执行步骤S106-1~S106-4，再执行步骤S107：In the present invention, after step S106 is performed, steps S106-1 to S106-4 are performed first, and then step S107 is performed:

S106-1、根据获取到的图像信息，进行人脸识别和唇动识别，确定该声源位置处当前的发声人数。S106-1. Perform face recognition and lip movement recognition according to the acquired image information, and determine the current number of speakers at the sound source position.

S106-2、判断当前的发声人数是否仅为1个，如是，则执行步骤S107，否则，执行步骤S106-3。S106-2, judging whether the current number of speakers is only one, if so, go to step S107; otherwise, go to step S106-3.

S106-3、继续采集该声源位置处的声音信号。S106-3. Continue to collect the sound signal at the sound source position.

S106-4、判断该声源位置处是否还存在其他语音信号，如是，则返回执行步骤S104，否则，重复执行S106-2。S106-4: Determine whether there are other voice signals at the sound source position, if so, return to step S104, otherwise, repeat S106-2.

相应地，本实施例还提供了一种具有高准确度的声源定位装置，图8是本发明实施例三提供的一种具有高准确度的声源定位装置的结构示意图，如图8所示，在实施例一的基础上，所述图像获取单元106中所述的图像信息可包括：人的脸部特征信息和唇动特征信息；所述的具有高准确度的声源定位装置可包括：Correspondingly, this embodiment also provides a sound source localization device with high accuracy. FIG. 8 is a schematic structural diagram of a sound source localization device with high accuracy provided in Embodiment 3 of the present invention, as shown in FIG. 8 . As shown, on the basis of Embodiment 1, the image information described in the image acquisition unit 106 may include: human facial feature information and lip movement feature information; the sound source localization device with high accuracy may include:

发声人数统计单元106-1：用于当当前识别出的声纹特征没有存储于声纹数据库中时获取该声纹特征对应的语音信号所在的声源位置处的图像信息之后，根据获取到的图像信息，进行人脸识别和唇动识别，确定该声源位置处当前的发声人数。The number of voice counting unit 106-1 is used to obtain the image information at the position of the sound source where the voice signal corresponding to the voiceprint feature is located when the currently recognized voiceprint feature is not stored in the voiceprint database. Image information, perform face recognition and lip movement recognition, and determine the current number of speakers at the sound source location.

第四判断单元106-2：用于判断当前的发声人数是否仅为1个，如是，则再利用机器自学习方法进行模型训练，确定该声纹特征对应的发声人及其身份信息，并将对应的声纹特征和发声人身份信息存储于声纹数据库中。The fourth judgment unit 106-2 is used to judge whether the current number of speakers is only one, and if so, then use the machine self-learning method to perform model training, determine the speaker corresponding to the voiceprint feature and its identity information, and use the machine self-learning method for model training. Corresponding voiceprint features and speaker identity information are stored in the voiceprint database.

第二声音采集单元106-3：用于当当前的发声人数是否不仅仅为1个时，继续采集该声源位置处的声音信号。The second sound collection unit 106-3: configured to continue to collect the sound signal at the sound source position when the current number of speakers is more than one.

第五判断单元106-4：用于判断该声源位置处是否还存在其他语音信号，如是，则将提取出的每一个语音信号一一进行声纹识别，否则，重复判断当前的发声人数是否仅为1个。The fifth judgment unit 106-4 is used for judging whether there are other voice signals at the sound source position. If so, each voice signal extracted will be subjected to voiceprint recognition, otherwise, it will be repeatedly judged whether the current number of voices is Only 1.

本实施例中，当需要对发声人的位置和身份进行现场采集时，图像获取单元106在现场获得的图像信息包括了人的脸部特征信息和唇动特征信息，这样，可以根据获得的脸部特征信息和唇动特征信息来进行人脸识别和唇动识别，确定该声源位置处当前的发声人数，然后判断当前的发声人数是否仅为1个，如果为1个，则直接进行机器自学习；如果为多个，则先判断该声源位置处是否还存在其他语音信号，如果存在，直接进行声纹识别，如果不存在，则一直循环判断发声人数，直至发声人数为1才进行机器自学习，最终能够唯一确定出当前发声人的身份和位置，这样就避免了同一位置或相邻位置上有多个人同时在说话而又分辨不出当前采集到的声纹到底属于谁的干扰。In this embodiment, when the location and identity of the speaker needs to be collected on-site, the image information obtained by the image obtaining unit 106 on the spot includes the facial feature information and lip movement feature information of the person. Use facial feature information and lip motion feature information to perform face recognition and lip motion recognition, determine the current number of speakers at the sound source location, and then determine whether the current number of speakers is only 1, if it is 1, then directly carry out the machine Self-learning; if there are more than one, first judge whether there are other voice signals at the sound source position, if so, perform voiceprint recognition directly, if not, it will loop to judge the number of voices until the number of voices is 1. Machine self-learning can finally uniquely determine the identity and position of the current speaker, thus avoiding the interference of multiple people speaking at the same or adjacent positions at the same time but unable to distinguish who the currently collected voiceprint belongs to. .

实施例四Embodiment 4

图9是本发明实施例四提供的一种具有高准确度的声源定位方法的流程示意图，如图9所示，在实施例一的基础上，具有高准确度的声源定位方法，还可包括以下步骤：FIG. 9 is a schematic flowchart of a sound source localization method with high accuracy provided by Embodiment 4 of the present invention. As shown in FIG. 9 , on the basis of Embodiment 1, the sound source localization method with high accuracy is further Can include the following steps:

S109、响应于发声人身份标记命令，将指定的发声人的身份信息进行标记，并进行显示。S109. In response to the speaker identity marking command, mark and display the identity information of the designated speaker.

相应地，本实施例还提供了一种具有高准确度的声源定位装置，图10是本发明实施例四提供的一种具有高准确度的声源定位装置的结构示意图，如图10所示，在实施例一的基础上，Correspondingly, this embodiment also provides a sound source localization device with high accuracy. FIG. 10 is a schematic structural diagram of a sound source localization device with high accuracy provided in Embodiment 4 of the present invention, as shown in FIG. 10 . As shown, on the basis of Example 1,

具有高准确度的声源定位装置，还可包括：A sound source localization device with high accuracy, further comprising:

身份标记单元109：用于响应于发声人身份标记命令，将指定的发声人的身份信息进行标记，并进行显示。Identity marking unit 109: used to mark and display the identity information of the designated speaker in response to the speaker's identity marking command.

确定了发声人的位置和身份之后，为了方便识别，可根据实际需求对发声人的身份进行标记。After the location and identity of the speaker are determined, the identity of the speaker can be marked according to actual needs for easy identification.

相应地，本发明还提供了一种存储设备，其中存储有多条指令，所述指令适于由处理器加载并执行如上所述的具有高针对性的语音识别方法。Correspondingly, the present invention also provides a storage device in which a plurality of instructions are stored, and the instructions are suitable for being loaded by the processor and executing the above-mentioned highly targeted speech recognition method.

所述存储设备可为一计算机可读存储介质，可以包括：ROM、RAM、磁盘或光盘等。The storage device may be a computer-readable storage medium, which may include: ROM, RAM, magnetic disk or optical disk, and the like.

相应地，本发明还提供了一种终端，可包括：Correspondingly, the present invention also provides a terminal, which may include:

处理器，适于实现各指令；以及a processor adapted to implement the instructions; and

存储设备，适于存储多条指令，所述指令适于由处理器加载并执行如上所述的具有高针对性的语音识别方法。A storage device adapted to store a plurality of instructions adapted to be loaded by a processor and execute the highly targeted speech recognition method as described above.

所述终端可为任意能够实现声源定位和语音识别的装置，该装置可以是各种可佩戴的终端设备，例如：眼镜、手环等，具体可以通过软件和/或硬件来实现。所述处理器可以采用带有GPU或者NPU的多核处理器，例如高通、瑞芯微、联发科等处理器。The terminal can be any device capable of sound source localization and speech recognition, and the device can be various wearable terminal devices, such as glasses, bracelets, etc., which can be specifically implemented by software and/or hardware. The processor can be a multi-core processor with GPU or NPU, such as Qualcomm, Rockchip, MediaTek and other processors.

具体实施时，第一声音采集单元101、第二声音采集单元106-3均可为基于驻极体麦克风的线性阵列、环形阵列、心形阵列、螺旋形阵列、不规则阵列等，也可以为基于MEMS麦克风的线性阵列、环形阵列、心形阵列、螺旋形阵列、不规则阵列等。图像获取单元106、图像采集单元103-1均可为单目相机、双目相机、多目相机、深度相机等。显示单元108可为OLED、LCD等显示屏，可以采用棱镜模组、自由曲面、光波导等显示模组，也可以采用LED灯或其他显示模块。During specific implementation, the first sound collection unit 101 and the second sound collection unit 106-3 may be linear arrays, annular arrays, cardioid arrays, spiral arrays, irregular arrays, etc. based on electret microphones, or may be Linear arrays, annular arrays, cardioid arrays, helical arrays, irregular arrays, etc. based on MEMS microphones. The image acquisition unit 106 and the image acquisition unit 103-1 can be a monocular camera, a binocular camera, a multi-eye camera, a depth camera, or the like. The display unit 108 can be a display screen such as an OLED, an LCD, etc., a display module such as a prism module, a free-form surface, an optical waveguide, etc., or an LED lamp or other display modules can be adopted.

对于多人同时在说话的嘈杂环境，或者说话人与听话人距离较远的环境，与传统的声源定位相比较，本发明的具有较大的优势，特别对于有听力障碍的残障人士来说，本发明的优势尤为突出。本发明基于机器视觉和声纹识别技术，将现场采集的音频数据和图像数据结合在一起，进行声源定位、声纹识别、人脸识别和唇动识别，确定对应的声源位置上有多少个人、有多少个人在发声、哪些人在发声，进而准确地得到多个声源的位置信息和发声人的身份信息。本发明解决了传统的声源定位算法针对多人会议、鸡尾酒会等应用环境出现的定位不准确、无法获取期望人声的问题，通过结合机器视觉和声纹识别技术，实现了复杂应用环境中准确的声源定位，具有突出的实质性特点和显著的进步。For a noisy environment where many people are talking at the same time, or an environment where the distance between the speaker and the listener is relatively far, compared with the traditional sound source localization, the present invention has greater advantages, especially for the disabled person with hearing impairment. , the advantages of the present invention are particularly prominent. Based on machine vision and voiceprint recognition technology, the present invention combines the audio data and image data collected on the spot to perform sound source localization, voiceprint recognition, face recognition and lip movement recognition, and determine how many corresponding sound source positions there are. Individuals, how many individuals are speaking, and who are speaking, and then accurately obtain the location information of multiple sound sources and the identity information of the speaker. The invention solves the problems of inaccurate positioning and inability to obtain desired human voices in the traditional sound source localization algorithm for multi-person conferences, cocktail parties and other application environments. Accurate sound source localization with outstanding substantive features and significant progress.

在本发明的描述中，可以理解的是，上述方法、装置及系统中的相关特征可以相互参考。另外，上述实施例中的“第一”、“第二”等是用于区分各实施例或用于描述目的，而并不代表各实施例的优劣，也不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”等特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In the description of the present invention, it can be understood that the related features in the above-mentioned methods, apparatuses and systems may refer to each other. In addition, "first", "second", etc. in the above-mentioned embodiments are used for distinguishing each embodiment or for description purpose, and do not represent the advantages and disadvantages of each embodiment, nor should they be interpreted as indicating or implying that they are relatively important. nature or implicitly indicate the number of technical features indicated. Thus, features defined as "first", "second" etc. may expressly or implicitly include at least one of such features. In the description of the present invention, "plurality" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。所述领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统和模块的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments. Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the system and modules described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

在此提供的算法和显示不与任何特定计算机、虚拟系统或者其他设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类装置所要求的结构是显而易见的。此外，本发明也不针对任何特定的编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a device is apparent from the above description. Furthermore, the present invention is not directed to any particular programming language. It should be understood that various programming languages may be used to implement the inventions described herein, and that the descriptions of specific languages above are intended to disclose the best mode for carrying out the invention.

在本申请所提供的实施例中，应该理解到，所揭露的系统和方法，可以通过其它的方式实现。以上所描述的系统实施例仅仅是示意性的，例如，所述模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，又例如，多个模块或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed systems and methods may be implemented in other manners. The system embodiments described above are only illustrative. For example, the division of the modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces, indirect coupling or communication connection of devices or units, which may be in electrical, mechanical or other forms.

所述作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims

1. A sound source localization method with high accuracy, characterized by: the method comprises the following steps:

s101, collecting sound signals in an environment in real time;

s102, judging whether a voice signal exists in the collected sound signal, if so, executing a step S103, otherwise, returning to the step S101;

s103, extracting all voice signals and acquiring sound source position information of each voice signal;

s104, performing voiceprint recognition on each extracted voice signal one by one;

s105, judging whether the currently identified voiceprint features are stored in a voiceprint database, if so, executing a step S108, otherwise, executing a step S106; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics;

s106, acquiring image information of a sound source position where the voice signal corresponding to the voiceprint feature is located;

s107, performing model training by using a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database;

and S108, displaying the sound source position information of the voice signal corresponding to the voiceprint characteristic and the identity information of the corresponding speaker.

2. The sound source localization method with high accuracy according to claim 1, characterized in that: after the step S103 is completed, the steps S103-1 to S103-3 are executed, and then the step S104 is executed;

s103-1, collecting image information of a sound source position where each voice signal is located;

s103-2, judging whether people exist at the corresponding sound source positions one by one, if so, executing the step S103-3, otherwise, returning to the step S101;

s103-3, judging that the voice signal corresponding to the sound source position with the existence of people is an effective voice signal, and then executing the step S104;

the voice signal described in step S104 is an effective voice signal.

3. The sound source localization method with high accuracy according to claim 1, characterized in that: the image information in step S106 includes: face feature information and lip movement feature information of the person;

after the step S106 is executed, the steps S106-1 to S106-4 are executed, and then the step S107 is executed;

s106-1, carrying out face recognition and lip movement recognition according to the acquired image information, and determining the current number of the sounders at the sound source position;

s106-2, judging whether the number of the current sounders is only 1, if so, executing a step S107, otherwise, executing a step S106-3;

s106-3, continuously collecting the sound signals at the sound source position;

s106-4, judging whether other voice signals exist at the sound source position, if so, returning to the step S104, otherwise, repeatedly executing the step S106-2.

4. Sound source localization apparatus with high accuracy, characterized by: the method comprises the following steps:

first sound collection unit (101): the system is used for acquiring sound signals in the environment in real time;

first judgment unit (102): the voice recognition device is used for judging whether a voice signal exists in the collected voice signals;

speech extraction and localization unit (103): the voice recognition system is used for extracting all voice signals and acquiring the position information of a sound source where each voice signal is located when the voice signals exist in the acquired voice signals;

voiceprint recognition unit (104): the voice print recognition module is used for carrying out voice print recognition on each extracted voice signal one by one;

second determination unit (105): the voice print database is used for judging whether the currently identified voice print characteristics are stored in the voice print database; the voiceprint database stores a plurality of groups of voiceprint characteristics and the identity information of the speaker uniquely corresponding to each group of voiceprint characteristics;

image acquisition unit (106): the voice recognition method comprises the steps that when the currently recognized voiceprint features are not stored in a voiceprint database, image information of a sound source position where a voice signal corresponding to the voiceprint features is located is obtained;

machine learning unit (107): the system is used for carrying out model training by utilizing a machine self-learning method, determining a speaker corresponding to the voiceprint characteristics and identity information thereof, and storing the corresponding voiceprint characteristics and the identity information of the speaker in a voiceprint database;

display unit (108): and the voice recognition module is used for displaying the sound source position information of the voice signal corresponding to the voiceprint characteristics and the identity information of the corresponding speaker when the currently recognized voiceprint characteristics are stored in the voiceprint database.

5. The sound source localization device with high accuracy according to claim 4, wherein: further comprising:

image acquisition unit (103-1): the voice recognition system is used for acquiring image information of a sound source position where each voice signal is located after all voice signals are extracted and sound source position information where each voice signal is located is acquired;

third judging unit (103-2): the system is used for judging whether people exist at the corresponding sound source positions one by one;

valid speech determination unit (103-3): and when a person exists at the corresponding sound source position, judging the voice signal corresponding to the sound source position where the person exists as an effective voice signal, and then carrying out voiceprint recognition on all the effective voice signals one by one.

6. The sound source localization device with high accuracy according to claim 4, wherein: the image information in the image acquisition unit (106) includes: face feature information and lip movement feature information of the person; the sound source localization apparatus with high accuracy includes:

an utterance people counting unit (106-1): the voice recognition system is used for carrying out face recognition and lip movement recognition according to the acquired image information after acquiring the image information of the sound source position where the voice signal corresponding to the voiceprint feature is located when the currently recognized voiceprint feature is not stored in the voiceprint database, and determining the current number of the voice producing people at the sound source position;

fourth judging unit (106-2): the voice print recognition system is used for judging whether the number of the current voice speakers is only 1, if so, performing model training by using a machine self-learning method, determining the voice speakers corresponding to the voice print characteristics and identity information thereof, and storing the corresponding voice print characteristics and the identity information of the voice speakers in a voice print database;

a second sound collection unit (106-3): when the number of the current sounders is not more than 1, continuously acquiring the sound signals at the sound source position;

fifth judging unit (106-4): and the voice recognition module is used for judging whether other voice signals exist at the position of the sound source, if so, performing voiceprint recognition on each extracted voice signal one by one, and otherwise, repeatedly judging whether the number of the current uttered people is only 1.

7. A speech recognition method with high pertinence, characterized by: the method comprises the following steps:

determining the sound source position information of the voice signal and the identity information of a corresponding speaker by adopting a sound source positioning method;

responding to the voice recognition command, converting the voice content corresponding to the appointed speaker into character content, and displaying the character content;

the sound source localization method according to any one of claims 1 to 3, wherein the sound source localization method is a sound source localization method with high accuracy.

8. A speech recognition system with high pertinence, characterized by: the method comprises the following steps:

sound source localization device (10): the method comprises the steps of determining sound source position information of a voice signal and identity information of a corresponding speaker by adopting a sound source positioning method;

speech conversion module (20): the voice recognition device is used for responding to a voice recognition command, converting voice content corresponding to a specified speaker into character content and displaying the character content;

the sound source localization apparatus (10) is the sound source localization apparatus with high accuracy according to any one of claims 4 to 6.

9. A storage device having a plurality of instructions stored therein, characterized in that: the instructions are adapted to be loaded by a processor and to perform a sound source localization method with high accuracy as claimed in any of claims 1-3.

10. A terminal, characterized in that: the method comprises the following steps:

a processor adapted to implement instructions; and

a storage device adapted to store a plurality of instructions adapted to be loaded by a processor and to perform a sound source localization method with high accuracy as claimed in any of claims 1-3.