CN115810344A

CN115810344A - Voice matching method, device, equipment and storage medium

Info

Publication number: CN115810344A
Application number: CN202211371312.4A
Authority: CN
Inventors: 王丹; 崔洋洋; 杨登舟
Original assignee: Shenzhen Micro & Nano Integrated Circuits And Systems Research Institute; Haining Micro Nano Sensing Computing Technology Co ltd
Current assignee: Shenzhen Micro & Nano Integrated Circuits And Systems Research Institute; Haining Micro Nano Sensing Computing Technology Co ltd
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2023-03-17

Abstract

The application provides a voice matching method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring target voice data and a sound source position coordinate of the target voice data; calculating a target distance between the sound source position and the voice receiving position according to the sound source position coordinates; determining a voice matching range in a preset voice database according to the target distance, wherein voice data contained in the voice matching range form a first voice data set, voice data contained in the voice database form a second voice data set, and the first voice data set is a subset of the second voice data set; and performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data. According to the method, the target distance between the sound source position and the sound receiving position is calculated, and the target distance is used as a reference basis for voice matching when voice matching is carried out, so that the accuracy of voice matching can be greatly improved.

Description

Voice matching method, device, equipment and storage medium

Technical Field

The present application relates to the field of voice matching technologies, and in particular, to a voice matching method, apparatus, device, and storage medium.

Background

And the voice matching refers to matching processing of voice audio. With the development of voice recognition technology, voice matching is widely applied in the detection field, for example, vehicle whistle detection, equipment fault detection, etc., and a voice matching system is required. However, the existing voice matching system does not have a function of positioning a sound source, cannot acquire a specific direction of the sound source, and is not high in accuracy of voice matching and poor in matching effect.

Disclosure of Invention

In view of this, embodiments of the present application provide a voice matching method, apparatus, device and storage medium, which aim to solve the technical problems of low accuracy and poor matching effect of voice matching in the prior art.

A first aspect of an embodiment of the present application provides a speech matching method, including: acquiring target voice data and a sound source position coordinate of the target voice data; calculating a target distance between the sound source position and the voice receiving position according to the sound source position coordinates; determining a voice matching range in a preset voice database according to the target distance, wherein voice data contained in the voice matching range form a first voice data set, voice data contained in the voice database form a second voice data set, and the first voice data set is a subset of the second voice data set; and performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the step of obtaining the target speech data includes: carrying out sound source sensing on a target scene to obtain a plurality of voice signals, wherein one voice signal corresponds to one sensing point; respectively carrying out sound intensity detection on the plurality of voice signals to obtain a sound intensity value corresponding to each voice signal; and determining a target sensing point according to the sound intensity value corresponding to each voice signal, and determining the voice signal corresponding to the target sensing point as target voice data, wherein the target sensing point is a sensing point corresponding to the voice signal with the maximum sound intensity value in the plurality of voice signals.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the step of obtaining the sound source position coordinates of the target speech data includes: performing feature extraction processing on the target voice data to obtain voice features corresponding to the target voice data; establishing a coordinate system according to the voice characteristics to obtain an initial coordinate system; and positioning the target voice data according to the initial coordinate system to obtain the sound source position coordinates of the target voice data.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, after the step of obtaining the sound source position coordinates of the target speech data, the method further includes: carrying out segmentation processing on the target voice data to obtain a plurality of voice fragments; carrying out feature extraction processing on the voice segments to obtain a plurality of voice segment features, wherein the voice segments correspond to the voice segment features one by one; optimizing the initial coordinate system according to the characteristics of the voice segments to obtain an optimized coordinate system; and adjusting the position coordinates of the sound source according to the optimized coordinate system.

With reference to the second possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, before the step of performing positioning processing on the target speech data according to the initial coordinate system to obtain the sound source position coordinates of the target speech data, the method further includes: and carrying out environment noisy noise intensity detection processing on the target voice data, extracting environment noisy noise features in the target voice data, and deleting the environment noisy noise features.

With reference to the first aspect, in a fifth possible implementation manner of the first aspect, the step of performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data includes: carrying out segmentation processing on the target voice data to obtain a plurality of voice fragments; performing data comparison processing on the plurality of voice segments to obtain a data comparison result, wherein the data comparison processing comprises voice feature comparison, voice parameter comparison, voice duration comparison and voice occupation size comparison; according to the data comparison result, carrying out paragraph statistical processing on the plurality of voice fragments to obtain a paragraph statistical result; according to the paragraph statistical result, performing voice integration processing on the plurality of voice segments to obtain integrated voice data, wherein the voice integration processing comprises the same characteristic voice integration processing, the same parameter voice integration, the same duration voice integration and the range occupation size voice integration processing; and performing voice matching processing on the integrated voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, before the step of performing segmentation processing on the target speech data to obtain a plurality of speech segments, the method further includes: preprocessing the target voice data, wherein the preprocessing comprises: fuzzy section removing processing, voice filtering processing and noisy voice processing.

A second aspect of an embodiment of the present application provides a speech matching apparatus, including: the acquisition module is used for acquiring target voice data and the sound source position coordinates of the target voice data; the calculation module is used for calculating the target distance between the sound source position and the voice receiving position according to the sound source position coordinates; a determining module, configured to determine a voice matching range in a preset voice database according to the target distance, where voice data included in the voice matching range forms a first voice data set, voice data included in the voice database forms a second voice data set, and the first voice data set is a subset of the second voice data set; and the matching module is used for performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

A third aspect of embodiments of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the electronic device, where the processor implements the steps of the voice matching method provided in the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the steps of the voice matching method provided by the first aspect.

The voice matching method, the voice matching device, the electronic equipment and the storage medium have the following beneficial effects:

the method comprises the steps of obtaining target voice data and a sound source position coordinate of the target voice data; calculating a target distance between the sound source position and the voice receiving position according to the sound source position coordinates; determining a voice matching range in a preset voice database according to the target distance, wherein voice data contained in the voice matching range form a first voice data set, voice data contained in the voice database form a second voice data set, and the first voice data set is a subset of the second voice data set; and performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data. According to the method, the target distance between the sound source position and the sound receiving position is calculated, and the target distance is used as a reference basis for voice matching when voice matching is carried out, so that the accuracy of voice matching can be greatly improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating an implementation of a voice matching method according to an embodiment of the present application;

fig. 2 is a flowchart of a method for obtaining target voice data in the voice matching method according to the embodiment of the present application;

fig. 3 is a flowchart of a method for obtaining a sound source position coordinate of target voice data in the voice matching method according to the embodiment of the present application;

fig. 4 is a flowchart of a method for adjusting and optimizing a sound source position coordinate in a voice matching method according to an embodiment of the present disclosure;

fig. 5 is a flowchart of a method for performing voice matching on target voice data in the voice matching method according to the embodiment of the present application;

fig. 6 is a block diagram of an infrastructure of a speech matching apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of a basic structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of a voice matching method according to an embodiment of the present application. The details are as follows:

s11: acquiring target voice data and a sound source position coordinate of the target voice data;

s12: calculating a target distance between the sound source position and the voice receiving position according to the sound source position coordinates;

s13: determining a voice matching range in a preset voice database according to the target distance, wherein voice data contained in the voice matching range form a first voice data set, voice data contained in the voice database form a second voice data set, and the first voice data set is a subset of the second voice data set;

s14: and performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

In this embodiment, the voice matching method is applied to a voice matching system having a sound source localization function. In the voice matching system, voice collection processing can be carried out through a voice collection module, and target voice data to be matched are obtained. After the target voice data is obtained, further, the sound source of the target voice data is positioned by the sound source positioning module, and the sound source position coordinates of the target voice data are obtained. For example, when obtaining the sound source position coordinates of the target voice data, sound source localization may be performed by one of the following methods: the method comprises a positioning method based on maximum output power controllable beam forming, a positioning method based on high-resolution spectrum estimation, a positioning method based on arrival time delay difference estimation and a method based on machine learning. And after the position coordinates of the sound source are obtained, the position coordinates of the voice receiving position corresponding to the coordinate system are obtained according to the coordinate system correspondingly generated by the position coordinates of the sound source. And calculating the target distance between the position of the sound source and the voice receiving position according to the position coordinates of the sound source position and the voice receiving position corresponding to the position coordinates of the sound source in the coordinate system of the sound source position. It is understood that the target distance is a straight-line distance in the coordinate system between a point represented by the sound source position coordinates and a point represented by position coordinates in which the voice receiving position corresponds to the coordinate system. A voice database is preset in the voice matching system, a large amount of voice data are stored in the voice database, and the stored voice data are classified according to different dimensions such as types and distances. In this embodiment, after the target distance between the sound source position and the voice receiving position of the target voice data is obtained through calculation, the target distance may be used as a reference for voice matching, a voice matching range is determined in the voice database, that is, the voice data meeting the requirement of the target distance is screened from the voice database to form a first voice data set, and then the voice data in the first voice data set is used for performing voice matching with the target voice data, so that a voice matching result corresponding to the target voice data is obtained based on the first voice data set, and therefore, the influence on the voice matching effect due to the various types and large data volume of the voice data stored in the voice database can be avoided, and the accuracy of voice matching is greatly improved. It should be noted that, in this embodiment, assuming that all the voice data contained in the voice database form one second voice data set, the first voice data set is a subset of the second voice data set.

In some embodiments of the present application, please refer to fig. 2, and fig. 2 is a flowchart illustrating a method for obtaining target voice data in a voice matching method according to an embodiment of the present application. The details are as follows:

s21: carrying out sound source sensing on a target scene to obtain a plurality of voice signals, wherein one voice signal corresponds to one sensing point;

s22: respectively carrying out sound intensity detection on the plurality of voice signals to obtain a sound intensity value corresponding to each voice signal;

s23: and determining a target sensing point according to the sound intensity value corresponding to each voice signal, and determining the voice signal corresponding to the target sensing point as target voice data, wherein the target sensing point is a sensing point corresponding to the voice signal with the maximum sound intensity value in the plurality of voice signals.

In this embodiment, the target voice data may be obtained through a sound source sensing module and a sound intensity detection module in the voice matching system. The target scene is a real scene needing sound detection, such as a road section needing vehicle whistle detection, for example, an equipment installation area needing equipment monitoring, and the like. In the present embodiment, a spherical microphone array may be installed in a target scene as a sound source sensing module for performing sound source sensing on the target scene. For example, in this embodiment, the process of acquiring the target voice data may be: a spherical microphone array is used for carrying out sound source sensing on a target scene, and a plurality of voice signals can be obtained. Wherein one speech signal corresponds to one sensing point. After a plurality of voice signals are obtained through the spherical microphone array, because the voice signals at the moment are represented as digital signals, the voice signals can be converted into frequency domain signals from the digital signals by arranging the frequency domain signal conversion module, so that the voice signals can be used in subsequent steps. After the voice signals are converted into frequency domain signals, the sound intensity detection sensor can be used as a sound intensity detection module to carry out sound intensity detection on the voice signals one by one, and a sound intensity value corresponding to each voice signal is obtained. Furthermore, the voice signal with the maximum sound intensity value is determined by comparing the magnitude relation of the sound intensity value corresponding to each voice signal, the sensing point corresponding to the voice signal with the maximum sound intensity value is determined as the target sensing point, and the voice signal with the maximum sound intensity value is determined as the target voice data.

In some embodiments of the present application, please refer to fig. 3, and fig. 3 is a flowchart illustrating a method for obtaining a sound source position coordinate of target voice data in a voice matching method according to an embodiment of the present application. The details are as follows:

s31: carrying out feature extraction processing on the target voice data to obtain voice features corresponding to the target voice data;

s32: establishing a coordinate system according to the voice characteristics to obtain an initial coordinate system;

s33: and positioning the target voice data according to the initial coordinate system to obtain the sound source position coordinates of the target voice data.

In this embodiment, after the target voice data is obtained, coordinate establishment may be performed by a coordinate establishment module in the voice matching system, so as to obtain a sound source position coordinate of the target voice data. For example, in this embodiment, the process of acquiring the sound source position coordinates of the target voice data may be: the voice feature corresponding to the target voice data can be obtained by performing feature extraction processing on the obtained target voice data, and it can be understood that, in this embodiment, the voice feature may be a frequency feature, a time feature, and the like of the target voice data. And then after the voice features of the target voice data are obtained, determining the initial positions of the coordinate axes according to the voice features, establishing the coordinate axes of a coordinate system, and further expanding the coordinate axes to a plane through a parallel relation, so as to obtain an initial coordinate system. It is understood that the initial coordinate system may be a two-dimensional planar coordinate system or a three-dimensional spatial coordinate system. After the initial coordinate system is obtained, the target voice data can be positioned according to the position of the target sensing point representing the position of the target voice data in the initial coordinate system, and the sound source position coordinate of the target voice data is obtained.

In some embodiments of the present application, please refer to fig. 4, and fig. 4 is a flowchart illustrating a method for adjusting and optimizing a sound source position coordinate in a voice matching method according to an embodiment of the present application. The details are as follows:

s41: carrying out segmentation processing on the target voice data to obtain a plurality of voice fragments;

s42: carrying out feature extraction processing on the voice segments to obtain a plurality of voice segment features, wherein the voice segments correspond to the voice segment features one by one;

s43: optimizing the initial coordinate system according to the characteristics of the voice segments to obtain an optimized coordinate system;

s44: and adjusting the position coordinates of the sound source according to the optimized coordinate system.

In this embodiment, in the voice matching system, the sound source position coordinates obtained by the coordinate establishing module may be adjusted and optimized by the coordinate comprehensive adjustment module. Specifically, the target speech data may be segmented into a plurality of speech segments by performing segmentation processing on the target speech data. And then, carrying out feature extraction processing on the plurality of voice segments to obtain a plurality of voice segment features, wherein the plurality of voice segments correspond to the plurality of voice segment features one to one. According to the characteristics of each voice fragment, determining the initial position of a coordinate axis, obtaining a plurality of coordinate axes, determining an optimal position for each obtained coordinate axis through a coordinate comprehensive adjusting module, and optimizing the initial coordinate system based on the optimal position of each coordinate axis to obtain an optimized coordinate system. According to the relative position relation between the optimized coordinate system and the initial coordinate system, the sound source position coordinate in the initial coordinate system can be mapped to the optimized coordinate system, so that the sound source position coordinate is adjusted and optimized.

In some embodiments of the present application, before the coordinate is established by the coordinate establishing module in the speech matching system, the target speech data may be further processed by the data processing module in the speech matching system, so as to remove a blank signal frame in the target speech data. Specifically, after the voice matching system performs voice intensity detection through the voice intensity detection module to obtain target voice data, the target voice data can be transmitted to the data processing module through the data transmission module, the target voice data is subjected to signal frame detection through the data processing module, one or more parameters of signal intensity, short-time energy and zero crossing rate of each signal frame in the target voice data are detected, whether the signal frame is a blank signal frame or not is judged through parameter values, and if the signal frame is judged to be a blank signal frame, the blank signal frame is deleted from the target voice data. For example, assuming that a signal strength threshold for determining whether a signal frame is a blank signal frame is set, the signal strength of the signal frame in the target voice data is detected, the detected signal strength is compared with the set signal strength threshold, and if the signal strength is smaller than the signal strength threshold, the signal frame is determined to be a blank signal frame, and the signal frame is deleted from the target voice data. By removing the blank signal frames in the target voice data, the influence of the blank signal frames on the sound source positioning of the target voice data can be avoided, and the accuracy of the sound source positioning is improved.

In some embodiments of the application, before the coordinate establishment is performed through the coordinate establishment module in the voice matching system, the ambient noisy sound intensity detection processing can be performed on the target voice data through the ambient noisy sound intensity detection module in the voice matching system, the ambient noisy sound feature in the target voice data is extracted, and the ambient noisy sound feature is deleted. Specifically, the environment noisy sound intensity detection module and the data processing module can be connected through the voice conduction module in the voice matching system, and bidirectional interaction between the environment noisy sound intensity detection module and the data processing module is achieved. The data processing module transmits the target voice data to the environment noisy sound intensity detection module through the data guide module, the environment noisy sound feature of the environment noisy sound intensity detection module in the target voice data is extracted, the target voice data with the deleted environment noisy sound feature is returned to the data processing module through the data guide module after the environment noisy sound feature is deleted. For example, in the ambient noisy sound intensity detection module, the process of performing the ambient noisy sound intensity detection processing on the target speech data may be: noise information extraction is carried out on the target voice data, and the noisy noise characteristics of the environment can be obtained. For example, a non-voice segment is obtained by framing the target voice data, the non-voice segment is determined as a pure noise segment, and noise information extracted from the pure noise segment is the noisy noise feature of the environment, such as performing fourier transform to obtain a noise spectrum. After the environmental noisy noise features are extracted, the environmental noisy noise features are deleted through spectral subtraction, wiener filtering and the like, and the extracted environmental noisy noise features can be deleted from the target voice data. For example, in the ambient noisy sound intensity detection module, the process of performing ambient noisy sound intensity detection processing on the target speech data may further be: the collected noise features and the blank noise features are stored in the noise library in a vector form by setting the noise library, for example, collecting the noise features including the blank noise features. The method comprises the steps of framing target voice data, extracting data features frame by frame, comparing the data features of each frame with noise features stored in a noise library to judge whether the vectors are similar, if so, judging that the data features of the frame are the noise features, and directly deleting the data features of the frame, so that the extracted environmental noise features are deleted from the target voice data. In the embodiment, the target voice data is deleted for the noisy environmental sound feature, so that the influence of the noisy environmental sound feature on the sound source positioning of the target voice data can be avoided, and the accuracy of the sound source positioning is improved.

In some embodiments of the present application, please refer to fig. 5, and fig. 5 is a flowchart illustrating a method for performing voice matching on target voice data in a voice matching method according to an embodiment of the present application. The details are as follows:

s51: carrying out segmentation processing on the target voice data to obtain a plurality of voice fragments;

s52: performing data comparison processing on the plurality of voice segments to obtain a data comparison result, wherein the data comparison processing comprises voice feature comparison, voice parameter comparison, voice duration comparison and voice occupation size comparison;

s53: according to the data comparison result, carrying out paragraph statistical processing on the plurality of voice fragments to obtain a paragraph statistical result;

s54: according to the paragraph statistical result, performing voice integration processing on the plurality of voice segments to obtain integrated voice data, wherein the voice integration processing comprises the same characteristic voice integration processing, the same parameter voice integration, the same duration voice integration and the range occupation size voice integration processing;

s55: and performing voice matching processing on the integrated voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

In this embodiment, the voice matching system is provided with a voice comparison module, a voice integration module, and a voice matching module, and is configured to implement voice matching processing on target voice data. In this embodiment, the process of performing voice matching on the obtained target voice data may be: firstly, carrying out segmentation processing and segmentation numbering processing on target voice data to obtain a plurality of voice fragments with numbering marks. Then, the data comparison result is obtained by carrying out data comparison processing on the plurality of voice fragments with the number marks pairwise. The method comprises but is not limited to data comparison processing of at least one dimension: voice feature comparison, voice parameter comparison, voice duration comparison and voice occupation size comparison. Thus, data comparison results of multiple dimensions can be obtained. By taking voice feature comparison processing as an example, after target voice data is segmented, 5 voice segments with the numbers of 1-5 are obtained, feature extraction processing can be respectively carried out on the 5 voice segments, voice features corresponding to the 5 voice segments are obtained, the voice features corresponding to the voice segments with the numbers of 1 are respectively compared with the voice features corresponding to the voice segments with the numbers of 2, 3, 4 and 5 one by one, whether the voice segment with the number of 1 and the voice segment with the number of 2 have the same voice features is judged, whether the voice segment with the number of 1 and the voice segment with the number of 3 have the same voice features is judged, whether the voice segment with the number of 1 and the voice segment with the number of 4 have the same voice features is judged, whether the voice segment with the number of 1 and the voice segment with the number of 5 have the same voice features is judged, and 4 data comparison results can be obtained; comparing the voice features corresponding to the voice segment with the number of 2 with the voice features corresponding to the voice segments with the numbers of 3, 4 and 5 one by one, judging whether the voice segment with the number of 2 has the same voice features as the voice segment with the number of 3, judging whether the voice segment with the number of 2 has the same voice features as the voice segment with the number of 4, judging whether the voice segment with the number of 2 has the same voice features as the voice segment with the number of 5, and obtaining 3 data comparison results; comparing the voice features corresponding to the voice segment with the number of 3 with the voice features corresponding to the voice segments with the numbers of 4 and 5 one by one, judging whether the voice segment with the number of 3 has the same voice features as the voice segment with the number of 4, judging whether the voice segment with the number of 3 has the same voice features as the voice segment with the number of 5, and obtaining 2 data comparison results; and comparing the voice feature corresponding to the voice segment with the number of 4 with the voice feature corresponding to the voice segment with the number of 5, and judging whether the voice segment with the number of 4 has the same voice feature as the voice segment with the number of 5, so that 1 data comparison result can be obtained, and 10 data comparison results based on feature comparison dimensionality can be obtained in total. The other data comparison processing is the same, and is not described herein again. Through comparison, after all data comparison results are obtained, paragraph statistical processing can be performed on a plurality of voice segments according to the data comparison results, and paragraph statistical results are obtained. Specifically, the paragraph statistics processing procedure may be: and aiming at each voice segment, acquiring a paragraph statistical result by recording the number of the voice segments with the same data in different comparison dimensions and the number marks of the voice segments with the same data. Specifically, according to the paragraph statistics result, it can be known which numbered voice segment and which numbered voice segment have the same voice characteristics, which numbered voice segment and which numbered voice segment have the same voice parameters, which numbered voice segment and which numbered voice segment have the same voice duration, and which numbered voice segment occupy the same range. Furthermore, according to the paragraph statistical result, the integrated voice data can be obtained by performing voice integration processing on a plurality of voice segments, wherein the voice integration processing includes the same-feature voice integration processing, the same-parameter voice integration, the same-duration voice integration and the range occupation size voice integration processing. For example, assuming that three voice segments numbered 1, 3, and 5 have the same voice feature, one voice integration data can be obtained by performing voice integration processing on the three voice segments numbered 1, 3, and 5. It is understood that the speech integration process may be a splicing process of speech segments sequentially according to the numbering order. After the integrated voice data is obtained, the integrated voice data and the voice data in the first voice data set are subjected to voice matching processing, whether the integrated voice data is matched with the voice data in the first voice data set or not is judged, and therefore the voice type of the target voice data is determined according to the voice type corresponding to the voice data matched with the integrated voice data in the first voice data set, and the voice matching result corresponding to the target voice data is obtained. It can be understood that, when multiple integrated voice data are obtained, a plurality of corresponding voice types can be obtained by performing voice matching processing on each integrated voice data and the voice data in the first voice data set, and a final voice type is determined from the plurality of voice types as a voice matching result corresponding to the target voice data through a probability calculation mode.

In some embodiments of the application, still be provided with the speech processing module among the pronunciation matching system, before carrying out the pronunciation matching to target speech data, can also carry out the preliminary treatment to this target speech data through this speech processing module, can accomplish the optimal treatment to target speech data, promote speech signal's whole quality, improve follow-up pronunciation matching efficiency and effect greatly, avoided the problem of matching mistake, improved the whole application effect of pronunciation matching. Specifically, the process of preprocessing the target speech data may include a fuzzy segment removal process, a speech filtering process, and a noisy speech process. In this embodiment, when carrying out noisy speech processing to target speech data, still include speech analysis processing, speech form is confirmed, noisy sound identification processing and noisy speech elimination processing, and noisy sound learning processing realizes studying noisy speech's course of treatment through the degree of deep learning algorithm when handling noisy speech, can promote subsequent noisy speech processing's efficiency and effect greatly, and the application effect is good, and can promote this course of treatment's fault-tolerant rate greatly.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In some embodiments of the present application, please refer to fig. 6, and fig. 6 is a block diagram of an infrastructure of a speech matching apparatus according to an embodiment of the present application. The apparatus in this embodiment comprises means for performing the steps of the method embodiments described above. The following description refers to the embodiments of the method. For convenience of explanation, only the portions related to the present embodiment are shown. As shown in fig. 6, the voice matching apparatus includes: an acquisition module 61, a calculation module 62, a determination module 63 and a matching module 64. Wherein: the obtaining module 61 is configured to obtain target voice data and a sound source position coordinate of the target voice data. The calculating module 62 is configured to calculate a target distance between the sound source position and the voice receiving position according to the sound source position coordinates. The determining module 63 is configured to determine a voice matching range in a preset voice database according to the target distance, where voice data included in the voice matching range form a first voice data set, voice data included in the voice database form a second voice data set, and the first voice data set is a subset of the second voice data set. The matching module 64 is configured to perform voice matching processing on the target voice data and the voice data in the first voice data set, so as to obtain a voice matching result corresponding to the target voice data.

It should be understood that the voice matching apparatus is in one-to-one correspondence with the voice matching method, and will not be described herein again.

In some embodiments of the present application, please refer to fig. 7, and fig. 7 is a basic structural block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, the electronic apparatus 7 of this embodiment includes: a processor 71, a memory 72 and a computer program 73, such as a program for a speech matching method, stored in said memory 72 and executable on said processor 71. The processor 71 implements the steps in the various embodiments of the speech matching method described above when executing the computer program 73. Alternatively, the processor 71 implements the functions of the modules in the embodiment corresponding to the voice matching apparatus when executing the computer program 73. Please refer to the related description in the embodiments, which is not repeated herein.

Illustratively, the computer program 73 may be divided into one or more modules (units) that are stored in the memory 72 and executed by the processor 71 to accomplish the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 73 in the electronic device 7. For example, the computer program 73 may be divided into an acquisition module, a calculation module, a determination module, and a matching module, each module having the specific functions as described above.

The electronic device may include, but is not limited to, a processor 71, a memory 72. It will be appreciated by those skilled in the art that fig. 7 is merely an example of the electronic device 7, and does not constitute a limitation of the electronic device 7, and may include more or less components than those shown, or combine certain components, or different components, for example, the electronic device may also include input output devices, network access devices, buses, etc.

The Processor 71 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 72 may be an internal storage unit of the electronic device 7, such as a hard disk or a memory of the electronic device 7. The memory 72 may also be an external storage device of the electronic device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 7. Further, the memory 72 may also include both an internal storage unit and an external storage device of the electronic device 7. The memory 72 is used for storing the computer program and other programs and data required by the electronic device. The memory 72 may also be used to temporarily store data that has been output or is to be output.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments. In this embodiment, the computer-readable storage medium may be nonvolatile or volatile.

The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of speech matching, comprising:

acquiring target voice data and a sound source position coordinate of the target voice data;

calculating a target distance between the sound source position and the voice receiving position according to the sound source position coordinates;

determining a voice matching range in a preset voice database according to the target distance, wherein voice data contained in the voice matching range form a first voice data set, voice data contained in the voice database form a second voice data set, and the first voice data set is a subset of the second voice data set;

and performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

2. The speech matching method of claim 1, wherein the step of obtaining target speech data comprises:

carrying out sound source sensing on a target scene to obtain a plurality of voice signals, wherein one voice signal corresponds to one sensing point;

respectively carrying out sound intensity detection on the plurality of voice signals to obtain a sound intensity value corresponding to each voice signal;

and determining a target sensing point according to the sound intensity value corresponding to each voice signal, and determining the voice signal corresponding to the target sensing point as target voice data, wherein the target sensing point is a sensing point corresponding to the voice signal with the maximum sound intensity value in the plurality of voice signals.

3. The voice matching method according to claim 2, wherein the step of acquiring the sound source position coordinates of the target voice data includes:

carrying out feature extraction processing on the target voice data to obtain voice features corresponding to the target voice data;

establishing a coordinate system according to the voice characteristics to obtain an initial coordinate system;

and positioning the target voice data according to the initial coordinate system to obtain the sound source position coordinates of the target voice data.

4. The voice matching method according to claim 3, characterized in that after the step of obtaining the sound source position coordinates of the target voice data, further comprising:

carrying out segmentation processing on the target voice data to obtain a plurality of voice fragments;

carrying out feature extraction processing on the voice segments to obtain a plurality of voice segment features, wherein the voice segments correspond to the voice segment features one by one;

optimizing the initial coordinate system according to the characteristics of the voice segments to obtain an optimized coordinate system;

and adjusting the position coordinates of the sound source according to the optimized coordinate system.

5. The speech matching method according to claim 3, wherein before the step of performing localization processing on the target speech data according to the initial coordinate system to obtain the sound source position coordinates of the target speech data, the method further comprises:

and carrying out environment noisy noise intensity detection processing on the target voice data, extracting environment noisy noise features in the target voice data, and deleting the environment noisy noise features.

6. The method according to claim 1, wherein the step of performing a voice matching process on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data comprises:

performing data comparison processing on the plurality of voice segments to obtain a data comparison result, wherein the data comparison processing comprises voice feature comparison, voice parameter comparison, voice duration comparison and voice occupation size comparison;

according to the data comparison result, carrying out paragraph statistical processing on the plurality of voice fragments to obtain a paragraph statistical result;

according to the paragraph statistical result, performing voice integration processing on the plurality of voice segments to obtain integrated voice data, wherein the voice integration processing comprises the same characteristic voice integration processing, the same parameter voice integration, the same duration voice integration and the range occupation size voice integration processing;

and performing voice matching processing on the integrated voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

7. The method according to claim 6, wherein the step of segmenting the target speech data to obtain a plurality of speech segments is preceded by the step of:

preprocessing the target voice data, wherein the preprocessing comprises: fuzzy section removing processing, voice filtering processing and noisy voice processing.

8. A speech matching apparatus, comprising:

the acquisition module is used for acquiring target voice data and the sound source position coordinates of the target voice data;

the calculation module is used for calculating the target distance between the sound source position and the voice receiving position according to the sound source position coordinates;

a determining module, configured to determine a voice matching range in a preset voice database according to the target distance, where voice data included in the voice matching range forms a first voice data set, voice data included in the voice database forms a second voice data set, and the first voice data set is a subset of the second voice data set;

and the matching module is used for performing voice matching processing on the target voice data and the voice data in the first voice data set to obtain a voice matching result corresponding to the target voice data.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.