CN116982111A - Audio characteristic compensation method, audio identification method and related products - Google Patents
Audio characteristic compensation method, audio identification method and related products Download PDFInfo
- Publication number
- CN116982111A CN116982111A CN202180095675.7A CN202180095675A CN116982111A CN 116982111 A CN116982111 A CN 116982111A CN 202180095675 A CN202180095675 A CN 202180095675A CN 116982111 A CN116982111 A CN 116982111A
- Authority
- CN
- China
- Prior art keywords
- frequency domain
- domain data
- frequency
- air density
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 141
- 238000012549 training Methods 0.000 claims description 107
- 238000012545 processing Methods 0.000 claims description 64
- 238000005070 sampling Methods 0.000 claims description 53
- 230000015654 memory Effects 0.000 claims description 48
- 238000007493 shaping process Methods 0.000 claims description 34
- 230000009466 transformation Effects 0.000 claims description 16
- 230000007480 spreading Effects 0.000 claims description 13
- 238000000605 extraction Methods 0.000 abstract description 7
- 230000009286 beneficial effect Effects 0.000 abstract 1
- 230000008569 process Effects 0.000 description 33
- 238000001228 spectrum Methods 0.000 description 25
- 230000006870 function Effects 0.000 description 23
- 238000013528 artificial neural network Methods 0.000 description 20
- 238000010586 diagram Methods 0.000 description 20
- 238000004891 communication Methods 0.000 description 13
- 230000007613 environmental effect Effects 0.000 description 12
- 239000011159 matrix material Substances 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 8
- 230000001537 neural effect Effects 0.000 description 8
- 230000004913 activation Effects 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 6
- 210000001260 vocal cord Anatomy 0.000 description 6
- 230000001755 vocal effect Effects 0.000 description 6
- 239000001307 helium Substances 0.000 description 5
- 229910052734 helium Inorganic materials 0.000 description 5
- SWQJXJOGLNCZEY-UHFFFAOYSA-N helium atom Chemical compound [He] SWQJXJOGLNCZEY-UHFFFAOYSA-N 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 241000238558 Eucarida Species 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 238000013213 extrapolation Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 239000000872 buffer Substances 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000001131 transforming effect Effects 0.000 description 3
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 239000007789 gas Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000002618 waking effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 210000001331 nose Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
An audio feature compensation method and apparatus, an audio recognition method and apparatus, an electronic apparatus, and a computer-readable medium, the audio feature compensation method includes acquiring first audio data (601); and performing characteristic compensation on the first audio data according to the first air density and the first reference air density to obtain second audio data, wherein the first air density is the air density in the environment when the first audio data are acquired (602). The method and the device are beneficial to improving the extraction precision of the audio features.
Description
The application relates to the technical field of artificial intelligence, in particular to an audio feature compensation method, an audio recognition method and related products.
The audio signal has abundant voice semantic information, and by extracting and identifying the information, a great number of intelligent services such as voice assistant, voiceprint identification and the like can be realized. The voiceprint is a biological attribute of a human body, and is easy to extract and simple to operate, so that voiceprint identification is increasingly applied to identity authentication of a bank or database system, a black-and-white list in real-time communication, meeting records in a meeting system, control authentication in a smart home and the like.
In terms of processing audio signals, it is essential to extract spectral features from the spectrum of the audio signals, however, the spectrum of the audio signals is affected by environmental factors, so that the extracted audio features have low accuracy, and thus identification accuracy in many application scenarios is also low. Therefore, how to improve the extraction accuracy of the audio features is a problem to be solved at present.
Disclosure of Invention
The embodiment of the application provides an audio feature compensation method, an audio recognition method and related products, which eliminate the influence of environmental factors on audio features and improve the extraction precision of the audio features through the compensation of the audio features.
In a first aspect, the present application provides an audio feature compensation method, including: acquiring first audio data; and performing characteristic compensation on the first audio data according to the first air density and the first reference air density to obtain second audio data, wherein the first air density is the air density in the environment where the first audio data are acquired.
The first reference air density is the air density of the environment where the audio data sample in the training sample is collected, that is, the first reference air density is related to the collection time and the collection location where the audio data sample is collected, for example, the first reference air density is the air density of the location W2 at the time T2 when the audio data sample is collected at the time T2 and the location W2. Wherein the training sample may be used to train a first audio recognition model that may be used to audio recognize the second audio data.
It can be seen that, in the embodiment of the present application, the acquired first audio data is subjected to feature compensation according to the first air density and the first reference air density, so that the audio data acquired under each air density can be unified to the first reference air density, that is, the audio data acquired under any air density is equivalent to the audio data acquired under the first reference air density, thereby eliminating environmental factors, such as air density, affecting the audio features of the audio data, and improving the extraction accuracy of the audio features.
With reference to the first aspect, in some possible implementations, compensating the first audio data according to the first air density and the first reference air density to obtain second audio data includes: performing frequency domain transformation on the first audio data to obtain first frequency domain data; performing characteristic compensation on the first frequency domain data according to the first air density and the first reference air density to obtain second frequency domain data; and performing frequency domain inverse transformation on the second frequency domain data to obtain second audio data.
It can be seen that in the embodiment of the application, the audio features can be compensated by frequency domain transformation and frequency domain inverse transformation, so that the influence of air density on the audio features is eliminated, and the extraction accuracy of the audio features is improved.
With reference to the first aspect, in some possible embodiments, performing feature compensation on the first frequency domain data according to the first air density and the first reference air density to obtain second frequency domain data includes: under the condition that the first air density is larger than the first reference air density, performing first sampling operation on the first frequency domain data according to the first air density and the first reference air density to obtain sampled first frequency domain data; and performing first frequency domain shaping operation on the first frequency domain data and the sampled first frequency domain data to obtain second frequency domain data, wherein the number of frequency points of the second frequency domain data is the same as that of the first frequency domain data, and the frequency interval between adjacent frequency points in the second frequency domain data is the same as that between adjacent frequency points in the first frequency domain data.
With reference to the first aspect, in some possible embodiments, in a case that the first air density is smaller than the first reference air density, according to the first air density and the first reference air density, the first frequency domain data is spread in a high frequency direction to obtain third frequency domain data; performing first sampling operation on the third frequency domain data according to the first air density and the first reference air density to obtain sampled third frequency domain data; and performing first frequency domain shaping operation on the third frequency domain data and the sampled third frequency domain data to obtain second frequency domain data, wherein the number of frequency points of the second frequency domain data is the same as that of the sampled third frequency domain data, and the frequency interval between adjacent frequency points in the second frequency domain data is the same as that between adjacent frequency points in the third frequency domain data.
It can be seen that in the above two embodiments, the first air density and the first reference air density are compared to obtain the difference therebetween, and the audio feature is compensated according to the actual influence of the difference on the audio feature, so that the first audio data can be accurately compensated from the first air density to the first reference air density, and the influence of the air density is eliminated.
With reference to the first aspect, in some possible implementations, spreading the first frequency domain data according to the first air density and the first reference air density to obtain third frequency domain data includes: according to the first sound velocity and the first reference sound velocity, the first frequency domain data are spread in the high frequency direction to obtain third frequency domain data, wherein the ratio between the number of frequency points of the third frequency domain data and the number of frequency points of the first frequency domain data is the ratio between the first sound velocity and the first reference sound velocity, the first sound velocity is determined according to the first air density, and the first reference sound velocity is determined according to the first reference air density.
It can be seen that, in the embodiment of the present application, when the first air density is smaller than the first reference air density, that is, when the first sound velocity is greater than the first reference sound velocity, the expansion is performed first in the high-frequency direction, so as to obtain high-frequency information, thereby providing high-frequency information for subsequent sampling operation and shaping operation.
With reference to the first aspect, in some possible implementations, the first sampling operation includes: sampling the frequency domain data A according to the first sound velocity and the first reference sound velocity to obtain sampled frequency domain data A; wherein the first sound speed is determined from a first air density and the first reference sound speed is determined from the first reference air density; the ratio of the number of frequency points of the sampled frequency domain data A to the number of frequency points of the frequency domain data A is the ratio of the first reference sound velocity to the first sound velocity; the frequency domain data a is sampled first frequency domain data when the frequency domain data a is first frequency domain data, and is sampled third frequency domain data when the frequency domain data a is third frequency domain data.
It can be seen that, in the embodiment of the present application, the frequency domain data a is sampled to obtain the sampled frequency domain data a, so that the formants of the first frequency domain data offset can be pulled back to the normal position by using the sampled frequency domain data a, thereby eliminating the influence of the air density on the audio characteristics.
With reference to the first aspect, in some possible implementations, a first frequency domain shaping operation includes: performing digital processing on the frequency domain data B to obtain digital processed frequency domain data B, wherein if the corresponding value of the frequency point A in the frequency domain data B is not 0, the corresponding value of the frequency point A in the digital processed frequency domain data B is 1, and if the corresponding value of the frequency point A in the frequency domain data B is 0, the corresponding value of the frequency point A in the digital processed frequency domain data B is 0, and the frequency point A is any frequency point in the frequency domain data B; performing mathematical operation on the frequency domain data B and the frequency domain data C after the digitizing processing according to the sequence from small frequency to large frequency of the frequency points to obtain second frequency domain data; in the case that the frequency domain data B is the first frequency domain data, the frequency domain data C is the first frequency domain data after sampling; in the case where the frequency domain data B is the third frequency domain data, the frequency domain data C is the sampled third frequency domain data.
With reference to the first aspect, in some possible embodiments, performing mathematical operation processing on the frequency domain data B and the frequency domain data C after the digitizing processing in order from the smaller frequency to the larger frequency of the frequency point to obtain second frequency domain data, including: performing mathematical operation on the frequency domain data B and the frequency domain data C after the digitizing processing according to the sequence from small frequency to large frequency of the frequency points to obtain fourth frequency domain data; and carrying out energy shaping on the fourth frequency domain data to obtain second frequency domain data, wherein the energy sum corresponding to the second frequency domain data is the same as the energy sum corresponding to the first frequency domain data.
It can be seen that, in the embodiment of the present application, by energy shaping, the energy sum of the second frequency domain data and the first frequency domain data is the same, so that the second audio data after feature compensation only transforms the waveform, and keeps consistent in energy, without changing the intrinsic essential characteristics of the first audio data, and the accuracy of audio identification using the second audio data is higher.
With reference to the first aspect, in some possible embodiments, the method further includes: and carrying out audio recognition on the second audio data to obtain an audio recognition result corresponding to the first audio data.
It can be seen that in the embodiment of the application, since the second audio data is the audio data with the first reference air density, the influence of the difference of the air density on the audio features is eliminated, so that the extraction of the audio features is not influenced by the air density, and the accuracy of audio identification on the first audio data is improved.
With reference to the first aspect, in some possible implementations, performing audio recognition on the second audio data to obtain an audio recognition result corresponding to the first audio data includes: and inputting the second audio data into a first audio recognition model which is subjected to training to carry out audio recognition, and obtaining an audio recognition result corresponding to the first audio data, wherein the first reference air density is the air density when the audio data sample in the training sample is acquired, and the training sample is used for training the first audio recognition model.
It can be seen that in the embodiment of the present application, since the first audio recognition model is trained using the audio data samples at the first reference air density, the first audio recognition model better remembers the audio features at the first reference air density. Therefore, the acquired first audio data is compensated to the first reference air density in the application stage, the second audio data is obtained, and the second audio data is used for audio recognition, so that the audio recognition accuracy can be improved.
In a second aspect, an embodiment of the present application provides an audio recognition method, including: acquiring first audio data; the first audio data is input into a second audio recognition model which completes training to carry out audio recognition, and an audio recognition result corresponding to the first audio data is obtained, wherein the second audio recognition model is determined according to a training sample set, the training sample set comprises a plurality of original audio data samples and a plurality of expanded audio data samples, the plurality of expanded audio data samples are obtained by respectively carrying out feature compensation on each original audio data sample in the plurality of original audio data samples according to a second air density and a plurality of second reference air densities, and the second air density is the air density in the environment where each original audio data sample is collected.
It can be seen that in the embodiment of the application, the second audio recognition model is trained by using the audio data samples under each air density, so that the trained second audio recognition model can memorize the audio characteristics of the user under each air density, the robustness of the second audio recognition model is higher, and the audio recognition precision is improved. The second audio recognition model can be used for directly carrying out audio recognition on the collected first audio data without audio feature compensation, so that the audio recognition accuracy is improved, and meanwhile, the audio recognition efficiency is improved.
With reference to the second aspect, in some possible embodiments, before acquiring the first audio data, the method further includes: acquiring a plurality of original audio data samples; according to the second air density and the second reference air densities, respectively performing feature compensation on each original audio data sample in the original audio data samples to obtain a plurality of expanded audio data samples corresponding to each original audio data sample; constructing a training sample set according to a plurality of extended audio data samples corresponding to each original audio data sample and a plurality of original audio data samples; and determining a second audio recognition model which completes training according to the training sample set.
It can be seen that in the embodiment of the present application, the original audio data samples are first expanded, so that expanded audio data samples under each air density can be obtained, and the second audio recognition model is trained by using such audio data samples, so that the robustness of the second audio recognition model is higher, and the subsequent audio recognition accuracy is improved.
With reference to the second aspect, in some possible embodiments, performing feature compensation on each of the plurality of original audio data samples according to the second air density and the plurality of second reference air densities to obtain a plurality of extended audio data samples corresponding to each of the plurality of original audio data samples, respectively, includes: performing frequency domain transformation on each original audio data sample to obtain fifth frequency domain data corresponding to each original audio data; performing feature compensation on the fifth audio data according to each second reference air density and each second air density in the plurality of second reference air densities for each original audio data sample to obtain sixth frequency domain data corresponding to each second reference air density, and obtaining a plurality of sixth frequency domain data corresponding to each original audio data sample according to the sixth frequency domain data corresponding to each second reference air density, wherein the plurality of sixth frequency domain data are in one-to-one correspondence with the plurality of second reference air densities; and respectively carrying out frequency domain inverse transformation on each sixth frequency domain data in the plurality of sixth frequency domain data corresponding to each original audio data sample to obtain a plurality of extended audio data samples corresponding to each original audio data sample.
With reference to the second aspect, in some possible embodiments, performing feature compensation on the fifth audio data according to each of the plurality of second reference air densities and the second air density to obtain sixth frequency domain data corresponding to each of the second reference air densities, including: under the condition that the second air density is larger than the second reference air density A, performing second sampling operation on the fifth frequency domain data according to the second air density and the second reference air density A to obtain sampled fifth frequency domain data; performing second frequency domain shaping operation on the fifth frequency domain data and the sampled fifth frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density A, wherein the number of frequency points of the sixth frequency domain data corresponding to the second reference air density A is the same as that of the fifth frequency domain data, and the frequency interval between adjacent frequency points in the sixth frequency domain data corresponding to the second reference air density A is the same as that between adjacent frequency points in the fifth frequency domain data;
wherein the second reference air density a is any one of a plurality of second reference air densities.
With reference to the second aspect, in some possible embodiments, in a case where the second air density is smaller than the second reference air density a, according to the second air density and the second reference air density a, the fifth frequency domain data is spread in a high frequency direction, so as to obtain seventh frequency domain data; performing second sampling operation on the seventh frequency domain data according to the second air density and the second reference air density A to obtain sampled seventh frequency domain data; performing second frequency domain shaping operation on the seventh frequency domain data and the sampled seventh frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density A; the second reference air density a is any one of a plurality of second reference air densities, wherein the number of frequency points of the sixth frequency domain data corresponding to the second reference air density a is the same as the number of frequency points of the sampled seventh frequency domain data, and the frequency interval between adjacent frequency points in the sixth frequency domain data corresponding to the second reference air density a is the same as the frequency interval between adjacent frequency points in the seventh frequency domain data.
With reference to the second aspect, in some possible embodiments, spreading the fifth frequency domain data in a high frequency direction according to the second air density and the second reference air density a to obtain seventh frequency domain data includes: and according to the second sound speed and the second reference sound speed A, spreading the fifth frequency domain data to a high frequency direction to obtain seventh frequency domain data, wherein the ratio between the number of frequency points of the seventh frequency domain data and the number of frequency points of the fifth frequency domain data is the ratio between the second reference sound speed A and the second sound speed, wherein the second sound speed is determined according to the second air density, and the second reference sound speed A is determined according to the second reference air density A.
With reference to the second aspect, in some possible embodiments, the second sampling operation includes: sampling the frequency domain data D according to the second sound speed and the second reference sound speed A to obtain sampled frequency domain data D; wherein the second sound speed is determined from the second air density and the second reference sound speed a is determined from the second reference air density a; the ratio between the number of frequency points of the sampled frequency domain data D and the number of frequency points of the frequency domain data D is the ratio between the second sound velocity and the second reference sound velocity A; when the frequency domain data D is the fifth frequency domain data, the sampled frequency domain data D is the sampled fifth frequency domain data, and when the frequency domain data D is the seventh frequency domain data, the sampled frequency domain data D is the sampled seventh frequency domain data.
With reference to the second aspect, in some possible embodiments, the second frequency domain shaping operation includes: the method comprises the steps of performing digital processing on frequency domain data E to obtain digital processed frequency domain data E, wherein if a value corresponding to a frequency point B in the frequency domain data E is not 0, a value corresponding to the frequency point B in the digital processed frequency domain data E is 1, and if a value corresponding to the frequency point B in the frequency domain data E is 0, a value corresponding to the frequency point B in the digital processed frequency domain data E is 0, and the frequency point B is any frequency point in the frequency domain data E; performing mathematical operation on the frequency domain data E and the frequency domain data F after the digitizing treatment according to the sequence from small frequency to large frequency of the frequency points to obtain sixth frequency domain data corresponding to the second reference air density A; wherein, when the frequency domain data E is the fifth frequency domain data, the frequency domain data F is the sampled fifth frequency domain data; in the case where the frequency-domain data is the seventh frequency-domain data, the frequency-domain data F is the sampled seventh frequency-domain data.
With reference to the second aspect, in some possible embodiments, performing mathematical operation processing on the frequency domain data E and the frequency domain data F after the digitizing processing in order from the frequency of the frequency bin to the frequency of the frequency bin, to obtain sixth frequency domain data corresponding to the second reference air density a, including: performing mathematical operation on the frequency domain data E and the frequency domain data F after the digitizing processing according to the sequence from small frequency to large frequency of the frequency points to obtain eighth frequency domain data; and carrying out energy shaping on the eighth frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density A, wherein the energy sum corresponding to the sixth frequency domain data is the same as the energy sum corresponding to the fifth frequency domain data.
In a third aspect, an embodiment of the present application provides an audio feature compensation apparatus. The advantages may be seen from the description of the first aspect and will not be repeated here. The audio feature compensation means has the function of implementing the behavior in the method example of the first aspect described above. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above. In one possible design, the audio feature compensation apparatus includes: an acquisition unit configured to acquire first audio data; and the processing unit is used for performing characteristic compensation on the first audio data according to the first air density and the first reference air density to obtain second audio data, wherein the first air density is the air density when the first audio data are acquired.
In a fourth aspect, an embodiment of the present application provides an audio recognition apparatus. The advantages may be seen from the description of the second aspect and will not be repeated here. The audio recognition device has the function of implementing the behavior in the method example of the second aspect described above. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above. In one possible design, the audio recognition device includes: an acquisition unit configured to acquire first audio data; the processing unit is used for inputting the first audio data into a second audio recognition model which is subjected to training to carry out audio recognition, and obtaining an audio recognition result corresponding to the first audio data, wherein the second audio recognition model is obtained by training a training sample set, the training sample set comprises a plurality of original audio data samples and a plurality of expanded audio data samples, the plurality of expanded audio data samples are obtained by respectively carrying out feature compensation on each original audio data sample in the plurality of original audio data samples according to a second air density and a plurality of second reference air densities, and the second air density is the air density when each original audio data sample is acquired.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory for storing a program; a processor for executing a memory-stored program, the processor being for executing the method of the first or second aspect described above when the memory-stored program is executed.
In a sixth aspect, embodiments of the present application provide a computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the method of the first or second aspects described above.
In a seventh aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method as in the first or second aspects described above.
In an eighth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a data interface, and the processor reads an instruction stored on a memory through the data interface, and performs the method in the first aspect or the second aspect.
Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to perform the method in the first aspect or the second aspect.
FIG. 1 is a schematic diagram of a sounding model according to an embodiment of the present application;
FIG. 2 is a schematic diagram showing variation of formants of a helium gas according to an embodiment of the present application;
FIG. 3 is a schematic diagram of voiceprint recognition according to an embodiment of the present application;
FIG. 4 is a diagram of a system architecture according to an embodiment of the present application;
fig. 5 is a schematic diagram of a chip hardware architecture according to an embodiment of the present application;
fig. 6 is a flowchart of an audio feature compensation method according to an embodiment of the present application;
fig. 7 is a schematic diagram of first frequency domain data according to an embodiment of the present application;
fig. 8 is a schematic diagram of spreading first frequency domain data according to an embodiment of the present application;
FIG. 9 is a schematic diagram of performing a first sampling operation on first frequency domain data according to an embodiment of the present application;
FIG. 10 is a schematic diagram of performing a first sampling operation on third frequency domain data according to an embodiment of the present application;
FIG. 11 is a schematic diagram of digitizing first frequency domain data according to an embodiment of the present application;
FIG. 12 is a schematic diagram of digitizing third frequency domain data according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a mathematical operation performed on first frequency domain data and sampled first frequency domain data according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a mathematical operation performed on third frequency domain data and sampled third frequency domain data according to an embodiment of the present application;
FIG. 15 is a graph of a spectrum before and after audio feature compensation when a first sound velocity is less than a first reference sound velocity;
FIG. 16 is a graph of a spectrum before and after audio feature compensation when a first sound velocity is greater than a first reference sound velocity;
FIG. 17 is a flowchart of an audio recognition model training method according to an embodiment of the present application;
fig. 18 is a flowchart of an audio recognition method according to an embodiment of the present application;
FIG. 19 is a schematic diagram of another audio recognition method according to an embodiment of the present application;
fig. 20 is a schematic structural diagram of an audio feature compensation apparatus according to an embodiment of the present application;
fig. 21 is a schematic structural diagram of an audio recognition device according to an embodiment of the present application;
fig. 22 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
In order to facilitate understanding of the present application, explanation will be first made of the present application concerning the related art.
The vocal model of human being can be simplified into vocal cords to excite vocal tract, and the process of generating sound is shown in fig. 1, wherein the vocal cords a can generate vocal cord signals with different fundamental frequencies and harmonics, and the vocal tract system B is composed of mouth, throat, nasal cavity and the like, i.e. the vocal tract system B can be regarded as a filter, and the filter can inhibit certain frequencies of the vocal cord signals and enhance certain frequencies (i.e. realize resonance) in the process of filtering. Thus, the vocal cord signal is filtered by the vocal cord system B to obtain an output signal C, namely the sound emitted by the human body.
However, even if one speaker speaks in different environments, the sound changes, and the formant position of the output signal C is different. Specifically, if the vocal tract system is abstracted as a pipeline, the position of the formant is an integer multiple of f, where f=c/4 l, l is the length of the pipeline, and c is the speed of sound. Because the speaker's vocal tract system will not generally change, the length L of the duct will not generally change, so the position of the influencing formants is mainly the speed of sound, which in turn is related to the air density, so when the speed of sound changes, the position of the formants will also change. Wherein the sound velocity can be expressed by the formula (1):
c is the speed of sound, K is the bulk modulus, ρ is the gas density.
Therefore, when the air density becomes high, the sound velocity becomes low, and when the air density becomes low, the sound velocity becomes high, and the position of the formant becomes high.
For example, as shown in fig. 2, the left graph shows that before the suction of helium, the positions of formants appear at f1 and f2, and after the suction of helium, the positions of formants appear at f0, and it can be seen that when the suction of helium, the air density becomes small, causing the positions of formants to be significantly shifted toward high frequency.
Therefore, even if the same person speaks under different environments, the positions of formants shift due to different environments, so that the characteristics of the voice of the person speaking also change, and finally the extracted audio characteristics have lower precision. Therefore, how to make the extracted audio features have high accuracy and not affected by environmental factors is a current urgent problem to be solved.
The technical scheme of the application will be described below with reference to the accompanying drawings.
The audio feature compensation method and/or the audio recognition method provided by the embodiment of the application can be applied to voice assistants and other scenes needing audio recognition. Specifically, the audio feature compensation method and/or the audio recognition method according to the embodiments of the present application can be applied to a voice assistant scene as well as an identification (one type of audio recognition) scene. The following describes a voice assistant scenario and an identification scenario, respectively.
Voice assistant scenarios:
aiming at the audio feature compensation method provided by the application, when a user exhales or wakes up a voice assistant, equipment firstly collects the voice which is sent by the user and is used for exhaling or waking up the voice assistant, for example, when the voice assistant Siri of the apple mobile phone exhales or wakes up, the collected voice can be 'hi, siri'; then the device compensates the audio characteristics of the collected voice, compensates the audio characteristics of the voice to the normal environment (for example, the environment when the voice template is recorded), eliminates the influence of environmental factors on the audio characteristics, ensures that the audio characteristics of the collected voice are not influenced by the environmental factors, ensures that the device can accurately recognize the semantics of the collected voice in any environment, and further realizes that a user can quickly and accurately exhale or wake up a voice assistant of the device in any environment;
According to the audio recognition method provided by the application, no matter in any environment, when a user exhales or wakes up a voice assistant, the device firstly collects the voice which is sent by the user and is used for exhaling or waking up the voice assistant, for example, when the user exhales or wakes up the voice assistant Siri of the apple phone, the collected voice can be 'hi, siri'; and then invoking a trained model to recognize the collected voice, wherein the trained model is obtained through rich sample training. Therefore, the device can accurately recognize the semantics of the voice sent by the user no matter in which environment the user is in, so that the device can accurately recognize the semantics of the voice of the user no matter in which environment the user exhales or wakes up the voice assistant, namely, the voice assistant of the device can be quickly and accurately exhaled or waked up by the user in any environment.
Identification scene:
aiming at the identification scene, the voice print identification scene can be understood, and the identification is mainly carried out by calling the neural network at present. Specifically, as shown in fig. 3, the present method for identifying an identity based on audio data mainly includes the following steps: 1) Training data: training a speaker model for voiceprint recognition, wherein the speaker model is required to be trained in different speaking environments and different individual voices, and the more the data volume is, the more stable the speaker model recognition effect is, and the better the robustness is; 2) And (3) data registration: collecting a section of voice in real time, registering the voice into a speaker model, so that voice print characteristics of the person are "remembered" by the speaker model, namely, the model can store the voice print characteristics of the person more, and a speaker model library is generated; 3) And (3) data identification: and migrating the registered speaker model from the speaker model library, calling the speaker model to perform voiceprint recognition on the voice acquired in real time, namely comparing the voice with each registered candidate, namely judging a threshold value, and if the voice is compared and matched with the stored voiceprint characteristics of a certain registered candidate, enabling a user corresponding to the voice to belong to the same person as the candidate corresponding to the registered voiceprint characteristics, so as to complete identity recognition.
Aiming at the audio characteristic compensation method provided by the application, the audio data of the user with the identity to be verified can be obtained; then, performing feature compensation on the audio data, and compensating the features of the audio data to a preset environment, for example, an environment for data registration, so as to obtain compensated audio data; and finally, carrying out identity verification on the user to be verified by using the compensated audio data. Therefore, the collected audio data can be compensated to a preset environment no matter what environment the user is in, and the identity of the user can be accurately identified no matter what environment the user is in;
according to the audio identification method provided by the application, when data registration is carried out, the acquired data can be expanded to obtain the audio characteristics of the user in each environment, so that the speaker model can memorize the audio characteristics of the user in each environment. When the identity is identified later, the identity information of the user can be accurately identified no matter the user is in any environment.
The method and apparatus provided in the embodiments of the present application may also be used to extend a training database, where, as shown in fig. 4, the I/O interface 112 of the execution device 110 may send all the extended audio data samples processed by the execution device 110 and the original audio data samples as training data to the database 130, so that the training data maintained by the database 130 is richer, and thus richer training data is provided for the training work of the training device 120.
The method provided by the application is described below from a model training side and a model application side:
the training method of the first audio recognition model provided by the embodiment of the application relates to audio data processing, and can be particularly applied to data processing methods such as data training, machine learning, deep learning and the like, and the training data such as the original audio data sample in the application is subjected to symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like to finally obtain the first audio recognition model for completing training; the audio feature compensation method provided by the embodiment of the application can be applied to the trained first audio recognition model, namely, the compensated audio data, namely, the second audio data obtained after the feature compensation of the first audio data, is input into the trained first audio recognition model to obtain an audio recognition result;
the training method of the second audio recognition model provided by the embodiment of the application relates to the processing of audio data, and can be particularly applied to data processing methods such as data training, machine learning, deep learning and the like; in addition, the audio recognition method provided by the embodiment of the application can use the trained second audio recognition model to input directly acquired audio data, namely the first audio data in the application, into the trained second audio recognition model to obtain an audio recognition result.
Because the embodiments of the present application relate to a large number of applications of neural networks, for convenience of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.
(1) Neural network
The neural network may be composed of neural units, which may be referred to as x s And an arithmetic unit whose intercept 1 is an input, the output of the arithmetic unit may be:
wherein s=1, 2, … … n, n is a natural number greater than 1, W s Is x s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function can be used asThe input of the next convolution layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.
(2) Loss function
In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.
(3) Back propagation algorithm
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.
Referring to fig. 4, fig. 4 is a system architecture 100 according to an embodiment of the present application. As shown in the system architecture 100 of fig. 4, the data acquisition device 160 is configured to acquire training data and store the training data in the database 130, where the training data includes: an original audio data sample; training device 120 trains audio recognition model 101 based on training data maintained in database 130. The audio recognition model 101 can be used to implement audio recognition. For example, when implementing the audio feature compensation method provided by the embodiment of the present application, the audio recognition model 101 may be a first audio recognition model that completes training, and then the second audio data obtained by feature compensation may be input to the audio recognition model 101 to obtain an audio recognition result of the first audio data; when the audio recognition method provided by the embodiment of the application is implemented, the audio recognition model 101 may be a second audio recognition model that completes training, and then the collected first audio data may be directly input to the audio recognition model 101, so as to obtain an audio recognition result of the first audio data. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places to perform the training of the audio recognition model 101.
The audio recognition model 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 4, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, or may be a server or cloud terminal. In fig. 4, the execution device 110 is configured with an I/O interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include in an embodiment of the present application: first audio data collected during application and training data, such as raw audio data samples, collected during training.
The preprocessing module 113 is used for preprocessing input data received by the I/O interface 112. For example, in the case where the audio recognition model 101 is trained using training data under a single environment, such as an original audio data sample, that is, the audio recognition model 101 is a first audio recognition model mentioned later, the preprocessing module 113 may perform feature compensation on the acquired first audio data when performing audio recognition using the audio recognition model 101, and then input the compensated second audio data to the audio recognition model 101.
In preprocessing input data by the execution device 110, or in performing processing related to computation or the like by the computation module 111 of the execution device 110, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 150.
Finally, the I/O interface 112 returns the processing result, such as the audio recognition result obtained as described above, to the client device 140, thereby providing the processing result to the user.
It should be noted that the training device 120 may generate, based on different training data, a corresponding audio recognition model 101 for different targets or different tasks, where the corresponding audio recognition model 101 may be used to implement the above-mentioned audio recognition task, thereby providing the user with a desired result.
In the case shown in FIG. 4, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.
It should be noted that fig. 4 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in fig. 4 is not limited in any way, for example, in fig. 4, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed inside the execution device 110.
As shown in fig. 4, according to training by the training device 120 to obtain an audio recognition model 101, the audio recognition model 101 may be a first audio recognition model or a second audio recognition model according to the embodiment of the present application, and specifically, the audio recognition model provided by the embodiment of the present application may include: convolutional neural networks. That is, in the convolutional neural network provided in the embodiment of the present application, both the first audio recognition model and the second audio recognition model may be convolutional neural networks.
The following describes a chip hardware structure provided by the embodiment of the application.
Fig. 5 is a chip hardware structure provided in an embodiment of the application, where the chip includes a neural network processor 50. The chip may be provided in an execution device 110 as shown in fig. 4 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 4 for completing the training work of the training device 120 and outputting the audio recognition model 101. The audio feature compensation method and the audio recognition method in the embodiment of the application can be realized in the chip shown in fig. 5.
The neural network processor (Neural Network Processing Unit, NPU) 50 is mounted as a coprocessor to a Host CPU (Host CPU) which distributes tasks. The NPU has a core part of an arithmetic circuit 503, and a controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform arithmetic.
In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) inside. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 503 is a general-purpose matrix processor.
For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 503 takes the data corresponding to the weight matrix B from the weight memory 502 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit 503 takes the input matrix a and the weight matrix B from the input memory 501, performs matrix operation, and saves the partial result or the final result of the matrix obtained in the accumulator 508 (accumulator).
The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculations of non-convolutional/non-FC layers in a neural network, such as Pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), and the like.
In some implementations, the vector computation unit 507 can store the vector of processed outputs to the unified buffer 506. For example, the vector calculation unit 507 may apply a nonlinear function to an output of the operation circuit 503, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 503, for example for use in subsequent layers in a neural network.
The unified memory 506 is used for storing input data and output data.
The weight data is directly transferred to the input memory 501 and/or the unified memory 506 through the memory cell access controller 505 (Direct Memory Access Controller, DMAC), the weight data in the external memory is stored in the weight memory 502, and the data in the unified memory 506 is stored in the external memory.
A bus interface unit (Bus Interface Unit, BIU) 510 for implementing interactions between the main CPU, DMAC and finger memory 509 over the bus;
an instruction fetch memory (instruction fetch buffer) 509 connected to the controller 504 for storing instructions used by the controller 504;
The controller 504 is configured to invoke an instruction cached in the instruction fetch memory 509, so as to control a working process of the operation accelerator.
Typically, the unified memory 506, the input memory 501, the weight memory 502, and the finger memory 509 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU50, which may be a double data rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, abbreviated as DDR SDRAM), a high bandwidth memory (High Bandwidth Memory, HBM), or other readable and writable memory.
Referring to fig. 6, fig. 6 is a flowchart of an audio feature compensation method according to an embodiment of the application. The method may be specifically performed by the execution device 110 as shown in fig. 4. The method comprises the following steps:
601: first audio data is acquired.
The first audio number is obtained by audio collection of a user needing audio identification. The first audio data may be acquired in real time or may be acquired in advance, and the first audio data is acquired from a stored audio data space if the first audio data is required to be used. The application is not limited in the manner in which the first audio data is acquired.
602: and performing characteristic compensation on the first audio data according to the first air density and the first reference air density to obtain second audio data, wherein the first air density is the air density in the environment where the first audio data are acquired.
The first air density is the air density in the environment where the first audio data is collected, that is, the first air density is related to the collection time and the collection place where the first audio data is collected. For example, when the first audio data is collected at the time T1 and the location W1, the first air density is the air density of the location W1 at the time T1.
For example, when the first audio data is collected, a geographic position of the collection device when the first audio data is collected may be obtained, and a first air density when the first audio data is collected is determined according to the geographic position. Of course, the first air density may also be obtained by other manners, for example, the first air density in the environment where the first audio data is collected may be obtained by a density monitor.
The first reference air density is the air density in the environment where the audio data sample in the training sample is collected. Also, the first reference air density is related to the collection time and collection location of the audio data sample, for example, the audio data sample is collected at time T2 and at time W2, and then the first reference air density is the air density at time T2 at time W2. Illustratively, when the audio data sample is acquired at a standard air density, then the first reference air density is the standard air density; or collecting the audio data sample in a laboratory environment, wherein the first reference air density is the air density when the audio data sample is collected in the laboratory environment; for another example, when the audio data sample is collected in a special environment, for example, when the audio data sample is collected in an environment with helium added, the first reference air density is the air density when the audio data sample is collected in the special environment. Therefore, the audio data samples can be collected at various air densities, i.e. the value of the first reference density is not limited in the present application.
In addition, the training sample is used for training the first audio recognition model, so that the trained first audio recognition model can be used for audio recognition of the second audio data, and the audio recognition process is described later, which is not described too much.
Therefore, the first audio data can be subjected to characteristic compensation according to the difference between the first air density and the first reference air density, so as to obtain the second audio data. That is, the influence of the difference of the air density on the audio data acquisition is eliminated, and the first audio data is converted to be equivalent to the audio data acquired under the first reference air density.
It can be seen that, in the embodiment of the present application, the first audio data of the user collected under the first air density is subjected to feature compensation, that is, the audio features of the first audio data are compensated to the first reference air density, so that the audio features of the first audio data collected under the first air density are aligned with the first reference air density, and the second audio data are obtained. Therefore, the audio characteristics of the audio data collected under any air density can be unified to the first reference air density, namely, the audio data collected under any air density is equivalent to the audio data collected under the first reference air density, so that the influence of environmental factors on the audio characteristics can be eliminated, and the recognition accuracy is higher and the influence of the environmental factors is avoided when the compensated second audio data is used for audio recognition.
The following describes a method for implementing feature compensation provided by the embodiment of the present application.
For example, the first audio data S (N) is subjected to frequency domain transformation to obtain first frequency domain data, for example, fourier transformation is performed on the first audio number to obtain the first frequency domain data S1 (w), and as shown in fig. 7, the first frequency domain data S1 (w) may be a spectrogram of the first audio data S (N), where the spectrogram includes N frequency points. Illustratively, the spectrum includes, but is not limited to, an energy spectrum, a power spectrum, or an amplitude spectrum, and the present application is described by taking frequency domain data as an example of the amplitude spectrum. Then, according to the first air density and the first reference air density, performing feature compensation on the first frequency domain data to obtain second frequency domain data; and performing frequency domain inverse transformation on the second frequency domain data to obtain second audio data, for example, performing Fourier inverse transformation on the second frequency domain data to obtain second audio data.
Optionally, under the condition that the first air density is greater than the first reference air density, first performing a first sampling operation on the first frequency domain data according to the first air density and the first reference air density to obtain sampled first frequency domain data; performing a first frequency domain shaping operation on the first frequency domain data and the sampled first frequency domain data to obtain second frequency domain data; optionally, under the condition that the first air density is smaller than the first reference air density, firstly spreading the first frequency domain data to a high frequency direction according to the first air density and the first reference air density to obtain third frequency domain data; then, performing the first sampling operation on the third frequency domain data to obtain sampled third frequency domain data; performing the first frequency domain shaping operation on the third frequency domain data and the sampled third frequency domain data; the second frequency domain data is obtained. It should be appreciated that in the case where the first air density is equal to the first reference air density, the first audio data may be directly used for audio recognition without feature compensation.
Illustratively, a first sound speed is determined according to equation (1) above and the first air density, and a first reference sound speed is determined according to the first reference air density; and then, according to the first sound velocity and the first reference sound velocity, spreading the first frequency domain data to a high frequency direction to obtain third frequency domain data, wherein the ratio between the frequency point data in the third frequency domain data and the frequency point number in the first frequency domain data is the ratio between the first sound velocity and the first reference sound velocity.
Illustratively, the ratio may be represented by equation (3):
wherein M is the number of frequency points in the third frequency domain data, N is the number of frequency points in the first frequency domain data, C 1 At a first sound velocity, C 0 Is a first reference sound speed.
For example, as shown in fig. 8, the first frequency domain data S1 (w) is spread in the high frequency direction to obtain the third frequency domain data S3 (w), that is, the number of frequency points of the first frequency domain data S1 (w) is spread from N to M to obtain the high frequency information, and further obtain the third frequency domain data with the number of frequency points of M.
By way of example, spreading the frequency domain data may be achieved by linear extrapolation (Linear Extrapolation, LE), effective High frequency bandwidth extension (EHBE), mixed signal extrapolation (Hybrid Signal Extrapolation, HSE), and nonlinear prediction. The application uses LE as the description of the spread spectrum implementation process, the LE mainly uses the logarithmic magnitude spectrum envelope of the audio signal to realize the spread spectrum in approximately linear decreasing relation.
Illustratively, first a spectral envelope of the first frequency domain data S1 (w) is acquired, which part of the spectrum may be regarded as a spectral envelope of the low frequency part; then, transforming the spectrum envelope of the first frequency domain data S1 (w) to a logarithmic domain, and fitting the spectrum envelope into a straight line by adopting a linear least square method in the logarithmic domain to obtain the slope of the fitted straight line; and finally, copying the low frequency spectrum information, namely the frequency spectrum envelope corresponding to the low frequency part to obtain high frequency information, carrying out envelope attenuation on the high frequency information by utilizing the slope of a fitting straight line to obtain the frequency spectrum envelope of the high frequency part in the logarithmic domain, and transforming the whole frequency spectrum envelope (comprising the frequency spectrum envelope of the low frequency part and the frequency spectrum envelope of the high frequency part) of the logarithmic domain into the same coordinate system as the first frequency domain data to obtain the third frequency domain data.
It should be appreciated that, since the sound velocity is determined by the air density, in practical applications, the sound velocity may not be determined, and the spread spectrum may be directly performed according to the air density. For example, the above-mentioned frequency-spreading the first frequency domain data according to the first air density and the first reference air density is to spread the first frequency domain data in the high frequency direction substantially according to the ratio between the first air density and the first reference air density, wherein the ratio between the number of frequency points of the third frequency domain data and the number of frequency points of the first frequency domain data is the arithmetic square root of the ratio between the first reference air density and the first air density. Therefore, the ratio between the number of frequency points of the third frequency domain data and the number of frequency points of the first frequency domain data can also be expressed by the formula (4):
Wherein M is the number of frequency points in the third frequency domain data, N is the number of frequency points in the first frequency domain data, ρ 0 For a first reference air density ρ 1 Is a first air density.
It should be understood that the ratio between the number of frequency points mentioned above and referred to later is only for defining the shape of the transformed (e.g. sampled and spread) frequency domain data, and in practical application, the shape of the transformed frequency domain data may be defined by other frequency domain parameters, which is not described in detail in the present application.
The first sampling operation of the present application is described below.
Illustratively, a first sound speed is determined from the first air density, and a first reference sound speed is determined from the first reference air density; then, sampling the frequency domain data A according to the first sound velocity and the first reference sound velocity to obtain sampled frequency domain data A, wherein the ratio between the number of frequency points of the sampled frequency domain data A and the number of frequency points of the frequency domain data A is the ratio between the first reference sound velocity and the first sound velocity. Illustratively, the ratio between the number of frequency points of the sampled frequency domain data a and the number of frequency points of the frequency domain data a may be represented by formula (5):
Wherein M is the number of frequency points in the sampled frequency domain data A, N is the number of frequency points in the frequency domain data A, and C 0 For the first reference sound velocity, C 1 Is the first speed of sound.
Therefore, in the case where the frequency domain data a is the first frequency domain data, that is, in the case where the first frequency domain data is sampled, as shown in fig. 9, the first frequency domain data S1 (w) is subjected to the first sampling operation, so as to obtain the sampled first frequency domain data S1 (w), that is, the number of frequency points is changed from N frequency points to M frequency points. In the case where the frequency domain data a is the third frequency domain data S3 (w), as shown in fig. 10, the first sampling operation is performed on the third frequency domain data S3 (w) to obtain sampled third frequency domain data S3 (w), that is, the number of frequency points is changed from M to N.
Similarly, similar to the above spreading, the first sampling operation may also be directly performed according to the first air density and the first reference air density, where the ratio between the number of frequency points of the sampled frequency domain data a and the number of frequency points of the frequency domain data a may be represented by formula (6):
wherein M is the number of frequency points in the sampled frequency domain data A, N is the number of frequency points in the frequency domain data A, ρ 0 For a first reference air density ρ 1 Is a first air density.
The first frequency domain shaping operation of the present application is described below.
Firstly, the frequency domain data B is digitized, and the frequency domain data B after the digitizing is obtained. For example, if the value of the frequency point a corresponding to the frequency domain data B is not 0, the value of the frequency domain data B corresponding to the frequency point a after the digitizing is 1, and if the value of the frequency point a corresponding to the frequency domain data B is 0, the value of the frequency domain data B corresponding to the digitizing is 0. The value of the frequency point a corresponding to the frequency domain data B is substantially the value of the ordinate corresponding to the frequency point a, for example, when the frequency domain data B is an amplitude spectrogram, the value of the frequency point a corresponding to the frequency domain data B is the amplitude corresponding to the frequency point a, and when the frequency domain data B is an energy spectrum, the value of the frequency point a corresponding to the frequency domain data B is the energy corresponding to the frequency point a, and so on.
The frequency domain data B may be first frequency domain data or third frequency domain data, for example. As shown in fig. 11, when the frequency domain data is the first frequency domain data S1 (w), the frequency domain data B is digitized, and the digitized frequency domain data B, that is, S'1 (w) is obtained; as shown in fig. 12, when the frequency domain data is the third frequency domain data S3 (w), the frequency domain data B is digitized, and the digitized frequency domain data B, that is, S'3 (w) is obtained.
Further, after the frequency domain data B is digitized, mathematical operation processing is performed on the frequency domain data B and the frequency domain data C after the digitizing in order from the smaller frequency point to the larger frequency point, so as to obtain the second frequency domain data. The digitized frequency domain data B is multiplied by the value of the frequency point at the corresponding position in the frequency domain data C to obtain the second frequency domain data.
Specifically, if the frequency domain data B is the first frequency domain data, the frequency domain data C is the first frequency domain data after sampling, and since the frequency domain data B after digitizing includes N frequency points, and the first frequency domain data after sampling includes M frequency points (where M is greater than N), the product processing cannot be directly performed on the frequency domain data B after digitizing and the first frequency domain data after sampling, so, as shown in fig. 13, N frequency points (i.e., the first N frequency points are cut out) are first cut out from the first frequency domain data after sampling according to the order of the frequency size, and then the product processing is performed on the N frequency points and the N frequency points in the first frequency domain data after digitizing, i.e., the product is performed on the value of the corresponding frequency point, so as to obtain the second frequency domain data. Therefore, the number of the frequency points of the second frequency domain data is the same as that of the first frequency domain data after the digitization processing or that of the first frequency domain data after the sampling, the frequencies of all the frequency points in the second frequency domain data are in one-to-one correspondence (the same) with those of all the frequency points in the first frequency domain data after the digitization processing, and the values of all the frequency points are products of the values of the corresponding frequency points, namely, the frequency intervals between the adjacent frequency points in the second frequency domain data are the same as that between the adjacent frequency points in the first frequency domain data after the digitization processing, namely, the sampling intervals of the two frequency domain data are the same. For example, the value of the first frequency point in the first frequency domain data after the digitizing process and the value of the first frequency point in the first frequency domain data after the sampling process are multiplied, the value obtained by the multiplication process is used as the value of the first frequency point in the second frequency domain data, and the frequency of the first frequency point in the second frequency domain data is the same as the frequency of the first frequency point in the first frequency domain data after the digitizing process.
Specifically, if the frequency domain data B is third frequency domain data, the frequency domain data C is sampled third frequency domain data, and since the frequency domain data B after the digitizing process includes M frequency points and the sampled third frequency domain data includes N frequency points, the product processing of the digitized third frequency domain data and the sampled third frequency domain data cannot be directly performed. Therefore, as shown in fig. 14, N frequency points are first cut out from the sampled third frequency domain data in order of frequency magnitude, that is, the first N frequency points are cut out; and then, carrying out product processing on the N frequency points and the N frequency points in the sampled third frequency domain data, namely carrying out product processing on the values of the corresponding frequency points to obtain second frequency domain data. Therefore, the number of the frequency points of the second frequency domain data is the same as that of the third frequency domain data after the digitization processing or that of the third frequency domain data after the sampling processing, the frequencies of all the frequency points in the second frequency domain data are in one-to-one correspondence (the same) with those of all the frequency points in the third frequency domain data after the digitization processing, and the value of each frequency point is the product of the values of the corresponding frequency points, namely, the frequency interval between the adjacent frequency points in the second frequency domain data is the same as that between the adjacent frequency points in the third frequency domain data after the digitization processing, namely, the sampling intervals of the two frequency domain data are the same. For example, the value of the first frequency point in the third frequency domain data after the digitizing process and the value of the first frequency point in the third frequency domain data after the sampling process are multiplied, the value obtained by the multiplication process is used as the value of the first frequency point in the second frequency domain data, and the frequency of the first frequency point in the second frequency domain data is the same as the frequency of the first frequency point in the third frequency domain data after the digitizing process.
It can be seen that, in the case where the first air density is greater than the first reference air density, as shown in fig. 15, the solid line represents the spectrogram of the first audio data, and the dotted line represents the spectrogram of the audio data acquired by the user at the first reference air density, which is equivalent to the spectrogram of the second audio data obtained after the feature compensation, as shown in formula (1). Therefore, the first frequency domain data may shift the resonance peak in a low frequency direction relative to the frequency domain data acquired at the first reference air density. Because the first audio recognition model is obtained through training of the audio data samples under the first reference air density, the first audio recognition model can accurately recognize the audio data under the first reference air density. Therefore, it is necessary to perform feature compensation on the first audio data to obtain second audio data. As can be seen from fig. 13, N frequency points are cut out from the sampled first frequency domain data and the digitized first frequency domain data are subjected to mathematical operation, and since the sampling interval of the digitized first frequency domain data is greater than that of the sampled first frequency domain data, after the operation, the N frequency points cut out from the sampled first frequency domain data are moved in a high-frequency direction, so that the formants are shifted in the high-frequency direction, that is, the formants of the first frequency domain data can be pulled back to the normal positions, wherein the normal positions are the positions of the formants of the audio data collected under the first reference air density; therefore, the first audio data are converted into the audio data which are acquired under the first reference air density, so that the environmental factors are unified into the same environment, and the accuracy of subsequent audio recognition is improved.
It can be seen that, in the case where the first air density is smaller than the first reference air density, as shown in fig. 16, the solid line represents the spectrogram of the first audio data collected by the user at the first air density, and the dotted line represents the spectrogram of the audio data collected by the user at the first reference air density, which is equivalent to the spectrogram of the second audio data obtained after the feature compensation, as shown in formula (1). Therefore, the first frequency domain data may shift the resonance peak in a high frequency direction with respect to the frequency domain data acquired at the first reference air density. Because the first audio recognition model is obtained through training of the audio data sample under the first reference air density, the first audio recognition model can accurately recognize the audio data under the first reference air density, and therefore feature compensation is required to be carried out on the first audio data to obtain the second audio data. As can be seen from fig. 14, N frequency points are extracted from the third frequency domain data after the digitization processing and the sampled third frequency domain data are subjected to mathematical operation processing, and since the sampling interval of the third frequency domain data after the digitization processing is smaller than the sampling interval of the first frequency domain data after the sampling, after the mathematical operation processing, the N frequency points in the third frequency domain data after the sampling are moved to the low frequency direction, so that the formants are shifted to the low frequency direction, that is, the formants of the first frequency domain data can be pulled back to the normal positions, wherein the normal positions are the positions of the formants of the audio data acquired under the first reference air density, so that the first audio data are transformed to be equivalent to the audio data acquired under the first reference air density, and the environmental factors are unified to the same environment, thereby improving the accuracy of the subsequent audio recognition.
In one embodiment of the present application, the frequency domain data B and the frequency domain data C after the digitizing process may be subjected to mathematical operation according to the order from the smaller frequency to the larger frequency, so as to obtain fourth frequency domain data; and then, carrying out energy shaping on the fourth frequency domain data to obtain the second frequency domain data, wherein the energy sum corresponding to the second frequency domain data is the same as the energy sum corresponding to the first frequency domain data.
Illustratively, the energy sum of the first frequency domain data, i.e. the square sum of the amplitudes of the individual frequency points, is determined from the first frequency domain data; determining an energy sum of the fourth frequency domain data according to the fourth frequency domain data; determining a ratio between an energy sum of the first frequency domain data and an energy sum of the fourth frequency domain data; then, the amplitude of each frequency point in the fourth frequency domain data is multiplied by the arithmetic square root of the ratio to obtain the second frequency domain data.
It can be seen that the mathematical operation process only intercepts part of the frequency points, which may cause that the energy sum of the frequency domain data obtained by the mathematical operation process is different from the energy sum of the first frequency domain data, so that the energy sum of the second frequency domain data and the first frequency domain data is the same through energy shaping, and the audio characteristics of the first audio data are reserved as much as possible under the condition that the first frequency domain data is transformed to the first reference air density, thereby further improving the accuracy of subsequent audio identification.
In another embodiment of the present application, after the second audio data is obtained, audio recognition may be performed on the second audio data to obtain an audio recognition result corresponding to the first audio data. For example, the second audio data may be input to the first audio recognition model after training, resulting in the audio recognition result. The first audio recognition model is obtained by training an audio data sample collected under a first reference air density, and a method for performing audio recognition on the audio data is described later, which will not be described in detail. The training of the first audio recognition model can be obtained through supervised training of the collected audio data samples, and is not described.
It should be understood that the audio feature compensation method according to the embodiment of the present application may be performed by the execution device 110 shown in fig. 4, the first audio data may be input data given by the client device 140 shown in fig. 4, the preprocessing module 113 in the execution device 110 may be used to perform the audio feature compensation method, and the calculation module 111 in the execution device 110 may be used to perform the subsequent audio recognition method.
Alternatively, the above-mentioned audio feature compensation method may be processed by a CPU, or may be processed by both the CPU and the GPU, or may not use the GPU, and other suitable processors for neural network computation may be used, which is not limited by the present application.
Referring to fig. 17, fig. 17 is a flowchart of an audio recognition model training method according to an embodiment of the present application. The repetition of the present embodiment with the embodiment shown in fig. 6 is not repeated here. The method comprises the following steps:
1701: a plurality of original audio data samples is acquired.
The plurality of original audio data samples are audio data of a plurality of speakers, wherein the plurality of original audio data samples are in one-to-one correspondence with the plurality of speakers, or may not be in one-to-one correspondence, for example, one speaker may be corresponding to two original audio data samples, in the present application, the plurality of original audio data samples are described by taking one-to-one correspondence with the plurality of speakers as an example, and the identities of the plurality of speakers may be understood as labels of the plurality of original audio data samples; the plurality of original audio data samples may be acquired under the same environment or may be acquired under a plurality of environments.
1702: and respectively carrying out feature compensation on each original audio data sample in the plurality of original audio data samples according to the second air density and the plurality of second reference air densities to obtain a plurality of extended audio data samples corresponding to each original audio data sample.
The second air density is the air density in the environment where each of the plurality of original audio data samples is collected, i.e. the second air density is related to the time and place where the original audio data sample is collected. Similar to the audio feature compensation method shown in fig. 6, in this embodiment, a plurality of second reference air densities may be used as standard densities, and the original audio data samples of each speaker collected under the second air densities may be respectively compensated to the plurality of second reference air densities, so as to obtain audio data samples of each speaker under the plurality of second reference air densities, and thus a plurality of expanded audio data samples may be obtained.
Illustratively, performing frequency domain transformation on each original audio data sample to obtain fifth frequency domain data corresponding to each original audio data sample; performing feature compensation on the fifth audio data according to each second reference air density and the second air density of the plurality of second reference air densities for each original audio data sample to obtain sixth frequency domain data corresponding to each second reference air density, so that after performing feature compensation on the plurality of second reference air densities, a plurality of sixth frequency domain data corresponding to each original audio data sample can be obtained, that is, according to each original audio data sample under the second air density, a plurality of sixth frequency domain data of the original audio data sample under the plurality of second reference air densities are expanded, wherein each second reference air density corresponds to one sixth frequency domain data; and then, performing frequency domain inverse transformation on each of a plurality of sixth frequency domain data corresponding to each original audio data sample respectively to obtain a plurality of extended audio data samples corresponding to each original audio data sample. Here, the frequency domain transform and the frequency domain inverse transform are similar to those in the above-described audio feature compensation method, and will not be described.
Specifically, when the second air density is greater than the second reference air density a, performing a second sampling operation on the fifth frequency domain data according to the second air density and the second reference air density a to obtain sampled fifth frequency domain data, and performing a second frequency domain shaping operation on the fifth frequency domain data and the sampled fifth frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density a; or, in the case that the second air density is smaller than the second reference air density a, according to the second air density and the second reference air density a, spreading the fifth frequency domain data in the high frequency direction to obtain seventh frequency domain data, wherein, the process of spreading the fifth data in the high frequency direction can refer to the process of spreading the first frequency domain data in the high frequency direction, which is not described again; and then, performing the second sampling operation on the seventh frequency domain data according to the second air density and the second reference air density A to obtain sampled seventh frequency domain data, and performing the second frequency domain shaping operation on the seventh frequency domain data and the sampled seventh frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density A. Wherein the second reference air density a is any one of the plurality of second reference air densities.
It should be appreciated that, similar to the first sampling operation, i.e., the second air density may be considered as the first air density in the first re-sampling operation, the second reference air density a may be considered as the first reference air density in the first sampling operation, the second sampling operation consisting essentially of:
for example, the frequency domain data D is sampled according to the second sound speed and the second reference sound speed a, where the second sound speed is determined according to the second air density, and the second reference sound speed a is determined according to the second reference air density a, and the process of sampling the frequency domain data D may refer to the process of sampling the frequency domain data a in the first sampling operation. Therefore, the ratio between the number of frequency points of the sampled frequency domain data D and the number of frequency points of the frequency domain data D is the ratio between the second sound velocity and the second reference sound velocity a. When the frequency domain data D is the fifth frequency domain data, the sampled frequency domain data D is the sampled fifth frequency domain data, and when the frequency domain data D is the seventh frequency domain data, the sampled frequency domain data D is the sampled seventh frequency domain data.
It should be appreciated that, similar to the first frequency domain shaping operation, the second air density may be considered as the first air density in the first frequency domain shaping operation and the second reference air density a may be considered as the first reference air density in the first frequency domain shaping operation, then the second frequency domain shaping operation mainly comprises the following:
further, the frequency domain data E is subjected to digital processing, and the frequency domain data E after the digital processing is obtained. For example, if the value of the frequency point B corresponding to the frequency domain data E is not 0, the value of the frequency point B corresponding to the frequency domain data E after the digitizing process is 1, and if the value of the frequency point B corresponding to the frequency domain data E is 0, the value of the frequency point B corresponding to the frequency domain data E after the digitizing process is 0, and the frequency point B is any frequency point in the frequency domain data E; the mathematical operation of the frequency domain data E and the frequency domain data F after the digitizing process is performed in order of the frequencies from small to large to obtain sixth frequency domain data corresponding to the second reference air density a, and the mathematical operation of the frequency domain data E and the frequency domain data F after the digitizing process is the same as the mathematical operation of the frequency domain data B and the frequency domain data C after the digitizing process performed in the first frequency domain shaping operation, and will not be described. Therefore, when the frequency domain data E is the fifth frequency domain data, the frequency domain data F is the sampled fifth frequency domain data, and the number of frequency points of the sixth frequency domain data corresponding to the second reference air density a is the same as that of the fifth frequency domain data, and the frequency interval between adjacent frequency points in the sixth frequency domain data corresponding to the second reference air density a is the same as that between adjacent frequency points in the fifth frequency domain data; when the frequency domain data is seventh frequency domain data, the frequency domain data F is sampled seventh frequency domain data, and the number of frequency points of sixth frequency domain data corresponding to the second reference air density a is the same as the number of frequency points of the sampled seventh frequency domain data, and the frequency interval between adjacent frequency points in the sixth frequency domain data corresponding to the second reference air density a is the same as the frequency interval between adjacent frequency points in the seventh frequency domain data.
Similarly, after the mathematical operation processing is performed on the frequency domain data E and the frequency domain data F after the digitizing processing in order of the frequencies from the smaller frequency point to the larger frequency point, energy shaping may be performed. Illustratively, performing mathematical operation on the frequency domain data E and the frequency domain data F after the digitizing process to obtain eighth frequency domain data; and carrying out energy shaping on the eighth frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density A, wherein the energy sum corresponding to the sixth frequency domain data is the same as the energy sum corresponding to the fifth frequency domain data. The process of energy shaping the eighth frequency domain data can be referred to as energy shaping the fourth frequency domain data, which will not be described.
1703: and constructing a training sample set according to the plurality of extended audio data samples corresponding to each original audio data sample and the plurality of original audio data samples.
Illustratively, a plurality of extended audio data samples corresponding to each original audio data sample are combined with the plurality of original audio data samples to obtain the training sample set.
It should be appreciated that if only a rich training sample set is to be constructed for the actual application, i.e., if audio data samples for each speaker at various air densities are to be obtained, then the process of step 1704 need not be performed.
1704: and determining a second audio recognition model which completes training according to the training sample set.
The training sample set is used for model training to obtain a second audio recognition model which is completed by training, namely, the training sample set is used for performing supervised model training by using the extended audio data sample and the original audio data sample to obtain the second audio recognition model which is completed by training.
The original audio data samples and the extended audio data samples may be training data maintained in the database 130 as shown in fig. 4, and optionally, the training of the second audio model may be performed in the training device 120, or may be performed in advance by other functional modules before the training device 120.
Optionally, the above audio recognition model training method may be processed by a CPU, or may be processed by both the CPU and the GPU, or may not use the GPU, and other processors suitable for neural network computation may be used, which is not limited in the present application.
It can be seen that in the embodiment of the present application, the original audio data samples of each speaker are first expanded to obtain a plurality of expanded audio data samples at a plurality of second reference air densities, so that each speaker has an audio data sample under each environment. Therefore, the second audio recognition model obtained through training can memorize the audio characteristics of each user under each air density by using the abundant audio data samples, and has higher robustness, so that no matter what environment the user is in subsequently, the identity information of the user can be accurately recognized through the second audio recognition model, the interference of environmental factors can be avoided, and the accuracy of audio recognition is improved.
Referring to fig. 18, fig. 18 is a flowchart of an audio recognition method according to an embodiment of the present application. The repetition of the present embodiment with the embodiment shown in fig. 6 and 17 is not described here. The method includes, but is not limited to, the steps of:
1801: first audio data is acquired.
1802: the first audio data is input into a second audio recognition model which is trained to carry out audio recognition, and an audio recognition result corresponding to the first audio data is obtained, wherein the second audio recognition model is determined according to a training sample set, the training sample set comprises a plurality of original audio data samples and a plurality of expanded audio data samples, the plurality of expanded audio data samples are obtained by respectively carrying out feature compensation on each original audio data sample in the plurality of original audio data samples according to a second air density and a plurality of second reference air densities, and the second air density is the air density of the environment where each original audio data sample is collected.
It can be seen that in the embodiment of the application, the second audio recognition model with higher robustness is used for audio recognition, so that the collected first audio data can be directly recognized without audio feature compensation, interference caused by environmental factors can be avoided, and the accuracy of audio recognition is improved.
Another audio recognition method according to the embodiment of the present application is described below with reference to fig. 19. The audio frequency identification method mainly comprises the following steps:
step 1): the acquired speech signal is pre-emphasized, which in the present application is the second audio data in the embodiment corresponding to fig. 6 or the first audio data in the embodiment corresponding to fig. 18. The pre-emphasis aims at eliminating the influence caused by the radiation of the mouth and nose during pronunciation, and the high-frequency part of the voice is promoted through a high-pass filter;
step 2): and framing and windowing the pre-emphasized voice signal. Since the speech signal is stationary for a short time, the speech signal is divided into short periods of time by frame windowing, each short period of time being referred to as an audio frame. Meanwhile, in order to avoid the loss of dynamic information of the voice signal, a section of overlapping area is needed between adjacent frames;
step 3): performing Fast Fourier Transform (FFT) on each audio frame, namely transforming the time domain signal subjected to framing and windowing into a frequency domain through the FFT to obtain a frequency spectrum characteristic X (k) of each audio frame;
step 4): filtering the spectral feature X (k) of each audio frame by a mel filter bank to obtain energy for each audio frame;
Step 5): the energy of each audio frame is obtained and transformed to a logarithmic domain, so that a Mel frequency logarithmic energy spectrum S (m) of the voice signal is obtained;
step 6): performing Discrete Cosine Transform (DCT) on the S (m) to obtain a Mel Frequency Cepstrum Coefficient (MFCC);
step 7): the MFCC is treated as an audio feature of the speech signal and audio recognition is performed based on the MFCC.
Referring to fig. 20, fig. 20 is a schematic structural diagram of an audio feature compensation device according to an embodiment of the application. As shown in fig. 20, the audio feature compensation apparatus 2000 includes an acquisition unit 2001 and a processing unit 2002;
an acquisition unit 2001 for acquiring first audio data;
the processing unit 2002 is configured to perform feature compensation on the first audio data according to a first air density and a first reference air density, so as to obtain second audio data, where the first air density is an air density in an environment where the first audio data is collected.
For a more detailed description of the above-described acquisition unit 2001 and processing unit 2202, reference may be made to the relevant description in the above-described method embodiments, which are not explained here.
Referring to fig. 21, fig. 21 is a schematic structural diagram of an audio recognition device according to an embodiment of the application. As shown in fig. 21, the audio recognition apparatus 2100 includes an acquisition unit 2101 and a processing unit 2102;
An acquisition unit 2101 for acquiring first audio data;
the processing unit 2102 is configured to input the first audio data into a second audio recognition model after training to perform audio recognition, and obtain an audio recognition result corresponding to the first audio data, where the second audio recognition model is obtained by training a training sample set, the training sample set includes a plurality of original audio data samples and a plurality of extended audio data samples, the plurality of extended audio data samples are obtained by performing feature compensation on each of the plurality of original audio data samples according to a second air density and a plurality of second reference air densities, and the second air density is an air density in an environment where each of the plurality of original audio data samples is collected.
For a more detailed description of the above-described acquisition unit 2101 and processing unit 2102, reference is made to the relevant description in the above-described method embodiments, which are not explained here.
Fig. 22 is a schematic hardware structure of an electronic device according to an embodiment of the present application. The electronic apparatus 2200 shown in fig. 22 (the electronic apparatus 2200 may be a computer device in particular) includes a memory 2201, a processor 2202, a communication interface 2203, and a bus 2204. The memory 2201, the processor 2202, and the communication interface 2203 are communicatively connected to each other via a bus 2204.
The Memory 2201 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 2201 may store a program that, when executed by the processor 2202, the processor 2202 and the communication interface 2203 are configured to perform respective steps in an audio feature compensation method or an audio recognition model training method or an audio recognition method of an embodiment of the present application.
The processor 2202 may employ a general-purpose central processing unit (Central Processing Unit, CPU), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to perform the functions required by the units in the audio feature compensation apparatus or audio recognition apparatus of the present application or to perform the audio feature compensation method or audio recognition model training method or audio recognition method of the present application.
The processor 2202 may also be an integrated circuit chip with signal processing capabilities. In implementation, various steps in the audio feature compensation method or audio recognition model training method or audio recognition method of the present application may be accomplished by instructions in the form of integrated logic circuits or software in hardware in the processor 2202. The processor 2202 described above may also be a general purpose processor, a digital signal processor (Digital Signal Processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 2201, and the processor 2202 reads the information in the memory 2201, and combines the hardware thereof to perform functions required to be performed by the units included in the audio feature compensation apparatus or the audio recognition apparatus according to the embodiment of the present application, or perform each step in the audio feature compensation method or the audio recognition model training method or the audio recognition method according to the embodiment of the present application.
The communication interface 2203 enables communication between the electronic device 2200 and other equipment or a communication network using a transceiver device such as, but not limited to, a transceiver. For example, the first audio data may be acquired through the communication interface 2203.
Bus 2204 may include a path to transfer information between various components of device electronics 2200 (e.g., memory 2201, processor 2202, communication interface 2203).
It is to be understood that the acquisition unit 2001 in the audio feature compensation apparatus 2000 or the acquisition unit 2101 in the audio recognition apparatus 2100 corresponds to the communication interface 2203 in the electronic apparatus 2200, and the processing unit 2002 in the audio feature compensation apparatus 2000 or the processing unit 2102 in the audio recognition apparatus 2100 may correspond to the processor 2202.
It should be noted that while the electronic device 2200 shown in fig. 22 shows only a memory, a processor, and a communication interface, those skilled in the art will appreciate that in a particular implementation, the electronic device 2200 also includes other components necessary to achieve proper operation. Also, as will be appreciated by those skilled in the art, the electronic device 2200 may also include hardware components that perform other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the electronic device 2200 may also include only the necessary components to implement embodiments of the present application, and not necessarily all of the components shown in FIG. 22.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (23)
- A method of audio feature compensation, comprising:acquiring first audio data;and performing feature compensation on the first audio data according to the first air density and the first reference air density to obtain second audio data, wherein the first air density is the air density in the environment where the first audio data are collected.
- The method of claim 1, wherein compensating the first audio data based on the first air density and the first reference air density to obtain second audio data comprises:performing frequency domain transformation on the first audio data to obtain first frequency domain data;performing characteristic compensation on the first frequency domain data according to the first air density and the first reference air density to obtain second frequency domain data;and performing frequency domain inverse transformation on the second frequency domain data to obtain second audio data.
- The method of claim 2, wherein performing feature compensation on the first frequency domain data based on the first air density and the first reference air density to obtain second frequency domain data comprises:under the condition that the first air density is larger than the first reference air density, performing a first sampling operation on the first frequency domain data according to the first air density and the first reference air density to obtain sampled first frequency domain data;And performing first frequency domain shaping operation on the first frequency domain data and the sampled first frequency domain data to obtain second frequency domain data, wherein the number of frequency points of the second frequency domain data is the same as that of the first frequency domain data, and the frequency interval between adjacent frequency points in the second frequency domain data is the same as that between adjacent frequency points in the first frequency domain data.
- A method according to claim 2 or 3, wherein said performing feature compensation on said first frequency domain data based on said first air density and said first reference air density to obtain second frequency domain data comprises:under the condition that the first air density is smaller than the first reference air density, according to the first air density and the first reference air density, the first frequency domain data are spread in a high-frequency direction, and third frequency domain data are obtained;according to the first air density and the first reference air density, performing the first sampling operation on the third frequency domain data to obtain sampled third frequency domain data;and performing the first frequency domain shaping operation on the third frequency domain data and the sampled third frequency domain data to obtain the second frequency domain data, wherein the number of frequency points of the second frequency domain data is the same as that of the sampled third frequency domain data, and the frequency interval between adjacent frequency points in the second frequency domain data is the same as that between adjacent frequency points in the third frequency domain data.
- The method of claim 4, wherein said spreading the first frequency domain data in a high frequency direction based on the first air density and the first reference air density to obtain third frequency domain data, comprising:according to a first sound speed and a first reference sound speed, the first frequency domain data is spread in a high-frequency direction to obtain third frequency domain data, wherein the ratio between the number of frequency points of the third frequency domain data and the number of frequency points of the first frequency domain data is the ratio between the first sound speed and the first reference sound speed, the first sound speed is determined according to the first air density, and the first reference sound speed is determined according to the first reference air density.
- The method of any of claims 3-5, wherein the first sampling operation comprises:sampling the frequency domain data A according to the first sound velocity and the first reference sound velocity to obtain sampled frequency domain data A;wherein the first speed of sound is determined from the first air density and the first reference speed of sound is determined from the first reference air density;the ratio between the number of frequency points of the sampled frequency domain data A and the number of frequency points of the frequency domain data A is the ratio between the first reference sound velocity and the first sound velocity; the frequency domain data a is the first frequency domain data after sampling, and the frequency domain data a is the third frequency domain data after sampling.
- The method of any of claims 3-6, wherein the first frequency domain shaping operation comprises:performing digital processing on the frequency domain data B to obtain digital processed frequency domain data B, wherein if the value corresponding to the frequency point B in the frequency domain data B is not 0, the value corresponding to the frequency point A in the digital processed frequency domain data B is 1, and if the value corresponding to the frequency point B in the frequency domain data B is 0, the value corresponding to the frequency point A in the digital processed frequency domain data B is 0, and the frequency point A is any frequency point in the frequency domain data B;performing mathematical operation processing on the frequency domain data B and the frequency domain data C after the digital processing according to the sequence from small frequency to large frequency of the frequency points to obtain the second frequency domain data;in the case that the frequency domain data B is the first frequency domain data, the frequency domain data C is the sampled first frequency domain data; in the case where the frequency domain data B is the third frequency domain data, the frequency domain data C is the sampled third frequency domain data.
- The method of claim 7, wherein the performing mathematical operation on the digitized frequency domain data B and the frequency domain data C in order of frequencies from small to large to obtain the second frequency domain data includes:Performing mathematical operation processing on the frequency domain data B and the frequency domain data C after the digital processing according to the sequence from small frequency to large frequency of the frequency points to obtain fourth frequency domain data;and carrying out energy shaping on the fourth frequency domain data to obtain the second frequency domain data, wherein the energy sum corresponding to the second frequency domain data is the same as the energy sum corresponding to the first frequency domain data.
- The method according to any one of claims 1-8, further comprising:and carrying out audio recognition on the second audio data to obtain an audio recognition result corresponding to the first audio data.
- The method of claim 9, wherein performing audio recognition on the second audio data to obtain an audio recognition result corresponding to the first audio data comprises:and inputting the second audio data into a first audio recognition model which is subjected to training to carry out audio recognition, and obtaining an audio recognition result corresponding to the first audio data, wherein the first reference air density is the air density when the audio data sample in a training sample is acquired, and the training sample is used for training the first audio recognition model.
- An audio recognition method, comprising:acquiring first audio data;and inputting the first audio data into a second audio recognition model which completes training to carry out audio recognition, and obtaining an audio recognition result corresponding to the first audio data, wherein the second audio recognition model is determined according to a training sample set, the training sample set comprises a plurality of original audio data samples and a plurality of extended audio data samples, the plurality of extended audio data samples are obtained by respectively carrying out feature compensation on each original audio data sample in the plurality of original audio data samples according to a second air density and a plurality of second reference air densities, and the second air density is the air density in the environment where each original audio data sample is collected.
- The method of claim 11, wherein prior to acquiring the first audio data, the method further comprises:acquiring the plurality of original audio data samples;respectively performing feature compensation on each original audio data sample in the plurality of original audio data samples according to the second air density and the plurality of second reference air densities to obtain a plurality of expanded audio data samples corresponding to each original audio data sample;Constructing the training sample set according to a plurality of extended audio data samples corresponding to each original audio data sample and the plurality of original audio data samples;and determining the second audio recognition model which completes training according to the training sample set.
- The method of claim 12, wherein the performing feature compensation on each of the plurality of original audio data samples according to the second air density and the plurality of second reference air densities to obtain a plurality of extended audio data samples corresponding to each of the plurality of original audio data samples, respectively, comprises:performing frequency domain transformation on each original audio data sample to obtain fifth frequency domain data corresponding to each original audio data;performing feature compensation on the fifth audio data according to each second reference air density and the second air density in the plurality of second reference air densities for each original audio data sample to obtain sixth frequency domain data corresponding to each second reference air density, and obtaining a plurality of sixth frequency domain data corresponding to each original audio data sample according to the sixth frequency domain data corresponding to each second reference air density, wherein the plurality of sixth frequency domain data are in one-to-one correspondence with the plurality of second reference air densities;And respectively carrying out frequency domain inverse transformation on each sixth frequency domain data in the plurality of sixth frequency domain data corresponding to each original audio data sample to obtain a plurality of extended audio data samples corresponding to each original audio data sample.
- The method of claim 13, wherein performing feature compensation on the fifth audio data based on each of the plurality of second reference air densities and the second air density to obtain sixth frequency domain data corresponding to each of the second reference air densities, comprises:under the condition that the second air density is larger than a second reference air density A, performing a second sampling operation on the fifth frequency domain data according to the second air density and the second reference air density A to obtain sampled fifth frequency domain data;performing a second frequency domain shaping operation on the fifth frequency domain data and the sampled fifth frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density a, wherein the number of frequency points of the sixth frequency domain data corresponding to the second reference air density a is the same as that of the fifth frequency domain data, and the frequency interval between adjacent frequency points in the sixth frequency domain data corresponding to the second reference air density a is the same as that between adjacent frequency points in the fifth frequency domain data;Wherein the second reference air density a is any one of the plurality of second reference air densities.
- The method according to claim 13 or 14, wherein the performing feature compensation on the fifth audio data according to each of the plurality of second reference air densities and the second air density to obtain sixth frequency domain data corresponding to each of the second reference air densities includes:under the condition that the second air density is smaller than the second reference air density A, according to the second air density and the second reference air density A, the fifth frequency domain data are spread in the high-frequency direction, and seventh frequency domain data are obtained;according to the second air density and the second reference air density A, performing the second sampling operation on the seventh frequency domain data to obtain sampled seventh frequency domain data;performing the second frequency domain shaping operation on the seventh frequency domain data and the sampled seventh frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density a, wherein the number of frequency points of the sixth frequency domain data corresponding to the second reference air density a is the same as that of the sampled seventh frequency domain data, and the frequency interval between adjacent frequency points in the sixth frequency domain data corresponding to the second reference air density a is the same as that between adjacent frequency points in the seventh frequency domain data;Wherein the second reference air density a is any one of the plurality of second reference air densities.
- The method of claim 15, wherein said spreading the fifth frequency domain data in a high frequency direction according to the second air density and the second reference air density a to obtain seventh frequency domain data, comprising:and according to the second sound speed and the second reference sound speed A, the fifth frequency domain data is spread in a high-frequency direction to obtain seventh frequency domain data, wherein the ratio between the frequency point number of the seventh frequency domain data and the frequency point number of the fifth frequency domain data is the ratio between the second reference sound speed A and the second sound speed, the second sound speed is determined according to the second air density, and the second reference sound speed A is determined according to the second reference air density A.
- The method according to any one of claims 14-16, wherein the second sampling operation comprises:sampling the frequency domain data D according to the second sound speed and the second reference sound speed A to obtain sampled frequency domain data D;wherein the second sound speed is determined from the second air density, and the second reference sound speed a is determined from the second reference air density a;The ratio between the number of frequency points of the sampled frequency domain data D and the number of frequency points of the frequency domain data D is the ratio between the second sound speed and the second reference sound speed A; the sampled frequency domain data D is the sampled fifth frequency domain data when the frequency domain data D is the fifth frequency domain data, and the sampled frequency domain data D is the sampled seventh frequency domain data when the frequency domain data D is the seventh frequency domain data.
- The method according to any of claims 14-17, wherein the second frequency domain shaping operation comprises:performing digital processing on the frequency domain data E to obtain digital processed frequency domain data E, wherein if the value corresponding to a frequency point B in the frequency domain data E is not 0, the value corresponding to the frequency point B in the digital processed frequency domain data E is 1, and if the value corresponding to the frequency point B in the frequency domain data E is 0, the value corresponding to the frequency point B in the digital processed frequency domain data E is 0, and the frequency point B is any frequency point in the frequency domain data E;performing mathematical operation on the frequency domain data E and the frequency domain data F after the digitization processing according to the sequence from the small frequency point to the large frequency point to obtain sixth frequency domain data corresponding to the second reference air density A;Wherein, when the frequency domain data E is the fifth frequency domain data, the frequency domain data F is the sampled fifth frequency domain data; in the case where the frequency domain data E is the seventh frequency domain data, the frequency domain data F is the sampled seventh frequency domain data.
- The method as set forth in claim 18, wherein the performing mathematical operation on the digitized frequency domain data E and the frequency domain data F in order of frequencies from small to large to obtain sixth frequency domain data corresponding to the second reference air density a includes:performing mathematical operation on the frequency domain data E and the frequency domain data F after the digitizing processing according to the sequence from small frequency to large frequency of the frequency points to obtain eighth frequency domain data;and carrying out energy shaping on the eighth frequency domain data to obtain sixth frequency domain data corresponding to the second reference air density A, wherein the energy sum corresponding to the sixth frequency domain data is the same as the energy sum corresponding to the fifth frequency domain data.
- An audio feature compensation apparatus comprising means for performing the method of any of claims 1-10.
- An audio recognition device comprising means for performing the method of any one of claims 11-19.
- An electronic device, comprising: a memory for storing a program; a processor for executing a memory-stored program, which processor is adapted to perform the method of any of claims 1-19 when the memory-stored program is executed.
- A computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the method of any one of claims 1-19.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/084787 WO2022205249A1 (en) | 2021-03-31 | 2021-03-31 | Audio feature compensation method, audio recognition method, and related product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116982111A true CN116982111A (en) | 2023-10-31 |
Family
ID=83457763
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202180095675.7A Pending CN116982111A (en) | 2021-03-31 | 2021-03-31 | Audio characteristic compensation method, audio identification method and related products |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116982111A (en) |
WO (1) | WO2022205249A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117238299B (en) * | 2023-11-14 | 2024-01-30 | 国网山东省电力公司电力科学研究院 | Method, system, medium and equipment for optimizing bird voice recognition model of power transmission line |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100694879B1 (en) * | 2006-11-23 | 2007-03-14 | 부산대학교 산학협력단 | Noise compensation method using simultaneous estimation of eigen environment and bias compensation vector |
CN107527624B (en) * | 2017-07-17 | 2021-03-09 | 北京捷通华声科技股份有限公司 | Voiceprint recognition method and device |
CN109302660B (en) * | 2017-07-24 | 2020-04-14 | 华为技术有限公司 | Audio signal compensation method, device and system |
CN108257606A (en) * | 2018-01-15 | 2018-07-06 | 江南大学 | A kind of robust speech personal identification method based on the combination of self-adaptive parallel model |
CN111261183B (en) * | 2018-12-03 | 2022-11-22 | 珠海格力电器股份有限公司 | Method and device for denoising voice |
CN111489763B (en) * | 2020-04-13 | 2023-06-20 | 武汉大学 | GMM model-based speaker recognition self-adaption method in complex environment |
-
2021
- 2021-03-31 CN CN202180095675.7A patent/CN116982111A/en active Pending
- 2021-03-31 WO PCT/CN2021/084787 patent/WO2022205249A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
WO2022205249A1 (en) | 2022-10-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
CN112562691B (en) | Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium | |
KR102235568B1 (en) | Environment sound recognition method based on convolutional neural networks, and system thereof | |
US20210193149A1 (en) | Method, apparatus and device for voiceprint recognition, and medium | |
TW201935464A (en) | Method and device for voiceprint recognition based on memorability bottleneck features | |
CN113516990B (en) | Voice enhancement method, neural network training method and related equipment | |
CN109817222B (en) | Age identification method and device and terminal equipment | |
WO2022141868A1 (en) | Method and apparatus for extracting speech features, terminal, and storage medium | |
CN112289338B (en) | Signal processing method and device, computer equipment and readable storage medium | |
CN112837670B (en) | Speech synthesis method and device and electronic equipment | |
CN110738980A (en) | Singing voice synthesis model training method and system and singing voice synthesis method | |
CN112037800A (en) | Voiceprint nuclear model training method and device, medium and electronic equipment | |
CN112185342A (en) | Voice conversion and model training method, device and system and storage medium | |
CN114913859B (en) | Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium | |
Zhao et al. | A survey on automatic emotion recognition using audio big data and deep learning architectures | |
CN111354374A (en) | Voice processing method, model training method and electronic equipment | |
CN114627889A (en) | Multi-sound-source sound signal processing method and device, storage medium and electronic equipment | |
CN117542373A (en) | Non-air conduction voice recovery system and method | |
CN116982111A (en) | Audio characteristic compensation method, audio identification method and related products | |
CN112397090B (en) | Real-time sound classification method and system based on FPGA | |
KR102220964B1 (en) | Method and device for audio recognition | |
CN116913258A (en) | Speech signal recognition method, device, electronic equipment and computer readable medium | |
CN113488069B (en) | Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network | |
CN114783455A (en) | Method, apparatus, electronic device and computer readable medium for voice noise reduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |