CN113160797B - Audio feature processing method and device, storage medium and electronic equipment - Google Patents

Audio feature processing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113160797B
CN113160797B CN202110447185.0A CN202110447185A CN113160797B CN 113160797 B CN113160797 B CN 113160797B CN 202110447185 A CN202110447185 A CN 202110447185A CN 113160797 B CN113160797 B CN 113160797B
Authority
CN
China
Prior art keywords
audio
feature
target
sub
feature data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110447185.0A
Other languages
Chinese (zh)
Other versions
CN113160797A (en
Inventor
岑吴镕
李骊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing HJIMI Technology Co Ltd
Original Assignee
Beijing HJIMI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing HJIMI Technology Co Ltd filed Critical Beijing HJIMI Technology Co Ltd
Priority to CN202110447185.0A priority Critical patent/CN113160797B/en
Publication of CN113160797A publication Critical patent/CN113160797A/en
Application granted granted Critical
Publication of CN113160797B publication Critical patent/CN113160797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Stereophonic System (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides an audio feature processing method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions; determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature; enhancing the target sub-feature data to obtain enhanced sub-feature data; and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature. By applying the audio feature processing method provided by the invention, the sub-feature data of part of feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent for expanding the audio features can be greatly reduced, the computing resources can be saved, and the expansion efficiency of the audio features can be improved.

Description

Audio feature processing method and device, storage medium and electronic equipment
Technical Field
The present invention relates to the field of audio data processing technologies, and in particular, to an audio feature processing method and apparatus, a storage medium, and an electronic device.
Background
With the development of science and technology, a speech recognition model is widely applied to various industries, and plays a very important role in various scenes. Speech recognition models require a large number of audio features to train, and in the case of an insufficient number of audio features, the training of the model may be poor.
At present, in order to increase the number of audio features, the number of audio features is generally increased by increasing speed disturbance, volume disturbance and noise on the original audio, however, the number of audio features is increased by adopting the existing method, which results in increasing the number of audio to be processed, further increasing the extraction time of the audio features, and consuming large computing resources.
Disclosure of Invention
The invention aims to provide an audio feature processing method which can effectively reduce the time for expanding audio features.
The invention also provides an audio feature processing device which is used for ensuring the realization and application of the method in practice.
An audio feature processing method, comprising:
acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions;
determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature;
enhancing the target sub-feature data to obtain enhanced sub-feature data;
and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature.
The method, optionally, the obtaining the audio feature of the target audio frame of the audio to be processed includes:
framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;
determining a target audio frame of the audio to be processed in the audio frames;
and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.
In the above method, optionally, the extracting the features of the target audio frame to obtain the audio features of the target audio frame includes:
pre-emphasis processing is carried out on the target audio frame, and a first audio frame is obtained;
adding a hamming window to the first audio frame to obtain a second audio frame;
performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
obtaining a mel frequency spectrum corresponding to the target audio frame based on the frequency domain data;
acquiring triangular filters corresponding to each set characteristic dimension according to the Mel frequency spectrum;
and inputting the energy corresponding to the frequency domain data to each triangular filter to obtain the audio characteristics of the target audio.
In the above method, optionally, the enhancing the target sub-feature data to obtain enhanced sub-feature data includes:
determining the enhancement times corresponding to the target sub-feature data;
and enhancing the target sub-feature data based on the enhancement multiple to obtain the enhanced sub-feature data.
The method, optionally, further comprises:
and training a preset voice recognition model by applying the target audio data.
An audio feature processing apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio characteristics of a target audio frame of audio to be processed, and the audio characteristics consist of sub-characteristic data of multiple dimensions;
a determining unit, configured to determine target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature;
the first execution unit is used for enhancing the target sub-feature data to obtain enhanced sub-feature data;
and the second execution unit is used for replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain the target audio feature.
The above device, optionally, the acquiring unit includes:
the frame dividing subunit is used for dividing frames of the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;
a first determining subunit, configured to determine a target audio frame of the audio to be processed from the audio frames;
and the characteristic extraction subunit is used for carrying out characteristic extraction on the target audio frame to obtain the audio characteristics of the target audio frame.
The above apparatus, optionally, the feature extraction subunit includes:
the pre-emphasis processing subunit is used for carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame;
a first execution subunit, configured to add a hamming window to the first audio frame, and obtain a second audio frame;
the second execution subunit is used for performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;
a fourth execution subunit, configured to obtain triangular filters corresponding to each set feature dimension according to the mel spectrum;
and a fifth execution subunit, configured to input energy corresponding to the frequency domain data to each triangular filter, and obtain an audio feature of the target audio.
The above apparatus, optionally, the first execution unit includes:
the second determining subunit is used for determining the enhancement times corresponding to the target sub-feature data;
and the data enhancement subunit is used for enhancing the target sub-feature data based on the enhancement multiple to obtain the enhancement sub-feature data.
The above device, optionally, further comprises: the model training unit is used for training a preset voice recognition model by applying the target audio data.
A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform an audio feature processing method as described above.
An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors to perform an audio feature processing method as described above.
Compared with the prior art, the invention has the following advantages:
the invention provides an audio feature processing method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions; determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature; enhancing the target sub-feature data to obtain enhanced sub-feature data; and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature. By applying the audio feature processing method provided by the invention, the sub-feature data of part of feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent for expanding the audio features can be greatly reduced, the computing resources can be saved, and the expansion efficiency of the audio features can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for processing audio features according to the present invention;
FIG. 2 is a flow chart of a process for obtaining audio characteristics of a target audio frame of audio to be processed according to the present invention;
FIG. 3 is a flow chart of a process for obtaining audio characteristics of a target audio frame in accordance with the present invention;
fig. 4 is a schematic structural diagram of an audio feature processing device according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The embodiment of the invention provides an audio feature processing method which can be applied to electronic equipment, wherein a flow chart of the method is shown in fig. 1, and the method specifically comprises the following steps:
s101: and acquiring the audio characteristics of a target audio frame of the audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions.
In the method provided by the embodiment of the invention, the target audio frame can be the audio frame to be processed currently.
Wherein the number of target audio frames may be one or more.
Specifically, the audio feature may be an Fbank feature, the audio feature is a feature vector composed of sub-feature data of multiple dimensions, and the number of dimensions of the audio feature may be 71 or 72.
Optionally, one possible way to obtain the audio features of the target audio frame of the audio to be processed is to perform feature extraction on the target audio frame, so as to obtain the audio features of the target audio frame.
S102: and determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature.
In the method provided by the embodiment of the invention, the target sub-feature data can be randomly determined sub-feature data or sub-feature data with specified dimension.
Wherein the number of target sub-feature data in the audio feature may be one or more.
S103: and enhancing the target sub-feature data to obtain enhanced sub-feature data.
In the method provided by the embodiment of the invention, the target sub-feature data can be enhanced according to a preset enhancement mode, and the enhanced sub-feature data of the target sub-feature data can be obtained.
And under the condition that the number of the target sub-feature data is a plurality of, enhancing each target sub-feature data to obtain enhanced sub-feature data of each target sub-feature data.
S104: and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature.
In the method provided by the embodiment of the invention, the target audio feature comprises all sub-feature data and the enhanced sub-feature data which are outside the target sub-feature data in the audio feature.
The embodiment of the invention provides an audio feature processing method, which comprises the following steps: acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions; determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature; enhancing the target sub-feature data to obtain enhanced sub-feature data; and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature. By applying the audio feature processing method provided by the invention, the sub-feature data of part of feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent for expanding the audio features can be greatly reduced, the computing resources can be saved, and the expansion efficiency of the audio features can be improved.
In the method provided by the embodiment of the present invention, based on the implementation process described above, specifically, the process of obtaining the audio feature of the target audio frame of the audio to be processed, as shown in fig. 2, specifically includes:
s201: and framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed.
In the method provided by the embodiment of the invention, the number of the sampling points can be the number of the sampling points forming one audio frame, the length of the moving step length can be the preset number of the sampling points, and the preset number can be smaller than the number of the sampling points.
The number of sampling points may be any number, for example, 500 or 512.
Alternatively, the length of the movement step may be 160 points.
S202: and determining a target audio frame of the audio to be processed in the audio frames.
In the method provided by the embodiment of the invention, a plurality of audio frames which are sequentially continuous in the audio to be processed can be determined as target audio frames, and the current audio frame to be processed in the audio to be processed can also be taken as the target audio frame.
S203: and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.
In the method provided by the embodiment of the invention, the audio features can be any type of audio features, and the feature extraction can be performed on the target audio frame through a preset feature extraction mode to obtain the audio features of the target audio frame corresponding to the feature extraction mode.
In the method provided by the embodiment of the present invention, based on the implementation process described above, specifically, the process of extracting the features of the target audio frame to obtain the audio features of the target audio frame, as shown in fig. 3, specifically includes:
s301: and carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame.
And pre-emphasis processing is carried out on the target audio frame through a preset pre-emphasis formula, so that a first audio frame is obtained.
Alternatively, the pre-emphasis formula may be: y is Y t+1 =X t+1 -α*X t
Wherein X is t The value of the sampling point at time t is represented by Y, the value of the sampling point after pre-emphasis is represented by Y, and α is a pre-emphasis coefficient, where the range of the pre-emphasis system may be 0.95 to 1, and the first sampling point of the target audio frame may be unchanged.
S302: and adding a Hamming window to the first audio frame to obtain a second audio frame.
Wherein a hamming window may be added to the first audio frame by a hamming window processing formula.
Alternatively, the hamming window processing formula may be: z is Z n =Y n *h n
Wherein Y represents the sampling point before windowing, Z represents the sampling point after windowing, and h represents the windowing coefficient.
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003037334220000071
where β=0.46 can be set, n represents the total number of points to be windowed, and n represents a certain sampling point.
S303: and performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame.
S304: and obtaining the Mel frequency spectrum corresponding to the target audio frame based on the frequency domain data.
The frequency domain data can be calculated through a mel frequency spectrum conversion formula, so that a mel frequency spectrum corresponding to the target audio frame is obtained.
Alternatively, the mel-frequency spectrum conversion formula may be:
Figure BDA0003037334220000081
where Mel (f) is Mel spectrum and f is frequency domain data.
S305: and obtaining triangular filters corresponding to the set characteristic dimensions according to the Mel frequency spectrum.
Dividing the Mel spectrum into initial triangular filters with preset dimension number, and converting each initial triangular filter back to frequency domain to obtain triangular filter corresponding to each set characteristic dimension.
S306: and inputting the energy corresponding to the frequency domain data to each triangular filter to obtain the audio characteristics of the target audio.
In the method provided by the embodiment of the invention, the real part square of the frequency domain data can be added with the imaginary part square of the frequency domain data to obtain the energy corresponding to the frequency domain data.
The energy corresponding to the frequency domain data can be processed through each triangular filter to obtain sub-feature data of each feature dimension, and the sub-feature data of each dimension forms the audio feature of the target audio.
In the method provided by the embodiment of the present invention, based on the implementation process, specifically, the enhancing the target sub-feature data to obtain enhanced sub-feature data includes:
determining the enhancement times corresponding to the target sub-feature data;
and enhancing the target sub-feature data based on the enhancement multiple to obtain the enhanced sub-feature data.
In the method provided by the embodiment of the invention, a preset enhancement multiple set can be determined, any multiple is randomly selected in the enhancement multiple set, and the selected multiple is used as the enhancement multiple corresponding to the target sub-feature data.
The enhancement multiplier may be multiplied by the target sub-feature data to implement enhancement of the target sub-feature data, to obtain enhanced sub-feature data of the target sub-feature data.
The enhancement multiple set may be set according to actual requirements, for example, [0.95,1.05] or [0.96,1.06].
By applying the method provided by the embodiment of the invention, the target sub-feature data can be effectively enhanced, so that each sub-feature data and enhanced sub-feature data except the target sub-feature data in the audio feature can be combined into the target audio feature.
In the method provided by the embodiment of the present invention, based on the implementation process, the method specifically further includes:
and training a preset voice recognition model by applying the target audio data.
In the audio feature processing method provided by the invention, in the actual application process, the audio features of the target audio frame can be Fbank features, and the Fbank features are taken as an example for explanation:
firstly, fbank characteristics of audio to be processed are extracted, and the Fbank characteristics are specifically as follows:
step a1: and carrying out audio framing on the processed audio, wherein 512 sampling points are taken as one frame, and 160 points are moved each time to obtain each audio frame.
Step a2: each audio frame extracts FBank features:
(1) The audio pre-emphasis is as follows: y is Y t+1 =X t+1 -α*X t
Wherein X is t The value of the sampling point at the time t is represented by Y, the value of the sampling point after pre-emphasis is represented by Y, alpha is a pre-emphasis coefficient, the range of the alpha is 0.95 to 1, and the first sampling point of the audio is unchanged.
(2) A hamming window is added.
The addition of the Hamming window can prevent the oscillation phenomenon at the rear edge of the Fourier transform.
The specific formula is as follows: z is Z n =Y n *h n
Wherein Y represents the sampling point before windowing, Z represents the sampling point after windowing, and h represents the windowing coefficient.
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003037334220000091
beta=0.46 can be set, n represents the total number of points to be windowed, nIndicating a certain sampling point.
(3) And converting the audio frame subjected to pre-emphasis and Hamming window from a time domain to a frequency domain through fast Fourier transform to obtain frequency domain data.
(4) By the formula
Figure BDA0003037334220000092
The frequency domain data is converted into a mel spectrum, the mel spectrum is equally divided into 71 triangular filters, and the triangular filters are converted back into the frequency domain.
(5) The frequency domain corresponding energy is passed through the triangular filter to obtain a 71-dimensional eigenvector.
Second, the FBank feature is enhanced as follows:
for the extracted 71-dimensional Fbank characteristic of each audio frame, randomly sampling the 71-dimensional vector (71 numbers) in each audio frame once to extract 1 number from the 71 numbers, carrying out size conversion of 0.95-1.05 times of the extracted numbers, and replacing the converted numbers with the original numbers to obtain the target audio characteristics of the audio frame.
Assuming that the enhancement factor is 0.97 and the number extracted is 10, the transformed number is 9.7, and then the transformed number 9.7 is substituted for the original number 10.
Corresponding to the method shown in fig. 1, the embodiment of the present invention further provides an audio feature processing apparatus, which is used for implementing the method shown in fig. 1, where the audio feature processing apparatus provided in the embodiment of the present invention may be applied to an electronic device, and the structural schematic diagram of the audio feature processing apparatus is shown in fig. 4, and specifically includes:
an obtaining unit 401, configured to obtain an audio feature of a target audio frame of audio to be processed, where the audio feature is composed of sub-feature data of multiple dimensions;
a determining unit 402, configured to determine target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature;
a first execution unit 403, configured to enhance the target sub-feature data to obtain enhanced sub-feature data;
and the second execution unit 404 is configured to replace the target sub-feature data in the audio feature with the enhancement sub-feature data, so as to obtain a target audio feature.
The embodiment of the invention provides an audio feature processing device which can acquire audio features of a target audio frame of audio to be processed, wherein the audio features consist of sub-feature data with multiple dimensions; determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature; enhancing the target sub-feature data to obtain enhanced sub-feature data; and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature. The audio feature processing device provided by the invention can enhance the sub-feature data of part of feature dimensions of the audio features, thereby acquiring new target audio feature data, greatly reducing the time spent for expanding the audio features, saving the computing resources and improving the expansion efficiency of the audio features.
In an embodiment of the present invention, based on the foregoing solution, optionally, the obtaining unit 401 includes:
the frame dividing subunit is used for dividing frames of the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;
a first determining subunit, configured to determine a target audio frame of the audio to be processed from the audio frames;
and the characteristic extraction subunit is used for carrying out characteristic extraction on the target audio frame to obtain the audio characteristics of the target audio frame.
In an embodiment of the present invention, based on the foregoing solution, optionally, the feature extraction subunit includes:
the pre-emphasis processing subunit is used for carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame;
a first execution subunit, configured to add a hamming window to the first audio frame, and obtain a second audio frame;
the second execution subunit is used for performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;
a fourth execution subunit, configured to obtain triangular filters corresponding to each set feature dimension according to the mel spectrum;
and a fifth execution subunit, configured to input energy corresponding to the frequency domain data to each triangular filter, and obtain an audio feature of the target audio.
In an embodiment of the present invention, based on the foregoing solution, optionally, the first executing unit 403 includes:
the second determining subunit is used for determining the enhancement times corresponding to the target sub-feature data;
and the data enhancement subunit is used for enhancing the target sub-feature data based on the enhancement multiple to obtain the enhancement sub-feature data.
In an embodiment of the present invention, based on the above solution, optionally, the method further includes: the model training unit is used for training a preset voice recognition model by applying the target audio data.
The specific principle and execution process of each unit and module in the audio feature processing device disclosed in the above embodiment of the present invention are the same as those of the audio feature processing method disclosed in the above embodiment of the present invention, and may refer to corresponding parts in the audio feature processing method provided in the above embodiment of the present invention, which are not described herein again.
The embodiment of the invention also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is located is controlled to execute the audio feature processing method when the instructions run.
The embodiment of the present invention further provides an electronic device, whose structural schematic diagram is shown in fig. 5, specifically including a memory 501, and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and configured to be executed by the one or more processors 503, where the one or more instructions 502 perform the following operations:
acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions;
determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature;
enhancing the target sub-feature data to obtain enhanced sub-feature data;
and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
The foregoing has outlined a detailed description of a method for processing audio features, wherein specific examples are provided herein to illustrate the principles and embodiments of the present invention, and the above examples are provided to assist in understanding the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (10)

1. A method of processing audio features, comprising:
acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions;
determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature, wherein the target sub-feature data is randomly determined sub-feature data or sub-feature data of a designated dimension;
enhancing the target sub-feature data to obtain enhanced sub-feature data;
and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature.
2. The method of claim 1, wherein the acquiring the audio characteristics of the target audio frame of the audio to be processed comprises:
framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;
determining a target audio frame of the audio to be processed in the audio frames;
and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.
3. The method of claim 2, wherein the extracting the features of the target audio frame to obtain the audio features of the target audio frame comprises:
pre-emphasis processing is carried out on the target audio frame, and a first audio frame is obtained;
adding a hamming window to the first audio frame to obtain a second audio frame;
performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
obtaining a mel frequency spectrum corresponding to the target audio frame based on the frequency domain data;
acquiring triangular filters corresponding to each set characteristic dimension according to the Mel frequency spectrum;
and inputting the energy corresponding to the frequency domain data to each triangular filter to obtain the audio characteristics of the target audio.
4. The method of claim 1, wherein enhancing the target sub-feature data to obtain enhanced sub-feature data comprises:
determining the enhancement times corresponding to the target sub-feature data;
and enhancing the target sub-feature data based on the enhancement multiple to obtain the enhanced sub-feature data.
5. The method as recited in claim 1, further comprising:
and training a preset voice recognition model by applying the target audio data.
6. An audio feature processing apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio characteristics of a target audio frame of audio to be processed, and the audio characteristics consist of sub-characteristic data of multiple dimensions;
the determining unit is used for determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature, wherein the target sub-feature data is randomly determined sub-feature data or sub-feature data of a designated dimension;
the first execution unit is used for enhancing the target sub-feature data to obtain enhanced sub-feature data;
and the second execution unit is used for replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain the target audio feature.
7. The apparatus of claim 6, wherein the acquisition unit comprises:
the frame dividing subunit is used for dividing frames of the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;
a first determining subunit, configured to determine a target audio frame of the audio to be processed from the audio frames;
and the characteristic extraction subunit is used for carrying out characteristic extraction on the target audio frame to obtain the audio characteristics of the target audio frame.
8. The apparatus of claim 7, wherein the feature extraction subunit comprises:
the pre-emphasis processing subunit is used for carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame;
a first execution subunit, configured to add a hamming window to the first audio frame, and obtain a second audio frame;
the second execution subunit is used for performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;
a fourth execution subunit, configured to obtain triangular filters corresponding to each set feature dimension according to the mel spectrum;
and a fifth execution subunit, configured to input energy corresponding to the frequency domain data to each triangular filter, and obtain an audio feature of the target audio.
9. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform the audio feature processing method of any one of claims 1 to 5.
10. An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the audio feature processing method of any of claims 1-5.
CN202110447185.0A 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment Active CN113160797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110447185.0A CN113160797B (en) 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110447185.0A CN113160797B (en) 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113160797A CN113160797A (en) 2021-07-23
CN113160797B true CN113160797B (en) 2023-06-02

Family

ID=76870199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110447185.0A Active CN113160797B (en) 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113160797B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8930185B2 (en) * 2009-08-28 2015-01-06 International Business Machines Corporation Speech feature extraction apparatus, speech feature extraction method, and speech feature extraction program
JP2013164572A (en) * 2012-01-10 2013-08-22 Toshiba Corp Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program
CN104240719B (en) * 2013-06-24 2018-01-12 浙江大华技术股份有限公司 The feature extracting method of audio, the sorting technique of audio and relevant apparatus
CN108922541B (en) * 2018-05-25 2023-06-02 南京邮电大学 Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
US10665222B2 (en) * 2018-06-28 2020-05-26 Intel Corporation Method and system of temporal-domain feature extraction for automatic speech recognition
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN111261189B (en) * 2020-04-02 2023-01-31 中国科学院上海微系统与信息技术研究所 Vehicle sound signal feature extraction method

Also Published As

Publication number Publication date
CN113160797A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
US20220343898A1 (en) Speech recognition method and apparatus, and computer-readable storage medium
WO2020173133A1 (en) Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium
JP5897107B2 (en) Detection of speech syllable / vowel / phoneme boundaries using auditory attention cues
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN105976812A (en) Voice identification method and equipment thereof
CN107527620A (en) Electronic installation, the method for authentication and computer-readable recording medium
CN104021789A (en) Self-adaption endpoint detection method using short-time time-frequency value
CN111261183A (en) Method and device for denoising voice
CN109036437A (en) Accents recognition method, apparatus, computer installation and computer readable storage medium
WO2023283823A1 (en) Speech adversarial sample testing method and apparatus, device, and computer-readable storage medium
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
CN109658943B (en) Audio noise detection method and device, storage medium and mobile terminal
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
CN111754983A (en) Voice denoising method and device, electronic equipment and storage medium
CN113571078A (en) Noise suppression method, device, medium, and electronic apparatus
CN112820314A (en) Intelligent voice control large screen display method, system and related components thereof
Yuan et al. Speech separation using convolutional neural network and attention mechanism
Shen et al. Rars: Recognition of audio recording source based on residual neural network
CN113160797B (en) Audio feature processing method and device, storage medium and electronic equipment
CN113327594A (en) Speech recognition model training method, device, equipment and storage medium
CN111883147B (en) Audio data processing method, device, computer equipment and storage medium
CN110415722B (en) Speech signal processing method, storage medium, computer program, and electronic device
CN116913304A (en) Real-time voice stream noise reduction method and device, computer equipment and storage medium
CN111402918A (en) Audio processing method, device, equipment and storage medium
CN115035887A (en) Voice signal processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant