CN113160797B

CN113160797B - Audio feature processing method and device, storage medium and electronic equipment

Info

Publication number: CN113160797B
Application number: CN202110447185.0A
Authority: CN
Inventors: 岑吴镕; 李骊
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2023-06-02
Anticipated expiration: 2041-04-25
Also published as: CN113160797A

Abstract

The invention provides an audio feature processing method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions; determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature; enhancing the target sub-feature data to obtain enhanced sub-feature data; and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature. By applying the audio feature processing method provided by the invention, the sub-feature data of part of feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent for expanding the audio features can be greatly reduced, the computing resources can be saved, and the expansion efficiency of the audio features can be improved.

Description

Audio feature processing method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of audio data processing technologies, and in particular, to an audio feature processing method and apparatus, a storage medium, and an electronic device.

Background

With the development of science and technology, a speech recognition model is widely applied to various industries, and plays a very important role in various scenes. Speech recognition models require a large number of audio features to train, and in the case of an insufficient number of audio features, the training of the model may be poor.

At present, in order to increase the number of audio features, the number of audio features is generally increased by increasing speed disturbance, volume disturbance and noise on the original audio, however, the number of audio features is increased by adopting the existing method, which results in increasing the number of audio to be processed, further increasing the extraction time of the audio features, and consuming large computing resources.

Disclosure of Invention

The invention aims to provide an audio feature processing method which can effectively reduce the time for expanding audio features.

The invention also provides an audio feature processing device which is used for ensuring the realization and application of the method in practice.

An audio feature processing method, comprising:

acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions;

determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature;

enhancing the target sub-feature data to obtain enhanced sub-feature data;

and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature.

The method, optionally, the obtaining the audio feature of the target audio frame of the audio to be processed includes:

framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;

determining a target audio frame of the audio to be processed in the audio frames;

and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.

In the above method, optionally, the extracting the features of the target audio frame to obtain the audio features of the target audio frame includes:

pre-emphasis processing is carried out on the target audio frame, and a first audio frame is obtained;

adding a hamming window to the first audio frame to obtain a second audio frame;

performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;

obtaining a mel frequency spectrum corresponding to the target audio frame based on the frequency domain data;

acquiring triangular filters corresponding to each set characteristic dimension according to the Mel frequency spectrum;

and inputting the energy corresponding to the frequency domain data to each triangular filter to obtain the audio characteristics of the target audio.

In the above method, optionally, the enhancing the target sub-feature data to obtain enhanced sub-feature data includes:

determining the enhancement times corresponding to the target sub-feature data;

and enhancing the target sub-feature data based on the enhancement multiple to obtain the enhanced sub-feature data.

The method, optionally, further comprises:

and training a preset voice recognition model by applying the target audio data.

An audio feature processing apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio characteristics of a target audio frame of audio to be processed, and the audio characteristics consist of sub-characteristic data of multiple dimensions;

a determining unit, configured to determine target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature;

the first execution unit is used for enhancing the target sub-feature data to obtain enhanced sub-feature data;

and the second execution unit is used for replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain the target audio feature.

The above device, optionally, the acquiring unit includes:

the frame dividing subunit is used for dividing frames of the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;

a first determining subunit, configured to determine a target audio frame of the audio to be processed from the audio frames;

and the characteristic extraction subunit is used for carrying out characteristic extraction on the target audio frame to obtain the audio characteristics of the target audio frame.

The above apparatus, optionally, the feature extraction subunit includes:

the pre-emphasis processing subunit is used for carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame;

a first execution subunit, configured to add a hamming window to the first audio frame, and obtain a second audio frame;

the second execution subunit is used for performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;

a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;

a fourth execution subunit, configured to obtain triangular filters corresponding to each set feature dimension according to the mel spectrum;

and a fifth execution subunit, configured to input energy corresponding to the frequency domain data to each triangular filter, and obtain an audio feature of the target audio.

The above apparatus, optionally, the first execution unit includes:

the second determining subunit is used for determining the enhancement times corresponding to the target sub-feature data;

and the data enhancement subunit is used for enhancing the target sub-feature data based on the enhancement multiple to obtain the enhancement sub-feature data.

The above device, optionally, further comprises: the model training unit is used for training a preset voice recognition model by applying the target audio data.

A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform an audio feature processing method as described above.

An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors to perform an audio feature processing method as described above.

Compared with the prior art, the invention has the following advantages:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for processing audio features according to the present invention;

FIG. 2 is a flow chart of a process for obtaining audio characteristics of a target audio frame of audio to be processed according to the present invention;

FIG. 3 is a flow chart of a process for obtaining audio characteristics of a target audio frame in accordance with the present invention;

fig. 4 is a schematic structural diagram of an audio feature processing device according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiment of the invention provides an audio feature processing method which can be applied to electronic equipment, wherein a flow chart of the method is shown in fig. 1, and the method specifically comprises the following steps:

s101: and acquiring the audio characteristics of a target audio frame of the audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions.

In the method provided by the embodiment of the invention, the target audio frame can be the audio frame to be processed currently.

Wherein the number of target audio frames may be one or more.

Specifically, the audio feature may be an Fbank feature, the audio feature is a feature vector composed of sub-feature data of multiple dimensions, and the number of dimensions of the audio feature may be 71 or 72.

Optionally, one possible way to obtain the audio features of the target audio frame of the audio to be processed is to perform feature extraction on the target audio frame, so as to obtain the audio features of the target audio frame.

S102: and determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature.

In the method provided by the embodiment of the invention, the target sub-feature data can be randomly determined sub-feature data or sub-feature data with specified dimension.

Wherein the number of target sub-feature data in the audio feature may be one or more.

S103: and enhancing the target sub-feature data to obtain enhanced sub-feature data.

In the method provided by the embodiment of the invention, the target sub-feature data can be enhanced according to a preset enhancement mode, and the enhanced sub-feature data of the target sub-feature data can be obtained.

And under the condition that the number of the target sub-feature data is a plurality of, enhancing each target sub-feature data to obtain enhanced sub-feature data of each target sub-feature data.

S104: and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature.

In the method provided by the embodiment of the invention, the target audio feature comprises all sub-feature data and the enhanced sub-feature data which are outside the target sub-feature data in the audio feature.

The embodiment of the invention provides an audio feature processing method, which comprises the following steps: acquiring audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of sub-characteristic data of multiple dimensions; determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature; enhancing the target sub-feature data to obtain enhanced sub-feature data; and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature. By applying the audio feature processing method provided by the invention, the sub-feature data of part of feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent for expanding the audio features can be greatly reduced, the computing resources can be saved, and the expansion efficiency of the audio features can be improved.

In the method provided by the embodiment of the present invention, based on the implementation process described above, specifically, the process of obtaining the audio feature of the target audio frame of the audio to be processed, as shown in fig. 2, specifically includes:

s201: and framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed.

In the method provided by the embodiment of the invention, the number of the sampling points can be the number of the sampling points forming one audio frame, the length of the moving step length can be the preset number of the sampling points, and the preset number can be smaller than the number of the sampling points.

The number of sampling points may be any number, for example, 500 or 512.

Alternatively, the length of the movement step may be 160 points.

S202: and determining a target audio frame of the audio to be processed in the audio frames.

In the method provided by the embodiment of the invention, a plurality of audio frames which are sequentially continuous in the audio to be processed can be determined as target audio frames, and the current audio frame to be processed in the audio to be processed can also be taken as the target audio frame.

S203: and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.

In the method provided by the embodiment of the invention, the audio features can be any type of audio features, and the feature extraction can be performed on the target audio frame through a preset feature extraction mode to obtain the audio features of the target audio frame corresponding to the feature extraction mode.

In the method provided by the embodiment of the present invention, based on the implementation process described above, specifically, the process of extracting the features of the target audio frame to obtain the audio features of the target audio frame, as shown in fig. 3, specifically includes:

s301: and carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame.

And pre-emphasis processing is carried out on the target audio frame through a preset pre-emphasis formula, so that a first audio frame is obtained.

Alternatively, the pre-emphasis formula may be: y is Y _t+1 ＝X _t+1 -α*X _t 。

Wherein X is _t The value of the sampling point at time t is represented by Y, the value of the sampling point after pre-emphasis is represented by Y, and α is a pre-emphasis coefficient, where the range of the pre-emphasis system may be 0.95 to 1, and the first sampling point of the target audio frame may be unchanged.

S302: and adding a Hamming window to the first audio frame to obtain a second audio frame.

Wherein a hamming window may be added to the first audio frame by a hamming window processing formula.

Alternatively, the hamming window processing formula may be: z is Z _n ＝Y _n *h _n 。

Wherein Y represents the sampling point before windowing, Z represents the sampling point after windowing, and h represents the windowing coefficient.

Wherein, the liquid crystal display device comprises a liquid crystal display device,

where β=0.46 can be set, n represents the total number of points to be windowed, and n represents a certain sampling point.

S303: and performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame.

S304: and obtaining the Mel frequency spectrum corresponding to the target audio frame based on the frequency domain data.

The frequency domain data can be calculated through a mel frequency spectrum conversion formula, so that a mel frequency spectrum corresponding to the target audio frame is obtained.

Alternatively, the mel-frequency spectrum conversion formula may be:

where Mel (f) is Mel spectrum and f is frequency domain data.

S305: and obtaining triangular filters corresponding to the set characteristic dimensions according to the Mel frequency spectrum.

Dividing the Mel spectrum into initial triangular filters with preset dimension number, and converting each initial triangular filter back to frequency domain to obtain triangular filter corresponding to each set characteristic dimension.

S306: and inputting the energy corresponding to the frequency domain data to each triangular filter to obtain the audio characteristics of the target audio.

In the method provided by the embodiment of the invention, the real part square of the frequency domain data can be added with the imaginary part square of the frequency domain data to obtain the energy corresponding to the frequency domain data.

The energy corresponding to the frequency domain data can be processed through each triangular filter to obtain sub-feature data of each feature dimension, and the sub-feature data of each dimension forms the audio feature of the target audio.

In the method provided by the embodiment of the present invention, based on the implementation process, specifically, the enhancing the target sub-feature data to obtain enhanced sub-feature data includes:

determining the enhancement times corresponding to the target sub-feature data;

In the method provided by the embodiment of the invention, a preset enhancement multiple set can be determined, any multiple is randomly selected in the enhancement multiple set, and the selected multiple is used as the enhancement multiple corresponding to the target sub-feature data.

The enhancement multiplier may be multiplied by the target sub-feature data to implement enhancement of the target sub-feature data, to obtain enhanced sub-feature data of the target sub-feature data.

The enhancement multiple set may be set according to actual requirements, for example, [0.95,1.05] or [0.96,1.06].

By applying the method provided by the embodiment of the invention, the target sub-feature data can be effectively enhanced, so that each sub-feature data and enhanced sub-feature data except the target sub-feature data in the audio feature can be combined into the target audio feature.

In the method provided by the embodiment of the present invention, based on the implementation process, the method specifically further includes:

In the audio feature processing method provided by the invention, in the actual application process, the audio features of the target audio frame can be Fbank features, and the Fbank features are taken as an example for explanation:

firstly, fbank characteristics of audio to be processed are extracted, and the Fbank characteristics are specifically as follows:

step a1: and carrying out audio framing on the processed audio, wherein 512 sampling points are taken as one frame, and 160 points are moved each time to obtain each audio frame.

Step a2: each audio frame extracts FBank features:

(1) The audio pre-emphasis is as follows: y is Y _t+1 ＝X _t+1 -α*X _t 。

Wherein X is _t The value of the sampling point at the time t is represented by Y, the value of the sampling point after pre-emphasis is represented by Y, alpha is a pre-emphasis coefficient, the range of the alpha is 0.95 to 1, and the first sampling point of the audio is unchanged.

(2) A hamming window is added.

The addition of the Hamming window can prevent the oscillation phenomenon at the rear edge of the Fourier transform.

The specific formula is as follows: z is Z _n ＝Y _n *h _n 。

beta=0.46 can be set, n represents the total number of points to be windowed, nIndicating a certain sampling point.

(3) And converting the audio frame subjected to pre-emphasis and Hamming window from a time domain to a frequency domain through fast Fourier transform to obtain frequency domain data.

(4) By the formula

The frequency domain data is converted into a mel spectrum, the mel spectrum is equally divided into 71 triangular filters, and the triangular filters are converted back into the frequency domain.

(5) The frequency domain corresponding energy is passed through the triangular filter to obtain a 71-dimensional eigenvector.

Second, the FBank feature is enhanced as follows:

for the extracted 71-dimensional Fbank characteristic of each audio frame, randomly sampling the 71-dimensional vector (71 numbers) in each audio frame once to extract 1 number from the 71 numbers, carrying out size conversion of 0.95-1.05 times of the extracted numbers, and replacing the converted numbers with the original numbers to obtain the target audio characteristics of the audio frame.

Assuming that the enhancement factor is 0.97 and the number extracted is 10, the transformed number is 9.7, and then the transformed number 9.7 is substituted for the original number 10.

Corresponding to the method shown in fig. 1, the embodiment of the present invention further provides an audio feature processing apparatus, which is used for implementing the method shown in fig. 1, where the audio feature processing apparatus provided in the embodiment of the present invention may be applied to an electronic device, and the structural schematic diagram of the audio feature processing apparatus is shown in fig. 4, and specifically includes:

an obtaining unit 401, configured to obtain an audio feature of a target audio frame of audio to be processed, where the audio feature is composed of sub-feature data of multiple dimensions;

a determining unit 402, configured to determine target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature;

a first execution unit 403, configured to enhance the target sub-feature data to obtain enhanced sub-feature data;

and the second execution unit 404 is configured to replace the target sub-feature data in the audio feature with the enhancement sub-feature data, so as to obtain a target audio feature.

The embodiment of the invention provides an audio feature processing device which can acquire audio features of a target audio frame of audio to be processed, wherein the audio features consist of sub-feature data with multiple dimensions; determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature; enhancing the target sub-feature data to obtain enhanced sub-feature data; and replacing the target sub-feature data in the audio feature with the enhancement sub-feature data to obtain a target audio feature. The audio feature processing device provided by the invention can enhance the sub-feature data of part of feature dimensions of the audio features, thereby acquiring new target audio feature data, greatly reducing the time spent for expanding the audio features, saving the computing resources and improving the expansion efficiency of the audio features.

In an embodiment of the present invention, based on the foregoing solution, optionally, the obtaining unit 401 includes:

In an embodiment of the present invention, based on the foregoing solution, optionally, the feature extraction subunit includes:

In an embodiment of the present invention, based on the foregoing solution, optionally, the first executing unit 403 includes:

In an embodiment of the present invention, based on the above solution, optionally, the method further includes: the model training unit is used for training a preset voice recognition model by applying the target audio data.

The specific principle and execution process of each unit and module in the audio feature processing device disclosed in the above embodiment of the present invention are the same as those of the audio feature processing method disclosed in the above embodiment of the present invention, and may refer to corresponding parts in the audio feature processing method provided in the above embodiment of the present invention, which are not described herein again.

The embodiment of the invention also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is located is controlled to execute the audio feature processing method when the instructions run.

The embodiment of the present invention further provides an electronic device, whose structural schematic diagram is shown in fig. 5, specifically including a memory 501, and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and configured to be executed by the one or more processors 503, where the one or more instructions 502 perform the following operations:

enhancing the target sub-feature data to obtain enhanced sub-feature data;

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

The foregoing has outlined a detailed description of a method for processing audio features, wherein specific examples are provided herein to illustrate the principles and embodiments of the present invention, and the above examples are provided to assist in understanding the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of processing audio features, comprising:

determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature, wherein the target sub-feature data is randomly determined sub-feature data or sub-feature data of a designated dimension;

enhancing the target sub-feature data to obtain enhanced sub-feature data;

2. The method of claim 1, wherein the acquiring the audio characteristics of the target audio frame of the audio to be processed comprises:

3. The method of claim 2, wherein the extracting the features of the target audio frame to obtain the audio features of the target audio frame comprises:

4. The method of claim 1, wherein enhancing the target sub-feature data to obtain enhanced sub-feature data comprises:

determining the enhancement times corresponding to the target sub-feature data;

5. The method as recited in claim 1, further comprising:

6. An audio feature processing apparatus, comprising:

the determining unit is used for determining target sub-feature data of the audio feature from sub-feature data of each dimension of the audio feature, wherein the target sub-feature data is randomly determined sub-feature data or sub-feature data of a designated dimension;

7. The apparatus of claim 6, wherein the acquisition unit comprises:

8. The apparatus of claim 7, wherein the feature extraction subunit comprises:

9. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform the audio feature processing method of any one of claims 1 to 5.

10. An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the audio feature processing method of any of claims 1-5.