CN112420070A - Automatic labeling method and device, electronic equipment and computer readable storage medium - Google Patents
Automatic labeling method and device, electronic equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN112420070A CN112420070A CN201910780661.3A CN201910780661A CN112420070A CN 112420070 A CN112420070 A CN 112420070A CN 201910780661 A CN201910780661 A CN 201910780661A CN 112420070 A CN112420070 A CN 112420070A
- Authority
- CN
- China
- Prior art keywords
- audio
- audio information
- preset
- features
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 claims abstract description 23
- 230000003595 spectral effect Effects 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 2
- 230000003287 optical effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
The embodiment of the application provides an automatic labeling method, an automatic labeling device, electronic equipment and a computer-readable storage medium. The method comprises the following steps: and extracting preset audio features from the audio information to be labeled, and inputting the preset audio features into the trained labeling model to obtain the labeled audio information. The embodiment of the application realizes automatic marking of the audio information, saves human resources and time resources compared with manual marking of the audio information, and can effectively improve marking efficiency by automatically marking the audio information.
Description
Technical Field
The present application relates to the field of information processing technologies, and in particular, to an automatic labeling method and apparatus, an electronic device, and a computer-readable storage medium.
Background
Audio information is information that people often contact, and it is very important to label audio information.
Under the general condition, the labeling of the audio information needs to be completed manually by specially trained labeling workers, a large amount of human resources need to be consumed, a large amount of time resources need to be consumed for labeling the audio information in a manual labeling mode, and in addition, the efficiency of manual labeling of the audio information is low.
Disclosure of Invention
The application provides an automatic labeling method, an automatic labeling device, electronic equipment and a computer-readable storage medium, which can solve at least one technical problem, and the technical scheme is as follows:
in a first aspect, an automatic labeling method is provided, and the method includes:
extracting preset audio features from audio information to be marked;
and inputting the preset audio features into the trained labeling model to obtain the labeled audio information.
In a second aspect, there is provided an automatic labeling apparatus, the apparatus comprising:
the extraction module is used for extracting preset audio features from the audio information to be labeled;
and the input module is used for inputting the preset audio features into the trained labeling model to obtain the labeled audio information.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and executing the operation corresponding to the automatic labeling method shown in the first aspect.
In a fourth aspect, a computer readable storage medium is provided, which when executed by a processor implements the automatic annotation method of the first aspect.
The beneficial effect that technical scheme that this application provided brought is:
compared with the prior art, the automatic labeling method and device, the electronic equipment and the computer readable storage medium have the advantages that the preset audio features are extracted from the audio information to be labeled, the preset audio features are input into the labeled model after training, the labeled audio information is obtained, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of an automatic labeling method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an automatic labeling apparatus according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The embodiment of the application provides an automatic labeling method, as shown in fig. 1, the method includes:
step S101, extracting preset audio features from audio information to be labeled.
And S102, inputting the preset audio features into the trained labeling model to obtain the labeled audio information.
For the embodiment of the application, the trained labeling model is used for labeling the preset audio features to obtain the labeled audio information.
Compared with the prior art, the automatic labeling method has the advantages that the preset audio features are extracted from the audio information to be labeled and input into the trained labeling model, the labeled audio information is obtained, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.
In a possible implementation manner of the embodiment of the present application, before the step S101, the method may further include: acquiring audio information to be marked; and transcoding the audio information to be marked according to a preset format.
Wherein, step S101 may specifically include: and extracting preset audio features from the audio information after transcoding processing.
For the embodiment of the present application, the format of the audio information to be labeled may include at least one of the following formats:
(1) CD format, wherein the CD format is a high-quality audio format, the sampling frequency is 44.1 kHz, the rate is 1411 kHz/s, and 16 bits are quantized.
(2) WAVE format, wherein the sampling frequency of WAVE format is 44.1 khz, the rate is 1411 khz/sec, 16 bits quantization bit number.
(3) AIFF Format (Audio exchange File Format).
(4) MPEG format (Moving Picture Experts Group).
(5) MP3 format (Moving Picture Experts Group Audio Layer III), wherein the MP3 format belongs to a lossy compression format.
(6) ACC format (Advanced Audio Coding), wherein the ACC format belongs to a lossy compression format.
For the embodiment of the present application, the format of the audio information to be labeled is not limited to the format shown in the foregoing, and may also include other audio formats, for example, an AMR format, where the AMR format is a storage format of a mobile phone recording.
For the embodiment of the present application, the format of the audio information after transcoding processing may specifically include: the frequency is 16000 hertz (Hz), the sampling rate is 16 bits (bit), and the vocal tract is monophonic.
Another possible implementation manner of the embodiment of the application is to acquire the audio information to be labeled, which may specifically include acquiring the audio information to be labeled input by a user; at least one item of audio information to be annotated is retrieved from a local storage.
For the embodiment of the application, the audio information to be marked input by the user can be obtained, and the audio information to be marked can also be obtained from a local storage.
In another possible implementation manner of the embodiment of the present application, the preset audio features may specifically include mel-frequency cepstrum coefficient features; a short-time energy characteristic; a short-time power characteristic; at least one of short-time zero-crossing rate features.
For the embodiment of the application, Mel Frequency Cepstrum Coefficient (MFCC) is extracted based on human ear hearing, and forms a nonlinear corresponding relation with Frequency; the short-time energy characteristic can be used for distinguishing unvoiced sound and voiced sound and can also be used for identifying a mute frame, wherein when the value of the short-time energy is smaller than a preset threshold value, the frame can be considered as a mute frame; the short-time power characteristic is that the length of each frame of the audio information is calculated; Short-Term Zero crossing Rate (ST-ZCR) is a time-domain description of the signal frequency.
For the embodiment of the present application, the preset audio features may include any one of the above-mentioned items and at least two or more items, but are not limited to at least one of the above-mentioned items, and may also include other audio features, for example, a spectral entropy feature, where the spectral entropy feature is obtained by normalizing an absolute value of a frequency spectrum of each frame of audio information to obtain a probability density of each frame of audio information, and further obtain an entropy of the audio information.
Another possible implementation manner of the embodiment of the present application, the extracting the mel-frequency cepstrum coefficient feature from the audio information to be labeled specifically may include: extracting waveform data corresponding to each audio frame in the audio information to be marked; and determining the Mel frequency cepstrum coefficient characteristics corresponding to the audio frames based on the waveform data corresponding to the audio frames.
For the embodiment of the present application, an audio frame refers to a frame of audio information. In the embodiment of the application, the mel-frequency cepstrum coefficient characteristics corresponding to each frame of audio information are determined based on the waveform data corresponding to each frame of audio information.
For the present embodiment, Mel-frequency cepstral coefficient features are cepstral parameters extracted in the Mel (Mel) scale frequency domain, which describes the non-linear characteristics of human ear frequencies. Wherein the relationship between Mel-scale and frequency can be represented by the following formula:
where Mel (f) characterizes Mel scale and f characterizes frequency.
In another possible implementation manner of the embodiment of the present application, determining, based on waveform data corresponding to any audio frame, a mel-frequency cepstrum coefficient feature corresponding to any audio frame may specifically include: pre-emphasis processing is carried out on waveform data corresponding to any audio frame to obtain pre-emphasized waveform data; carrying out Hamming window adding processing on the waveform data subjected to the pre-emphasis processing to obtain waveform data subjected to the Hamming window adding processing; performing discrete Fourier transform on the waveform data subjected to Hamming window processing to obtain a frequency spectrum characteristic corresponding to any audio frame; calculating output energy respectively corresponding to the spectral characteristics corresponding to any audio frame after passing through each triangular Mel frequency filter bank; and performing discrete cosine transform calculation processing on the output energy which respectively corresponds to each triangular Mel frequency filter bank to obtain Mel frequency cepstrum coefficient characteristics corresponding to any audio frame.
For the embodiment of the application, waveform data corresponding to any audio frame is subjected to pre-emphasis processing, that is, any audio frame passes through a high-pass filter to promote a high-frequency part, so that the frequency spectrum of any audio frame is flattened, the frequency spectrum is maintained in the whole frequency band from low frequency to high frequency, and meanwhile, a formant of the high frequency is highlighted. Wherein the high pass filter can be represented by the following formula:
H(z)=1-μz-1
wherein, h (z) represents the output of any audio frame after passing through the high-pass filter, μ represents the coefficient corresponding to the high-pass filter, and z is a complex number.
Wherein, the value of mu is an open interval (0.9,1.0), and can be 0.97.
For the embodiment of the application, the hamming window adding processing is performed on the waveform data after the pre-emphasis processing, namely, the waveform data after the pre-emphasis processing corresponding to any audio frame is multiplied by the hamming window, so as to increase the continuity of the left end of any audio frame and the right end of any audio frame. Wherein, the Hamming window can be expressed by the following formula:
wherein W (N, a) represents a Hamming window, N represents parameters of the Hamming window, a represents coefficients of the Hamming window, and N represents the number of audio frames.
Wherein a may take the value of 0.46.
For the embodiment of the present application, waveform data after hamming window processing is a signal feature of any audio frame in the time domain, and since it is usually difficult to see the signal feature through signal transformation in the time domain, the signal feature in the time domain is often transformed into a signal feature in the frequency domain, for example, the signal feature in the time domain is transformed into energy distribution in the frequency domain to observe the signal feature. In the embodiment of the present application, discrete fourier transform is performed on the waveform data after hamming window processing, so as to obtain a spectral feature corresponding to any audio frame, where the spectral feature corresponding to any audio frame can be characterized by the following formula:
wherein, Xa(k) The method comprises the steps of representing the corresponding spectral characteristics of any audio frame, a representing the coefficient of a Hamming window, k representing the point number of the kth Fourier transform, N representing the parameter of the Hamming window, N representing the point number of the Fourier transform, and x (N) representing waveform data processed by the Hamming window.
For the embodiment of the application, the output energy corresponding to the spectral characteristics corresponding to any audio frame after passing through each triangular mel frequency filter bank is calculated. In the embodiment of the present application, the qmel frequency filters are triangular filters, and the number of one qmel frequency filter bank may be any one of 22 to 26. In the embodiment of the present application, the role of the qmel frequency filter is mainly as follows: the frequency spectrum of any audio frame is smoothed, the influence of harmonic waves is eliminated, and formants of audio information are highlighted.
For the embodiments of the present application, the frequency response of each qmel frequency filter bank can be represented by the following formula:
wherein Hm(k) Characterization Hm(k) The frequency response of the mth triangular Mel frequency filter bank, k represents a frequency variable, f (M-1), f (M +1) respectively represent the center frequencies of the (M-1), M and M +1) triangular Mel frequency filters in the mth triangular Mel frequency filter bank, M represents the number of the triangular Mel frequency filter banks, and sigma is a summation symbol.
For the embodiment of the present application, the output energy corresponding to any audio frame after passing through each qmel frequency filter bank can be represented by the following formula:
wherein, s (m) represents the corresponding output energy of any audio frame after passing through the mth triangular Mel frequency filter bank, m represents the mth triangular Mel frequency filter bank, Σ is a summation symbol, k represents a frequency variable, N represents the number of triangular Mel frequency filters in the mth triangular Mel frequency filter bank, X represents the number of triangular Mel frequency filters in the mth triangular Mel frequency filter bank, m represents the sum of thea(k) Characterizing the corresponding spectral feature of any audio frame, Hm(k) And representing the frequency response of the mth triangular Mel frequency filter bank, and M represents the number of the triangular Mel frequency filter banks.
For the embodiment of the present application, Discrete Cosine Transform (DCT) calculation processing is performed on the output energy respectively corresponding to each triangular mel-frequency filter bank, so as to obtain mel-frequency cepstrum coefficient characteristics corresponding to any audio frame, where the mel-frequency cepstrum coefficient characteristics corresponding to any audio frame can be represented by the following formula:
the method comprises the following steps of C (N) representing a Mel frequency cepstrum coefficient characteristic corresponding to any audio frame, N representing the nth order of the Mel frequency cepstrum coefficient characteristic corresponding to any audio frame, Σ being a summation symbol, M representing the mth triangular Mel frequency filter bank, N representing the number of triangular Mel frequency filters in the mth triangular Mel frequency filter bank, s (M) representing the corresponding output energy of any audio frame after passing through each triangular Mel frequency filter bank, M representing the number of the triangular Mel frequency filter banks, and L representing the order of the Mel frequency cepstrum coefficient characteristic.
For the embodiment of the present application, the order of the mel-frequency cepstrum coefficient feature may be set to any one of 12 to 16, for example, the order of the mel-frequency cepstrum coefficient feature is set to 13.
For the present embodiment, the dimensions of the mel-frequency cepstral coefficient features are 13-dimensional static coefficients, which include 1-dimensional energy coefficients (F0) and 12-dimensional DCT coefficients. Wherein, the 1-dimensional energy coefficient (F0) can be used for distinguishing a speech frame from a non-speech frame; the specific extraction process of the 12-dimensional DCT coefficients is as follows:
because the results of the effect of a plurality of adjacent frequencies on human ears are similar, a preset number of triangular Mel frequency filters are adopted to divide a frequency domain into a small number of sub-bands, and each sub-band outputs sub-band energy and an energy level characteristic representing the frequency section, so that 24 sub-band energy characteristics are obtained; after the DCT calculation, the DCT coefficients are sequentially decreased, the first 13 DCT coefficients are C0-C12, the 14 th and following DCT coefficients are almost 0, wherein C0 is discarded, and the 12-dimensional DCT coefficients of C1-C12 are reserved.
In another possible implementation manner of the embodiment of the present application, step S102 may specifically include: and inputting the preset audio features into the trained marking model, and marking the start point time of each character and the end point time of each character in the audio information to be marked by using the trained marking model to obtain the marked audio information.
For the embodiment of the application, the audio information is obtained by recording the humming or singing of the user, and the labeling of the audio information is that the audio information is labeled, namely the start point time of each character humming or singing and the end point time of each character in the audio information.
In another possible implementation manner of the embodiment of the present application, before the step S102, the method may further include: obtaining a plurality of training samples; and training the preset model based on a plurality of training samples to obtain a trained labeling model.
Any training sample can include the labeled audio information for training and preset audio features corresponding to the audio information for training.
For the embodiment of the present application, the audio information used for training and the audio information to be labeled may be the same audio information, or may be two different audio information, which is not limited in the embodiment of the present application.
In another possible implementation of the embodiment of the application, the preset model includes a hidden markov model.
Training the preset model based on the plurality of training samples may specifically include: the hidden markov model is trained using a maximum expectation algorithm based on a plurality of training samples.
For the embodiment of the present application, a Hidden Markov Model (HMM) can be used in the fields of speech recognition, behavior recognition, character recognition, fault diagnosis, and the like. In the embodiment of the present application, a hidden markov model is trained by using a Maximum Expectation algorithm (EM), which is also called a Dempster-Laird-Rubin algorithm, and is a type of optimization algorithm that performs Maximum Likelihood Estimation (MLE) through iteration.
For the embodiment of the present application, the preset model may include a hidden markov model, and may further include other models, which are not limited in the embodiment of the present application. In the embodiment of the present application, the preset model is trained, specifically, the hidden markov model may be trained by using a maximum expectation algorithm, the hidden markov model may be trained by using other algorithms, and the preset model may be trained by using other algorithms, which is not limited in the embodiment of the present application.
For the embodiment of the present application, the above embodiment may be executed by a terminal device, may also be executed by a server, may also be executed by a part of the terminal device, and a part of the server, which is not limited in the embodiment of the present application.
The above-described automatic labeling method is specifically set forth from the perspective of method steps, and the automatic labeling apparatus is specifically set forth below from the perspective of modules, units, or sub-units.
The embodiment of the present application provides an automatic labeling apparatus, as shown in fig. 2, the automatic labeling apparatus 20 may specifically include an extraction module 201 and an input module 202, wherein,
the extracting module 201 is configured to extract preset audio features from the audio information to be labeled.
The input module 202 is configured to input a preset audio feature into the trained labeling model, so as to obtain labeled audio information.
In one possible implementation manner of the embodiment of the present application, the automatic labeling apparatus 20 may further include a first obtaining module and a transcoding module, wherein,
the first acquisition module is used for acquiring the audio information to be marked.
And the transcoding module is used for transcoding the audio information to be marked according to a preset format.
The extracting module 201 may be specifically configured to extract a preset audio feature from the transcoded audio information.
In another possible implementation manner of the embodiment of the present application, the first obtaining module may specifically include at least one of a first obtaining unit and a second obtaining unit, wherein,
the first acquisition unit is used for acquiring the audio information to be labeled input by a user.
And the second acquisition unit is used for acquiring the audio information to be marked from the local storage.
In another possible implementation manner of the embodiment of the present application, the preset audio feature may specifically include at least one of the following:
mel-frequency cepstrum coefficient features; a short-time energy characteristic; a short-time power characteristic; short-time zero-crossing rate characteristics.
In another possible implementation manner of the embodiment of the present application, the extraction module 201 may specifically include an extraction unit and a determination unit, wherein,
and the extraction unit is used for extracting the waveform data corresponding to each audio frame in the audio information to be labeled.
And the determining unit is used for determining the Mel frequency cepstrum coefficient characteristics corresponding to the audio frames respectively based on the waveform data corresponding to the audio frames respectively.
In another possible implementation manner of the embodiment of the present application, the determining unit may specifically include a first processing subunit, a second processing subunit, a third processing subunit, a calculating subunit, and a fourth processing subunit, where,
and the first processing subunit is used for performing pre-emphasis processing on the waveform data corresponding to any audio frame to obtain the waveform data after the pre-emphasis processing.
And the second processing subunit is used for performing Hamming window adding processing on the waveform data subjected to the pre-emphasis processing to obtain the waveform data subjected to the Hamming window adding processing.
The third processing subunit is used for performing discrete Fourier transform on the waveform data subjected to the Hamming window processing to obtain a frequency spectrum characteristic corresponding to any audio frame;
and the calculating subunit is used for calculating the output energy respectively corresponding to the spectral characteristics corresponding to any audio frame after passing through each triangular Mel frequency filter bank.
And the fourth processing subunit is used for performing discrete cosine transform calculation processing on the output energy which respectively corresponds to each triangular Mel frequency filter bank to obtain Mel frequency cepstrum coefficient characteristics corresponding to any audio frame.
In another possible implementation manner of the embodiment of the application, the input module 202 may be specifically configured to input the preset audio features into the trained annotation model, and label, by using the trained annotation model, the start point time of each character and the end point time of each character in the audio information to be labeled to obtain the labeled audio information.
In another possible implementation manner of the embodiment of the present application, the automatic labeling device 20 may further include a second obtaining module and a training module, wherein,
and the second acquisition module is used for acquiring a plurality of training samples.
Any training sample can specifically include the labeled audio information for training and the preset audio features corresponding to the audio information for training.
And the training module is used for training the preset model based on a plurality of training samples to obtain a trained labeling model.
In another possible implementation manner of the embodiment of the present application, the preset model may specifically include a hidden markov model.
The training module may be specifically configured to train the hidden markov model using a maximum expectation algorithm based on a plurality of training samples.
For the embodiment of the present application, the first obtaining module and the second obtaining module may be the same obtaining module, or may be two different obtaining modules, the first obtaining unit and the second obtaining unit may be the same obtaining unit, or may be two different obtaining units, and the first processing subunit, the second processing subunit, the third processing subunit, and the fourth processing subunit may be the same processing subunit, or any two of them may be the same processing subunit, or any three of them may be the same processing subunit, or may be four different processing subunits, which is not limited in the embodiment of the present application.
The automatic labeling device provided in the embodiment of the present application can be used to perform operations corresponding to the automatic labeling method provided in the foregoing method embodiment, and the implementation principle is similar, which is not described herein again.
Compared with the prior art, the automatic labeling device has the advantages that the preset audio features are extracted from the audio information to be labeled, the preset audio features are input into the labeled model after training, the labeled audio information is obtained, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.
The automatic labeling apparatus is specifically described from the perspective of a module, a unit, or a subunit, and an electronic device is specifically described from the perspective of an entity apparatus, where the electronic device in this embodiment of the present application may be a terminal device, and may also be a server, and is not limited in this embodiment of the present application.
An embodiment of the present application provides an electronic device, and an electronic device 4000 shown in fig. 3 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in any of the foregoing method embodiments.
An embodiment of the present application provides an electronic device, where the electronic device includes: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: the preset audio features are extracted from the audio information to be labeled and input into the trained labeling model to obtain the labeled audio information, so that the automatic labeling of the audio information is realized, compared with the manual labeling of the audio information, the human resources and the time resources are saved, and the automatic labeling of the audio information can effectively improve the labeling efficiency.
The electronic device of the present application is described above from the perspective of a physical device, and the computer-readable storage medium of the present application is described below from the perspective of a storage medium.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the method has the advantages that the preset audio features are extracted from the audio information to be labeled and input into the trained labeling model to obtain the labeled audio information, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.
Claims (20)
1. An automatic labeling method, comprising:
extracting preset audio features from audio information to be marked;
and inputting the preset audio features into the trained labeling model to obtain the labeled audio information.
2. The method according to claim 1, wherein the extracting the preset audio features from the audio information to be labeled further comprises:
acquiring the audio information to be marked;
transcoding the audio information to be marked according to a preset format;
wherein, the preset audio features are extracted from the audio information to be labeled, and the method comprises the following steps:
and extracting the preset audio features from the audio information after transcoding processing.
3. The method according to claim 2, wherein the obtaining the audio information to be labeled comprises at least one of:
acquiring audio information to be marked input by a user;
and acquiring the audio information to be marked from the local storage.
4. The method of claim 1, wherein the preset audio features comprise at least one of:
mel-frequency cepstrum coefficient features;
a short-time energy characteristic;
a short-time power characteristic;
short-time zero-crossing rate characteristics.
5. The method of claim 4, wherein extracting Mel frequency cepstral coefficient features from the audio information to be labeled comprises:
extracting waveform data corresponding to each audio frame in the audio information to be marked;
and determining the Mel frequency cepstrum coefficient characteristics corresponding to the audio frames respectively based on the waveform data corresponding to the audio frames respectively.
6. The method of claim 5, wherein determining the Mel frequency cepstral coefficient characteristic corresponding to any audio frame based on the waveform data corresponding to any audio frame comprises:
pre-emphasis processing is carried out on the waveform data corresponding to any audio frame to obtain pre-emphasized waveform data;
carrying out Hamming window adding processing on the waveform data subjected to the pre-emphasis processing to obtain waveform data subjected to the Hamming window adding processing;
performing discrete Fourier transform on the waveform data subjected to the Hamming window processing to obtain a frequency spectrum characteristic corresponding to any audio frame;
calculating output energy respectively corresponding to the spectral features corresponding to any audio frame after the spectral features pass through each triangular Mel frequency filter bank;
and performing discrete cosine transform calculation processing on the output energy which is respectively corresponding to each triangular Mel frequency filter bank to obtain Mel frequency cepstrum coefficient characteristics corresponding to any audio frame.
7. The method of claim 1, wherein the inputting the preset audio features into the trained labeling model to obtain labeled audio information comprises:
and inputting the preset audio features into a trained marking model, and marking the starting point time of each character and the ending point time of each character in the audio information to be marked by using the trained marking model to obtain the marked audio information.
8. The method of claim 1, wherein the inputting the preset audio features into the trained labeling model further comprises:
obtaining a plurality of training samples, any training sample comprising: the marked audio information used for training and the preset audio features corresponding to the audio information used for training are obtained;
and training a preset model based on the plurality of training samples to obtain a trained labeling model.
9. The method of claim 8, wherein the pre-set model comprises a hidden markov model;
training a preset model based on the training samples comprises:
training the hidden Markov model based on the plurality of training samples and using a maximum expectation algorithm.
10. An automatic labeling apparatus, comprising:
the extraction module is used for extracting preset audio features from the audio information to be labeled;
and the input module is used for inputting the preset audio features into the trained labeling model to obtain the labeled audio information.
11. The apparatus of claim 10, wherein the automatic labeling apparatus further comprises a first retrieving module and a transcoding module, wherein,
the first obtaining module is used for obtaining the audio information to be marked;
the transcoding module is used for transcoding the audio information to be marked according to a preset format;
the extraction module is specifically configured to extract the preset audio features from the transcoded audio information.
12. The apparatus of claim 11, wherein the first obtaining module comprises at least one of a first obtaining unit and a second obtaining unit, wherein,
the first acquisition unit is used for acquiring audio information to be marked input by a user;
and the second acquisition unit is used for acquiring the audio information to be marked from the local storage.
13. The apparatus of claim 10, wherein the preset audio features comprise at least one of:
mel-frequency cepstrum coefficient features;
a short-time energy characteristic;
a short-time power characteristic;
short-time zero-crossing rate characteristics.
14. The apparatus of claim 13, wherein the extraction module comprises an extraction unit and a determination unit, wherein,
the extracting unit is used for extracting waveform data corresponding to each audio frame in the audio information to be labeled;
the determining unit is configured to determine mel-frequency cepstrum coefficient features corresponding to the audio frames based on the waveform data corresponding to the audio frames.
15. The apparatus of claim 14, wherein the determination unit comprises a first processing subunit, a second processing subunit, a third processing subunit, a calculation subunit, and a fourth processing subunit, wherein,
the first processing subunit is configured to perform pre-emphasis processing on the waveform data corresponding to the any audio frame to obtain pre-emphasized waveform data;
the second processing subunit is configured to perform hamming window adding processing on the waveform data after the pre-emphasis processing, so as to obtain hamming window added waveform data;
the third processing subunit is configured to perform discrete fourier transform on the waveform data subjected to hamming window processing to obtain a spectral feature corresponding to any audio frame;
the calculating subunit is configured to calculate output energies corresponding to spectral features corresponding to the any audio frame after passing through each qmel frequency filter bank;
and the fourth processing subunit is configured to perform discrete cosine transform calculation processing on the output energy respectively corresponding to the output energy after passing through each triangular mel-frequency filter bank, so as to obtain mel-frequency cepstrum coefficient characteristics corresponding to any one of the audio frames.
16. The apparatus according to claim 10, wherein the input module is specifically configured to input the preset audio features into a trained labeling model, and label a start point time of each tone and an end point time of each tone in the audio information to be labeled by using the trained labeling model to obtain the labeled audio information.
17. The apparatus of claim 10, wherein the automatic labeling apparatus further comprises a second acquisition module and a training module, wherein,
the second obtaining module is configured to obtain a plurality of training samples, where any training sample includes: the marked audio information used for training and the preset audio features corresponding to the audio information used for training are obtained;
and the training module is used for training a preset model based on the plurality of training samples to obtain a trained labeling model.
18. The apparatus of claim 17, wherein the pre-set model comprises a hidden markov model;
the training module is specifically configured to train the hidden markov model based on the plurality of training samples and using a maximum expectation algorithm.
19. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the automatic labeling method according to any one of claims 1 to 9.
20. A computer-readable storage medium, on which a computer program is stored, the program, when being executed by a processor, implementing the automatic labeling method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910780661.3A CN112420070A (en) | 2019-08-22 | 2019-08-22 | Automatic labeling method and device, electronic equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910780661.3A CN112420070A (en) | 2019-08-22 | 2019-08-22 | Automatic labeling method and device, electronic equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112420070A true CN112420070A (en) | 2021-02-26 |
Family
ID=74780223
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910780661.3A Pending CN112420070A (en) | 2019-08-22 | 2019-08-22 | Automatic labeling method and device, electronic equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112420070A (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101639934A (en) * | 2009-09-04 | 2010-02-03 | 西安电子科技大学 | SAR image denoising method based on contour wave domain block hidden Markov model |
CN101753941A (en) * | 2008-12-19 | 2010-06-23 | 康佳集团股份有限公司 | Method for realizing markup information in imaging device and imaging device |
JP2013057735A (en) * | 2011-09-07 | 2013-03-28 | National Institute Of Information & Communication Technology | Hidden markov model learning device for voice synthesis and voice synthesizer |
CN104795082A (en) * | 2015-03-26 | 2015-07-22 | 广州酷狗计算机科技有限公司 | Player and audio subtitle display method and device |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
CN108053836A (en) * | 2018-01-18 | 2018-05-18 | 成都嗨翻屋文化传播有限公司 | A kind of audio automation mask method based on deep learning |
WO2018107810A1 (en) * | 2016-12-15 | 2018-06-21 | 平安科技(深圳)有限公司 | Voiceprint recognition method and apparatus, and electronic device and medium |
CN108205535A (en) * | 2016-12-16 | 2018-06-26 | 北京酷我科技有限公司 | The method and its system of Emotion tagging |
CN108257614A (en) * | 2016-12-29 | 2018-07-06 | 北京酷我科技有限公司 | The method and its system of audio data mark |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
CN109036381A (en) * | 2018-08-08 | 2018-12-18 | 平安科技(深圳)有限公司 | Method of speech processing and device, computer installation and readable storage medium storing program for executing |
CN109256138A (en) * | 2018-08-13 | 2019-01-22 | 平安科技(深圳)有限公司 | Auth method, terminal device and computer readable storage medium |
CN109378016A (en) * | 2018-10-10 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of keyword identification mask method based on VAD |
CN109508402A (en) * | 2018-11-15 | 2019-03-22 | 上海指旺信息科技有限公司 | Violation term detection method and device |
-
2019
- 2019-08-22 CN CN201910780661.3A patent/CN112420070A/en active Pending
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101753941A (en) * | 2008-12-19 | 2010-06-23 | 康佳集团股份有限公司 | Method for realizing markup information in imaging device and imaging device |
CN101639934A (en) * | 2009-09-04 | 2010-02-03 | 西安电子科技大学 | SAR image denoising method based on contour wave domain block hidden Markov model |
JP2013057735A (en) * | 2011-09-07 | 2013-03-28 | National Institute Of Information & Communication Technology | Hidden markov model learning device for voice synthesis and voice synthesizer |
CN104795082A (en) * | 2015-03-26 | 2015-07-22 | 广州酷狗计算机科技有限公司 | Player and audio subtitle display method and device |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
WO2018107810A1 (en) * | 2016-12-15 | 2018-06-21 | 平安科技(深圳)有限公司 | Voiceprint recognition method and apparatus, and electronic device and medium |
CN108205535A (en) * | 2016-12-16 | 2018-06-26 | 北京酷我科技有限公司 | The method and its system of Emotion tagging |
CN108257614A (en) * | 2016-12-29 | 2018-07-06 | 北京酷我科技有限公司 | The method and its system of audio data mark |
CN108053836A (en) * | 2018-01-18 | 2018-05-18 | 成都嗨翻屋文化传播有限公司 | A kind of audio automation mask method based on deep learning |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
CN109036381A (en) * | 2018-08-08 | 2018-12-18 | 平安科技(深圳)有限公司 | Method of speech processing and device, computer installation and readable storage medium storing program for executing |
CN109256138A (en) * | 2018-08-13 | 2019-01-22 | 平安科技(深圳)有限公司 | Auth method, terminal device and computer readable storage medium |
CN109378016A (en) * | 2018-10-10 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of keyword identification mask method based on VAD |
CN109508402A (en) * | 2018-11-15 | 2019-03-22 | 上海指旺信息科技有限公司 | Violation term detection method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shrawankar et al. | Techniques for feature extraction in speech recognition system: A comparative study | |
Singh et al. | Multimedia analysis for disguised voice and classification efficiency | |
JP3277398B2 (en) | Voiced sound discrimination method | |
CN102968986B (en) | Overlapped voice and single voice distinguishing method based on long time characteristics and short time characteristics | |
CN109256138B (en) | Identity verification method, terminal device and computer readable storage medium | |
CN109767756B (en) | Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient | |
WO2018223727A1 (en) | Voiceprint recognition method, apparatus and device, and medium | |
CN110459241B (en) | Method and system for extracting voice features | |
CN108305639B (en) | Speech emotion recognition method, computer-readable storage medium and terminal | |
CN111833843B (en) | Speech synthesis method and system | |
CN113327626A (en) | Voice noise reduction method, device, equipment and storage medium | |
Shanthi et al. | Review of feature extraction techniques in automatic speech recognition | |
CN109147796A (en) | Audio recognition method, device, computer equipment and computer readable storage medium | |
CN108682432B (en) | Speech emotion recognition device | |
Sapijaszko et al. | An overview of recent window based feature extraction algorithms for speaker recognition | |
Oura et al. | Deep neural network based real-time speech vocoder with periodic and aperiodic inputs | |
CN113421584A (en) | Audio noise reduction method and device, computer equipment and storage medium | |
Joy et al. | Deep scattering power spectrum features for robust speech recognition | |
Makhijani et al. | Speech enhancement using pitch detection approach for noisy environment | |
CN112420070A (en) | Automatic labeling method and device, electronic equipment and computer readable storage medium | |
CN112397087B (en) | Formant envelope estimation method, formant envelope estimation device, speech processing method, speech processing device, storage medium and terminal | |
CN111862931B (en) | Voice generation method and device | |
Tomchuk | Spectral masking in MFCC calculation for noisy speech | |
Singh et al. | A comparative study on feature extraction techniques for language identification | |
Jiang et al. | Acoustic feature comparison of MFCC and CZT-based cepstrum for speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |