CN112420070A

CN112420070A - Automatic labeling method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN112420070A
Application number: CN201910780661.3A
Authority: CN
Inventors: 郝舫; 张跃; 白云飞
Original assignee: Beijing Fengqu Internet Information Service Co ltd
Current assignee: Beijing Fengqu Internet Information Service Co ltd
Priority date: 2019-08-22
Filing date: 2019-08-22
Publication date: 2021-02-26

Abstract

The embodiment of the application provides an automatic labeling method, an automatic labeling device, electronic equipment and a computer-readable storage medium. The method comprises the following steps: and extracting preset audio features from the audio information to be labeled, and inputting the preset audio features into the trained labeling model to obtain the labeled audio information. The embodiment of the application realizes automatic marking of the audio information, saves human resources and time resources compared with manual marking of the audio information, and can effectively improve marking efficiency by automatically marking the audio information.

Description

Automatic labeling method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to an automatic labeling method and apparatus, an electronic device, and a computer-readable storage medium.

Background

Audio information is information that people often contact, and it is very important to label audio information.

Under the general condition, the labeling of the audio information needs to be completed manually by specially trained labeling workers, a large amount of human resources need to be consumed, a large amount of time resources need to be consumed for labeling the audio information in a manual labeling mode, and in addition, the efficiency of manual labeling of the audio information is low.

Disclosure of Invention

The application provides an automatic labeling method, an automatic labeling device, electronic equipment and a computer-readable storage medium, which can solve at least one technical problem, and the technical scheme is as follows:

in a first aspect, an automatic labeling method is provided, and the method includes:

extracting preset audio features from audio information to be marked;

and inputting the preset audio features into the trained labeling model to obtain the labeled audio information.

In a second aspect, there is provided an automatic labeling apparatus, the apparatus comprising:

the extraction module is used for extracting preset audio features from the audio information to be labeled;

and the input module is used for inputting the preset audio features into the trained labeling model to obtain the labeled audio information.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and executing the operation corresponding to the automatic labeling method shown in the first aspect.

In a fourth aspect, a computer readable storage medium is provided, which when executed by a processor implements the automatic annotation method of the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

compared with the prior art, the automatic labeling method and device, the electronic equipment and the computer readable storage medium have the advantages that the preset audio features are extracted from the audio information to be labeled, the preset audio features are input into the labeled model after training, the labeled audio information is obtained, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of an automatic labeling method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an automatic labeling apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiment of the application provides an automatic labeling method, as shown in fig. 1, the method includes:

step S101, extracting preset audio features from audio information to be labeled.

And S102, inputting the preset audio features into the trained labeling model to obtain the labeled audio information.

For the embodiment of the application, the trained labeling model is used for labeling the preset audio features to obtain the labeled audio information.

Compared with the prior art, the automatic labeling method has the advantages that the preset audio features are extracted from the audio information to be labeled and input into the trained labeling model, the labeled audio information is obtained, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.

In a possible implementation manner of the embodiment of the present application, before the step S101, the method may further include: acquiring audio information to be marked; and transcoding the audio information to be marked according to a preset format.

Wherein, step S101 may specifically include: and extracting preset audio features from the audio information after transcoding processing.

For the embodiment of the present application, the format of the audio information to be labeled may include at least one of the following formats:

(1) CD format, wherein the CD format is a high-quality audio format, the sampling frequency is 44.1 kHz, the rate is 1411 kHz/s, and 16 bits are quantized.

(2) WAVE format, wherein the sampling frequency of WAVE format is 44.1 khz, the rate is 1411 khz/sec, 16 bits quantization bit number.

(3) AIFF Format (Audio exchange File Format).

(4) MPEG format (Moving Picture Experts Group).

(5) MP3 format (Moving Picture Experts Group Audio Layer III), wherein the MP3 format belongs to a lossy compression format.

(6) ACC format (Advanced Audio Coding), wherein the ACC format belongs to a lossy compression format.

For the embodiment of the present application, the format of the audio information to be labeled is not limited to the format shown in the foregoing, and may also include other audio formats, for example, an AMR format, where the AMR format is a storage format of a mobile phone recording.

For the embodiment of the present application, the format of the audio information after transcoding processing may specifically include: the frequency is 16000 hertz (Hz), the sampling rate is 16 bits (bit), and the vocal tract is monophonic.

Another possible implementation manner of the embodiment of the application is to acquire the audio information to be labeled, which may specifically include acquiring the audio information to be labeled input by a user; at least one item of audio information to be annotated is retrieved from a local storage.

For the embodiment of the application, the audio information to be marked input by the user can be obtained, and the audio information to be marked can also be obtained from a local storage.

In another possible implementation manner of the embodiment of the present application, the preset audio features may specifically include mel-frequency cepstrum coefficient features; a short-time energy characteristic; a short-time power characteristic; at least one of short-time zero-crossing rate features.

For the embodiment of the application, Mel Frequency Cepstrum Coefficient (MFCC) is extracted based on human ear hearing, and forms a nonlinear corresponding relation with Frequency; the short-time energy characteristic can be used for distinguishing unvoiced sound and voiced sound and can also be used for identifying a mute frame, wherein when the value of the short-time energy is smaller than a preset threshold value, the frame can be considered as a mute frame; the short-time power characteristic is that the length of each frame of the audio information is calculated; Short-Term Zero crossing Rate (ST-ZCR) is a time-domain description of the signal frequency.

For the embodiment of the present application, the preset audio features may include any one of the above-mentioned items and at least two or more items, but are not limited to at least one of the above-mentioned items, and may also include other audio features, for example, a spectral entropy feature, where the spectral entropy feature is obtained by normalizing an absolute value of a frequency spectrum of each frame of audio information to obtain a probability density of each frame of audio information, and further obtain an entropy of the audio information.

Another possible implementation manner of the embodiment of the present application, the extracting the mel-frequency cepstrum coefficient feature from the audio information to be labeled specifically may include: extracting waveform data corresponding to each audio frame in the audio information to be marked; and determining the Mel frequency cepstrum coefficient characteristics corresponding to the audio frames based on the waveform data corresponding to the audio frames.

For the embodiment of the present application, an audio frame refers to a frame of audio information. In the embodiment of the application, the mel-frequency cepstrum coefficient characteristics corresponding to each frame of audio information are determined based on the waveform data corresponding to each frame of audio information.

For the present embodiment, Mel-frequency cepstral coefficient features are cepstral parameters extracted in the Mel (Mel) scale frequency domain, which describes the non-linear characteristics of human ear frequencies. Wherein the relationship between Mel-scale and frequency can be represented by the following formula:

where Mel (f) characterizes Mel scale and f characterizes frequency.

In another possible implementation manner of the embodiment of the present application, determining, based on waveform data corresponding to any audio frame, a mel-frequency cepstrum coefficient feature corresponding to any audio frame may specifically include: pre-emphasis processing is carried out on waveform data corresponding to any audio frame to obtain pre-emphasized waveform data; carrying out Hamming window adding processing on the waveform data subjected to the pre-emphasis processing to obtain waveform data subjected to the Hamming window adding processing; performing discrete Fourier transform on the waveform data subjected to Hamming window processing to obtain a frequency spectrum characteristic corresponding to any audio frame; calculating output energy respectively corresponding to the spectral characteristics corresponding to any audio frame after passing through each triangular Mel frequency filter bank; and performing discrete cosine transform calculation processing on the output energy which respectively corresponds to each triangular Mel frequency filter bank to obtain Mel frequency cepstrum coefficient characteristics corresponding to any audio frame.

For the embodiment of the application, waveform data corresponding to any audio frame is subjected to pre-emphasis processing, that is, any audio frame passes through a high-pass filter to promote a high-frequency part, so that the frequency spectrum of any audio frame is flattened, the frequency spectrum is maintained in the whole frequency band from low frequency to high frequency, and meanwhile, a formant of the high frequency is highlighted. Wherein the high pass filter can be represented by the following formula:

H(z)＝1-μz^-1

wherein, h (z) represents the output of any audio frame after passing through the high-pass filter, μ represents the coefficient corresponding to the high-pass filter, and z is a complex number.

Wherein, the value of mu is an open interval (0.9,1.0), and can be 0.97.

For the embodiment of the application, the hamming window adding processing is performed on the waveform data after the pre-emphasis processing, namely, the waveform data after the pre-emphasis processing corresponding to any audio frame is multiplied by the hamming window, so as to increase the continuity of the left end of any audio frame and the right end of any audio frame. Wherein, the Hamming window can be expressed by the following formula:

wherein W (N, a) represents a Hamming window, N represents parameters of the Hamming window, a represents coefficients of the Hamming window, and N represents the number of audio frames.

Wherein a may take the value of 0.46.

For the embodiment of the present application, waveform data after hamming window processing is a signal feature of any audio frame in the time domain, and since it is usually difficult to see the signal feature through signal transformation in the time domain, the signal feature in the time domain is often transformed into a signal feature in the frequency domain, for example, the signal feature in the time domain is transformed into energy distribution in the frequency domain to observe the signal feature. In the embodiment of the present application, discrete fourier transform is performed on the waveform data after hamming window processing, so as to obtain a spectral feature corresponding to any audio frame, where the spectral feature corresponding to any audio frame can be characterized by the following formula:

wherein, X_a(k) The method comprises the steps of representing the corresponding spectral characteristics of any audio frame, a representing the coefficient of a Hamming window, k representing the point number of the kth Fourier transform, N representing the parameter of the Hamming window, N representing the point number of the Fourier transform, and x (N) representing waveform data processed by the Hamming window.

For the embodiment of the application, the output energy corresponding to the spectral characteristics corresponding to any audio frame after passing through each triangular mel frequency filter bank is calculated. In the embodiment of the present application, the qmel frequency filters are triangular filters, and the number of one qmel frequency filter bank may be any one of 22 to 26. In the embodiment of the present application, the role of the qmel frequency filter is mainly as follows: the frequency spectrum of any audio frame is smoothed, the influence of harmonic waves is eliminated, and formants of audio information are highlighted.

For the embodiments of the present application, the frequency response of each qmel frequency filter bank can be represented by the following formula:

wherein,

wherein H_m(k) Characterization H_m(k) The frequency response of the mth triangular Mel frequency filter bank, k represents a frequency variable, f (M-1), f (M +1) respectively represent the center frequencies of the (M-1), M and M +1) triangular Mel frequency filters in the mth triangular Mel frequency filter bank, M represents the number of the triangular Mel frequency filter banks, and sigma is a summation symbol.

For the embodiment of the present application, the output energy corresponding to any audio frame after passing through each qmel frequency filter bank can be represented by the following formula:

wherein, s (m) represents the corresponding output energy of any audio frame after passing through the mth triangular Mel frequency filter bank, m represents the mth triangular Mel frequency filter bank, Σ is a summation symbol, k represents a frequency variable, N represents the number of triangular Mel frequency filters in the mth triangular Mel frequency filter bank, X represents the number of triangular Mel frequency filters in the mth triangular Mel frequency filter bank, m represents the sum of the_a(k) Characterizing the corresponding spectral feature of any audio frame, H_m(k) And representing the frequency response of the mth triangular Mel frequency filter bank, and M represents the number of the triangular Mel frequency filter banks.

For the embodiment of the present application, Discrete Cosine Transform (DCT) calculation processing is performed on the output energy respectively corresponding to each triangular mel-frequency filter bank, so as to obtain mel-frequency cepstrum coefficient characteristics corresponding to any audio frame, where the mel-frequency cepstrum coefficient characteristics corresponding to any audio frame can be represented by the following formula:

the method comprises the following steps of C (N) representing a Mel frequency cepstrum coefficient characteristic corresponding to any audio frame, N representing the nth order of the Mel frequency cepstrum coefficient characteristic corresponding to any audio frame, Σ being a summation symbol, M representing the mth triangular Mel frequency filter bank, N representing the number of triangular Mel frequency filters in the mth triangular Mel frequency filter bank, s (M) representing the corresponding output energy of any audio frame after passing through each triangular Mel frequency filter bank, M representing the number of the triangular Mel frequency filter banks, and L representing the order of the Mel frequency cepstrum coefficient characteristic.

For the embodiment of the present application, the order of the mel-frequency cepstrum coefficient feature may be set to any one of 12 to 16, for example, the order of the mel-frequency cepstrum coefficient feature is set to 13.

For the present embodiment, the dimensions of the mel-frequency cepstral coefficient features are 13-dimensional static coefficients, which include 1-dimensional energy coefficients (F0) and 12-dimensional DCT coefficients. Wherein, the 1-dimensional energy coefficient (F0) can be used for distinguishing a speech frame from a non-speech frame; the specific extraction process of the 12-dimensional DCT coefficients is as follows:

because the results of the effect of a plurality of adjacent frequencies on human ears are similar, a preset number of triangular Mel frequency filters are adopted to divide a frequency domain into a small number of sub-bands, and each sub-band outputs sub-band energy and an energy level characteristic representing the frequency section, so that 24 sub-band energy characteristics are obtained; after the DCT calculation, the DCT coefficients are sequentially decreased, the first 13 DCT coefficients are C0-C12, the 14 th and following DCT coefficients are almost 0, wherein C0 is discarded, and the 12-dimensional DCT coefficients of C1-C12 are reserved.

In another possible implementation manner of the embodiment of the present application, step S102 may specifically include: and inputting the preset audio features into the trained marking model, and marking the start point time of each character and the end point time of each character in the audio information to be marked by using the trained marking model to obtain the marked audio information.

For the embodiment of the application, the audio information is obtained by recording the humming or singing of the user, and the labeling of the audio information is that the audio information is labeled, namely the start point time of each character humming or singing and the end point time of each character in the audio information.

In another possible implementation manner of the embodiment of the present application, before the step S102, the method may further include: obtaining a plurality of training samples; and training the preset model based on a plurality of training samples to obtain a trained labeling model.

Any training sample can include the labeled audio information for training and preset audio features corresponding to the audio information for training.

For the embodiment of the present application, the audio information used for training and the audio information to be labeled may be the same audio information, or may be two different audio information, which is not limited in the embodiment of the present application.

In another possible implementation of the embodiment of the application, the preset model includes a hidden markov model.

Training the preset model based on the plurality of training samples may specifically include: the hidden markov model is trained using a maximum expectation algorithm based on a plurality of training samples.

For the embodiment of the present application, a Hidden Markov Model (HMM) can be used in the fields of speech recognition, behavior recognition, character recognition, fault diagnosis, and the like. In the embodiment of the present application, a hidden markov model is trained by using a Maximum Expectation algorithm (EM), which is also called a Dempster-Laird-Rubin algorithm, and is a type of optimization algorithm that performs Maximum Likelihood Estimation (MLE) through iteration.

For the embodiment of the present application, the preset model may include a hidden markov model, and may further include other models, which are not limited in the embodiment of the present application. In the embodiment of the present application, the preset model is trained, specifically, the hidden markov model may be trained by using a maximum expectation algorithm, the hidden markov model may be trained by using other algorithms, and the preset model may be trained by using other algorithms, which is not limited in the embodiment of the present application.

For the embodiment of the present application, the above embodiment may be executed by a terminal device, may also be executed by a server, may also be executed by a part of the terminal device, and a part of the server, which is not limited in the embodiment of the present application.

The above-described automatic labeling method is specifically set forth from the perspective of method steps, and the automatic labeling apparatus is specifically set forth below from the perspective of modules, units, or sub-units.

The embodiment of the present application provides an automatic labeling apparatus, as shown in fig. 2, the automatic labeling apparatus 20 may specifically include an extraction module 201 and an input module 202, wherein,

the extracting module 201 is configured to extract preset audio features from the audio information to be labeled.

The input module 202 is configured to input a preset audio feature into the trained labeling model, so as to obtain labeled audio information.

In one possible implementation manner of the embodiment of the present application, the automatic labeling apparatus 20 may further include a first obtaining module and a transcoding module, wherein,

the first acquisition module is used for acquiring the audio information to be marked.

And the transcoding module is used for transcoding the audio information to be marked according to a preset format.

The extracting module 201 may be specifically configured to extract a preset audio feature from the transcoded audio information.

In another possible implementation manner of the embodiment of the present application, the first obtaining module may specifically include at least one of a first obtaining unit and a second obtaining unit, wherein,

the first acquisition unit is used for acquiring the audio information to be labeled input by a user.

And the second acquisition unit is used for acquiring the audio information to be marked from the local storage.

In another possible implementation manner of the embodiment of the present application, the preset audio feature may specifically include at least one of the following:

mel-frequency cepstrum coefficient features; a short-time energy characteristic; a short-time power characteristic; short-time zero-crossing rate characteristics.

In another possible implementation manner of the embodiment of the present application, the extraction module 201 may specifically include an extraction unit and a determination unit, wherein,

and the extraction unit is used for extracting the waveform data corresponding to each audio frame in the audio information to be labeled.

And the determining unit is used for determining the Mel frequency cepstrum coefficient characteristics corresponding to the audio frames respectively based on the waveform data corresponding to the audio frames respectively.

In another possible implementation manner of the embodiment of the present application, the determining unit may specifically include a first processing subunit, a second processing subunit, a third processing subunit, a calculating subunit, and a fourth processing subunit, where,

and the first processing subunit is used for performing pre-emphasis processing on the waveform data corresponding to any audio frame to obtain the waveform data after the pre-emphasis processing.

And the second processing subunit is used for performing Hamming window adding processing on the waveform data subjected to the pre-emphasis processing to obtain the waveform data subjected to the Hamming window adding processing.

The third processing subunit is used for performing discrete Fourier transform on the waveform data subjected to the Hamming window processing to obtain a frequency spectrum characteristic corresponding to any audio frame;

and the calculating subunit is used for calculating the output energy respectively corresponding to the spectral characteristics corresponding to any audio frame after passing through each triangular Mel frequency filter bank.

And the fourth processing subunit is used for performing discrete cosine transform calculation processing on the output energy which respectively corresponds to each triangular Mel frequency filter bank to obtain Mel frequency cepstrum coefficient characteristics corresponding to any audio frame.

In another possible implementation manner of the embodiment of the application, the input module 202 may be specifically configured to input the preset audio features into the trained annotation model, and label, by using the trained annotation model, the start point time of each character and the end point time of each character in the audio information to be labeled to obtain the labeled audio information.

In another possible implementation manner of the embodiment of the present application, the automatic labeling device 20 may further include a second obtaining module and a training module, wherein,

and the second acquisition module is used for acquiring a plurality of training samples.

Any training sample can specifically include the labeled audio information for training and the preset audio features corresponding to the audio information for training.

And the training module is used for training the preset model based on a plurality of training samples to obtain a trained labeling model.

In another possible implementation manner of the embodiment of the present application, the preset model may specifically include a hidden markov model.

The training module may be specifically configured to train the hidden markov model using a maximum expectation algorithm based on a plurality of training samples.

For the embodiment of the present application, the first obtaining module and the second obtaining module may be the same obtaining module, or may be two different obtaining modules, the first obtaining unit and the second obtaining unit may be the same obtaining unit, or may be two different obtaining units, and the first processing subunit, the second processing subunit, the third processing subunit, and the fourth processing subunit may be the same processing subunit, or any two of them may be the same processing subunit, or any three of them may be the same processing subunit, or may be four different processing subunits, which is not limited in the embodiment of the present application.

The automatic labeling device provided in the embodiment of the present application can be used to perform operations corresponding to the automatic labeling method provided in the foregoing method embodiment, and the implementation principle is similar, which is not described herein again.

Compared with the prior art, the automatic labeling device has the advantages that the preset audio features are extracted from the audio information to be labeled, the preset audio features are input into the labeled model after training, the labeled audio information is obtained, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.

The automatic labeling apparatus is specifically described from the perspective of a module, a unit, or a subunit, and an electronic device is specifically described from the perspective of an entity apparatus, where the electronic device in this embodiment of the present application may be a terminal device, and may also be a server, and is not limited in this embodiment of the present application.

An embodiment of the present application provides an electronic device, and an electronic device 4000 shown in fig. 3 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

Processor 4001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. Bus 4002 may be a PCI bus, EISA bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 3, but this does not mean only one bus or one type of bus.

Memory 4003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, an optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in any of the foregoing method embodiments.

An embodiment of the present application provides an electronic device, where the electronic device includes: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: the preset audio features are extracted from the audio information to be labeled and input into the trained labeling model to obtain the labeled audio information, so that the automatic labeling of the audio information is realized, compared with the manual labeling of the audio information, the human resources and the time resources are saved, and the automatic labeling of the audio information can effectively improve the labeling efficiency.

The electronic device of the present application is described above from the perspective of a physical device, and the computer-readable storage medium of the present application is described below from the perspective of a storage medium.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the method has the advantages that the preset audio features are extracted from the audio information to be labeled and input into the trained labeling model to obtain the labeled audio information, automatic labeling of the audio information is achieved, compared with manual labeling of the audio information, human resources and time resources are saved, and the efficiency of labeling can be effectively improved by automatically labeling the audio information.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. An automatic labeling method, comprising:

extracting preset audio features from audio information to be marked;

2. The method according to claim 1, wherein the extracting the preset audio features from the audio information to be labeled further comprises:

acquiring the audio information to be marked;

transcoding the audio information to be marked according to a preset format;

wherein, the preset audio features are extracted from the audio information to be labeled, and the method comprises the following steps:

and extracting the preset audio features from the audio information after transcoding processing.

3. The method according to claim 2, wherein the obtaining the audio information to be labeled comprises at least one of:

acquiring audio information to be marked input by a user;

and acquiring the audio information to be marked from the local storage.

4. The method of claim 1, wherein the preset audio features comprise at least one of:

mel-frequency cepstrum coefficient features;

a short-time energy characteristic;

a short-time power characteristic;

short-time zero-crossing rate characteristics.

5. The method of claim 4, wherein extracting Mel frequency cepstral coefficient features from the audio information to be labeled comprises:

extracting waveform data corresponding to each audio frame in the audio information to be marked;

and determining the Mel frequency cepstrum coefficient characteristics corresponding to the audio frames respectively based on the waveform data corresponding to the audio frames respectively.

6. The method of claim 5, wherein determining the Mel frequency cepstral coefficient characteristic corresponding to any audio frame based on the waveform data corresponding to any audio frame comprises:

pre-emphasis processing is carried out on the waveform data corresponding to any audio frame to obtain pre-emphasized waveform data;

carrying out Hamming window adding processing on the waveform data subjected to the pre-emphasis processing to obtain waveform data subjected to the Hamming window adding processing;

performing discrete Fourier transform on the waveform data subjected to the Hamming window processing to obtain a frequency spectrum characteristic corresponding to any audio frame;

calculating output energy respectively corresponding to the spectral features corresponding to any audio frame after the spectral features pass through each triangular Mel frequency filter bank;

and performing discrete cosine transform calculation processing on the output energy which is respectively corresponding to each triangular Mel frequency filter bank to obtain Mel frequency cepstrum coefficient characteristics corresponding to any audio frame.

7. The method of claim 1, wherein the inputting the preset audio features into the trained labeling model to obtain labeled audio information comprises:

and inputting the preset audio features into a trained marking model, and marking the starting point time of each character and the ending point time of each character in the audio information to be marked by using the trained marking model to obtain the marked audio information.

8. The method of claim 1, wherein the inputting the preset audio features into the trained labeling model further comprises:

obtaining a plurality of training samples, any training sample comprising: the marked audio information used for training and the preset audio features corresponding to the audio information used for training are obtained;

and training a preset model based on the plurality of training samples to obtain a trained labeling model.

9. The method of claim 8, wherein the pre-set model comprises a hidden markov model;

training a preset model based on the training samples comprises:

training the hidden Markov model based on the plurality of training samples and using a maximum expectation algorithm.

10. An automatic labeling apparatus, comprising:

11. The apparatus of claim 10, wherein the automatic labeling apparatus further comprises a first retrieving module and a transcoding module, wherein,

the first obtaining module is used for obtaining the audio information to be marked;

the transcoding module is used for transcoding the audio information to be marked according to a preset format;

the extraction module is specifically configured to extract the preset audio features from the transcoded audio information.

12. The apparatus of claim 11, wherein the first obtaining module comprises at least one of a first obtaining unit and a second obtaining unit, wherein,

the first acquisition unit is used for acquiring audio information to be marked input by a user;

13. The apparatus of claim 10, wherein the preset audio features comprise at least one of:

mel-frequency cepstrum coefficient features;

a short-time energy characteristic;

a short-time power characteristic;

short-time zero-crossing rate characteristics.

14. The apparatus of claim 13, wherein the extraction module comprises an extraction unit and a determination unit, wherein,

the extracting unit is used for extracting waveform data corresponding to each audio frame in the audio information to be labeled;

the determining unit is configured to determine mel-frequency cepstrum coefficient features corresponding to the audio frames based on the waveform data corresponding to the audio frames.

15. The apparatus of claim 14, wherein the determination unit comprises a first processing subunit, a second processing subunit, a third processing subunit, a calculation subunit, and a fourth processing subunit, wherein,

the first processing subunit is configured to perform pre-emphasis processing on the waveform data corresponding to the any audio frame to obtain pre-emphasized waveform data;

the second processing subunit is configured to perform hamming window adding processing on the waveform data after the pre-emphasis processing, so as to obtain hamming window added waveform data;

the third processing subunit is configured to perform discrete fourier transform on the waveform data subjected to hamming window processing to obtain a spectral feature corresponding to any audio frame;

the calculating subunit is configured to calculate output energies corresponding to spectral features corresponding to the any audio frame after passing through each qmel frequency filter bank;

and the fourth processing subunit is configured to perform discrete cosine transform calculation processing on the output energy respectively corresponding to the output energy after passing through each triangular mel-frequency filter bank, so as to obtain mel-frequency cepstrum coefficient characteristics corresponding to any one of the audio frames.

16. The apparatus according to claim 10, wherein the input module is specifically configured to input the preset audio features into a trained labeling model, and label a start point time of each tone and an end point time of each tone in the audio information to be labeled by using the trained labeling model to obtain the labeled audio information.

17. The apparatus of claim 10, wherein the automatic labeling apparatus further comprises a second acquisition module and a training module, wherein,

the second obtaining module is configured to obtain a plurality of training samples, where any training sample includes: the marked audio information used for training and the preset audio features corresponding to the audio information used for training are obtained;

and the training module is used for training a preset model based on the plurality of training samples to obtain a trained labeling model.

18. The apparatus of claim 17, wherein the pre-set model comprises a hidden markov model;

the training module is specifically configured to train the hidden markov model based on the plurality of training samples and using a maximum expectation algorithm.

19. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the automatic labeling method according to any one of claims 1 to 9.

20. A computer-readable storage medium, on which a computer program is stored, the program, when being executed by a processor, implementing the automatic labeling method according to any one of claims 1 to 9.