WO2020029404A1 - 语音处理方法及装置、计算机装置及可读存储介质 - Google Patents

语音处理方法及装置、计算机装置及可读存储介质 Download PDF

Info

Publication number
WO2020029404A1
WO2020029404A1 PCT/CN2018/108190 CN2018108190W WO2020029404A1 WO 2020029404 A1 WO2020029404 A1 WO 2020029404A1 CN 2018108190 W CN2018108190 W CN 2018108190W WO 2020029404 A1 WO2020029404 A1 WO 2020029404A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
sentence
feature parameters
voice signal
text
Prior art date
Application number
PCT/CN2018/108190
Other languages
English (en)
French (fr)
Inventor
王健宗
王珏
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020029404A1 publication Critical patent/WO2020029404A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the field of computer hearing technology, and in particular, to a speech processing method and device, a computer device, and a non-volatile readable storage medium.
  • speech recognition technology is a key technology that can convert a person's speech signal into text information that can be recognized by a computer as an output.
  • the existing intelligent conference system only implements speech-to-text conversion, and cannot further process the recognized text information.
  • the text information obtained by direct conversion of speech may contain useless information, such as something unrelated to the content of the conference. sentence.
  • a first aspect of the present application provides a speech processing method, which includes:
  • a summary sentence is extracted from the text in the sentence unit by a hidden Markov model HMM.
  • a second aspect of the present application provides a voice processing apparatus, where the apparatus includes:
  • a feature extraction unit for extracting feature parameters from the pre-processed speech signal
  • a decoding unit configured to decode the speech signal by using a pre-trained speech recognition model according to the feature parameters to obtain a text in a sentence unit;
  • the abstract extraction unit is configured to extract a summary sentence from the text in the sentence unit through a hidden Markov model HMM.
  • a third aspect of the present application provides a computer device including a processor, where the processor is configured to implement the voice processing method when executing computer-readable instructions stored in a memory.
  • a fourth aspect of the present application provides a non-volatile readable storage medium on which computer-readable instructions are stored, and the computer-readable instructions implement the voice processing method when executed by a processor.
  • the present application pre-processes a voice signal; extracts feature parameters from the pre-processed voice signal; and decodes the voice signal using a pre-trained voice recognition model according to the feature parameters to obtain a text in sentences; A summary sentence is extracted from the text in the sentence unit by a hidden Markov model HMM.
  • This application not only converts speech information into text, but also extracts abstract sentences in the text for output, eliminating useless information from speech recognition results, and obtaining better speech processing results.
  • FIG. 1 is a flowchart of a speech processing method according to an embodiment of the present application.
  • FIG. 2 is a structural diagram of a voice processing apparatus provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the speech processing method of the present application is applied in one or more computer devices.
  • the computer device is a device capable of automatically performing numerical calculations and / or information processing in accordance with instructions set or stored in advance.
  • the hardware includes, but is not limited to, a microprocessor and an Application Specific Integrated Circuit (ASIC). , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), Embedded Equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can perform human-computer interaction with a user through a keyboard, a mouse, a remote control, a touch pad, or a voice control device.
  • FIG. 1 is a flowchart of a speech processing method provided in Embodiment 1 of the present application.
  • the speech processing method is applied to a computer device.
  • the speech processing method recognizes a sentence-based text from a speech signal, and extracts a summary sentence from the text-based sentence.
  • the voice processing method specifically includes the following steps:
  • Step 101 Pre-process the voice signal.
  • the voice signal may be an analog voice signal or a digital voice signal. If the voice signal is an analog voice signal, the analog voice signal is subjected to analog-to-digital conversion to be converted into a digital voice signal.
  • the voice processing method is applied to a smart conference system, and the voice signal is a voice signal input to a speaker of the smart conference system through a voice input device (for example, a microphone, a mobile phone microphone, etc.).
  • a voice input device for example, a microphone, a mobile phone microphone, etc.
  • Preprocessing the voice signal may include pre-emphasizing the voice signal.
  • pre-emphasis boost the high-frequency components of the speech and flatten the spectrum of the signal. Due to the influence of the glottal excitation and mouth-nose radiation, the energy of the speech signal is significantly reduced at the high-frequency end, usually the higher the frequency, the smaller the amplitude. When the frequency is doubled, the power spectrum amplitude drops by 6dB / oct. Therefore, before performing spectrum analysis or channel parameter analysis on a voice signal, it is necessary to perform a frequency boost on a high frequency portion of the voice signal, that is, pre-emphasis the voice signal.
  • Pre-emphasis is generally implemented using a high-pass filter.
  • the transfer function of the high-pass filter can be:
  • H (z) 1- ⁇ z -1 , 0.9 ⁇ 1.0.
  • is a pre-emphasis coefficient, and a preferred value is between 0.94 and 0.97.
  • Preprocessing the voice signal may further include windowing and framing the voice signal.
  • Speech signals are a kind of non-stationary time-varying signals, which are mainly divided into two categories, voiced and unvoiced.
  • the pitch period of the voiced voice, the amplitude of the voiced voice signal, and the channel parameters all change slowly with time, but usually it can be considered to have short-term smoothness within 10ms-30ms.
  • the voice signal can be divided into some short segments (that is, to obtain a short-term stationary signal) for processing. This process is called framing, and the obtained short segment voice signal is called a voice frame. Framing is achieved by windowing the speech signal.
  • each voice frame is 25 milliseconds, and there is a 15 millisecond overlap between two adjacent voice frames, that is, one voice frame is taken every 10 milliseconds.
  • the commonly used window functions are rectangular window, Hamming window and Hanning window.
  • the rectangular window function is:
  • the Hamming window function is:
  • the Hanning window function is:
  • N is the number of sampling points included in a speech frame.
  • Preprocessing the voice signal may further include detecting valid voice in the voice signal.
  • the purpose of detecting effective speech is to exclude the effective speech (that is, non-speech segments) from the speech signal and obtain the effective speech (that is, speech segments), so as to reduce the calculation amount and accuracy of feature extraction, shorten the time of speech recognition, and improve the recognition rate.
  • Effective voice detection can be performed based on the short-term energy and short-term zero-crossing rate of the speech signal.
  • the short-term energy is:
  • the short-term zero-crossing rate is:
  • a two-stage judgment method may be used to detect the start and end points of effective speech in the speech signal.
  • the two-stage judgment method is a well-known technology in the art, and will not be repeated here.
  • a valid voice in the voice signal may be detected by the following method:
  • a Hamming window may be added to the voice signal, each frame is 20 ms, and the frame is shifted by 10 ms. If the voice signal has been framed by windowing during the preprocessing, this step is omitted.
  • DFT Discrete Fourier Transform
  • E (m) represents the cumulative energy of the m-th frequency band
  • (m 1 , m 2 ) represents the starting frequency band point of the m-th frequency band.
  • step 102 feature parameters are extracted from the pre-processed speech signal.
  • Feature parameter extraction is the analysis of speech signals to extract the sequence of acoustic parameters that reflect the essential features of speech.
  • the extracted feature parameters can include time-domain parameters such as short-term average energy, short-term average zero-crossing rate, formant, pitch period, etc., and can also include Linear Prediction Coefficient (LPC), Linear Prediction Coefficient Cepstrum Coefficient (LPCC), Mel Frequency Coefficient (Cepstrum Coefficient, MFCC), Perceptual Linear Predictive (PLP) and other transformation domain parameters.
  • LPC Linear Prediction Coefficient
  • LPCC Linear Prediction Coefficient Cepstrum Coefficient
  • MFCC Mel Frequency Coefficient
  • PGP Perceptual Linear Predictive
  • MFCC characteristic parameters of a speech signal may be extracted.
  • the steps for extracting MFCC characteristic parameters are as follows:
  • Discrete Fourier Transform (DFT, which can be a fast Fourier transform) is performed on each speech frame to obtain the frequency spectrum of the speech frame.
  • the discrete energy spectrum of the speech frame is passed through a set of triangular filters (ie, triangular filter groups) uniformly distributed on the Mel frequency to obtain the output of each triangular filter.
  • the center frequencies of the set of triangular filters are evenly arranged on the Mel frequency scale, and the frequencies of the two bottom points of the triangles of each triangular filter are respectively equal to the center frequencies of two adjacent triangular filters.
  • the center frequency of the triangular filter is:
  • the frequency response of the triangular filter is:
  • Discrete Cosine Transform is performed on S (m) to obtain the initial MFCC characteristic parameters of the speech frame.
  • the discrete cosine transform is:
  • the triangular filter bank is introduced in MFCC, and the triangular filter is densely distributed in the low frequency band and sparsely distributed in the high frequency band, which conforms to the human ear hearing characteristics, and still has better recognition performance in noisy environments.
  • the extracted MFCC feature parameters are 39-dimensional feature vectors, including 13-dimensional initial MFCC feature parameters, 13-dimensional first-order differential MFCC feature parameters, and 13-dimensional second-order differential MFCC feature parameters.
  • the extracted feature parameters may also be subjected to dimensionality reduction processing to obtain the dimensionality-reduced feature parameters.
  • dimensionality reduction processing For example, a segmented mean data dimensionality reduction algorithm is used to perform dimensionality reduction processing on the feature parameters (such as MFCC feature parameters) to obtain the dimensionality-reduced feature parameters.
  • the reduced feature parameters will be used in subsequent steps.
  • Step 103 Decode the speech signal using a pre-trained speech recognition model according to the feature parameters to obtain a text in a sentence unit.
  • the speech recognition model may include a dynamic time warping model, a hidden Markov model, an artificial neural network model, a support vector machine classification model, and the like.
  • the speech recognition model may also be a combination of two or more of the models.
  • the speech recognition model is a hidden Markov model (HMM).
  • HMM includes an acoustic model and a speech model.
  • Hidden Markov Model is used to model phonemes. In the field of speech, not words are used, but subwords are used as recognition units. Subwords are the basic acoustic unit of the acoustic model. In English, a subword is a phoneme. For a specific word, the corresponding acoustic model is formed by concatenating multiple phonemes through the grammar rules of the pronunciation dictionary. In Chinese, the consonants are the initials and finals. Each subword can be modeled with an HMM that includes multiple states.
  • each phoneme can be modeled with an HMM containing up to 6 states, and each state can be fitted with a corresponding observation frame using a Gaussian mixture model (GMM), and the observation frames are combined into an observation sequence in time series.
  • GMM Gaussian mixture model
  • Each acoustic model can generate observation sequences of different lengths, that is, one-to-many mapping.
  • Language Model In order to effectively combine grammar and semantic knowledge in the process of speech recognition, improve the recognition rate, and reduce the scope of search. Because it is difficult to accurately determine the boundaries of words, and the acoustic model has limited ability to describe phonetic variability, many sequences of words with similar probability scores will be generated during recognition. Therefore, in practical speech recognition systems, the language model P (W) is usually used to select the most likely subsequence from many candidate results to supplement the shortcomings of the acoustic model.
  • rule-based language models can summarize grammatical rules and even semantic rules, and then use these rules to exclude results that are not grammatical or semantic rules in acoustic recognition.
  • Statistical language models describe the dependencies between words by statistical probability, indirectly encoding grammatical or semantic rules.
  • Decoding is to search for an optimal path in the state network, and the probability of speech corresponding to this path is the largest.
  • a dynamic programming algorithm ie, Viterbi algorithm
  • Viterbi algorithm is used to find the global optimal path.
  • the word sequence w 1: L w 1 , w 2 ... w L is most likely to be generated by the decoding algorithm.
  • the decoding algorithm is to solve the parameter w corresponding to maximize the posterior probability P (w
  • the prior probability P (W) is determined by the language model, and the likelihood probability p (Y
  • Y) can be obtained.
  • step 104 a summary sentence is extracted from the text in the sentence unit through a Hidden Markov Model HMM.
  • the speech signal is decoded into text in units of sentences.
  • speech recognition has been completed.
  • the method further extracts a summary sentence from the identified text in sentence units.
  • the purpose of extracting abstract sentences is to extract important information from the speech and eliminate useless information.
  • This method extracts abstract sentences through the HMM model.
  • the main steps are as follows:
  • the sentence is marked by the Viterbi algorithm, and the degree of conformity of each sentence to the summary sentence is obtained.
  • Sentences that satisfy a preset degree of conformity are extracted from the text in the sentence unit to obtain a summary sentence in the text in the sentence unit.
  • the speech processing method of the first embodiment pre-processes the speech signal; extracts characteristic parameters from the pre-processed speech signal; and uses the pre-trained speech recognition model to decode the speech signal according to the characteristic parameters to obtain a sentence.
  • the text is the unit; the abstract sentence is extracted from the text in the unit of sentence by HMM.
  • the first embodiment not only converts voice information into text, but also extracts summary sentences in the text for output, and eliminates useless information from the speech recognition results to obtain better speech processing results.
  • a channel length normalization (Vocal, Tract, Length, Normalization, VTLN) can be performed to obtain the MFCC characteristic parameters with normalized channel lengths.
  • the channels can be represented as a cascaded sound tube model.
  • Each sound tube can be regarded as a resonant cavity, and their resonance frequency depends on the length and shape of the sound tube. Therefore, some of the acoustic differences between speakers are due to the difference in speaker channel length. For example, the range of the channel length generally varies from 13 cm (adult female) to 18 cm (adult male). Therefore, people of different genders say that the formant frequencies of the same vowel differ greatly. VTLN is to eliminate the difference in the length of the male and female channels, so that the result of accent recognition is not disturbed by gender.
  • VTLN can match the frequency of formants of each speaker by bending and translating the frequency coordinates.
  • a VTLN method based on bilinear transformation may be used.
  • the bilinear transformation-based VTLN method does not directly fold the frequency spectrum of the speech signal, but uses a mapping formula of the bilinear transformation low-pass filter cutoff frequency to calculate the frequency curve aligned with the average third formant of different speakers.
  • the frequency bending factor adjusting the position (for example, the starting point, the middle point, and the end point of the triangular filter) and the width of the triangular filter bank by using a bilinear transformation; according to the adjusted triangular filter bank Calculate the normalized MFCC characteristic parameters.
  • the scale of the triangular filter is stretched, and the triangular filter bank is expanded and moved to the left at this time.
  • the scale of the triangular filter is compressed, and the triangular filter bank is compressed and moved to the right.
  • the VTLN method based on the bilinear transformation is used to normalize the channel of a specific group of people or a specific person, only the triangular filter bank coefficients need to be transformed once, and the signal spectrum does not need to be extracted each time the feature parameters are extracted. Folding, which greatly reduces the amount of calculation.
  • the VTLN method based on bilinear transformation avoids a linear search for frequency factors and reduces the computational complexity.
  • the VTLN method based on bilinear transformation uses bilinear transformation to make the bending frequency continuous without bandwidth change.
  • FIG. 2 is a structural diagram of a voice processing apparatus provided in Embodiment 2 of the present application.
  • the speech processing apparatus 10 may include a preprocessing unit 201, a feature extraction unit 202, a decoding unit 203, and a digest extraction unit 204.
  • the pre-processing unit 201 is configured to pre-process a voice signal.
  • the voice signal may be an analog voice signal or a digital voice signal. If the voice signal is an analog voice signal, the analog voice signal is subjected to analog-to-digital conversion to be converted into a digital voice signal.
  • the voice processing method is applied to a smart conference system, and the voice signal is a voice signal input to a speaker of the smart conference system through a voice input device (for example, a microphone, a mobile phone microphone, etc.).
  • a voice input device for example, a microphone, a mobile phone microphone, etc.
  • Preprocessing the voice signal may include pre-emphasizing the voice signal.
  • pre-emphasis boost the high-frequency components of the speech and flatten the spectrum of the signal. Due to the influence of the glottal excitation and mouth-nose radiation, the energy of the speech signal is significantly reduced at the high-frequency end, usually the higher the frequency, the smaller the amplitude. When the frequency is doubled, the power spectrum amplitude drops by 6dB / oct. Therefore, before performing spectrum analysis or channel parameter analysis on a voice signal, it is necessary to perform a frequency boost on a high frequency portion of the voice signal, that is, pre-emphasis the voice signal.
  • Pre-emphasis is generally implemented using a high-pass filter.
  • the transfer function of the high-pass filter can be:
  • H (z) 1- ⁇ z -1 , 0.9 ⁇ 1.0.
  • is a pre-emphasis coefficient, and a preferred value is between 0.94 and 0.97.
  • Preprocessing the voice signal may further include windowing and framing the voice signal.
  • Speech signals are a kind of non-stationary time-varying signals, which are mainly divided into two categories, voiced and unvoiced.
  • the pitch period of the voiced voice, the amplitude of the voiced voice signal, and the channel parameters all change slowly with time, but usually it can be considered to have short-term smoothness within 10ms-30ms.
  • the voice signal can be divided into some short segments (that is, to obtain a short-term stationary signal) for processing. This process is called framing, and the obtained short segment voice signal is called a voice frame. Framing is achieved by windowing the speech signal.
  • each voice frame is 25 milliseconds, and there is a 15 millisecond overlap between two adjacent voice frames, that is, one voice frame is taken every 10 milliseconds.
  • the commonly used window functions are rectangular window, Hamming window and Hanning window.
  • the rectangular window function is:
  • the Hamming window function is:
  • the Hanning window function is:
  • N is the number of sampling points included in a speech frame.
  • Preprocessing the voice signal may further include detecting valid voice in the voice signal.
  • the purpose of detecting effective speech is to exclude the effective speech (that is, non-speech segments) from the speech signal and obtain the effective speech (that is, speech segments) to reduce the calculation amount and accuracy of feature extraction, shorten the time of speech recognition, and improve the recognition rate .
  • Effective voice detection can be performed based on the short-term energy and short-term zero-crossing rate of the speech signal.
  • the short-term energy is:
  • the short-term zero-crossing rate is:
  • a two-stage judgment method may be used to detect the start and end points of effective speech in the speech signal.
  • the two-stage judgment method is a well-known technology in the art, and will not be repeated here.
  • a valid voice in the voice signal may be detected by the following method:
  • a Hamming window may be added to the voice signal, each frame is 20 ms, and the frame is shifted by 10 ms. If the voice signal has been framed by windowing during the preprocessing, this step is omitted.
  • DFT Discrete Fourier Transform
  • E (m) represents the cumulative energy of the m-th frequency band
  • (m 1 , m 2 ) represents the starting frequency band point of the m-th frequency band.
  • the feature extraction unit 202 is configured to extract feature parameters from the pre-processed speech signal.
  • Feature parameter extraction is the analysis of speech signals to extract the sequence of acoustic parameters that reflect the essential features of speech.
  • the extracted feature parameters can include time-domain parameters such as short-term average energy, short-term average zero-crossing rate, formant, pitch period, etc., and can also include Linear Prediction Coefficient (LPC), Linear Prediction Cepstrum Coefficient (LPCC), Mel Frequency Coefficient (Cepstrum Coefficient, MFCC), Perceptual Linear Predictive (PLP) and other transformation domain parameters.
  • LPC Linear Prediction Coefficient
  • LPCC Linear Prediction Cepstrum Coefficient
  • MFCC Mel Frequency Coefficient
  • PLP Perceptual Linear Predictive
  • MFCC characteristic parameters of a speech signal may be extracted.
  • the steps for extracting MFCC characteristic parameters are as follows:
  • Discrete Fourier Transform (DFT, which can be a fast Fourier transform) is performed on each voice frame obtained by the pre-processing unit 201 to obtain the frequency spectrum of the voice frame.
  • DFT Discrete Fourier Transform
  • the discrete energy spectrum of the speech frame is passed through a set of triangular filters (ie, triangular filter groups) uniformly distributed on the Mel frequency to obtain the output of each triangular filter.
  • the center frequencies of the set of triangular filters are evenly arranged on the Mel frequency scale, and the frequencies of the two bottom points of the triangles of each triangular filter are respectively equal to the center frequencies of two adjacent triangular filters.
  • the center frequency of the triangular filter is:
  • the frequency response of the triangular filter is:
  • Discrete Cosine Transform is performed on S (m) to obtain the initial MFCC characteristic parameters of the speech frame.
  • the discrete cosine transform is:
  • the triangular filter bank is introduced in MFCC, and the triangular filter is densely distributed in the low frequency band and sparsely distributed in the high frequency band, which conforms to the human ear hearing characteristics, and still has better recognition performance in noisy environments.
  • the initial MFCC characteristic parameters only reflect the static characteristics of the speech parameters.
  • the dynamic characteristics of speech can be described by the differential spectrum of static characteristics. The combination of dynamic and static can effectively improve the recognition performance of the system. Usually, first-order and / or second-order differential MFCC characteristics parameter.
  • the extracted MFCC feature parameters are 39-dimensional feature vectors, including 13-dimensional initial MFCC feature parameters, 13-dimensional first-order differential MFCC feature parameters, and 13-dimensional second-order differential MFCC feature parameters.
  • the extracted feature parameters may also be subjected to dimensionality reduction processing to obtain the dimensionality-reduced feature parameters.
  • dimensionality reduction processing For example, a segmented mean data dimensionality reduction algorithm is used to perform dimensionality reduction processing on the feature parameters (such as MFCC feature parameters) to obtain the dimensionality-reduced feature parameters.
  • the reduced feature parameters will be used in subsequent steps.
  • a decoding unit 203 is configured to decode the speech signal by using a pre-trained speech recognition model according to the feature parameters to obtain a text in a sentence unit.
  • the speech recognition model may include a dynamic time warping model, a hidden Markov model, an artificial neural network model, a support vector machine classification model, and the like.
  • the speech recognition model may also be a combination of two or more of the models.
  • the speech recognition model is a hidden Markov model (HMM).
  • HMM includes an acoustic model and a speech model.
  • Hidden Markov Model is used to model phonemes. In the field of speech, not words are used, but subwords are used as recognition units. Subwords are the basic acoustic unit of the acoustic model. In English, a subword is a phoneme. For a specific word, the corresponding acoustic model is formed by concatenating multiple phonemes through the grammar rules of the pronunciation dictionary. In Chinese, the consonants are the initials and finals. Each subword can be modeled with an HMM that includes multiple states.
  • each phoneme can be modeled with an HMM containing up to 6 states, and each state can be fitted with a corresponding observation frame using a Gaussian mixture model (GMM), and the observation frames are combined into an observation sequence in time series.
  • GMM Gaussian mixture model
  • Each acoustic model can generate observation sequences of different lengths, that is, one-to-many mapping.
  • Language Model In order to effectively combine grammar and semantic knowledge in the process of speech recognition, improve the recognition rate, and reduce the scope of search. Because it is difficult to accurately determine the boundaries of words, and the acoustic model has limited ability to describe phonetic variability, many sequences of words with similar probability scores will be generated during recognition. Therefore, in practical speech recognition systems, the language model P (W) is usually used to select the most likely subsequence from many candidate results to supplement the shortcomings of the acoustic model.
  • rule-based language models can summarize grammatical rules and even semantic rules, and then use these rules to exclude results that are not grammatical or semantic rules in acoustic recognition.
  • Statistical language models describe the dependencies between words by statistical probability, indirectly encoding grammatical or semantic rules.
  • Decoding is to search for an optimal path in the state network, and the probability of speech corresponding to this path is the largest.
  • a dynamic programming algorithm ie, Viterbi algorithm
  • Viterbi algorithm is used to find the global optimal path.
  • a word sequence w 1: L w 1 , w 2 ... w L is most likely to be generated by a decoding algorithm.
  • the decoding algorithm is to solve the parameter w corresponding to maximize the posterior probability P (w
  • the prior probability P (W) is determined by the language model, and the likelihood probability p (Y
  • Y) can be obtained.
  • the abstract extraction unit 204 is configured to extract a summary sentence from the text in units of sentences.
  • the decoding unit 203 decodes the speech signal into text in units of sentences.
  • speech recognition work has been completed.
  • the abstract extraction unit 204 extracts a summary sentence from the recognized text in sentence units.
  • the purpose of extracting abstract sentences is to extract important information from the speech and eliminate useless information.
  • the abstract extraction unit 204 extracts a summary sentence through an HMM model.
  • the main steps are as follows:
  • the sentence is marked by the Viterbi algorithm, and the degree of conformity of each sentence to the summary sentence is obtained.
  • Sentences that satisfy a preset degree of conformity are extracted from the text in the sentence unit to obtain a summary sentence in the text in the sentence unit.
  • the speech processing device 10 of the second embodiment pre-processes the speech signal; extracts characteristic parameters from the pre-processed speech signal; and uses the pre-trained speech recognition model to decode the speech signal according to the characteristic parameters to obtain Sentence-based text; extracts summary sentences from sentence-based text through Hidden Markov Model HMM.
  • the second embodiment not only converts the voice information into text, but also extracts the abstract sentences in the text for output, and eliminates useless information from the speech recognition results to obtain better speech processing results.
  • the channel length normalization (Vocal Tract Length Normalization, VTLN) can be performed to obtain the MFCC feature parameters with normalized channel length.
  • the channels can be represented as a cascaded sound tube model.
  • Each sound tube can be regarded as a resonant cavity, and their resonance frequency depends on the length and shape of the sound tube. Therefore, some of the acoustic differences between speakers are due to the difference in speaker channel length. For example, the range of the channel length generally varies from 13 cm (adult female) to 18 cm (adult male). Therefore, people of different genders say that the formant frequencies of the same vowel differ greatly. VTLN is to eliminate the difference in the length of the male and female channels, so that the result of accent recognition is not disturbed by gender.
  • VTLN can match the frequency of formants of each speaker by bending and translating the frequency coordinates.
  • a VTLN method based on bilinear transformation may be used.
  • the bilinear transformation-based VTLN method does not directly fold the frequency spectrum of the speech signal, but uses a mapping formula of the bilinear transformation low-pass filter cutoff frequency to calculate the frequency curve aligned with the average third formant of different speakers.
  • the frequency bending factor adjusting the position (for example, the starting point, the middle point, and the end point of the triangular filter) and the width of the triangular filter bank by using a bilinear transformation; according to the adjusted triangular filter bank Calculate the normalized MFCC characteristic parameters.
  • the scale of the triangular filter is stretched, and the triangular filter bank is expanded and moved to the left at this time.
  • the scale of the triangular filter is compressed, and the triangular filter bank is compressed and moved to the right.
  • the VTLN method based on the bilinear transformation is used to normalize the channel of a specific group of people or a specific person, only the triangular filter bank coefficients need to be transformed once, and the signal spectrum does not need to be extracted each time the feature parameters are extracted Folding, which greatly reduces the amount of calculation.
  • the VTLN method based on bilinear transformation avoids a linear search for frequency factors and reduces the computational complexity.
  • the VTLN method based on bilinear transformation uses bilinear transformation to make the bending frequency continuous without bandwidth change.
  • This embodiment provides a non-volatile readable storage medium.
  • Computer-readable instructions are stored on the non-volatile readable storage medium.
  • the foregoing voice processing method embodiment is implemented. Steps, such as steps 101-104 shown in Figure 1:
  • Step 101 pre-process the voice signal
  • Step 102 extract feature parameters from the pre-processed voice signal
  • Step 103 Decode the speech signal by using a pre-trained speech recognition model according to the feature parameters to obtain a text in a sentence unit;
  • step 104 a summary sentence is extracted from the text in units of sentences by a Hidden Markov Model HMM.
  • a pre-processing unit 201 configured to pre-process a voice signal
  • a feature extraction unit 202 configured to extract feature parameters from the pre-processed voice signal
  • a decoding unit 203 configured to decode the speech signal by using a pre-trained speech recognition model according to the feature parameters to obtain a text in a sentence unit;
  • the abstract extraction unit 204 is configured to extract a summary sentence from the text in units of sentences through a Hidden Markov Model HMM.
  • FIG. 3 is a schematic diagram of a computer device according to a third embodiment of the present application.
  • the computer device 1 includes a memory 20, a processor 30, and computer-readable instructions 40, such as a voice processing program, stored in the memory 20 and executable on the processor 30.
  • computer-readable instructions 40 such as a voice processing program
  • the processor 30 executes the computer-readable instructions 40, the steps in the foregoing embodiment of the voice processing method are implemented, for example, steps 101-104 shown in FIG. 1:
  • Step 101 pre-process the voice signal
  • Step 102 extract feature parameters from the pre-processed voice signal
  • Step 103 Decode the speech signal by using a pre-trained speech recognition model according to the feature parameters to obtain a text in a sentence unit;
  • step 104 a summary sentence is extracted from the text in units of sentences by a Hidden Markov Model HMM.
  • a pre-processing unit 201 configured to pre-process a voice signal
  • a feature extraction unit 202 configured to extract feature parameters from the pre-processed voice signal
  • a decoding unit 203 configured to decode the speech signal by using a pre-trained speech recognition model according to the feature parameters to obtain a text in a sentence unit;
  • the abstract extraction unit 204 is configured to extract a summary sentence from the text in units of sentences through a Hidden Markov Model HMM.
  • the computer-readable instructions 40 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 20 and executed by the processor 30, To complete this application.
  • the one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 40 in the computer device 1.
  • the computer-readable instructions 40 may be divided into a pre-processing unit 201, a feature extraction unit 202, a decoding unit 203, and a digest extraction unit 204 in FIG. 2. For specific functions of each unit, refer to Embodiment 2.
  • the computer device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the schematic diagram 3 is only an example of the computer device 1, and does not constitute a limitation on the computer device 1. It may include more or fewer components than shown in the figure, or some components may be combined or different.
  • the components, for example, the computer apparatus 1 may further include an input-output device, a network access device, a bus, and the like.
  • the so-called processor 30 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor, or the processor 30 may be any conventional processor, etc.
  • the processor 30 is a control center of the computer device 1 and uses various interfaces and lines to connect the entire computer device 1 Various parts.
  • the memory 20 may be configured to store the computer-readable instructions 40 and / or modules / units, and the processor 30 may execute or execute the computer-readable instructions and / or modules / units stored in the memory 20, and
  • the data stored in the memory 20 is called to implement various functions of the computer device 1.
  • the memory 20 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, application programs required for at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may Data (such as audio data, phone book, etc.) created according to the use of the computer device 1 are stored.
  • the memory 20 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD) Card, flash memory card (Flash card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
  • a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD) Card, flash memory card (Flash card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
  • the modules / units integrated in the computer device 1 When the modules / units integrated in the computer device 1 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the methods of the above embodiments, and can also be completed by computer-readable instructions instructing related hardware.
  • the computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by a processor, the steps of the foregoing method embodiments can be implemented.
  • the computer-readable instruction code may be in a source code form, an object code form, an executable file, or some intermediate form.
  • the non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electric carrier signals telecommunication signals
  • telecommunication signals and software distribution media.
  • the content contained in the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdictions. For example, in some jurisdictions, according to legislation and patent practices, non- Volatile readable media does not include electrical carrier signals and telecommunication signals.
  • each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist separately physically, or two or more units may be integrated in the same unit.
  • the integrated unit can be implemented in the form of hardware, or in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种语音处理方法,所述方法包括:对语音信号进行预处理;对预处理后的语音信号提取特征参数;根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。本申请还提供一种语音处理装置、计算机装置及非易失性可读存储介质。本申请可以对语音进行识别,并从语音识别结果中去除无用信息。

Description

语音处理方法及装置、计算机装置及可读存储介质
本申请要求于2018年08月8日提交中国专利局,申请号为201810897646.2发明名称为“语音处理方法及装置、计算机装置及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机听觉技术领域,具体涉及一种语音处理方法及装置、计算机装置和非易失性可读存储介质。
背景技术
在智能会议系统中,语音识别技术是一项关键技术,其可以将人的说话信号转换为可被计算机识别的文字信息作为输出。
然而,现有的智能会议系统只是实现语音到文字的转换,而不能对识别出来的文字信息做进一步的处理,根据语音直接转换得到的文字信息可以会包含无用的信息,例如一些与会议内容无关的句子。
发明内容
鉴于以上内容,有必要提出一种语音处理方法及装置、计算机装置和非易失性可读存储介质,其可以对语音进行识别,并从语音识别结果中去除无用信息。
本申请的第一方面提供一种语音处理方法,所述方法包括:
对语音信号进行预处理;
对预处理后的语音信号提取特征参数;
根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。
本申请的第二方面提供一种语音处理装置,所述装置包括:
预处理单元,用于对语音信号进行预处理;
特征提取单元,用于对预处理后的语音信号提取特征参数;
解码单元,用于根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
摘要提取单元,用于通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。
本申请的第三方面提供一种计算机装置,所述计算机装置包括处理器,所述处理器用于执行存储器中存储的计算机可读指令时实现所述语音处理方 法。
本申请的第四方面提供一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述语音处理方法。
本申请对语音信号进行预处理;对预处理后的语音信号提取特征参数;根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。本申请不仅将语音信息转化为文字,还提取文字中的摘要句进行输出,剔除了由语音识别结果中的无用信息,获得更好的语音处理结果。
附图说明
图1是本申请实施例提供的语音处理方法的流程图。
图2是本申请实施例提供的语音处理装置的结构图。
图3是本申请实施例提供的计算机装置的示意图。
具体实施方式
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
优选地,本申请的语音处理方法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
实施例一
图1是本申请实施例一提供的语音处理方法的流程图。所述语音处理方法应用于计算机装置。所述语音处理方法从语音信号中识别出以句子为单位的文本,从以句子为单位的文本中提取出摘要句。
如图1所示,所述语音处理方法具体包括以下步骤:
步骤101,对语音信号进行预处理。
所述语音信号可以是模拟语音信号,也可以是数字语音信号。若所述语音信号是模拟语音信号,则将所述模拟语音信号进行模数变换,转换为数字语音信号。
本申请用于连续语音识别,即对连续的音频流进行处理。在本申请的一个实施例中,所述语音处理方法应用于智能会议系统中,所述语音信号是通过语音输入设备(例如麦克风、手机话筒等)输入到智能会议系统的发言者的语音信号。
对所述语音信号进行预处理可以包括对所述语音信号进行预加重。
预加重的目的是提升语音的高频分量,使信号的频谱变得平坦。语音信号由于受声门激励和口鼻辐射的影响,能量在高频端明显减小,通常是频率越高幅值越小。当频率提升两倍时,功率谱幅度按6dB/oct跌落。因此,在对语音信号进行频谱分析或声道参数分析前,需要对语音信号的高频部分进行频率提升,即对语音信号进行预加重。预加重一般利用高通滤波器实现,高通滤波器的传递函数可以为:
H(z)=1-κz -1,0.9≤κ≤1.0。
其中,κ为预加重系数,优选取值在0.94-0.97之间。
对所述语音信号进行预处理还可以包括对所述语音信号进行加窗分帧。
语音信号是一种非平稳的时变信号,主要分为浊音和清音两大类。浊音的基音周期、请浊音信号幅度和声道参数等都随时间而缓慢变化,但通常在10ms-30ms的时间内可以认为具有短时平稳性。语音信号处理中可以把语音信号分成一些短段(即获得短时平稳信号)来进行处理,这个过程称为分帧,得到的短段的语音信号称为语音帧。分帧是通过对语音信号进行加窗处理来实现的。为了避免相邻两帧的变化幅度过大,帧与帧之间需要重叠一部分。在本申请的一个实施例中,每个语音帧为25毫秒,相邻两个语音帧之间存在15毫秒重叠,也就是每隔10毫秒取一个语音帧。
常用的窗函数有矩形窗、汉明窗和汉宁窗,矩形窗函数为:
Figure PCTCN2018108190-appb-000001
汉明窗函数为:
Figure PCTCN2018108190-appb-000002
汉宁窗函数为:
Figure PCTCN2018108190-appb-000003
其中,N为一个语音帧所包含的采样点的个数。
对所述语音信号进行预处理还可以包括检测所述语音信号中的有效语音。
检测有效语音的目的是从语音信号中剔除非有效语音(即非语音段), 获得有效语音(即语音段),以降低特征提取的计算量和准确度,缩短语音识别的时间,提高识别率。可以根据语音信号的短时能量和短时过零率等进行有效语音检测。
在一实施例中,假设语音信号中第n个语音帧为x n(m),则短时能量为:
Figure PCTCN2018108190-appb-000004
短时过零率为:
Figure PCTCN2018108190-appb-000005
其中,sgn[.]为符号函数,表达式为:
Figure PCTCN2018108190-appb-000006
可以采用两级判断法检测所述语音信号中有效语音的起点和终点。两级判断法为本领域的公知技术,此处不再赘述。
在另一实施例中,可以通过下述方法检测所述语音信号中的有效语音:
(1)对所述语音信号进行加窗分帧,得到所述语音信号的语音帧x(n)。在一个具体实施例中,可以对所述语音信号加汉明窗,每帧20ms,帧移10ms。若预处理过程中已对语音信号加窗分帧,则该步骤省略。
(2)对所述语音帧x(n)进行离散傅里叶变换(Discrete Fourier Transform,DFT),得到所述语音帧x(n)的频谱:
Figure PCTCN2018108190-appb-000007
(3)根据所述语音帧x(n)的频谱计算各个频带的累计能量:
Figure PCTCN2018108190-appb-000008
其中E(m)表示第m个频带的累计能量,(m 1,m 2)表示第m个频带的起始频带点。
(4)对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值。
(5)将所述各个频带的累计能量对数值与预设阈值比较,得到所述有效语音。若一个频带的累计能量对数值高于预设阈值,则所述频带对应的语音为有效语音。
步骤102,对预处理后的语音信号提取特征参数。
特征参数提取是对语音信号进行分析,提取出反映语音本质特征的声学参数序列。
提取的特征参数可以包括短时平均能量、短时平均过零率、共振峰、基音周期等时域参数,还可以包括线性预测系数(Linear Prediction Coefficient,LPC)、线性预测倒谱系数(Linear Prediction Cepstrum Coefficient,LPCC)、梅尔倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)、感知线性预测(Perceptual Linear Predictive,PLP)等变换域参数。
在本申请的一个实施例中,可以提取语音信号的MFCC特征参数。提取MFCC特征参数的步骤如下:
(1)对每个语音帧进行离散傅里叶变换(Discrete Fourier Transform,DFT,可以是快速傅里叶变换),得到该语音帧的频谱。
(2)求该语音帧的频谱幅度的平方,得到该语音帧的离散能量谱。
(3)将该语音帧的离散能量谱通过一组Mel频率上均匀分布的三角滤波器(即三角滤波器组),得到各个三角滤波器的输出。该组三角滤波器的中心频率在Mel频率刻度上均匀排列,且每个三角滤波器的三角形两个底点的频率分别等于相邻的两个三角滤波器的中心频率。三角滤波器的中心频率为:
Figure PCTCN2018108190-appb-000009
三角滤波器的频率响应为:
Figure PCTCN2018108190-appb-000010
其中,f h、f 1为三角滤波器的高频和低频;N为傅里叶变换的点数;F S为采样频率;M为三角滤波器的个数;B -1=700(e b/1125-1)是f mel的逆函数。
(4)对所有三角滤波器的输出做对数运算,得到该语音帧的对数功率谱S(m)。
(5)对S(m)做离散余弦变换(Discrete Cosine Transform,DCT),得到该语音帧的初始MFCC特征参数。离散余弦变换为:
Figure PCTCN2018108190-appb-000011
MFCC中引入了三角滤波器组,且三角滤波器在低频段分布较密,在高频段分布较疏,符合人耳听觉特性,在噪声环境下仍具有较好的识别性能。
提取MFCC特征参数的步骤还可以包括:
(6)根据语音帧的初始MFCC特征参数提取语音帧的动态差分MFCC特征参数。初始MFCC特征参数只反映了语音参数的静态特性,语音的动态特性可通过静态特征的差分谱来描述,动静态结合可以有效提升系统的识别性能,通常使用一阶和/或者二阶差分MFCC特征参数。
在一具体实施例中,提取的MFCC特征参数为39维的特征矢量,包括13维初始MFCC特征参数、13维一阶差分MFCC特征参数和13维二阶差分MFCC特征参数。
在本申请的一个实施中,在对预处理后的语音信号提取特征参数之后,还可以对提取的特征参数进行降维处理,得到降维后的特征参数。例如,采用分段均值数据降维算法对所述特征参数(例如MFCC特征参数)进行降维处理,得到降维后的特征参数。降维后的特征参数将用于后续的步骤。
步骤103,根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本。
所述语音识别模型可以包括动态时间规整模型、隐马尔科夫模型、人工 神经网络模型、支持向量机分类模型等。所述语音识别模型也可以是两种或两种以上所述模型的组合。
在本申请的一个实施例中,所述语音识别模型为隐马尔科夫模型(HMM)。所述HMM包括声学模型和语音模型。
声学模型(Acoustic Model):采用隐马尔科夫模型对音素建模。在语音领域中,并不是以单词,而是以子词为识别单位,子词是声学模型基本的声学单元。在英语中子词为音素,对于某个特定的单词,相应的声学模型由多个音素通过查找发音字典的语法规则拼接得成。在汉语中子词为声母和韵母。每个子词可以用包括多个状态的HMM建模。举例来说,每一个音素可以用包含最多6个状态的HMM建模,每个状态可以用高斯混合模型(GMM)拟合对应的观测帧,观测帧按时序组合成观测序列。而每个声学模型可以生成长短不一的观测序列,即一对多映射。
语言模型(Language Model):是为了在语音识别的过程中有效地结合语法和语义的知识,提高识别率,减少搜索的范围。由于很难准确确定词的边界,以及声学模型描述语音变异性的能力有限,识别时将产生很多概率得分相似的词的序列。因此在实用的语音识别系统中通常使用语言模型P(W)从诸多候选结果中选择最有可能的次序列来补充声学模型的不足。
在本实施例中,采用基于规则的语言模型。基于规则的语言模型可以总结出语法规则乃至语义规则,然后用这些规则排除声学识别中不合语法规则或语义规则的结果。统计语言模型通过统计概率描述词与词之间的依赖关系,间接地对语法或语义规则进行编码。
解码就是在状态网络中搜索一条最佳路径,语音对应这条路径的概率最大。在本实施例中,利用动态规划算法(即Viterbi算法)寻找全局最优路径。
假设对语音信号提取的特征参数为特征向量Y,通过解码算法寻找最有可能生成Y的词序列w 1:L=w 1,w 2…w L
解码算法是求解使得后验概率P(w|Y)最大所对应的参数w,即:
w best=argmax{p(w|Y)}
由贝叶斯定理将上式转化为:
Figure PCTCN2018108190-appb-000012
由于观测概率P(Y)在给定观测序列下是常数,上式可进一步简化为:
w best=argmax{p(Y|w)p(w)}
其中先验概率P(W)由语言模型决定,似然概率p(Y|w)由声学模型决定。通过以上计算即可得出后验概率P(w|Y)最大所对应的参数w。
步骤104,通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。
经过步骤103,语音信号被解码为以句子为单位的文本,在常规的语音识别系统中,语音识别工作已经完成。本方法进一步从识别出来的以句子为单位的文本中提取出摘要句。
提取摘要句的目的是从语音中抽取重要的信息,剔除无用信息。
本方法通过HMM模型提取摘要句。此时,HMM模型的双重随机关系可以描述为:一重随机关系为句子序列的释放,是可观察的;另一重随机关系为该句子是否应该被归为摘要句的性质,是不可观察的。所以用HMM模型来提取摘要句的过程可以描述为给定句子序列O={O 1,O 2…O n},以确定句子是否为摘要句的最大可能性。主要步骤如下:
(1)获得以句子为单位的文本的观察状态序列O={O 1,O 2…O n};
(2)确定HMM隐含状态。可以设置5个隐含状态。可以把隐含状态设置为“1”-符合,“2”-较符合,“3”-一般,“4”-较不符合,“5”-不符合,用来依次表示句子符合摘要句的程度。
(3)进行HMM参数估计。首先随机产生初始的概率参数,经过不断地迭代,当达到设定的阈值时,停止计算,得到适合的HMM参数。
(4)根据训练好的HMM,通过Viterbi算法对句子进行标记,得到各个句子符合摘要句的符合度。
(5)将满足预设符合度的句子(例如至少较符合的句子)从所述以句子为单位的文本中提取出来,得到所述以句子为单位的文本中的摘要句。
实施例一的语音处理方法对语音信号进行预处理;对预处理后的语音信号提取特征参数;根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;通过隐马尔科夫模型HMM从以句子为单位的文本中提取摘要句。实施例一不仅将语音信息转化为文字,还提取文字中的摘要句进行输出,剔除了由语音识别结果中的无用信息,获得更好的语音处理结果。
在另一实施例中,在提取MFCC特征参数时,可以进行声道长度归一化(Vocal Tract Length Normalization,VTLN),得到声道长度归一化的MFCC特征参数。
声道可以表示为级联声管模型,每个声管都可以看成是一个谐振腔,它们的共振频率取决于声管的长度和形状。因此,说话人之间的部分声学差异是由于说话人的声道长度不同。例如,声道长度的变化范围一般从13cm(成年女性)变化到18cm(成年男性),因此,不同性别的人说同一个元音的共振峰频率相差很大。VTLN就是为了消除男、女声道长度的差异,使口音识别的结果不受性别的干扰。
VTLN可以通过弯折和平移频率坐标来使各说话人的共振峰频率相匹配。在本实施例中,可以采用基于双线性变换的VTLN方法。该基于双线性变换的VTLN方法并不直接对语音信号的频谱进行折叠,而是采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;根据所述频率弯折因子,采用双线性变换对三角滤波器组的位置(例如三角滤波器的起点、中间点和结束点)和宽度进行调整;根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。例如,若要对语音信号进行频谱压缩,则对三角滤波器的刻度进行拉伸,此时三角滤波器组向左扩展和移动。若要对语音信号进行频谱拉伸,则对三角滤波器的刻度进行压缩,此时三角滤波器组向右压缩和移动。采用该基于双线性变换的VTLN方法对 特定人群或特定人进行声道归一化时,仅需要对三角滤波器组系数进行一次变换即可,无需每次在提取特征参数时都对信号频谱折叠,从而大大减小了计算量。并且,该基于双线性变换的VTLN方法避免了对频率因子线性搜索,减小了运算复杂度。同时,该基于双线性变换的VTLN方法利用双线性变换,使弯折的频率连续且无带宽改变。
实施例二
图2为本申请实施例二提供的语音处理装置的结构图。如图2所示,所述语音处理装置10可以包括:预处理单元201、特征提取单元202、解码单元203、摘要提取单元204。
预处理单元201,用于对语音信号进行预处理。
所述语音信号可以是模拟语音信号,也可以是数字语音信号。若所述语音信号是模拟语音信号,则将所述模拟语音信号进行模数变换,转换为数字语音信号。
本申请用于连续语音识别,即对连续的音频流进行处理。在本申请的一个实施例中,所述语音处理方法应用于智能会议系统中,所述语音信号是通过语音输入设备(例如麦克风、手机话筒等)输入到智能会议系统的发言者的语音信号。
对所述语音信号进行预处理可以包括对所述语音信号进行预加重。
预加重的目的是提升语音的高频分量,使信号的频谱变得平坦。语音信号由于受声门激励和口鼻辐射的影响,能量在高频端明显减小,通常是频率越高幅值越小。当频率提升两倍时,功率谱幅度按6dB/oct跌落。因此,在对语音信号进行频谱分析或声道参数分析前,需要对语音信号的高频部分进行频率提升,即对语音信号进行预加重。预加重一般利用高通滤波器实现,高通滤波器的传递函数可以为:
H(z)=1-κz -1,0.9≤κ≤1.0。
其中,κ为预加重系数,优选取值在0.94-0.97之间。
对所述语音信号进行预处理还可以包括对所述语音信号进行加窗分帧。
语音信号是一种非平稳的时变信号,主要分为浊音和清音两大类。浊音的基音周期、请浊音信号幅度和声道参数等都随时间而缓慢变化,但通常在10ms-30ms的时间内可以认为具有短时平稳性。语音信号处理中可以把语音信号分成一些短段(即获得短时平稳信号)来进行处理,这个过程称为分帧,得到的短段的语音信号称为语音帧。分帧是通过对语音信号进行加窗处理来实现的。为了避免相邻两帧的变化幅度过大,帧与帧之间需要重叠一部分。在本申请的一个实施例中,每个语音帧为25毫秒,相邻两个语音帧之间存在15毫秒重叠,也就是每隔10毫秒取一个语音帧。
常用的窗函数有矩形窗、汉明窗和汉宁窗,矩形窗函数为:
Figure PCTCN2018108190-appb-000013
汉明窗函数为:
Figure PCTCN2018108190-appb-000014
汉宁窗函数为:
Figure PCTCN2018108190-appb-000015
其中,N为一个语音帧所包含的采样点的个数。
对所述语音信号进行预处理还可以包括检测所述语音信号中的有效语音。
检测有效语音的目的是从语音信号中剔除非有效语音(即非语音段),获得有效语音(即语音段),以降低特征提取的计算量和准确度,缩短语音识别的时间,提高识别率。可以根据语音信号的短时能量和短时过零率等进行有效语音检测。
在一实施例中,假设语音信号中第n个语音帧为x n(m),则短时能量为:
Figure PCTCN2018108190-appb-000016
短时过零率为:
Figure PCTCN2018108190-appb-000017
其中,sgn[.]为符号函数,表达式为:
Figure PCTCN2018108190-appb-000018
可以采用两级判断法检测所述语音信号中有效语音的起点和终点。两级判断法为本领域的公知技术,此处不再赘述。
在另一实施例中,可以通过下述方法检测所述语音信号中的有效语音:
(1)对所述语音信号进行加窗分帧,得到所述语音信号的语音帧x(n)。在一个具体实施例中,可以对所述语音信号加汉明窗,每帧20ms,帧移10ms。若预处理过程中已对语音信号加窗分帧,则该步骤省略。
(2)对所述语音帧x(n)进行离散傅里叶变换(Discrete Fourier Transform,DFT),得到所述语音帧x(n)的频谱:
Figure PCTCN2018108190-appb-000019
(3)根据所述语音帧x(n)的频谱计算各个频带的累计能量:
Figure PCTCN2018108190-appb-000020
其中E(m)表示第m个频带的累计能量,(m 1,m 2)表示第m个频带的起始频带点。
(4)对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值。
(5)将所述各个频带的累计能量对数值与预设阈值比较,得到所述有效语音。若一个频带的累计能量对数值高于预设阈值,则所述频带对应的语音为有效语音。
特征提取单元202,用于对预处理后的语音信号提取特征参数。
特征参数提取是对语音信号进行分析,提取出反映语音本质特征的声学 参数序列。
提取的特征参数可以包括短时平均能量、短时平均过零率、共振峰、基音周期等时域参数,还可以包括线性预测系数(Linear Prediction Coefficient,LPC)、线性预测倒谱系数(Linear Prediction Cepstrum Coefficient,LPCC)、梅尔倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)、感知线性预测(Perceptual Linear Predictive,PLP)等变换域参数。
在本申请的一个实施例中,可以提取语音信号的MFCC特征参数。提取MFCC特征参数的步骤如下:
(1)对预处理单元201得到的每一个语音帧进行离散傅里叶变换(Discrete Fourier Transform,DFT,可以是快速傅里叶变换),得到该语音帧的频谱。
(2)求该语音帧的频谱幅度的平方,得到该语音帧的离散能量谱。
(3)将该语音帧的离散能量谱通过一组Mel频率上均匀分布的三角滤波器(即三角滤波器组),得到各个三角滤波器的输出。该组三角滤波器的中心频率在Mel频率刻度上均匀排列,且每个三角滤波器的三角形两个底点的频率分别等于相邻的两个三角滤波器的中心频率。三角滤波器的中心频率为:
Figure PCTCN2018108190-appb-000021
三角滤波器的频率响应为:
Figure PCTCN2018108190-appb-000022
其中,f h、f 1为三角滤波器的高频和低频;N为傅里叶变换的点数;F S为采样频率;M为三角滤波器的个数;B -1=700(e b/1125-1)是f mel的逆函数。
(4)对所有三角滤波器的输出做对数运算,得到该语音帧的对数功率谱S(m)。
(5)对S(m)做离散余弦变换(Discrete Cosine Transform,DCT),得到该语音帧的初始MFCC特征参数。离散余弦变换为:
Figure PCTCN2018108190-appb-000023
MFCC中引入了三角滤波器组,且三角滤波器在低频段分布较密,在高频段分布较疏,符合人耳听觉特性,在噪声环境下仍具有较好的识别性能。
提取MFCC特征参数的步骤还可以包括:
(6)提取语音帧的动态差分MFCC特征参数。初始MFCC特征参数只反映了语音参数的静态特性,语音的动态特性可通过静态特征的差分谱来描述,动静态结合可以有效提升系统的识别性能,通常使用一阶和/或者二阶差分MFCC特征参数。
在一具体实施例中,提取的MFCC特征参数为39维的特征矢量,包括 13维初始MFCC特征参数、13维一阶差分MFCC特征参数和13维二阶差分MFCC特征参数。
在本申请的一个实施中,在对预处理后的语音信号提取特征参数之后,还可以对提取的特征参数进行降维处理,得到降维后的特征参数。例如,采用分段均值数据降维算法对所述特征参数(例如MFCC特征参数)进行降维处理,得到降维后的特征参数。降维后的特征参数将用于后续的步骤。
解码单元203,用于根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本。
所述语音识别模型可以包括动态时间规整模型、隐马尔科夫模型、人工神经网络模型、支持向量机分类模型等。所述语音识别模型也可以是两种或两种以上所述模型的组合。
在本申请的一个实施例中,所述语音识别模型为隐马尔科夫模型(HMM)。所述HMM包括声学模型和语音模型。
声学模型(Acoustic Model):采用隐马尔科夫模型对音素建模。在语音领域中,并不是以单词,而是以子词为识别单位,子词是声学模型基本的声学单元。在英语中子词为音素,对于某个特定的单词,相应的声学模型由多个音素通过查找发音字典的语法规则拼接得成。在汉语中子词为声母和韵母。每个子词可以用包括多个状态的HMM建模。举例来说,每一个音素可以用包含最多6个状态的HMM建模,每个状态可以用高斯混合模型(GMM)拟合对应的观测帧,观测帧按时序组合成观测序列。而每个声学模型可以生成长短不一的观测序列,即一对多映射。
语言模型(Language Model):是为了在语音识别的过程中有效地结合语法和语义的知识,提高识别率,减少搜索的范围。由于很难准确确定词的边界,以及声学模型描述语音变异性的能力有限,识别时将产生很多概率得分相似的词的序列。因此在实用的语音识别系统中通常使用语言模型P(W)从诸多候选结果中选择最有可能的次序列来补充声学模型的不足。
在本实施例中,采用基于规则的语言模型。基于规则的语言模型可以总结出语法规则乃至语义规则,然后用这些规则排除声学识别中不合语法规则或语义规则的结果。统计语言模型通过统计概率描述词与词之间的依赖关系,间接地对语法或语义规则进行编码。
解码就是在状态网络中搜索一条最佳路径,语音对应这条路径的概率最大。在本实施例中,利用动态规划算法(即Viterbi算法)寻找全局最优路径。
假设特征提取单元202提取的特征参数为特征向量Y,通过解码算法寻找最有可能生成Y的词序列w 1:L=w 1,w 2…w L
解码算法是求解使得后验概率P(w|Y)最大所对应的参数w,即:
w best=argmax{p(w|Y)}
由贝叶斯定理将上式转化为:
Figure PCTCN2018108190-appb-000024
由于观测概率P(Y)在给定观测序列下是常数,上式可进一步简化为:
w best=argmax{p(Y|w)p(w)}
其中先验概率P(W)由语言模型决定,似然概率p(Y|w)由声学模型决定。通过以上计算即可得出后验概率P(w|Y)最大所对应的参数w。
摘要提取单元204,用于从所述以句子为单位的文本中提取摘要句。
解码单元203将语音信号解码为以句子为单位的文本,在常规的语音识别系统中,语音识别工作已经完成。在本申请中,摘要提取单元204从识别出来的以句子为单位的文本中提取出摘要句。
提取摘要句的目的是从语音中抽取重要的信息,剔除无用信息。
摘要提取单元204通过HMM模型提取摘要句。此时,HMM模型的双重随机关系可以描述为:一重随机关系为句子序列的释放,是可观察的;另一重随机关系为该句子是否应该被归为摘要句的性质,是不可观察的。所以用HMM模型来提取摘要句的过程可以描述为给定句子序列O={O 1,O 2…O n},以确定句子是否为摘要句的最大可能性。主要步骤如下:
(1)获得以句子为单位的文本的观察状态序列O={O 1,O 2…O n};
(2)确定HMM隐含状态。可以设置5个隐含状态。可以把隐含状态设置为“1”-符合,“2”-较符合,“3”-一般,“4”-较不符合,“5”-不符合,用来依次表示句子符合摘要句的程度。
(3)进行HMM参数估计。首先随机产生初始的概率参数,经过不断地迭代,当达到设定的阈值时,停止计算,得到适合的HMM参数。
(4)根据训练好的HMM,通过Viterbi算法对句子进行标记,得到各个句子符合摘要句的符合度。
(5)将满足预设符合度的句子(例如至少较符合的句子)从所述以句子为单位的文本中提取出来,得到所述以句子为单位的文本中的摘要句。
实施例二的语音处理装置10对语音信号进行预处理;对预处理后的语音信号提取特征参数;根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;通过隐马尔科夫模型HMM从以句子为单位的文本中提取摘要句。实施例二不仅将语音信息转化为文字,还提取文字中的摘要句进行输出,剔除了由语音识别结果中的无用信息,获得更好的语音处理结果。
在另一实施例中,特征提取单元202在提取MFCC特征参数时,可以进行声道长度归一化(Vocal Tract Length Normalization,VTLN),得到声道长度归一化的MFCC特征参数。
声道可以表示为级联声管模型,每个声管都可以看成是一个谐振腔,它们的共振频率取决于声管的长度和形状。因此,说话人之间的部分声学差异是由于说话人的声道长度不同。例如,声道长度的变化范围一般从13cm(成年女性)变化到18cm(成年男性),因此,不同性别的人说同一个元音的共振峰频率相差很大。VTLN就是为了消除男、女声道长度的差异,使口音识别的结果不受性别的干扰。
VTLN可以通过弯折和平移频率坐标来使各说话人的共振峰频率相匹配。在本实施例中,可以采用基于双线性变换的VTLN方法。该基于双线性变换 的VTLN方法并不直接对语音信号的频谱进行折叠,而是采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;根据所述频率弯折因子,采用双线性变换对三角滤波器组的位置(例如三角滤波器的起点、中间点和结束点)和宽度进行调整;根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。例如,若要对语音信号进行频谱压缩,则对三角滤波器的刻度进行拉伸,此时三角滤波器组向左扩展和移动。若要对语音信号进行频谱拉伸,则对三角滤波器的刻度进行压缩,此时三角滤波器组向右压缩和移动。采用该基于双线性变换的VTLN方法对特定人群或特定人进行声道归一化时,仅需要对三角滤波器组系数进行一次变换即可,无需每次在提取特征参数时都对信号频谱折叠,从而大大减小了计算量。并且,该基于双线性变换的VTLN方法避免了对频率因子线性搜索,减小了运算复杂度。同时,该基于双线性变换的VTLN方法利用双线性变换,使弯折的频率连续且无带宽改变。
实施例三
本实施例提供一种非易失性可读存储介质,该非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述语音处理方法实施例中的步骤,例如图1所示的步骤101-104:
步骤101,对语音信号进行预处理;
步骤102,对预处理后的语音信号提取特征参数;
步骤103,根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
步骤104,通过隐马尔科夫模型HMM从以句子为单位的文本中提取摘要句。
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块/单元的功能,例如图2中的单元201-204:
预处理单元201,用于对语音信号进行预处理;
特征提取单元202,用于对预处理后的语音信号提取特征参数;
解码单元203,用于根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
摘要提取单元204,用于通过隐马尔科夫模型HMM从以句子为单位的文本中提取摘要句。
实施例四
图3为本申请实施例三提供的计算机装置的示意图。所述计算机装置1包括存储器20、处理器30以及存储在所述存储器20中并可在所述处理器30上运行的计算机可读指令40,例如语音处理程序。所述处理器30执行所述计算机可读指令40时实现上述语音处理方法实施例中的步骤,例如图1所示的步骤101-104:
步骤101,对语音信号进行预处理;
步骤102,对预处理后的语音信号提取特征参数;
步骤103,根据所述特征参数,利用预先训练好的语音识别模型对所述 语音信号进行解码,得到以句子为单位的文本;
步骤104,通过隐马尔科夫模型HMM从以句子为单位的文本中提取摘要句。
或者,所述处理器30执行所述计算机可读指令40时实现上述装置实施例中各模块/单元的功能,例如图2中的单元201-204:
预处理单元201,用于对语音信号进行预处理;
特征提取单元202,用于对预处理后的语音信号提取特征参数;
解码单元203,用于根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
摘要提取单元204,用于通过隐马尔科夫模型HMM从以句子为单位的文本中提取摘要句。
示例性的,所述计算机可读指令40可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器20中,并由所述处理器30执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令40在所述计算机装置1中的执行过程。例如,所述计算机可读指令40可以被分割成图2中的预处理单元201、特征提取单元202、解码单元203、摘要提取单元204,各单元具体功能参见实施例二。
所述计算机装置1可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图3仅仅是计算机装置1的示例,并不构成对计算机装置1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述计算机装置1还可以包括输入输出设备、网络接入设备、总线等。
所称处理器30可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器30也可以是任何常规的处理器等,所述处理器30是所述计算机装置1的控制中心,利用各种接口和线路连接整个计算机装置1的各个部分。
所述存储器20可用于存储所述计算机可读指令40和/或模块/单元,所述处理器30通过运行或执行存储在所述存储器20内的计算机可读指令和/或模块/单元,以及调用存储在存储器20内的数据,实现所述计算机装置1的各种功能。所述存储器20可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机装置1的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器20可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固 态存储器件。
所述计算机装置1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述非易失性可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述非易失性可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,非易失性可读介质不包括电载波信号和电信信号。
在本申请所提供的几个实施例中,应该理解到,所揭露的计算机装置和方法,可以通过其它的方式实现。例如,以上所描述的计算机装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
另外,在本申请各个实施例中的各功能单元可以集成在相同处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在相同单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。计算机装置权利要求中陈述的多个单元或计算机装置也可以由同一个单元或计算机装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种语音处理方法,其特征在于,所述方法包括:
    对语音信号进行预处理;
    对预处理后的语音信号提取特征参数;
    根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
    通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。
  2. 如权利要求1所述的方法,其特征在于,所述通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句,具体包括:
    获得所述以句子为单位的文本的观察状态序列O={O 1,O 2…O n};
    确定HMM的隐含状态;
    进行HMM参数估计,得到训练好的HMM;
    根据所述训练好的HMM,通过Viterbi算法对所述句子进行标记,得到各个句子符合摘要句的符合度;
    将满足预设符合度的句子从所述以句子为单位的文本中提取出来,得到所述以句子为单位的文本中的摘要句。
  3. 如权利要求1所述的方法,其特征在于,所述对语音信号进行预处理包括检测所述语音信号中的有效语音,具体包括:
    对所述语音信号进行加窗分帧,得到所述语音信号的语音帧;
    对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;
    根据所述语音帧的频谱计算各个频带的累计能量;
    对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;
    将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。
  4. 如权利要求1所述的方法,其特征在于,所述特征参数包括初始梅尔倒谱系数MFCC特征参数、一阶差分MFCC特征参数和二阶差分MFCC特征参数。
  5. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    对所述特征参数进行降维处理,得到降维后的特征参数。
  6. 如权利要求1所述的方法,其特征在于,所述对预处理后的语音信号提取特征参数包括对预处理后的语音信号提取梅尔倒谱系数MFCC特征参数,具体包括:
    采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;
    根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;
    根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。
  7. 如权利要求1所述的方法,其特征在于,所述对预处理后的语音信号提取特征参数包括对预处理后的语音信号提取梅尔倒谱系数MFCC特征参数,具体包括:
    对每个语音帧进行离散傅里叶变换DFT,得到该语音帧的频谱;
    求该语音帧的频谱幅度的平方,得到该语音帧的离散能量谱;
    将该语音帧的离散能量谱通过Mel频率上均匀分布的三角滤波器组,得到各个三角滤波器的输出;
    对所有三角滤波器的输出做对数运算,得到该语音帧的对数功率谱;
    对所述对数功率谱离散余弦变换DCT,得到该语音帧的初始MFCC特征参数。
  8. 一种语音处理装置,其特征在于,所述装置包括:
    预处理单元,用于对语音信号进行预处理;
    特征提取单元,用于对预处理后的语音信号提取特征参数;
    解码单元,用于根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
    摘要提取单元,用于通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。
  9. 一种计算机装置,其特征在于,所述计算机装置包括处理器和存储器,所述存储器用于存储至少一个计算机可读指令,所述处理器用于执行所述至少一个计算机可读指令以实现以下步骤:
    对语音信号进行预处理;
    对预处理后的语音信号提取特征参数;
    根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
    通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。
  10. 如权利要求9所述的计算机装置,其特征在于,所述通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句,具体包括:
    获得所述以句子为单位的文本的观察状态序列O={O 1,O 2…O n};
    确定HMM的隐含状态;
    进行HMM参数估计,得到训练好的HMM;
    根据所述训练好的HMM,通过Viterbi算法对所述句子进行标记,得到各个句子符合摘要句的符合度;
    将满足预设符合度的句子从所述以句子为单位的文本中提取出来,得到所述以句子为单位的文本中的摘要句。
  11. 如权利要求9所述的计算机装置,其特征在于,所述对语音信号进行预处理包括检测所述语音信号中的有效语音,具体包括:
    对所述语音信号进行加窗分帧,得到所述语音信号的语音帧;
    对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;
    根据所述语音帧的频谱计算各个频带的累计能量;
    对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能 量对数值;
    将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。
  12. 如权利要求9所述的计算机装置,其特征在于,所述处理器还用于执行所述至少一个计算机可读指令以实现以下步骤:
    对所述特征参数进行降维处理,得到降维后的特征参数。
  13. 如权利要求9所述的计算机装置,其特征在于,所述对预处理后的语音信号提取特征参数包括对预处理后的语音信号提取梅尔倒谱系数MFCC特征参数,具体包括:
    采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;
    根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;
    根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。
  14. 如权利要求9所述的计算机装置,其特征在于,所述对预处理后的语音信号提取特征参数包括对预处理后的语音信号提取梅尔倒谱系数MFCC特征参数,具体包括:
    对每个语音帧进行离散傅里叶变换DFT,得到该语音帧的频谱;
    求该语音帧的频谱幅度的平方,得到该语音帧的离散能量谱;
    将该语音帧的离散能量谱通过Mel频率上均匀分布的三角滤波器组,得到各个三角滤波器的输出;
    对所有三角滤波器的输出做对数运算,得到该语音帧的对数功率谱;
    对所述对数功率谱离散余弦变换DCT,得到该语音帧的初始MFCC特征参数。
  15. 一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现以下步骤:
    对语音信号进行预处理;
    对预处理后的语音信号提取特征参数;
    根据所述特征参数,利用预先训练好的语音识别模型对所述语音信号进行解码,得到以句子为单位的文本;
    通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句。
  16. 如权利要求15所述的存储介质,其特征在于,所述通过隐马尔科夫模型HMM从所述以句子为单位的文本中提取摘要句,具体包括:
    获得所述以句子为单位的文本的观察状态序列O={O 1,O 2…O n};
    确定HMM的隐含状态;
    进行HMM参数估计,得到训练好的HMM;
    根据所述训练好的HMM,通过Viterbi算法对所述句子进行标记,得到各个句子符合摘要句的符合度;
    将满足预设符合度的句子从所述以句子为单位的文本中提取出来,得到所述以句子为单位的文本中的摘要句。
  17. 如权利要求15所述的存储介质,其特征在于,所述对语音信号进行预处理包括检测所述语音信号中的有效语音,具体包括:
    对所述语音信号进行加窗分帧,得到所述语音信号的语音帧;
    对所述语音帧进行离散傅里叶变换,得到所述语音帧的频谱;
    根据所述语音帧的频谱计算各个频带的累计能量;
    对所述各个频带的累计能量进行对数运算,得到所述各个频带的累计能量对数值;
    将所述各个频带的累计能量对数值与预设阈值进行比较,得到所述有效语音。
  18. 如权利要求15所述的存储介质,其特征在于,所述至少一个计算机可读指令被所述处理器执行时还实现以下步骤:
    对所述特征参数进行降维处理,得到降维后的特征参数。
  19. 如权利要求15所述的存储介质,其特征在于,所述对预处理后的语音信号提取特征参数包括对预处理后的语音信号提取梅尔倒谱系数MFCC特征参数,具体包括:
    采用双线性变换低通滤波器截止频率的映射公式,计算对齐不同说话人平均第三共振峰的频率弯折因子;
    根据所述频率弯折因子,采用双线性变换对MFCC特征参数提取所使用的三角滤波器组的位置和宽度进行调整;
    根据调整后的三角滤波器组计算声道归一化的MFCC特征参数。
  20. 如权利要求15所述的存储介质,其特征在于,所述对预处理后的语音信号提取特征参数包括对预处理后的语音信号提取梅尔倒谱系数MFCC特征参数,具体包括:
    对每个语音帧进行离散傅里叶变换DFT,得到该语音帧的频谱;
    求该语音帧的频谱幅度的平方,得到该语音帧的离散能量谱;
    将该语音帧的离散能量谱通过Mel频率上均匀分布的三角滤波器组,得到各个三角滤波器的输出;
    对所有三角滤波器的输出做对数运算,得到该语音帧的对数功率谱;
    对所述对数功率谱离散余弦变换DCT,得到该语音帧的初始MFCC特征参数。
PCT/CN2018/108190 2018-08-08 2018-09-28 语音处理方法及装置、计算机装置及可读存储介质 WO2020029404A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810897646.2 2018-08-08
CN201810897646.2A CN109036381A (zh) 2018-08-08 2018-08-08 语音处理方法及装置、计算机装置及可读存储介质

Publications (1)

Publication Number Publication Date
WO2020029404A1 true WO2020029404A1 (zh) 2020-02-13

Family

ID=64632382

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/108190 WO2020029404A1 (zh) 2018-08-08 2018-09-28 语音处理方法及装置、计算机装置及可读存储介质

Country Status (2)

Country Link
CN (1) CN109036381A (zh)
WO (1) WO2020029404A1 (zh)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109872714A (zh) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 一种提高语音识别准确性的方法、电子设备及存储介质
CN109741761B (zh) * 2019-03-13 2020-09-25 百度在线网络技术(北京)有限公司 声音处理方法和装置
CN110300001B (zh) * 2019-05-21 2022-03-15 深圳壹账通智能科技有限公司 会议音频控制方法、系统、设备及计算机可读存储介质
CN112420070A (zh) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 自动标注方法、装置、电子设备及计算机可读存储介质
CN110738991A (zh) * 2019-10-11 2020-01-31 东南大学 基于柔性可穿戴传感器的语音识别设备
CN111128178A (zh) * 2019-12-31 2020-05-08 上海赫千电子科技有限公司 一种基于面部表情分析的语音识别方法
CN111509842A (zh) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 一种带切割机特征量识别的电缆防破坏预警装置
CN111509843A (zh) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 一种带机械破碎锤特征量识别的电缆防破坏预警装置
CN111509841A (zh) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 一种带挖掘机特征量识别的电缆防外力破坏预警装置
CN111933116B (zh) * 2020-06-22 2023-02-14 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111968622A (zh) * 2020-08-18 2020-11-20 广州市优普科技有限公司 一种基于注意力机制的语音识别方法、系统及装置
CN112201253B (zh) * 2020-11-09 2023-08-25 观华(广州)电子科技有限公司 文字标记方法、装置、电子设备及计算机可读存储介质
CN112562646A (zh) * 2020-12-09 2021-03-26 江苏科技大学 一种机器人语音识别方法
CN115063895A (zh) * 2022-06-10 2022-09-16 深圳市智远联科技有限公司 一种基于语音识别的售票方法及售票系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464898A (zh) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 一种提取文本主题词的方法
CN102436809A (zh) * 2011-10-21 2012-05-02 东南大学 英语口语机考系统中网络语音识别方法
CN103646094A (zh) * 2013-12-18 2014-03-19 上海紫竹数字创意港有限公司 实现视听类产品内容摘要自动提取生成的系统及方法
US20140180694A1 (en) * 2012-06-06 2014-06-26 Spansion Llc Phoneme Score Accelerator
CN108305632A (zh) * 2018-02-02 2018-07-20 深圳市鹰硕技术有限公司 一种会议的语音摘要形成方法及系统

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4256393B2 (ja) * 2001-08-08 2009-04-22 日本電信電話株式会社 音声処理方法及びそのプログラム
WO2004049188A1 (en) * 2002-11-28 2004-06-10 Agency For Science, Technology And Research Summarizing digital audio data
GB2409750B (en) * 2004-01-05 2006-03-15 Toshiba Res Europ Ltd Speech recognition system and technique
US9020816B2 (en) * 2008-08-14 2015-04-28 21Ct, Inc. Hidden markov model for speech processing with training method
JP5346327B2 (ja) * 2010-08-10 2013-11-20 日本電信電話株式会社 対話学習装置、要約装置、対話学習方法、要約方法、プログラム
CN103021408B (zh) * 2012-12-04 2014-10-22 中国科学院自动化研究所 一种发音稳定段辅助的语音识别优化解码方法及装置
US10140262B2 (en) * 2015-05-04 2018-11-27 King Fahd University Of Petroleum And Minerals Systems and associated methods for Arabic handwriting synthesis and dataset design
CN106446109A (zh) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 语音文件摘要的获取方法和装置
CN107403619B (zh) * 2017-06-30 2021-05-28 武汉泰迪智慧科技有限公司 一种应用于自行车环境的语音控制方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464898A (zh) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 一种提取文本主题词的方法
CN102436809A (zh) * 2011-10-21 2012-05-02 东南大学 英语口语机考系统中网络语音识别方法
US20140180694A1 (en) * 2012-06-06 2014-06-26 Spansion Llc Phoneme Score Accelerator
CN103646094A (zh) * 2013-12-18 2014-03-19 上海紫竹数字创意港有限公司 实现视听类产品内容摘要自动提取生成的系统及方法
CN108305632A (zh) * 2018-02-02 2018-07-20 深圳市鹰硕技术有限公司 一种会议的语音摘要形成方法及系统

Also Published As

Publication number Publication date
CN109036381A (zh) 2018-12-18

Similar Documents

Publication Publication Date Title
WO2020029404A1 (zh) 语音处理方法及装置、计算机装置及可读存储介质
McAuliffe et al. Montreal forced aligner: Trainable text-speech alignment using kaldi.
Arora et al. Automatic speech recognition: a review
Gerosa et al. Acoustic variability and automatic recognition of children’s speech
Bhangale et al. A review on speech processing using machine learning paradigm
CN109741732B (zh) 命名实体识别方法、命名实体识别装置、设备及介质
Shahnawazuddin et al. Pitch-Adaptive Front-End Features for Robust Children's ASR.
Bezoui et al. Feature extraction of some Quranic recitation using mel-frequency cepstral coeficients (MFCC)
Mouaz et al. Speech recognition of moroccan dialect using hidden Markov models
Fukuda et al. Detecting breathing sounds in realistic Japanese telephone conversations and its application to automatic speech recognition
Khelifa et al. Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system
Jothilakshmi et al. Large scale data enabled evolution of spoken language research and applications
Karpov An automatic multimodal speech recognition system with audio and video information
Vegesna et al. Application of emotion recognition and modification for emotional Telugu speech recognition
Ranjan et al. Isolated word recognition using HMM for Maithili dialect
Kadyan et al. Developing children’s speech recognition system for low resource Punjabi language
Revathi et al. Text independent speaker recognition and speaker independent speech recognition using iterative clustering approach
Hafen et al. Speech information retrieval: a review
Zolnay et al. Using multiple acoustic feature sets for speech recognition
Furui Selected topics from 40 years of research on speech and speaker recognition.
Thalengala et al. Study of sub-word acoustical models for Kannada isolated word recognition system
Tawaqal et al. Recognizing five major dialects in Indonesia based on MFCC and DRNN
Sayem Speech analysis for alphabets in Bangla language: automatic speech recognition
Nwe et al. Myanmar language speech recognition with hybrid artificial neural network and hidden Markov model
Laleye et al. Automatic text-independent syllable segmentation using singularity exponents and rényi entropy

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18929555

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18929555

Country of ref document: EP

Kind code of ref document: A1