CN109036381A - Method of speech processing and device, computer installation and readable storage medium storing program for executing - Google Patents

Method of speech processing and device, computer installation and readable storage medium storing program for executing Download PDF

Info

Publication number
CN109036381A
CN109036381A CN201810897646.2A CN201810897646A CN109036381A CN 109036381 A CN109036381 A CN 109036381A CN 201810897646 A CN201810897646 A CN 201810897646A CN 109036381 A CN109036381 A CN 109036381A
Authority
CN
China
Prior art keywords
sentence
voice signal
characteristic parameter
unit
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810897646.2A
Other languages
Chinese (zh)
Inventor
王健宗
王珏
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810897646.2A priority Critical patent/CN109036381A/en
Priority to PCT/CN2018/108190 priority patent/WO2020029404A1/en
Publication of CN109036381A publication Critical patent/CN109036381A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

A kind of method of speech processing, which comprises voice signal is pre-processed;Characteristic parameter is extracted to pretreated voice signal;According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, obtains the text as unit of sentence;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.The present invention also provides a kind of voice processing apparatus, computer installation and computer readable storage mediums.The present invention can identify voice, and garbage is removed from speech recognition result.

Description

Method of speech processing and device, computer installation and readable storage medium storing program for executing
Technical field
The present invention relates to computer audio technical fields, and in particular to a kind of method of speech processing and device, computer dress It sets and computer readable storage medium.
Background technique
In intelligent meeting system, speech recognition technology is a key technology, can be by the signal conversion of speaking of people Text information for that can be identified by computer is used as output.
However, existing intelligent meeting system is the conversion of realization voice to text, and cannot be to the text identified Word information is further to be handled, and the text information being directly converted to according to voice can be able to include useless information, such as Some sentences unrelated with conference content.
Summary of the invention
In view of the foregoing, it is necessary to propose a kind of method of speech processing and device, computer installation and computer-readable Storage medium can identify voice, and remove garbage from speech recognition result.
The first aspect of the application provides a kind of method of speech processing, which comprises
Voice signal is pre-processed;
Characteristic parameter is extracted to pretreated voice signal;
According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, Obtain the text as unit of sentence;
Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
In alternatively possible implementation, it is described by Hidden Markov Model HMM from described as unit of sentence Abstract sentence is extracted in text, is specifically included:
Obtain the observation state sequence O={ O of the text as unit of sentence1,O2…On};
Determine the hidden state of HMM;
HMM parameter Estimation is carried out, trained HMM is obtained;
According to the trained HMM, the sentence is marked by Viterbi algorithm, obtains each sentence symbol Close the degree of conformity of abstract sentence;
The sentence for meeting default degree of conformity is extracted from the text as unit of sentence, is obtained described with sentence Son is the abstract sentence in the text of unit.
It is described that pretreatment is carried out including detecting in the voice signal to voice signal in alternatively possible implementation Efficient voice, specifically include:
Adding window framing is carried out to the voice signal, obtains the speech frame of the voice signal;
Discrete Fourier transform is carried out to the speech frame, obtains the frequency spectrum of the speech frame;
The accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame;
Logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy logarithm of each frequency band Value;
The accumulated energy logarithm of each frequency band is compared with preset threshold, obtains the efficient voice.
In alternatively possible implementation, the characteristic parameter include initial mel cepstrum coefficients MFCC characteristic parameter, First-order difference MFCC characteristic parameter and second differnce MFCC characteristic parameter.
In alternatively possible implementation, the method also includes:
Dimension-reduction treatment is carried out to the characteristic parameter, the characteristic parameter after obtaining dimensionality reduction.
In alternatively possible implementation, described includes to pre- place to pretreated voice signal extraction characteristic parameter Voice signal after reason extracts mel cepstrum coefficients MFCC characteristic parameter, specifically includes:
Using the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment difference and speak third for each person The frequency of formant bends the factor;
The factor is bent according to the frequency, triangle used in MFCC characteristic parameter extraction is filtered using bilinear transformation The position of device group and width are adjusted;
The normalized MFCC characteristic parameter of sound channel is calculated according to triangular filter group adjusted.
In alternatively possible implementation, described includes to pre- place to pretreated voice signal extraction characteristic parameter Voice signal after reason extracts mel cepstrum coefficients MFCC characteristic parameter, specifically includes:
Discrete Fourier transform DFT is carried out to each speech frame, obtains the frequency spectrum of the speech frame;
Square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame;
By the conventional spectrum of the speech frame by equally distributed triangular filter group in Mel frequency, each three are obtained The output of angle filter;
Logarithm operation is done to the output of all triangular filters, obtains the log power spectrum of the speech frame;
To the log power spectrum discrete cosine transform, the initial MFCC characteristic parameter of the speech frame is obtained.
The second aspect of the application provides a kind of voice processing apparatus, and described device includes:
Pretreatment unit, for being pre-processed to voice signal;
Feature extraction unit, for extracting characteristic parameter to pretreated voice signal;
Decoding unit is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice Signal is decoded, and obtains the text as unit of sentence;
Abstract extraction unit, for being extracted from the text as unit of sentence by Hidden Markov Model HMM Abstract sentence.
The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing Device is for realizing the method for speech processing when executing the computer program stored in memory.
The fourth aspect of the application provides a kind of computer readable storage medium, is stored thereon with computer program, described The method of speech processing is realized when computer program is executed by processor.
The present invention pre-processes voice signal;Characteristic parameter is extracted to pretreated voice signal;According to described Characteristic parameter is decoded the voice signal using preparatory trained speech recognition modeling, obtains as unit of sentence Text;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.The present invention not only will Voice messaging is converted into text, and the also abstract sentence in extraction text is exported, and eliminates by useless in speech recognition result Information obtains better speech processes result.
Detailed description of the invention
Fig. 1 is the flow chart of method of speech processing provided in an embodiment of the present invention.
Fig. 2 is the structure chart of voice processing apparatus provided in an embodiment of the present invention.
Fig. 3 is the schematic diagram of computer installation provided in an embodiment of the present invention.
Specific embodiment
To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, embodiments herein and embodiment In feature can be combined with each other.
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Preferably, method of speech processing of the invention is applied in one or more computer installation.The computer Device is that one kind can be according to the instruction for being previously set or storing, the automatic equipment for carrying out numerical value calculating and/or information processing, Hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processing unit (Digital Signal Processor, DSP), embedded device etc..
The computer installation can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.The computer installation can carry out people by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with user Machine interaction.
Embodiment one
Fig. 1 is the flow chart for the method for speech processing that the embodiment of the present invention one provides.The method of speech processing is applied to Computer installation.The method of speech processing is from the text identified as unit of sentence in voice signal, from being single with sentence Abstract sentence is extracted in the text of position.
As shown in Figure 1, the method for speech processing specifically includes the following steps:
Step 101, voice signal is pre-processed.
The voice signal can be analog voice signal, be also possible to audio digital signals.If the voice signal is The analog voice signal is then carried out analog to digital conversion, is converted to audio digital signals by analog voice signal.
The present invention is used for continuous speech recognition, i.e., handles continuous audio stream.In one embodiment of the present of invention In, the method for speech processing is applied in intelligent meeting system, and the voice signal is by voice-input device (such as wheat Gram wind, mobile phone microphone etc.) be input to intelligent meeting system spokesman voice signal.
Carrying out pretreatment to the voice signal may include carrying out preemphasis to the voice signal.
The purpose of preemphasis is to promote the high fdrequency component of voice, and the frequency spectrum of signal is made to become flat.Voice signal due to by The influence of glottal excitation and mouth and nose radiation, energy are obviously reduced in front end, and usually it is smaller to get over amplitude for frequency.When frequency mentions When rising twice, power spectral amplitude is fallen by 6dB/oct.Therefore, spectrum analysis or channel parameters analysis are being carried out to voice signal Before, it needs to carry out frequency upgrading to the high frequency section of voice signal, i.e., preemphasis is carried out to voice signal.Preemphasis generally utilizes High-pass filter realizes, the transmission function of high-pass filter can be with are as follows:
H (z)=1- κ z-1, 0.9≤κ≤1.0.
Wherein, κ is pre emphasis factor, and preferably value is between 0.94-0.97.
Pre-processing to the voice signal can also include carrying out adding window framing to the voice signal.
Voice signal is a kind of time varying signal of non-stationary, is broadly divided into voiced sound and voiceless sound two major classes.The fundamental tone week of voiced sound Phase, to ask Voiced signal amplitude and channel parameters etc. all slowly varying at any time, but can be with usually within the time of 10ms-30ms Think with short-term stationarity.Voice signal can be divided into some short sections (i.e. acquisition short-term stationarity letter in Speech processing Number) handled, this process is known as framing, the voice signal of obtained short section is known as speech frame.Framing is by language Sound signal carries out windowing process to realize.In order to avoid the amplitude of variation of adjacent two frame is excessive, needed between frame and frame be overlapped A part.In one embodiment of the invention, each speech frame is 25 milliseconds, and there are 15 milliseconds between two neighboring speech frame Overlapping, that is, a speech frame is taken every 10 milliseconds.
Common window function has rectangular window, Hamming window and Hanning window, rectangular window function are as follows:
Hamming window function are as follows:
Hanning window function are as follows:
Wherein, the number for the sampled point that N includes by a speech frame.
Pre-processing to the voice signal can also include the efficient voice detected in the voice signal.
The purpose for detecting efficient voice is that non-effective voice (i.e. non-speech segment) is rejected from voice signal, obtains effective language Sound (i.e. voice segments) shortens the time of speech recognition to reduce the calculation amount and accuracy of feature extraction, improves discrimination.It can To carry out efficient voice detection according to the short-time energy of voice signal and short-time zero-crossing rate etc..
In one embodiment, it is assumed that n-th of speech frame is x in voice signaln(m), then short-time energy are as follows:
Short-time zero-crossing rate are as follows:
Wherein, sgn [] is sign function, expression formula are as follows:
The beginning and end of efficient voice in the voice signal can be detected using two-stage determining method.Two-stage determining method is Techniques known, details are not described herein again.
In another embodiment, the efficient voice in the voice signal can be detected by following methods:
(1) adding window framing is carried out to the voice signal, obtains the speech frame x (n) of the voice signal.It is specific at one In embodiment, Hamming window, every frame 20ms can be added to the voice signal, frame moves 10ms.If to voice in preprocessing process Signal adding window framing, then the step is omitted.
(2) to the speech frame x (n) carry out discrete Fourier transform (Discrete Fourier Transform, DFT), the frequency spectrum of the speech frame x (n) is obtained:
(3) accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame x (n):
Wherein E (m) indicates the accumulated energy of m-th of frequency band, (m1,m2) indicate m-th of frequency band start frequency band point.
(4) logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy pair of each frequency band Numerical value.
(5) the accumulated energy logarithm of each frequency band is obtained into the efficient voice compared with preset threshold.If one The accumulated energy logarithm of a frequency band is higher than preset threshold, then the corresponding voice of the frequency band is efficient voice.
Step 102, characteristic parameter is extracted to pretreated voice signal.
Characteristic parameter extraction is analyzed voice signal, and the parameters,acoustic sequence of reflection essential phonetic feature is extracted Column.
Whens the characteristic parameter of extraction may include short-time average energy, short-time average zero-crossing rate, formant, pitch period etc. Field parameter can also include linear predictor coefficient (Linear Prediction Coefficient, LPC), linear prediction cepstrum coefficient Coefficient (Linear Prediction Cepstrum Coefficient, LPCC), mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC), perception linear prediction (Perceptual Linear Predictive, PLP) etc. becomes Change field parameter.
In one embodiment of the invention, the MFCC characteristic parameter of voice signal can be extracted.Extract MFCC feature ginseng Several steps is as follows:
(1) discrete Fourier transform is carried out to each speech frame (Discrete Fourier Transform, DFT, can be with It is Fast Fourier Transform (FFT)), obtain the frequency spectrum of the speech frame.
(2) square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame.
(3) the conventional spectrum of the speech frame is passed through into equally distributed triangular filter (i.e. triangle in one group of Mel frequency Filter group), obtain the output of each triangular filter.The centre frequency of this group of triangular filter is equal in Mel frequency scale Even arrangement, and the frequency of two bottom points of triangle of each triangular filter is respectively equal in two adjacent triangular filters Frequency of heart.The centre frequency of triangular filter are as follows:
The frequency response of triangular filter are as follows:
Wherein, fh、f1For the high and low frequency of triangular filter;N is the points of Fourier transformation;FsFor sample frequency;M For the number of triangular filter;B-1=700 (eb/1125It -1) is fmelInverse function.
(4) logarithm operation is done to the output of all triangular filters, obtains the log power spectrum S (m) of the speech frame.
(5) discrete cosine transform (Discrete Cosine Transform, DCT) is done to S (m), obtains the speech frame Initial MFCC characteristic parameter.Discrete cosine transform are as follows:
Introduce triangular filter group in MFCC, and triangular filter be distributed in low-frequency range it is closeer, high band distribution compared with It dredges, meets human hearing characteristic, still there is preferable recognition performance in a noisy environment.
Extract MFCC characteristic parameter the step of can also include:
(6) according to the dynamic difference MFCC characteristic parameter of the initial MFCC characteristic parameter extraction speech frame of speech frame.Initially MFCC characteristic parameter only reflects the static characteristic of speech parameter, the dynamic characteristic of voice can by the Difference Spectrum of static nature come Description, sound state combine can effective lifting system recognition performance, usually using single order and/or second differnce MFCC feature Parameter.
In one embodiment, the MFCC characteristic parameter of extraction is the characteristic vector of 39 dimensions, including the 13 initial MFCC of dimension Characteristic parameter, 13 dimension first-order difference MFCC characteristic parameters and 13 dimension second differnce MFCC characteristic parameters.
It, can also be right after extracting characteristic parameter to pretreated voice signal in implementing at of the invention one The characteristic parameter of extraction carries out dimension-reduction treatment, the characteristic parameter after obtaining dimensionality reduction.For example, using segmentation mean data dimension-reduction algorithm Dimension-reduction treatment is carried out to the characteristic parameter (such as MFCC characteristic parameter), the characteristic parameter after obtaining dimensionality reduction.Spy after dimensionality reduction Sign parameter will be used for subsequent step.
Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into Row decoding, obtains the text as unit of sentence.
The speech recognition modeling may include dynamic time warping model, Hidden Markov Model, artificial neural network Model, support vector cassification model etc..The speech recognition modeling is also possible to the group of two or more model It closes.
In one embodiment of the invention, the speech recognition modeling is Hidden Markov Model (HMM).The HMM Including acoustic model and speech model.
Acoustic model (Acoustic Model): phoneme is modeled using Hidden Markov Model.In voice field, and Not instead of with word, using sub- word as recognition unit, sub- word is the basic acoustic elements of acoustic model.It is sound in English neutron word Element, for some specific word, corresponding acoustic model is spliced by multiple phonemes by searching for the syntax rule of Pronounceable dictionary At.It is initial consonant and simple or compound vowel of a Chinese syllable in Chinese neutron word.Every sub- word can be modeled with the HMM including multiple states.For example, often One phoneme can be modeled with the HMM comprising most 6 states, and each state can use gauss hybrid models (GMM) fitting pair The observation frame answered, observation frame are chronologically combined into observation sequence.And observation sequence different in size can be generated in each acoustic model Column, i.e., one-to-many mapping.
Language model (Language Model): being to be effectively combined syntax and semantics during speech recognition Knowledge, improve discrimination, reduce the range of search.Due to being difficult accurately to determine that the boundary of word and acoustic model describe language The ability of the change of tune opposite sex is limited, and when identification will generate the sequence of the similar word of many probability scores.Therefore know in practical voice Most possible secondary sequence is selected from many candidate results to supplement acoustic mode usually using language model P (W) in other system The deficiency of type.
In the present embodiment, using rule-based language model.Rule-based language model can sum up grammer Then rule or even semantic rules exclude the result for not conforming to syntax rule or semantic rules in acoustics identification with these rules.System Language model is counted by the dependence between statistical probability descriptor and word, grammer or semantic rules are compiled indirectly Code.
Decoding is exactly that an optimal path is searched in state network, and voice corresponds to the maximum probability of this paths.At this In embodiment, global optimum path is found using dynamic programming algorithm (i.e. Viterbi algorithm).
Assuming that being feature vector Y to the characteristic parameter that voice signal extracts, most possible generation Y is found by decoding algorithm Word sequence w1:L=w1,w2…wL
Parameter w corresponding to maximum that decoding algorithm is to solve for so that posterior probability P (w | Y), it may be assumed that
wbest=argmax p (w | Y) }
Above formula is converted by Bayes' theorem are as follows:
Since observation probability P (Y) is constant under given observation sequence, above formula can be further simplified are as follows:
wbest=argmax p (Y | w) p (w) }
Wherein prior probability P (W) is determined by language model, and likelihood probability p (Y | w) it is determined by acoustic model.By above Calculate parameter w corresponding to you can get it posterior probability P (w | Y) maximum.
Step 104, abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
By step 103, voice signal is decoded as the text as unit of sentence, in conventional speech recognition system In, speech recognition work has been completed.This method is further extracted from the text as unit of sentence identified and is plucked Want sentence.
The purpose for extracting abstract sentence is that important information is extracted from voice, rejects garbage.
This method extracts abstract sentence by HMM model.At this point, the dual random relationship of HMM model can be described as: a weight Stochastic relation is the release of sentence sequence, is observable;Another heavy stochastic relation is whether the sentence should be classified as making a summary The property of sentence is not observable.So the process for extracting abstract sentence with HMM model can be described as given sentence sequence O ={ O1,O2…On, with determine sentence whether be make a summary sentence maximum likelihood.Key step is as follows:
(1) the observation state sequence O={ O of the text as unit of sentence is obtained1,O2…On};
(2) HMM hidden state is determined.5 hidden states can be set." 1 "-hidden state can be set as to meet, " 2 "-relatively meet, and " 3 "-are general, and " 4 "-are not met relatively, and " 5 "-are not met, for successively indicating that sentence meets the degree of abstract sentence.
(3) HMM parameter Estimation is carried out.Initial probability parameter is randomly generated first, by constantly iteration, is set when reaching When fixed threshold value, stops calculating, obtain suitable HMM parameter.
(4) according to trained HMM, sentence is marked by Viterbi algorithm, each sentence is obtained and meets abstract The degree of conformity of sentence.
(5) sentence (sentence for example, at least relatively met) of default degree of conformity will be met from the text as unit of sentence It is extracted in this, obtains the abstract sentence in the text as unit of sentence.
The method of speech processing of embodiment one pre-processes voice signal;Pretreated voice signal is extracted special Levy parameter;According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, is obtained To the text as unit of sentence;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.It is real It applies example one and not only converts text for voice messaging, the also abstract sentence in extraction text is exported, and is eliminated by speech recognition As a result the garbage in obtains better speech processes result.
In another embodiment, when extracting MFCC characteristic parameter, sound channel length normalization (Vocal can be carried out Tract Length Normalization, VTLN), obtain the normalized MFCC characteristic parameter of sound channel length.
Sound channel can be expressed as cascade vocal tube model, and each sound pipe can regard a resonant cavity, their resonance as Frequency depends on the length and shape of sound pipe.Therefore, the part acoustic difference between speaker is since the sound channel of speaker is long Degree is different.For example, the variation range of sound channel length generally changes to 18cm (adult male) from 13cm (adult female), therefore, Dissimilarity others say that the same formant frequency of vowel differs greatly.VTLN is exactly to eliminate male, female's sound channel length Difference makes the result of accents recognition not by the interference of gender.
VTLN can move frequency coordinate by bending peace to make the formant frequency of each speaker match.In this implementation In example, the VTLN method based on bilinear transformation can be used.The VTLN method based on bilinear transformation is not directly to language The frequency spectrum of sound signal is folded, but uses the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment Difference is spoken the frequency bending factor of third formant for each person;The factor is bent according to the frequency, using bilinear transformation pair The position (such as starting point, intermediate point and end point of triangular filter) of triangular filter group and width are adjusted;According to tune Triangular filter group after whole calculates the normalized MFCC characteristic parameter of sound channel.For example, to carry out frequency spectrum pressure to voice signal Contracting, then stretch the scale of triangular filter, and triangular filter group is extended to the left and moved at this time.To voice signal Frequency spectrum stretching is carried out, then the scale of triangular filter is compressed, triangular filter group is compressed to the right and moved at this time.Using When the VTLN method based on bilinear transformation carries out sound channel normalization to specific crowd or particular person, it is only necessary to be filtered to triangle Device group coefficient carries out linear transformation, without all folding every time to signal spectrum when extracting characteristic parameter, to subtract significantly Small calculation amount.Also, the VTLN method based on bilinear transformation of being somebody's turn to do is avoided to frequency factor linear search, reduces operation Complexity.Meanwhile bilinear transformation should be utilized based on the VTLN method of bilinear transformation, keep the frequency of bending continuous and without bandwidth Change.
Embodiment two
Fig. 2 is the structure chart of voice processing apparatus provided by Embodiment 2 of the present invention.As shown in Fig. 2, the speech processes Device 10 may include: pretreatment unit 201, feature extraction unit 202, decoding unit 203, abstract extraction unit 204.
Pretreatment unit 201, for being pre-processed to voice signal.
The voice signal can be analog voice signal, be also possible to audio digital signals.If the voice signal is The analog voice signal is then carried out analog to digital conversion, is converted to audio digital signals by analog voice signal.
The present invention is used for continuous speech recognition, i.e., handles continuous audio stream.In one embodiment of the present of invention In, the method for speech processing is applied in intelligent meeting system, and the voice signal is by voice-input device (such as wheat Gram wind, mobile phone microphone etc.) be input to intelligent meeting system spokesman voice signal.
Carrying out pretreatment to the voice signal may include carrying out preemphasis to the voice signal.
The purpose of preemphasis is to promote the high fdrequency component of voice, and the frequency spectrum of signal is made to become flat.Voice signal due to by The influence of glottal excitation and mouth and nose radiation, energy are obviously reduced in front end, and usually it is smaller to get over amplitude for frequency.When frequency mentions When rising twice, power spectral amplitude is fallen by 6dB/oct.Therefore, spectrum analysis or channel parameters analysis are being carried out to voice signal Before, it needs to carry out frequency upgrading to the high frequency section of voice signal, i.e., preemphasis is carried out to voice signal.Preemphasis generally utilizes High-pass filter realizes, the transmission function of high-pass filter can be with are as follows:
H (z)=1-kz-1, 0.9≤κ≤1.0.
Wherein, k is pre emphasis factor, and preferably value is between 0.94-0.97.
Pre-processing to the voice signal can also include carrying out adding window framing to the voice signal.
Voice signal is a kind of time varying signal of non-stationary, is broadly divided into voiced sound and voiceless sound two major classes.The fundamental tone week of voiced sound Phase, to ask Voiced signal amplitude and channel parameters etc. all slowly varying at any time, but can be with usually within the time of 10ms-30ms Think with short-term stationarity.Voice signal can be divided into some short sections (i.e. acquisition short-term stationarity letter in Speech processing Number) handled, this process is known as framing, the voice signal of obtained short section is known as speech frame.Framing is by language Sound signal carries out windowing process to realize.In order to avoid the amplitude of variation of adjacent two frame is excessive, needed between frame and frame be overlapped A part.In one embodiment of the invention, each speech frame is 25 milliseconds, and there are 15 milliseconds between two neighboring speech frame Overlapping, that is, a speech frame is taken every 10 milliseconds.
Common window function has rectangular window, Hamming window and Hanning window, rectangular window function are as follows:
Hamming window function are as follows:
Hanning window function are as follows:
Wherein, the number for the sampled point that N includes by a speech frame.
Pre-processing to the voice signal can also include the efficient voice detected in the voice signal.
The purpose for detecting efficient voice is that non-effective voice (i.e. non-speech segment) is rejected from voice signal, obtains effective language Sound (i.e. voice segments) shortens the time of speech recognition to reduce the calculation amount and accuracy of feature extraction, improves discrimination.It can To carry out efficient voice detection according to the short-time energy of voice signal and short-time zero-crossing rate etc..
In one embodiment, it is assumed that n-th of speech frame is x in voice signaln(m), then short-time energy are as follows:
Short-time zero-crossing rate are as follows:
Wherein, sgn [] is sign function, expression formula are as follows:
The beginning and end of efficient voice in the voice signal can be detected using two-stage determining method.Two-stage determining method is Techniques known, details are not described herein again.
In another embodiment, the efficient voice in the voice signal can be detected by following methods:
(1) adding window framing is carried out to the voice signal, obtains the speech frame x (n) of the voice signal.It is specific at one In embodiment, Hamming window, every frame 20ms can be added to the voice signal, frame moves 10ms.If to voice in preprocessing process Signal adding window framing, then the step is omitted.
(2) to the speech frame x (n) carry out discrete Fourier transform (Discrete Fourier Transform, DFT), the frequency spectrum of the speech frame x (n) is obtained:
(3) accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame x (n):
Wherein E (m) indicates the accumulated energy of m-th of frequency band, (m1,m2) indicate m-th of frequency band start frequency band point.
(4) logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy pair of each frequency band Numerical value.
(5) the accumulated energy logarithm of each frequency band is obtained into the efficient voice compared with preset threshold.If one The accumulated energy logarithm of a frequency band is higher than preset threshold, then the corresponding voice of the frequency band is efficient voice.
Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal.
Characteristic parameter extraction is analyzed voice signal, and the parameters,acoustic sequence of reflection essential phonetic feature is extracted Column.
Whens the characteristic parameter of extraction may include short-time average energy, short-time average zero-crossing rate, formant, pitch period etc. Field parameter can also include linear predictor coefficient (Linear Prediction Coefficient, LPC), linear prediction cepstrum coefficient Coefficient (Linear Prediction Cepstrum Coefficient, LPCC), mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC), perception linear prediction (Perceptual Linear Predictive, PLP) etc. becomes Change field parameter.
In one embodiment of the invention, the MFCC characteristic parameter of voice signal can be extracted.Extract MFCC feature ginseng Several steps is as follows:
(1) discrete Fourier transform (Discrete is carried out to each speech frame that pretreatment unit 201 obtains Fourier Transform, DFT, can be Fast Fourier Transform (FFT)), obtain the frequency spectrum of the speech frame.
(2) square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame.
(3) the conventional spectrum of the speech frame is passed through into equally distributed triangular filter (i.e. triangle in one group of Mel frequency Filter group), obtain the output of each triangular filter.The centre frequency of this group of triangular filter is equal in Mel frequency scale Even arrangement, and the frequency of two bottom points of triangle of each triangular filter is respectively equal in two adjacent triangular filters Frequency of heart.The centre frequency of triangular filter are as follows:
The frequency response of triangular filter are as follows:
Wherein, fh、f1For the high and low frequency of triangular filter;N is the points of Fourier transformation;FsFor sample frequency;M For the number of triangular filter;B-1=700 (eb/1125It -1) is fmelInverse function.
(4) logarithm operation is done to the output of all triangular filters, obtains the log power spectrum S (m) of the speech frame.
(5) discrete cosine transform (Discrete Cosine Transform, DCT) is done to S (m), obtains the speech frame Initial MFCC characteristic parameter.Discrete cosine transform are as follows:
Introduce triangular filter group in MFCC, and triangular filter be distributed in low-frequency range it is closeer, high band distribution compared with It dredges, meets human hearing characteristic, still there is preferable recognition performance in a noisy environment.
Extract MFCC characteristic parameter the step of can also include:
(6) the dynamic difference MFCC characteristic parameter of speech frame is extracted.Initial MFCC characteristic parameter only reflects speech parameter Static characteristic, the dynamic characteristic of voice can be described by the Difference Spectrum of static nature, and sound state is combined and can effectively be promoted The recognition performance of system, usually using single order and/or second differnce MFCC characteristic parameter.
In one embodiment, the MFCC characteristic parameter of extraction is the characteristic vector of 39 dimensions, including the 13 initial MFCC of dimension Characteristic parameter, 13 dimension first-order difference MFCC characteristic parameters and 13 dimension second differnce MFCC characteristic parameters.
It, can also be right after extracting characteristic parameter to pretreated voice signal in implementing at of the invention one The characteristic parameter of extraction carries out dimension-reduction treatment, the characteristic parameter after obtaining dimensionality reduction.For example, using segmentation mean data dimension-reduction algorithm Dimension-reduction treatment is carried out to the characteristic parameter (such as MFCC characteristic parameter), the characteristic parameter after obtaining dimensionality reduction.Spy after dimensionality reduction Sign parameter will be used for subsequent step.
Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate Sound signal is decoded, and obtains the text as unit of sentence.
The speech recognition modeling may include dynamic time warping model, Hidden Markov Model, artificial neural network Model, support vector cassification model etc..The speech recognition modeling is also possible to the group of two or more model It closes.
In one embodiment of the invention, the speech recognition modeling is Hidden Markov Model (HMM).The HMM Including acoustic model and speech model.
Acoustic model (Acoustic Model): phoneme is modeled using Hidden Markov Model.In voice field, and Not instead of with word, using sub- word as recognition unit, sub- word is the basic acoustic elements of acoustic model.It is sound in English neutron word Element, for some specific word, corresponding acoustic model is spliced by multiple phonemes by searching for the syntax rule of Pronounceable dictionary At.It is initial consonant and simple or compound vowel of a Chinese syllable in Chinese neutron word.Every sub- word can be modeled with the HMM including multiple states.For example, often One phoneme can be modeled with the HMM comprising most 6 states, and each state can use gauss hybrid models (GMM) fitting pair The observation frame answered, observation frame are chronologically combined into observation sequence.And observation sequence different in size can be generated in each acoustic model Column, i.e., one-to-many mapping.
Language model (Language Model): being to be effectively combined syntax and semantics during speech recognition Knowledge, improve discrimination, reduce the range of search.Due to being difficult accurately to determine that the boundary of word and acoustic model describe language The ability of the change of tune opposite sex is limited, and when identification will generate the sequence of the similar word of many probability scores.Therefore know in practical voice Most possible secondary sequence is selected from many candidate results to supplement acoustic mode usually using language model P (W) in other system The deficiency of type.
In the present embodiment, using rule-based language model.Rule-based language model can sum up grammer Then rule or even semantic rules exclude the result for not conforming to syntax rule or semantic rules in acoustics identification with these rules.System Language model is counted by the dependence between statistical probability descriptor and word, grammer or semantic rules are compiled indirectly Code.
Decoding is exactly that an optimal path is searched in state network, and voice corresponds to the maximum probability of this paths.At this In embodiment, global optimum path is found using dynamic programming algorithm (i.e. Viterbi algorithm).
Assuming that the characteristic parameter that feature extraction unit 202 is extracted is feature vector Y, found by decoding algorithm most possible Generate the word sequence w of Y1:L=w1,w2…wL
Parameter w corresponding to maximum that decoding algorithm is to solve for so that posterior probability P (w | Y), it may be assumed that
wbest=argmax p (w | Y) }
Above formula is converted by Bayes' theorem are as follows:
Since observation probability P (Y) is constant under given observation sequence, above formula can be further simplified are as follows:
wbest=argmax p (Y | w) p (w) }
Wherein prior probability P (W) is determined by language model, and likelihood probability p (Y | w) it is determined by acoustic model.By above Calculate parameter w corresponding to you can get it posterior probability P (w | Y) maximum.
Abstract extraction unit 204, for extracting abstract sentence from the text as unit of sentence.
Voice signal is decoded as the text as unit of sentence by decoding unit 203, in conventional speech recognition system, Speech recognition work has been completed.In the present invention, abstract extraction unit 204 is from the text as unit of sentence identified In extract abstract sentence.
The purpose for extracting abstract sentence is that important information is extracted from voice, rejects garbage.
Abstract extraction unit 204 extracts abstract sentence by HMM model.At this point, the dual random relationship of HMM model can retouch It states are as follows: a weight stochastic relation is the release of sentence sequence, is observable;Another heavy stochastic relation is whether the sentence should be by It is classified as the property of abstract sentence, is not observable.So the process for extracting abstract sentence with HMM model can be described as given sentence Subsequence O={ O1,O2…On, with determine sentence whether be make a summary sentence maximum likelihood.Key step is as follows:
(1) the observation state sequence O={ O of the text as unit of sentence is obtained1,O2…On};
(2) HMM hidden state is determined.5 hidden states can be set." 1 "-hidden state can be set as to meet, " 2 "-relatively meet, and " 3 "-are general, and " 4 "-are not met relatively, and " 5 "-are not met, for successively indicating that sentence meets the degree of abstract sentence.
(3) HMM parameter Estimation is carried out.Initial probability parameter is randomly generated first, by constantly iteration, is set when reaching When fixed threshold value, stops calculating, obtain suitable HMM parameter.
(4) according to trained HMM, sentence is marked by Viterbi algorithm, each sentence is obtained and meets abstract The degree of conformity of sentence.
(5) sentence (sentence for example, at least relatively met) of default degree of conformity will be met from the text as unit of sentence It is extracted in this, obtains the abstract sentence in the text as unit of sentence.
The voice processing apparatus 10 of embodiment two pre-processes voice signal;Pretreated voice signal is extracted Characteristic parameter;According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, Obtain the text as unit of sentence;Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM. Embodiment two not only converts text for voice messaging, and the also abstract sentence in extraction text is exported, and eliminates and is known by voice Garbage in other result obtains better speech processes result.
In another embodiment, feature extraction unit 202 can carry out sound channel length and return when extracting MFCC characteristic parameter One changes (Vocal Tract Length Normalization, VTLN), obtains the normalized MFCC characteristic parameter of sound channel length.
Sound channel can be expressed as cascade vocal tube model, and each sound pipe can regard a resonant cavity, their resonance as Frequency depends on the length and shape of sound pipe.Therefore, the part acoustic difference between speaker is since the sound channel of speaker is long Degree is different.For example, the variation range of sound channel length generally changes to 18cm (adult male) from 13cm (adult female), therefore, Dissimilarity others say that the same formant frequency of vowel differs greatly.VTLN is exactly to eliminate male, female's sound channel length Difference makes the result of accents recognition not by the interference of gender.
VTLN can move frequency coordinate by bending peace to make the formant frequency of each speaker match.In this implementation In example, the VTLN method based on bilinear transformation can be used.The VTLN method based on bilinear transformation is not directly to language The frequency spectrum of sound signal is folded, but uses the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment Difference is spoken the frequency bending factor of third formant for each person;The factor is bent according to the frequency, using bilinear transformation pair The position (such as starting point, intermediate point and end point of triangular filter) of triangular filter group and width are adjusted;According to tune Triangular filter group after whole calculates the normalized MFCC characteristic parameter of sound channel.For example, to carry out frequency spectrum pressure to voice signal Contracting, then stretch the scale of triangular filter, and triangular filter group is extended to the left and moved at this time.To voice signal Frequency spectrum stretching is carried out, then the scale of triangular filter is compressed, triangular filter group is compressed to the right and moved at this time.Using When the VTLN method based on bilinear transformation carries out sound channel normalization to specific crowd or particular person, it is only necessary to be filtered to triangle Device group coefficient carries out linear transformation, without all folding every time to signal spectrum when extracting characteristic parameter, to subtract significantly Small calculation amount.Also, the VTLN method based on bilinear transformation of being somebody's turn to do is avoided to frequency factor linear search, reduces operation Complexity.Meanwhile bilinear transformation should be utilized based on the VTLN method of bilinear transformation, keep the frequency of bending continuous and without bandwidth Change.
Embodiment three
The present embodiment provides a kind of computer readable storage medium, computer is stored on the computer readable storage medium Program, the computer program realize the step in above-mentioned method of speech processing embodiment when being executed by processor, such as shown in Fig. 1 Step 101-104:
Step 101, voice signal is pre-processed;
Step 102, characteristic parameter is extracted to pretreated voice signal;
Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into Row decoding, obtains the text as unit of sentence;
Step 104, abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
Alternatively, the function of each module/unit in above-mentioned apparatus embodiment is realized when the computer program is executed by processor, Such as the unit 201-204 in Fig. 2:
Pretreatment unit 201, for being pre-processed to voice signal;
Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal;
Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate Sound signal is decoded, and obtains the text as unit of sentence;
Abstract extraction unit 204 is plucked for being extracted from the text as unit of sentence by Hidden Markov Model HMM Want sentence.
Example IV
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three provides.The computer installation 1 includes memory 20, processor 30 and the computer program 40 that can be run in the memory 20 and on the processor 30, example are stored in Such as voice processing program.The processor 30 is realized when executing the computer program 40 in above-mentioned method of speech processing embodiment The step of, such as step 101-104 shown in FIG. 1:
Step 101, voice signal is pre-processed;
Step 102, characteristic parameter is extracted to pretreated voice signal;
Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into Row decoding, obtains the text as unit of sentence;
Step 104, abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
Alternatively, realizing each module in above-mentioned apparatus embodiment/mono- when the processor 30 executes the computer program 40 The function of member, such as the unit 201-204 in Fig. 2:
Pretreatment unit 201, for being pre-processed to voice signal;
Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal;
Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate Sound signal is decoded, and obtains the text as unit of sentence;
Abstract extraction unit 204 is plucked for being extracted from the text as unit of sentence by Hidden Markov Model HMM Want sentence.
Illustratively, the computer program 40 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 20, and are executed by the processor 30, to complete the present invention.Described one A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for Implementation procedure of the computer program 40 in the computer installation 1 is described.For example, the computer program 40 can be by Pretreatment unit 201, feature extraction unit 202, decoding unit 203, the abstract extraction unit 204 being divided into Fig. 2, each unit Concrete function is referring to embodiment two.
The computer installation 1 can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 1, do not constitute to computer The restriction of device 1 may include perhaps combining certain components or different components, example than illustrating more or fewer components Such as described computer installation 1 can also include input-output equipment, network access equipment, bus.
Alleged processor 30 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor 30 is also possible to any conventional processor Deng the processor 30 is the control centre of the computer installation 1, utilizes various interfaces and connection entire computer dress Set 1 various pieces.
The memory 20 can be used for storing the computer program 40 and/or module/unit, and the processor 30 passes through Operation executes the computer program and/or module/unit being stored in the memory 20, and calls and be stored in memory Data in 20 realize the various functions of the computer installation 1.The memory 20 can mainly include storing program area and deposit Store up data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound is broadcast Playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (ratio according to computer installation 1 Such as audio data, phone directory) etc..In addition, memory 20 may include high-speed random access memory, it can also include non-easy The property lost memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.
If the integrated module/unit of the computer installation 1 is realized in the form of SFU software functional unit and as independence Product when selling or using, can store in a computer readable storage medium.Based on this understanding, of the invention It realizes all or part of the process in above-described embodiment method, can also instruct relevant hardware come complete by computer program At the computer program can be stored in a computer readable storage medium, which is being executed by processor When, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer program code, described Computer program code can be source code form, object identification code form, executable file or certain intermediate forms etc..The meter Calculation machine readable medium may include: can carry the computer program code any entity or device, recording medium, USB flash disk, Mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It needs to illustrate It is that the content that the computer-readable medium includes can be fitted according to the requirement made laws in jurisdiction with patent practice When increase and decrease, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium does not include electric carrier wave letter Number and telecommunication signal.
In several embodiments provided by the present invention, it should be understood that disclosed computer installation and method, it can be with It realizes by another way.For example, computer installation embodiment described above is only schematical, for example, described The division of unit, only a kind of logical function partition, there may be another division manner in actual implementation.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in same treatment unit It is that each unit physically exists alone, can also be integrated in same unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds software function module.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.It is stated in computer installation claim Multiple units or computer installation can also be implemented through software or hardware by the same unit or computer installation.The One, the second equal words are used to indicate names, and are not indicated any particular order.
Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention Technical solution is modified or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. a kind of method of speech processing, which is characterized in that the described method includes:
Voice signal is pre-processed;
Characteristic parameter is extracted to pretreated voice signal;
According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, is obtained Text as unit of sentence;
Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.
2. the method as described in claim 1, which is characterized in that it is described by Hidden Markov Model HMM from described with sentence For extraction abstract sentence in the text of unit, specifically include:
Obtain the observation state sequence O={ O of the text as unit of sentence1,O2…On};
Determine the hidden state of HMM;
HMM parameter Estimation is carried out, trained HMM is obtained;
According to the trained HMM, the sentence is marked by Viterbi algorithm, obtains each sentence and meet to pluck Want the degree of conformity of sentence;
The sentence for meeting default degree of conformity is extracted from the text as unit of sentence, obtain described be with sentence Abstract sentence in the text of unit.
3. the method as described in claim 1, which is characterized in that described to carry out pretreatment including detecting institute's predicate to voice signal Efficient voice in sound signal, specifically includes:
Adding window framing is carried out to the voice signal, obtains the speech frame of the voice signal;
Discrete Fourier transform is carried out to the speech frame, obtains the frequency spectrum of the speech frame;
The accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame;
Logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy logarithm of each frequency band;
The accumulated energy logarithm of each frequency band is compared with preset threshold, obtains the efficient voice.
4. the method as described in claim 1, which is characterized in that the characteristic parameter includes initial mel cepstrum coefficients MFCC special Levy parameter, first-order difference MFCC characteristic parameter and second differnce MFCC characteristic parameter.
5. the method as described in claim 1, which is characterized in that the method also includes:
Dimension-reduction treatment is carried out to the characteristic parameter, the characteristic parameter after obtaining dimensionality reduction.
6. the method as described in claim 1, which is characterized in that described to extract characteristic parameter packet to pretreated voice signal It includes and mel cepstrum coefficients MFCC characteristic parameter is extracted to pretreated voice signal, specifically include:
Using the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment difference and speak third resonance for each person The frequency at peak bends the factor;
The factor is bent according to the frequency, using bilinear transformation to triangular filter group used in MFCC characteristic parameter extraction Position and width be adjusted;
The normalized MFCC characteristic parameter of sound channel is calculated according to triangular filter group adjusted.
7. the method as described in claim 1, which is characterized in that described to extract characteristic parameter packet to pretreated voice signal It includes and mel cepstrum coefficients MFCC characteristic parameter is extracted to pretreated voice signal, specifically include:
Discrete Fourier transform DFT is carried out to each speech frame, obtains the frequency spectrum of the speech frame;
Square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame;
By the conventional spectrum of the speech frame by equally distributed triangular filter group in Mel frequency, each triangle filter is obtained The output of wave device;
Logarithm operation is done to the output of all triangular filters, obtains the log power spectrum of the speech frame;
To the log power spectrum discrete cosine transform, the initial MFCC characteristic parameter of the speech frame is obtained.
8. a kind of voice processing apparatus, which is characterized in that described device includes:
Pretreatment unit, for being pre-processed to voice signal;
Feature extraction unit, for extracting characteristic parameter to pretreated voice signal;
Decoding unit is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal It is decoded, obtains the text as unit of sentence;
Abstract extraction unit, for extracting abstract from the text as unit of sentence by Hidden Markov Model HMM Sentence.
9. a kind of computer installation, it is characterised in that: the computer installation includes processor, and the processor is deposited for executing The computer program stored in reservoir is to realize the method for speech processing as described in any one of claim 1-7.
10. a kind of computer readable storage medium, computer program, feature are stored on the computer readable storage medium It is: realizes the method for speech processing as described in any one of claim 1-7 when the computer program is executed by processor.
CN201810897646.2A 2018-08-08 2018-08-08 Method of speech processing and device, computer installation and readable storage medium storing program for executing Pending CN109036381A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810897646.2A CN109036381A (en) 2018-08-08 2018-08-08 Method of speech processing and device, computer installation and readable storage medium storing program for executing
PCT/CN2018/108190 WO2020029404A1 (en) 2018-08-08 2018-09-28 Speech processing method and device, computer device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810897646.2A CN109036381A (en) 2018-08-08 2018-08-08 Method of speech processing and device, computer installation and readable storage medium storing program for executing

Publications (1)

Publication Number Publication Date
CN109036381A true CN109036381A (en) 2018-12-18

Family

ID=64632382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810897646.2A Pending CN109036381A (en) 2018-08-08 2018-08-08 Method of speech processing and device, computer installation and readable storage medium storing program for executing

Country Status (2)

Country Link
CN (1) CN109036381A (en)
WO (1) WO2020029404A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109741761A (en) * 2019-03-13 2019-05-10 百度在线网络技术(北京)有限公司 Sound processing method and device
CN109872714A (en) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN110300001A (en) * 2019-05-21 2019-10-01 深圳壹账通智能科技有限公司 Conference audio control method, system, equipment and computer readable storage medium
CN110738991A (en) * 2019-10-11 2020-01-31 东南大学 Speech recognition equipment based on flexible wearable sensor
CN111128178A (en) * 2019-12-31 2020-05-08 上海赫千电子科技有限公司 Voice recognition method based on facial expression analysis
CN111509843A (en) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 Cable damage prevention early warning device with mechanical breaking hammer characteristic quantity recognition function
CN111509841A (en) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 Cable external damage prevention early warning device with excavator characteristic quantity recognition function
CN111509842A (en) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 Cable anti-damage early warning device with cutting machine characteristic quantity recognition function
CN111933116A (en) * 2020-06-22 2020-11-13 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111968622A (en) * 2020-08-18 2020-11-20 广州市优普科技有限公司 Attention mechanism-based voice recognition method, system and device
CN112201253A (en) * 2020-11-09 2021-01-08 平安普惠企业管理有限公司 Character marking method and device, electronic equipment and computer readable storage medium
CN112420070A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Automatic labeling method and device, electronic equipment and computer readable storage medium
CN112562646A (en) * 2020-12-09 2021-03-26 江苏科技大学 Robot voice recognition method
CN115063895A (en) * 2022-06-10 2022-09-16 深圳市智远联科技有限公司 Ticket selling method and system based on voice recognition

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0400101D0 (en) * 2004-01-05 2004-02-04 Toshiba Res Europ Ltd Speech recognition system and technique
WO2004049188A1 (en) * 2002-11-28 2004-06-10 Agency For Science, Technology And Research Summarizing digital audio data
JP2006146261A (en) * 2001-08-08 2006-06-08 Nippon Telegr & Teleph Corp <Ntt> Speech processing method and program therefor
WO2010019831A1 (en) * 2008-08-14 2010-02-18 21Ct, Inc. Hidden markov model for speech processing with training method
JP2012037797A (en) * 2010-08-10 2012-02-23 Nippon Telegr & Teleph Corp <Ntt> Dialogue learning device, summarization device, dialogue learning method, summarization method, program
CN103021408A (en) * 2012-12-04 2013-04-03 中国科学院自动化研究所 Method and device for speech recognition, optimizing and decoding assisted by stable pronunciation section
US20160328366A1 (en) * 2015-05-04 2016-11-10 King Fahd University Of Petroleum And Minerals Systems and associated methods for arabic handwriting synthesis and dataset design
CN106446109A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 Acquiring method and device for audio file abstract
CN107403619A (en) * 2017-06-30 2017-11-28 武汉泰迪智慧科技有限公司 A kind of sound control method and system applied to bicycle environment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464898B (en) * 2009-01-12 2011-09-21 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN102436809B (en) * 2011-10-21 2013-04-24 东南大学 Network speech recognition method in English oral language machine examination system
US9514739B2 (en) * 2012-06-06 2016-12-06 Cypress Semiconductor Corporation Phoneme score accelerator
CN103646094B (en) * 2013-12-18 2017-05-31 上海紫竹数字创意港有限公司 Realize that audiovisual class product content summary automatically extracts the system and method for generation
CN108305632B (en) * 2018-02-02 2020-03-27 深圳市鹰硕技术有限公司 Method and system for forming voice abstract of conference

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006146261A (en) * 2001-08-08 2006-06-08 Nippon Telegr & Teleph Corp <Ntt> Speech processing method and program therefor
WO2004049188A1 (en) * 2002-11-28 2004-06-10 Agency For Science, Technology And Research Summarizing digital audio data
GB0400101D0 (en) * 2004-01-05 2004-02-04 Toshiba Res Europ Ltd Speech recognition system and technique
WO2010019831A1 (en) * 2008-08-14 2010-02-18 21Ct, Inc. Hidden markov model for speech processing with training method
JP2012037797A (en) * 2010-08-10 2012-02-23 Nippon Telegr & Teleph Corp <Ntt> Dialogue learning device, summarization device, dialogue learning method, summarization method, program
CN103021408A (en) * 2012-12-04 2013-04-03 中国科学院自动化研究所 Method and device for speech recognition, optimizing and decoding assisted by stable pronunciation section
US20160328366A1 (en) * 2015-05-04 2016-11-10 King Fahd University Of Petroleum And Minerals Systems and associated methods for arabic handwriting synthesis and dataset design
CN106446109A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 Acquiring method and device for audio file abstract
CN107403619A (en) * 2017-06-30 2017-11-28 武汉泰迪智慧科技有限公司 A kind of sound control method and system applied to bicycle environment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
于江德等: "隐马尔可夫模型在自然语言处理中的应用", 《计算机工程与设计》 *
刘云中等: "基于隐马尔可夫模型的文本信息抽取", 《系统仿真学报》 *
金砚硕等: "一种基于隐马尔可夫聚类的信息提取方法", 《情报杂志》 *
陈科等: "基于MFCC与CHMM的方向指令语音识别", 《成都大学学报(自然科学版)》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109872714A (en) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN109741761B (en) * 2019-03-13 2020-09-25 百度在线网络技术(北京)有限公司 Sound processing method and device
CN109741761A (en) * 2019-03-13 2019-05-10 百度在线网络技术(北京)有限公司 Sound processing method and device
CN110300001A (en) * 2019-05-21 2019-10-01 深圳壹账通智能科技有限公司 Conference audio control method, system, equipment and computer readable storage medium
CN110300001B (en) * 2019-05-21 2022-03-15 深圳壹账通智能科技有限公司 Conference audio control method, system, device and computer readable storage medium
CN112420070A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Automatic labeling method and device, electronic equipment and computer readable storage medium
CN110738991A (en) * 2019-10-11 2020-01-31 东南大学 Speech recognition equipment based on flexible wearable sensor
CN111128178A (en) * 2019-12-31 2020-05-08 上海赫千电子科技有限公司 Voice recognition method based on facial expression analysis
CN111509841A (en) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 Cable external damage prevention early warning device with excavator characteristic quantity recognition function
CN111509842A (en) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 Cable anti-damage early warning device with cutting machine characteristic quantity recognition function
CN111509843A (en) * 2020-04-14 2020-08-07 佛山市威格特电气设备有限公司 Cable damage prevention early warning device with mechanical breaking hammer characteristic quantity recognition function
CN111933116A (en) * 2020-06-22 2020-11-13 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111968622A (en) * 2020-08-18 2020-11-20 广州市优普科技有限公司 Attention mechanism-based voice recognition method, system and device
CN112201253A (en) * 2020-11-09 2021-01-08 平安普惠企业管理有限公司 Character marking method and device, electronic equipment and computer readable storage medium
CN112201253B (en) * 2020-11-09 2023-08-25 观华(广州)电子科技有限公司 Text marking method, text marking device, electronic equipment and computer readable storage medium
CN112562646A (en) * 2020-12-09 2021-03-26 江苏科技大学 Robot voice recognition method
CN115063895A (en) * 2022-06-10 2022-09-16 深圳市智远联科技有限公司 Ticket selling method and system based on voice recognition

Also Published As

Publication number Publication date
WO2020029404A1 (en) 2020-02-13

Similar Documents

Publication Publication Date Title
CN109036381A (en) Method of speech processing and device, computer installation and readable storage medium storing program for executing
Tirumala et al. Speaker identification features extraction methods: A systematic review
Arora et al. Automatic speech recognition: a review
Cutajar et al. Comparative study of automatic speech recognition techniques
Bhangale et al. A review on speech processing using machine learning paradigm
Dua et al. GFCC based discriminatively trained noise robust continuous ASR system for Hindi language
Mouaz et al. Speech recognition of moroccan dialect using hidden Markov models
Khelifa et al. Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system
Bhatt et al. Feature extraction techniques with analysis of confusing words for speech recognition in the Hindi language
Karpov An automatic multimodal speech recognition system with audio and video information
Chelali et al. Text dependant speaker recognition using MFCC, LPC and DWT
Jothilakshmi et al. Large scale data enabled evolution of spoken language research and applications
Ranjan et al. Isolated word recognition using HMM for Maithili dialect
Nandi et al. Parametric representation of excitation source information for language identification
Devi et al. An analysis on types of speech recognition and algorithms
Nedjah et al. Automatic speech recognition of Portuguese phonemes using neural networks ensemble
Grewal et al. Isolated word recognition system for English language
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
Singh et al. An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning
Gaudani et al. Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language
Sahoo et al. MFCC feature with optimized frequency range: An essential step for emotion recognition
Yavuz et al. A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model.
Sharma et al. Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Trivedi A survey on English digit speech recognition using HMM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218