CN109036381A

CN109036381A - Method of speech processing and device, computer installation and readable storage medium storing program for executing

Info

Publication number: CN109036381A
Application number: CN201810897646.2A
Authority: CN
Inventors: 王健宗; 王珏; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-08-08
Filing date: 2018-08-08
Publication date: 2018-12-18
Also published as: WO2020029404A1

Abstract

A kind of method of speech processing, which comprises voice signal is pre-processed；Characteristic parameter is extracted to pretreated voice signal；According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, obtains the text as unit of sentence；Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.The present invention also provides a kind of voice processing apparatus, computer installation and computer readable storage mediums.The present invention can identify voice, and garbage is removed from speech recognition result.

Description

Method of speech processing and device, computer installation and readable storage medium storing program for executing

Technical field

The present invention relates to computer audio technical fields, and in particular to a kind of method of speech processing and device, computer dress It sets and computer readable storage medium.

Background technique

In intelligent meeting system, speech recognition technology is a key technology, can be by the signal conversion of speaking of people Text information for that can be identified by computer is used as output.

However, existing intelligent meeting system is the conversion of realization voice to text, and cannot be to the text identified Word information is further to be handled, and the text information being directly converted to according to voice can be able to include useless information, such as Some sentences unrelated with conference content.

Summary of the invention

In view of the foregoing, it is necessary to propose a kind of method of speech processing and device, computer installation and computer-readable Storage medium can identify voice, and remove garbage from speech recognition result.

The first aspect of the application provides a kind of method of speech processing, which comprises

Voice signal is pre-processed；

Characteristic parameter is extracted to pretreated voice signal；

According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, Obtain the text as unit of sentence；

Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.

In alternatively possible implementation, it is described by Hidden Markov Model HMM from described as unit of sentence Abstract sentence is extracted in text, is specifically included:

Obtain the observation state sequence O={ O of the text as unit of sentence₁,O₂…O_n}；

Determine the hidden state of HMM；

HMM parameter Estimation is carried out, trained HMM is obtained；

According to the trained HMM, the sentence is marked by Viterbi algorithm, obtains each sentence symbol Close the degree of conformity of abstract sentence；

The sentence for meeting default degree of conformity is extracted from the text as unit of sentence, is obtained described with sentence Son is the abstract sentence in the text of unit.

It is described that pretreatment is carried out including detecting in the voice signal to voice signal in alternatively possible implementation Efficient voice, specifically include:

Adding window framing is carried out to the voice signal, obtains the speech frame of the voice signal；

Discrete Fourier transform is carried out to the speech frame, obtains the frequency spectrum of the speech frame；

The accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame；

Logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy logarithm of each frequency band Value；

The accumulated energy logarithm of each frequency band is compared with preset threshold, obtains the efficient voice.

In alternatively possible implementation, the characteristic parameter include initial mel cepstrum coefficients MFCC characteristic parameter, First-order difference MFCC characteristic parameter and second differnce MFCC characteristic parameter.

In alternatively possible implementation, the method also includes:

Dimension-reduction treatment is carried out to the characteristic parameter, the characteristic parameter after obtaining dimensionality reduction.

In alternatively possible implementation, described includes to pre- place to pretreated voice signal extraction characteristic parameter Voice signal after reason extracts mel cepstrum coefficients MFCC characteristic parameter, specifically includes:

Using the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment difference and speak third for each person The frequency of formant bends the factor；

The factor is bent according to the frequency, triangle used in MFCC characteristic parameter extraction is filtered using bilinear transformation The position of device group and width are adjusted；

The normalized MFCC characteristic parameter of sound channel is calculated according to triangular filter group adjusted.

Discrete Fourier transform DFT is carried out to each speech frame, obtains the frequency spectrum of the speech frame；

Square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame；

By the conventional spectrum of the speech frame by equally distributed triangular filter group in Mel frequency, each three are obtained The output of angle filter；

Logarithm operation is done to the output of all triangular filters, obtains the log power spectrum of the speech frame；

To the log power spectrum discrete cosine transform, the initial MFCC characteristic parameter of the speech frame is obtained.

The second aspect of the application provides a kind of voice processing apparatus, and described device includes:

Pretreatment unit, for being pre-processed to voice signal；

Feature extraction unit, for extracting characteristic parameter to pretreated voice signal；

Decoding unit is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice Signal is decoded, and obtains the text as unit of sentence；

Abstract extraction unit, for being extracted from the text as unit of sentence by Hidden Markov Model HMM Abstract sentence.

The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing Device is for realizing the method for speech processing when executing the computer program stored in memory.

The fourth aspect of the application provides a kind of computer readable storage medium, is stored thereon with computer program, described The method of speech processing is realized when computer program is executed by processor.

The present invention pre-processes voice signal；Characteristic parameter is extracted to pretreated voice signal；According to described Characteristic parameter is decoded the voice signal using preparatory trained speech recognition modeling, obtains as unit of sentence Text；Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.The present invention not only will Voice messaging is converted into text, and the also abstract sentence in extraction text is exported, and eliminates by useless in speech recognition result Information obtains better speech processes result.

Detailed description of the invention

Fig. 1 is the flow chart of method of speech processing provided in an embodiment of the present invention.

Fig. 2 is the structure chart of voice processing apparatus provided in an embodiment of the present invention.

Fig. 3 is the schematic diagram of computer installation provided in an embodiment of the present invention.

Specific embodiment

To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, embodiments herein and embodiment In feature can be combined with each other.

In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.

Preferably, method of speech processing of the invention is applied in one or more computer installation.The computer Device is that one kind can be according to the instruction for being previously set or storing, the automatic equipment for carrying out numerical value calculating and/or information processing, Hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processing unit (Digital Signal Processor, DSP), embedded device etc..

The computer installation can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.The computer installation can carry out people by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with user Machine interaction.

Embodiment one

Fig. 1 is the flow chart for the method for speech processing that the embodiment of the present invention one provides.The method of speech processing is applied to Computer installation.The method of speech processing is from the text identified as unit of sentence in voice signal, from being single with sentence Abstract sentence is extracted in the text of position.

As shown in Figure 1, the method for speech processing specifically includes the following steps:

Step 101, voice signal is pre-processed.

The voice signal can be analog voice signal, be also possible to audio digital signals.If the voice signal is The analog voice signal is then carried out analog to digital conversion, is converted to audio digital signals by analog voice signal.

The present invention is used for continuous speech recognition, i.e., handles continuous audio stream.In one embodiment of the present of invention In, the method for speech processing is applied in intelligent meeting system, and the voice signal is by voice-input device (such as wheat Gram wind, mobile phone microphone etc.) be input to intelligent meeting system spokesman voice signal.

Carrying out pretreatment to the voice signal may include carrying out preemphasis to the voice signal.

The purpose of preemphasis is to promote the high fdrequency component of voice, and the frequency spectrum of signal is made to become flat.Voice signal due to by The influence of glottal excitation and mouth and nose radiation, energy are obviously reduced in front end, and usually it is smaller to get over amplitude for frequency.When frequency mentions When rising twice, power spectral amplitude is fallen by 6dB/oct.Therefore, spectrum analysis or channel parameters analysis are being carried out to voice signal Before, it needs to carry out frequency upgrading to the high frequency section of voice signal, i.e., preemphasis is carried out to voice signal.Preemphasis generally utilizes High-pass filter realizes, the transmission function of high-pass filter can be with are as follows:

H (z)=1- κ z^-1, 0.9≤κ≤1.0.

Wherein, κ is pre emphasis factor, and preferably value is between 0.94-0.97.

Pre-processing to the voice signal can also include carrying out adding window framing to the voice signal.

Voice signal is a kind of time varying signal of non-stationary, is broadly divided into voiced sound and voiceless sound two major classes.The fundamental tone week of voiced sound Phase, to ask Voiced signal amplitude and channel parameters etc. all slowly varying at any time, but can be with usually within the time of 10ms-30ms Think with short-term stationarity.Voice signal can be divided into some short sections (i.e. acquisition short-term stationarity letter in Speech processing Number) handled, this process is known as framing, the voice signal of obtained short section is known as speech frame.Framing is by language Sound signal carries out windowing process to realize.In order to avoid the amplitude of variation of adjacent two frame is excessive, needed between frame and frame be overlapped A part.In one embodiment of the invention, each speech frame is 25 milliseconds, and there are 15 milliseconds between two neighboring speech frame Overlapping, that is, a speech frame is taken every 10 milliseconds.

Common window function has rectangular window, Hamming window and Hanning window, rectangular window function are as follows:

Hamming window function are as follows:

Hanning window function are as follows:

Wherein, the number for the sampled point that N includes by a speech frame.

Pre-processing to the voice signal can also include the efficient voice detected in the voice signal.

The purpose for detecting efficient voice is that non-effective voice (i.e. non-speech segment) is rejected from voice signal, obtains effective language Sound (i.e. voice segments) shortens the time of speech recognition to reduce the calculation amount and accuracy of feature extraction, improves discrimination.It can To carry out efficient voice detection according to the short-time energy of voice signal and short-time zero-crossing rate etc..

In one embodiment, it is assumed that n-th of speech frame is x in voice signal_n(m), then short-time energy are as follows:

Short-time zero-crossing rate are as follows:

Wherein, sgn [] is sign function, expression formula are as follows:

The beginning and end of efficient voice in the voice signal can be detected using two-stage determining method.Two-stage determining method is Techniques known, details are not described herein again.

In another embodiment, the efficient voice in the voice signal can be detected by following methods:

(1) adding window framing is carried out to the voice signal, obtains the speech frame x (n) of the voice signal.It is specific at one In embodiment, Hamming window, every frame 20ms can be added to the voice signal, frame moves 10ms.If to voice in preprocessing process Signal adding window framing, then the step is omitted.

(2) to the speech frame x (n) carry out discrete Fourier transform (Discrete Fourier Transform, DFT), the frequency spectrum of the speech frame x (n) is obtained:

(3) accumulated energy of each frequency band is calculated according to the frequency spectrum of the speech frame x (n):

Wherein E (m) indicates the accumulated energy of m-th of frequency band, (m₁,m₂) indicate m-th of frequency band start frequency band point.

(4) logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy pair of each frequency band Numerical value.

(5) the accumulated energy logarithm of each frequency band is obtained into the efficient voice compared with preset threshold.If one The accumulated energy logarithm of a frequency band is higher than preset threshold, then the corresponding voice of the frequency band is efficient voice.

Step 102, characteristic parameter is extracted to pretreated voice signal.

Characteristic parameter extraction is analyzed voice signal, and the parameters,acoustic sequence of reflection essential phonetic feature is extracted Column.

Whens the characteristic parameter of extraction may include short-time average energy, short-time average zero-crossing rate, formant, pitch period etc. Field parameter can also include linear predictor coefficient (Linear Prediction Coefficient, LPC), linear prediction cepstrum coefficient Coefficient (Linear Prediction Cepstrum Coefficient, LPCC), mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC), perception linear prediction (Perceptual Linear Predictive, PLP) etc. becomes Change field parameter.

In one embodiment of the invention, the MFCC characteristic parameter of voice signal can be extracted.Extract MFCC feature ginseng Several steps is as follows:

(1) discrete Fourier transform is carried out to each speech frame (Discrete Fourier Transform, DFT, can be with It is Fast Fourier Transform (FFT)), obtain the frequency spectrum of the speech frame.

(2) square for seeking the spectrum amplitude of the speech frame obtains the conventional spectrum of the speech frame.

(3) the conventional spectrum of the speech frame is passed through into equally distributed triangular filter (i.e. triangle in one group of Mel frequency Filter group), obtain the output of each triangular filter.The centre frequency of this group of triangular filter is equal in Mel frequency scale Even arrangement, and the frequency of two bottom points of triangle of each triangular filter is respectively equal in two adjacent triangular filters Frequency of heart.The centre frequency of triangular filter are as follows:

The frequency response of triangular filter are as follows:

Wherein, f_h、f₁For the high and low frequency of triangular filter；N is the points of Fourier transformation；F_sFor sample frequency；M For the number of triangular filter；B^-1=700 (e^b/1125It -1) is f_melInverse function.

(4) logarithm operation is done to the output of all triangular filters, obtains the log power spectrum S (m) of the speech frame.

(5) discrete cosine transform (Discrete Cosine Transform, DCT) is done to S (m), obtains the speech frame Initial MFCC characteristic parameter.Discrete cosine transform are as follows:

Introduce triangular filter group in MFCC, and triangular filter be distributed in low-frequency range it is closeer, high band distribution compared with It dredges, meets human hearing characteristic, still there is preferable recognition performance in a noisy environment.

Extract MFCC characteristic parameter the step of can also include:

(6) according to the dynamic difference MFCC characteristic parameter of the initial MFCC characteristic parameter extraction speech frame of speech frame.Initially MFCC characteristic parameter only reflects the static characteristic of speech parameter, the dynamic characteristic of voice can by the Difference Spectrum of static nature come Description, sound state combine can effective lifting system recognition performance, usually using single order and/or second differnce MFCC feature Parameter.

In one embodiment, the MFCC characteristic parameter of extraction is the characteristic vector of 39 dimensions, including the 13 initial MFCC of dimension Characteristic parameter, 13 dimension first-order difference MFCC characteristic parameters and 13 dimension second differnce MFCC characteristic parameters.

It, can also be right after extracting characteristic parameter to pretreated voice signal in implementing at of the invention one The characteristic parameter of extraction carries out dimension-reduction treatment, the characteristic parameter after obtaining dimensionality reduction.For example, using segmentation mean data dimension-reduction algorithm Dimension-reduction treatment is carried out to the characteristic parameter (such as MFCC characteristic parameter), the characteristic parameter after obtaining dimensionality reduction.Spy after dimensionality reduction Sign parameter will be used for subsequent step.

Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into Row decoding, obtains the text as unit of sentence.

The speech recognition modeling may include dynamic time warping model, Hidden Markov Model, artificial neural network Model, support vector cassification model etc..The speech recognition modeling is also possible to the group of two or more model It closes.

In one embodiment of the invention, the speech recognition modeling is Hidden Markov Model (HMM).The HMM Including acoustic model and speech model.

Acoustic model (Acoustic Model): phoneme is modeled using Hidden Markov Model.In voice field, and Not instead of with word, using sub- word as recognition unit, sub- word is the basic acoustic elements of acoustic model.It is sound in English neutron word Element, for some specific word, corresponding acoustic model is spliced by multiple phonemes by searching for the syntax rule of Pronounceable dictionary At.It is initial consonant and simple or compound vowel of a Chinese syllable in Chinese neutron word.Every sub- word can be modeled with the HMM including multiple states.For example, often One phoneme can be modeled with the HMM comprising most 6 states, and each state can use gauss hybrid models (GMM) fitting pair The observation frame answered, observation frame are chronologically combined into observation sequence.And observation sequence different in size can be generated in each acoustic model Column, i.e., one-to-many mapping.

Language model (Language Model): being to be effectively combined syntax and semantics during speech recognition Knowledge, improve discrimination, reduce the range of search.Due to being difficult accurately to determine that the boundary of word and acoustic model describe language The ability of the change of tune opposite sex is limited, and when identification will generate the sequence of the similar word of many probability scores.Therefore know in practical voice Most possible secondary sequence is selected from many candidate results to supplement acoustic mode usually using language model P (W) in other system The deficiency of type.

In the present embodiment, using rule-based language model.Rule-based language model can sum up grammer Then rule or even semantic rules exclude the result for not conforming to syntax rule or semantic rules in acoustics identification with these rules.System Language model is counted by the dependence between statistical probability descriptor and word, grammer or semantic rules are compiled indirectly Code.

Decoding is exactly that an optimal path is searched in state network, and voice corresponds to the maximum probability of this paths.At this In embodiment, global optimum path is found using dynamic programming algorithm (i.e. Viterbi algorithm).

Assuming that being feature vector Y to the characteristic parameter that voice signal extracts, most possible generation Y is found by decoding algorithm Word sequence w_1:L=w₁,w₂…w_L。

Parameter w corresponding to maximum that decoding algorithm is to solve for so that posterior probability P (w | Y), it may be assumed that

w_best=argmax p (w | Y) }

Above formula is converted by Bayes' theorem are as follows:

Since observation probability P (Y) is constant under given observation sequence, above formula can be further simplified are as follows:

w_best=argmax p (Y | w) p (w) }

Wherein prior probability P (W) is determined by language model, and likelihood probability p (Y | w) it is determined by acoustic model.By above Calculate parameter w corresponding to you can get it posterior probability P (w | Y) maximum.

Step 104, abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.

By step 103, voice signal is decoded as the text as unit of sentence, in conventional speech recognition system In, speech recognition work has been completed.This method is further extracted from the text as unit of sentence identified and is plucked Want sentence.

The purpose for extracting abstract sentence is that important information is extracted from voice, rejects garbage.

This method extracts abstract sentence by HMM model.At this point, the dual random relationship of HMM model can be described as: a weight Stochastic relation is the release of sentence sequence, is observable；Another heavy stochastic relation is whether the sentence should be classified as making a summary The property of sentence is not observable.So the process for extracting abstract sentence with HMM model can be described as given sentence sequence O ={ O₁,O₂…O_n, with determine sentence whether be make a summary sentence maximum likelihood.Key step is as follows:

(1) the observation state sequence O={ O of the text as unit of sentence is obtained₁,O₂…O_n}；

(2) HMM hidden state is determined.5 hidden states can be set." 1 "-hidden state can be set as to meet, " 2 "-relatively meet, and " 3 "-are general, and " 4 "-are not met relatively, and " 5 "-are not met, for successively indicating that sentence meets the degree of abstract sentence.

(3) HMM parameter Estimation is carried out.Initial probability parameter is randomly generated first, by constantly iteration, is set when reaching When fixed threshold value, stops calculating, obtain suitable HMM parameter.

(4) according to trained HMM, sentence is marked by Viterbi algorithm, each sentence is obtained and meets abstract The degree of conformity of sentence.

(5) sentence (sentence for example, at least relatively met) of default degree of conformity will be met from the text as unit of sentence It is extracted in this, obtains the abstract sentence in the text as unit of sentence.

The method of speech processing of embodiment one pre-processes voice signal；Pretreated voice signal is extracted special Levy parameter；According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, is obtained To the text as unit of sentence；Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM.It is real It applies example one and not only converts text for voice messaging, the also abstract sentence in extraction text is exported, and is eliminated by speech recognition As a result the garbage in obtains better speech processes result.

In another embodiment, when extracting MFCC characteristic parameter, sound channel length normalization (Vocal can be carried out Tract Length Normalization, VTLN), obtain the normalized MFCC characteristic parameter of sound channel length.

Sound channel can be expressed as cascade vocal tube model, and each sound pipe can regard a resonant cavity, their resonance as Frequency depends on the length and shape of sound pipe.Therefore, the part acoustic difference between speaker is since the sound channel of speaker is long Degree is different.For example, the variation range of sound channel length generally changes to 18cm (adult male) from 13cm (adult female), therefore, Dissimilarity others say that the same formant frequency of vowel differs greatly.VTLN is exactly to eliminate male, female's sound channel length Difference makes the result of accents recognition not by the interference of gender.

VTLN can move frequency coordinate by bending peace to make the formant frequency of each speaker match.In this implementation In example, the VTLN method based on bilinear transformation can be used.The VTLN method based on bilinear transformation is not directly to language The frequency spectrum of sound signal is folded, but uses the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment Difference is spoken the frequency bending factor of third formant for each person；The factor is bent according to the frequency, using bilinear transformation pair The position (such as starting point, intermediate point and end point of triangular filter) of triangular filter group and width are adjusted；According to tune Triangular filter group after whole calculates the normalized MFCC characteristic parameter of sound channel.For example, to carry out frequency spectrum pressure to voice signal Contracting, then stretch the scale of triangular filter, and triangular filter group is extended to the left and moved at this time.To voice signal Frequency spectrum stretching is carried out, then the scale of triangular filter is compressed, triangular filter group is compressed to the right and moved at this time.Using When the VTLN method based on bilinear transformation carries out sound channel normalization to specific crowd or particular person, it is only necessary to be filtered to triangle Device group coefficient carries out linear transformation, without all folding every time to signal spectrum when extracting characteristic parameter, to subtract significantly Small calculation amount.Also, the VTLN method based on bilinear transformation of being somebody's turn to do is avoided to frequency factor linear search, reduces operation Complexity.Meanwhile bilinear transformation should be utilized based on the VTLN method of bilinear transformation, keep the frequency of bending continuous and without bandwidth Change.

Embodiment two

Fig. 2 is the structure chart of voice processing apparatus provided by Embodiment 2 of the present invention.As shown in Fig. 2, the speech processes Device 10 may include: pretreatment unit 201, feature extraction unit 202, decoding unit 203, abstract extraction unit 204.

Pretreatment unit 201, for being pre-processed to voice signal.

H (z)=1-kz^-1, 0.9≤κ≤1.0.

Wherein, k is pre emphasis factor, and preferably value is between 0.94-0.97.

Hamming window function are as follows:

Hanning window function are as follows:

Wherein, the number for the sampled point that N includes by a speech frame.

Short-time zero-crossing rate are as follows:

Wherein, sgn [] is sign function, expression formula are as follows:

Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal.

(1) discrete Fourier transform (Discrete is carried out to each speech frame that pretreatment unit 201 obtains Fourier Transform, DFT, can be Fast Fourier Transform (FFT)), obtain the frequency spectrum of the speech frame.

The frequency response of triangular filter are as follows:

Extract MFCC characteristic parameter the step of can also include:

(6) the dynamic difference MFCC characteristic parameter of speech frame is extracted.Initial MFCC characteristic parameter only reflects speech parameter Static characteristic, the dynamic characteristic of voice can be described by the Difference Spectrum of static nature, and sound state is combined and can effectively be promoted The recognition performance of system, usually using single order and/or second differnce MFCC characteristic parameter.

Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate Sound signal is decoded, and obtains the text as unit of sentence.

Assuming that the characteristic parameter that feature extraction unit 202 is extracted is feature vector Y, found by decoding algorithm most possible Generate the word sequence w of Y_1:L=w₁,w₂…w_L。

w_best=argmax p (w | Y) }

Above formula is converted by Bayes' theorem are as follows:

w_best=argmax p (Y | w) p (w) }

Abstract extraction unit 204, for extracting abstract sentence from the text as unit of sentence.

Voice signal is decoded as the text as unit of sentence by decoding unit 203, in conventional speech recognition system, Speech recognition work has been completed.In the present invention, abstract extraction unit 204 is from the text as unit of sentence identified In extract abstract sentence.

Abstract extraction unit 204 extracts abstract sentence by HMM model.At this point, the dual random relationship of HMM model can retouch It states are as follows: a weight stochastic relation is the release of sentence sequence, is observable；Another heavy stochastic relation is whether the sentence should be by It is classified as the property of abstract sentence, is not observable.So the process for extracting abstract sentence with HMM model can be described as given sentence Subsequence O={ O₁,O₂…O_n, with determine sentence whether be make a summary sentence maximum likelihood.Key step is as follows:

The voice processing apparatus 10 of embodiment two pre-processes voice signal；Pretreated voice signal is extracted Characteristic parameter；According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, Obtain the text as unit of sentence；Abstract sentence is extracted from the text as unit of sentence by Hidden Markov Model HMM. Embodiment two not only converts text for voice messaging, and the also abstract sentence in extraction text is exported, and eliminates and is known by voice Garbage in other result obtains better speech processes result.

In another embodiment, feature extraction unit 202 can carry out sound channel length and return when extracting MFCC characteristic parameter One changes (Vocal Tract Length Normalization, VTLN), obtains the normalized MFCC characteristic parameter of sound channel length.

Embodiment three

The present embodiment provides a kind of computer readable storage medium, computer is stored on the computer readable storage medium Program, the computer program realize the step in above-mentioned method of speech processing embodiment when being executed by processor, such as shown in Fig. 1 Step 101-104:

Step 101, voice signal is pre-processed；

Step 102, characteristic parameter is extracted to pretreated voice signal；

Step 103, according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal into Row decoding, obtains the text as unit of sentence；

Alternatively, the function of each module/unit in above-mentioned apparatus embodiment is realized when the computer program is executed by processor, Such as the unit 201-204 in Fig. 2:

Pretreatment unit 201, for being pre-processed to voice signal；

Feature extraction unit 202, for extracting characteristic parameter to pretreated voice signal；

Decoding unit 203 is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to institute's predicate Sound signal is decoded, and obtains the text as unit of sentence；

Abstract extraction unit 204 is plucked for being extracted from the text as unit of sentence by Hidden Markov Model HMM Want sentence.

Example IV

Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three provides.The computer installation 1 includes memory 20, processor 30 and the computer program 40 that can be run in the memory 20 and on the processor 30, example are stored in Such as voice processing program.The processor 30 is realized when executing the computer program 40 in above-mentioned method of speech processing embodiment The step of, such as step 101-104 shown in FIG. 1:

Step 101, voice signal is pre-processed；

Step 102, characteristic parameter is extracted to pretreated voice signal；

Alternatively, realizing each module in above-mentioned apparatus embodiment/mono- when the processor 30 executes the computer program 40 The function of member, such as the unit 201-204 in Fig. 2:

Pretreatment unit 201, for being pre-processed to voice signal；

Illustratively, the computer program 40 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 20, and are executed by the processor 30, to complete the present invention.Described one A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for Implementation procedure of the computer program 40 in the computer installation 1 is described.For example, the computer program 40 can be by Pretreatment unit 201, feature extraction unit 202, decoding unit 203, the abstract extraction unit 204 being divided into Fig. 2, each unit Concrete function is referring to embodiment two.

The computer installation 1 can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 1, do not constitute to computer The restriction of device 1 may include perhaps combining certain components or different components, example than illustrating more or fewer components Such as described computer installation 1 can also include input-output equipment, network access equipment, bus.

Alleged processor 30 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor 30 is also possible to any conventional processor Deng the processor 30 is the control centre of the computer installation 1, utilizes various interfaces and connection entire computer dress Set 1 various pieces.

The memory 20 can be used for storing the computer program 40 and/or module/unit, and the processor 30 passes through Operation executes the computer program and/or module/unit being stored in the memory 20, and calls and be stored in memory Data in 20 realize the various functions of the computer installation 1.The memory 20 can mainly include storing program area and deposit Store up data field, wherein storing program area can application program needed for storage program area, at least one function (for example sound is broadcast Playing function, image player function etc.) etc.；Storage data area, which can be stored, uses created data (ratio according to computer installation 1 Such as audio data, phone directory) etc..In addition, memory 20 may include high-speed random access memory, it can also include non-easy The property lost memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.

If the integrated module/unit of the computer installation 1 is realized in the form of SFU software functional unit and as independence Product when selling or using, can store in a computer readable storage medium.Based on this understanding, of the invention It realizes all or part of the process in above-described embodiment method, can also instruct relevant hardware come complete by computer program At the computer program can be stored in a computer readable storage medium, which is being executed by processor When, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer program code, described Computer program code can be source code form, object identification code form, executable file or certain intermediate forms etc..The meter Calculation machine readable medium may include: can carry the computer program code any entity or device, recording medium, USB flash disk, Mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It needs to illustrate It is that the content that the computer-readable medium includes can be fitted according to the requirement made laws in jurisdiction with patent practice When increase and decrease, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium does not include electric carrier wave letter Number and telecommunication signal.

In several embodiments provided by the present invention, it should be understood that disclosed computer installation and method, it can be with It realizes by another way.For example, computer installation embodiment described above is only schematical, for example, described The division of unit, only a kind of logical function partition, there may be another division manner in actual implementation.

It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in same treatment unit It is that each unit physically exists alone, can also be integrated in same unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds software function module.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.It is stated in computer installation claim Multiple units or computer installation can also be implemented through software or hardware by the same unit or computer installation.The One, the second equal words are used to indicate names, and are not indicated any particular order.

Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention Technical solution is modified or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. a kind of method of speech processing, which is characterized in that the described method includes:

Voice signal is pre-processed；

Characteristic parameter is extracted to pretreated voice signal；

According to the characteristic parameter, the voice signal is decoded using preparatory trained speech recognition modeling, is obtained Text as unit of sentence；

2. the method as described in claim 1, which is characterized in that it is described by Hidden Markov Model HMM from described with sentence For extraction abstract sentence in the text of unit, specifically include:

Determine the hidden state of HMM；

HMM parameter Estimation is carried out, trained HMM is obtained；

According to the trained HMM, the sentence is marked by Viterbi algorithm, obtains each sentence and meet to pluck Want the degree of conformity of sentence；

The sentence for meeting default degree of conformity is extracted from the text as unit of sentence, obtain described be with sentence Abstract sentence in the text of unit.

3. the method as described in claim 1, which is characterized in that described to carry out pretreatment including detecting institute's predicate to voice signal Efficient voice in sound signal, specifically includes:

Logarithm operation is carried out to the accumulated energy of each frequency band, obtains the accumulated energy logarithm of each frequency band；

4. the method as described in claim 1, which is characterized in that the characteristic parameter includes initial mel cepstrum coefficients MFCC special Levy parameter, first-order difference MFCC characteristic parameter and second differnce MFCC characteristic parameter.

5. the method as described in claim 1, which is characterized in that the method also includes:

6. the method as described in claim 1, which is characterized in that described to extract characteristic parameter packet to pretreated voice signal It includes and mel cepstrum coefficients MFCC characteristic parameter is extracted to pretreated voice signal, specifically include:

Using the mapping equation of bilinear transformation low pass filter cutoff frequency, calculates alignment difference and speak third resonance for each person The frequency at peak bends the factor；

The factor is bent according to the frequency, using bilinear transformation to triangular filter group used in MFCC characteristic parameter extraction Position and width be adjusted；

7. the method as described in claim 1, which is characterized in that described to extract characteristic parameter packet to pretreated voice signal It includes and mel cepstrum coefficients MFCC characteristic parameter is extracted to pretreated voice signal, specifically include:

By the conventional spectrum of the speech frame by equally distributed triangular filter group in Mel frequency, each triangle filter is obtained The output of wave device；

8. a kind of voice processing apparatus, which is characterized in that described device includes:

Pretreatment unit, for being pre-processed to voice signal；

Decoding unit is used for according to the characteristic parameter, using preparatory trained speech recognition modeling to the voice signal It is decoded, obtains the text as unit of sentence；

Abstract extraction unit, for extracting abstract from the text as unit of sentence by Hidden Markov Model HMM Sentence.

9. a kind of computer installation, it is characterised in that: the computer installation includes processor, and the processor is deposited for executing The computer program stored in reservoir is to realize the method for speech processing as described in any one of claim 1-7.

10. a kind of computer readable storage medium, computer program, feature are stored on the computer readable storage medium It is: realizes the method for speech processing as described in any one of claim 1-7 when the computer program is executed by processor.